G3.
Gauge data fitness

This step asks whether the available EHR data are suitable for estimating the target estimand. EHR data may be large, but size alone does not mean the data are appropriate for the question.

Researchers should assess whether the sample, variables, measurement timing, follow-up period, and data quality are sufficient for the intended research task. If the data are not fit for purpose, the question, estimand, data source, or analysis plan may need to be revised.

Description

Sample Requirements

  1. The sample should be drawn from the target population or contain sufficient information about the selection mechanism to enable standardisation or reweighting to that population.

  2. Sample size should be sufficient to estimate the estimand with desired level of precision 

  3. Period of observation is sufficient for target estimand (e.g. calendar time coverage sufficient for trend analysis, or follow-up long enough for cumulative incidence estimation)

Variable Requirements

  1. Key health state, event, exposure, or practice of interest and all key auxiliary variables are available and accurately measured

  2. Variable definitions, coding practices, and recording completeness are consistent across data sources, sites, and time periods (e.g. diagnostic codes are harmonised across datasets, and any secular changes in coding or recording practices are unlikely to produce artefactual trends)

Signal Discovery

Signal Discovery

Sample Requirements

  1. The sample should be drawn from the target population, or contain sufficient information to enable transportation or reweighting of estimates to that population.

  2. Sample size should be sufficient to detect desired effect sizes after multiple testing correction, accounting for case-control imbalance where applicable (e.g. effective sample size in case-control designs.

Variable Requirements

  1. Key variables (i.e. exposures/variants and outcomes/traits) are consistently defined, available, and accurately measured across all data sources and cohorts

  2. Where the signal of interest takes time to emerge, the period of observation is sufficient to detect it (e.g. the duration of follow-up in a pharmacovigilance analysis is sufficient for relevant adverse outcomes to occur)

Factual Prediction

Sample Requirements

  1. The sample should be drawn from the target population, or contain sufficient information to enable transportation or reweighting of estimates to that population.

  2. Sample size should be sufficient to estimate baseline risk with desired precision, while minimising overfitting and optimism (e.g. using pmsampsize or equivalent)

  3. Period of observation is sufficient to observe the outcome within the intended prediction horizon (e.g. follow-up long enough to observe 5-year cardiovascular events.

Variable Requirements

  1. Outcome, treatments, and all key predictors are available, accurately measured, and consistently defined across the study period and any external validation samples

  2. All predictors are available at the reference time point (i.e. measured at or before the landmark time for prognostication, or available at the point of screening for classification)

Counterfactual Prediction

Sample Requirements

  1. The sample should be drawn from the target population, or contain sufficient information to enable transportation or reweighting of estimates to that population.

  2. Sample size should be sufficient to estimate counterfactual risks with desired precision, while minimising overfitting and optimism, and with adequate observations per confounder to avoid sparse data bias

  3. Period of observation is sufficient to observe the outcome within the intended prediction horizon, with sufficient prior observation time to establish baseline exposure status

Variable Requirements

  1. The outcome, all hypothetical treatment strategies, and all confounders, are available and accurately measured, with consistent definitions and coding practices across data sources, sites, and time periods

  2. Variables are measured with sufficient timing and frequency to establish the correct causal ordering between the hypothetical treatment strategies and the outcome.

  3. All hypothetical treatment strategies of interest are observed across all relevant confounder strata

Causal Effect Estimation

Sample Requirements

  1. The sample should be drawn from the target population, or contain sufficient information to enable transportation or reweighting of estimates to that population.

  2. Sample size should be sufficient to estimate the estimand with desired precision, with adequate observations per confounder to avoid sparse data bias 

  3. Period of observation is sufficient to observe the outcome following exposure (e.g. follow-up long enough for the outcome to accrue, and sufficient prior observation time to establish baseline exposure status and exclude prevalent users)

Variable Requirements

  1. Exposure, outcome, any mediators, and all key confounders are available and accurately measured, with consistent definitions and coding practices across data sources, sites, and time periods.

  2. Variables are measured with sufficient timing and frequency to establish the correct causal ordering between exposure and outcome

  3. Exposure (and any mediators) varies within the sample and across all relevant confounding strata

By the end of this step, you should have:

  • Assessed whether the sample reflects or can be transported to the target population

  • Confirmed that key variables are available and measured with sufficient accuracy

  • Checked whether timing and follow-up support the estimand

  • Identified gaps in variable availability, measurement, or observation periods

  • Decided whether to proceed, refine the question, enrich the data, or seek another data source

RIGOROUS