G3.
Gauge data fitness
This step asks whether the available EHR data are suitable for estimating the target estimand. EHR data may be large, but size alone does not mean the data are appropriate for the question.
Researchers should assess whether the sample, variables, measurement timing, follow-up period, and data quality are sufficient for the intended research task. If the data are not fit for purpose, the question, estimand, data source, or analysis plan may need to be revised.
Description
Sample Requirements
The sample should be drawn from the target population or contain sufficient information about the selection mechanism to enable standardisation or reweighting to that population.
Sample size should be sufficient to estimate the estimand with desired level of precision
Period of observation is sufficient for target estimand (e.g. calendar time coverage sufficient for trend analysis, or follow-up long enough for cumulative incidence estimation)
Variable Requirements
Key health state, event, exposure, or practice of interest and all key auxiliary variables are available and accurately measured
Variable definitions, coding practices, and recording completeness are consistent across data sources, sites, and time periods (e.g. diagnostic codes are harmonised across datasets, and any secular changes in coding or recording practices are unlikely to produce artefactual trends)
Signal Discovery
Signal Discovery
Sample Requirements
The sample should be drawn from the target population, or contain sufficient information to enable transportation or reweighting of estimates to that population.
Sample size should be sufficient to detect desired effect sizes after multiple testing correction, accounting for case-control imbalance where applicable (e.g. effective sample size in case-control designs.
Variable Requirements
Key variables (i.e. exposures/variants and outcomes/traits) are consistently defined, available, and accurately measured across all data sources and cohorts
Where the signal of interest takes time to emerge, the period of observation is sufficient to detect it (e.g. the duration of follow-up in a pharmacovigilance analysis is sufficient for relevant adverse outcomes to occur)
Factual Prediction
Sample Requirements
The sample should be drawn from the target population, or contain sufficient information to enable transportation or reweighting of estimates to that population.
Sample size should be sufficient to estimate baseline risk with desired precision, while minimising overfitting and optimism (e.g. using pmsampsize or equivalent)
Period of observation is sufficient to observe the outcome within the intended prediction horizon (e.g. follow-up long enough to observe 5-year cardiovascular events.
Variable Requirements
Outcome, treatments, and all key predictors are available, accurately measured, and consistently defined across the study period and any external validation samples
All predictors are available at the reference time point (i.e. measured at or before the landmark time for prognostication, or available at the point of screening for classification)
Counterfactual Prediction
Sample Requirements
The sample should be drawn from the target population, or contain sufficient information to enable transportation or reweighting of estimates to that population.
Sample size should be sufficient to estimate counterfactual risks with desired precision, while minimising overfitting and optimism, and with adequate observations per confounder to avoid sparse data bias
Period of observation is sufficient to observe the outcome within the intended prediction horizon, with sufficient prior observation time to establish baseline exposure status
Variable Requirements
The outcome, all hypothetical treatment strategies, and all confounders, are available and accurately measured, with consistent definitions and coding practices across data sources, sites, and time periods
Variables are measured with sufficient timing and frequency to establish the correct causal ordering between the hypothetical treatment strategies and the outcome.
All hypothetical treatment strategies of interest are observed across all relevant confounder strata
Causal Effect Estimation
Sample Requirements
The sample should be drawn from the target population, or contain sufficient information to enable transportation or reweighting of estimates to that population.
Sample size should be sufficient to estimate the estimand with desired precision, with adequate observations per confounder to avoid sparse data bias
Period of observation is sufficient to observe the outcome following exposure (e.g. follow-up long enough for the outcome to accrue, and sufficient prior observation time to establish baseline exposure status and exclude prevalent users)
Variable Requirements
Exposure, outcome, any mediators, and all key confounders are available and accurately measured, with consistent definitions and coding practices across data sources, sites, and time periods.
Variables are measured with sufficient timing and frequency to establish the correct causal ordering between exposure and outcome
Exposure (and any mediators) varies within the sample and across all relevant confounding strata
By the end of this step, you should have:
Assessed whether the sample reflects or can be transported to the target population
Confirmed that key variables are available and measured with sufficient accuracy
Checked whether timing and follow-up support the estimand
Identified gaps in variable availability, measurement, or observation periods
Decided whether to proceed, refine the question, enrich the data, or seek another data source