R5.
Run Appropriate Analysis
The analysis should follow from the research task, estimand, data fitness assessment, and identified threats to validity. Different EHR research tasks require different analytical strategies.
This step helps researchers select and conduct analyses that are suitable for the target estimand and available data, while documenting key analytical decisions clearly.
Description
Choose between design-based and model-based approaches
When stratifying or standardizing, explain the purpose of the stratification or standardization (and select an appropriate target population for stratification).
For model-based approaches, select an appropriate model family for the data structure (e.g., multilevel models for hierarchically structured data, small area estimation for sparse geographies)
When the analysis requires follow-up over time, specify how competing events and censoring are handled and state the assumptions of the chosen approach
For each source of error, bias, or validity threat identified in the O1 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2
Compare estimates with external benchmarks where available
Signal Discovery
Choose appropriate association model based on the outcome type and study design (e.g., linear or logistic regression for unrelated samples, linear mixed models for samples with population structure or relatedness, Cox models for time-to-event phenotypes)
For time-to-event phenotypes, specify how competing events will be handled
For each source of error, bias, or validity threat identified in the O1 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2
Choose appropriate multiple testing correction for the inferential context (FDR vs. FWER)
Pre-specify falsification strategies to detect residual confounding or other systematic biases (e.g., genomic inflation diagnostics, negative control outcomes)
Seek replication in independent data where feasible; where replication is not possible, state the reasons and acknowledge the signals as preliminary
Use phenome-wide or exposure-wide visualization to contextualize signals
Factual Prediction
Choose modelling approach based on outcome type, sample size relative to number of candidate predictors, interpretability requirements, and deployment constraints (e.g., regression-based, machine learning, or ensemble approaches)
Choose and implement appropriate analytical methods for the competing event and intercurrent event strategies specified in the estimand (e.g., cause-specific or subdistribution models for competing events; composite endpoint or treatment policy strategies for intercurrent events)
For each source of error, bias, or validity threat identified in the O1 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2.
Apply shrinkage or penalization in high-dimensional settings
Use bootstrap resampling or cross-validation for internal validation and optimism correction
Assess discrimination and calibration in internal and external validation samples
Evaluate and address miscalibration revealed by external validation
When the goal is to evaluate the incremental predictive value of additional variables, use nested model comparison methods (e.g., likelihood ratio tests, C-statistic improvement, net reclassification, decision curve analysis)
Counterfactual Prediction
Choose and implement the modelling approach for generating counterfactual predictions, including the outcome model specification (e.g., regression-based, machine learning, or ensemble approaches) and the g-method used to generate predictions under the specified intervention (e.g., parametric g-formula, inverse probability weighting)
Choose and implement appropriate analytical methods for the competing event and intercurrent event strategies specified in the estimand
For each source of error, bias, or validity threat identified in the O1 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2
For internal validation, use counterfactual-adjusted performance measures for discrimination, calibration, and overall prediction error, as standard performance measures are biased when applied to counterfactual predictions (Boyer et al. 2025, Statistics in Medicine). Bootstrap resampling or cross-validation should incorporate counterfactual-adjusted measures at each stage
For external validation of time-to-event outcomes, use artificial censoring and inverse probability weighting approaches to assess counterfactual calibration, discrimination, and prediction error (Keogh & van Geloven 2024, Epidemiology)
Causal Effect Estimation
Choose identification strategy (e.g., backdoor adjustment, IV estimation, or other quasi-experimental approaches) and state the identification assumptions for the target estimand
Choose appropriate estimation approach for the estimand and identification strategy (e.g., for backdoor approach with point treatment: outcome regression, propensity score methods, or doubly-robust estimators; for IV approach: two-stage regression or ratio of coefficients; for sustained or time-varying treatments, or causal mediation analyses: g-computation or inverse probability weighting of marginal structural models*)
Identify variables that need conditioning, weighting, and/or standardizing using causal reasoning (e.g., a directed acyclic graph)
Choose and implement appropriate analytical methods for the competing event and intercurrent event strategies specified in the estimand (e.g., cause-specific or subdistribution models for competing events; g-methods for hypothetical strategies)
For each source of error, bias, or validity threat identified in the O4 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2
Pre-specify falsification strategies, including negative control exposures and/or outcomes, to detect residual confounding or other biases
* Standard regression adjustment is inappropriate when time-varying confounders are affected by prior treatment, as conditioning on these variables simultaneously blocks causal pathways and introduces collider bias.
By the end of this step, you should have:
Selected an analytical approach that matches the research task
Explained why the method is appropriate for the estimand
Specified how competing events, censoring, missing data, and identified biases will be handled
Planned validation, replication, falsification, or sensitivity checks where appropriate
Documented the final analysis strategy