R5.
Run Appropriate Analysis

The analysis should follow from the research task, estimand, data fitness assessment, and identified threats to validity. Different EHR research tasks require different analytical strategies.

This step helps researchers select and conduct analyses that are suitable for the target estimand and available data, while documenting key analytical decisions clearly.

Description

  • Choose between design-based and model-based approaches

  • When stratifying or standardizing, explain the purpose of the stratification or standardization (and select an appropriate target population for stratification).

  • For model-based approaches, select an appropriate model family for the data structure (e.g., multilevel models for hierarchically structured data, small area estimation for sparse geographies)

  • When the analysis requires follow-up over time, specify how competing events and censoring are handled and state the assumptions of the chosen approach 

  • For each source of error, bias, or validity threat identified in the O1 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2 

  • Compare estimates with external benchmarks where available

Signal Discovery

  • Choose appropriate association model based on the outcome type and study design (e.g., linear or logistic regression for unrelated samples, linear mixed models for samples with population structure or relatedness, Cox models for time-to-event phenotypes) 

  • For time-to-event phenotypes, specify how competing events will be handled

  • For each source of error, bias, or validity threat identified in the O1 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2

  • Choose appropriate multiple testing correction for the inferential context (FDR vs. FWER)

  • Pre-specify falsification strategies to detect residual confounding or other systematic biases (e.g., genomic inflation diagnostics, negative control outcomes)

  • Seek replication in independent data where feasible; where replication is not possible, state the reasons and acknowledge the signals as preliminary

  • Use phenome-wide or exposure-wide visualization to contextualize signals

Factual Prediction

  • Choose modelling approach based on outcome type, sample size relative to number of candidate predictors, interpretability requirements, and deployment constraints (e.g., regression-based, machine learning, or ensemble approaches)

  • Choose and implement appropriate analytical methods for the competing event and intercurrent event strategies specified in the estimand (e.g., cause-specific or subdistribution models for competing events; composite endpoint or treatment policy strategies for intercurrent events)

  • For each source of error, bias, or validity threat identified in the O1 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2.

  • Apply shrinkage or penalization in high-dimensional settings

  • Use bootstrap resampling or cross-validation for internal validation and optimism correction

  • Assess discrimination and calibration in internal and external validation samples

  • Evaluate and address miscalibration revealed by external validation

  • When the goal is to evaluate the incremental predictive value of additional variables, use nested model comparison methods (e.g., likelihood ratio tests, C-statistic improvement, net reclassification, decision curve analysis)

Counterfactual Prediction

  • Choose and implement the modelling approach for generating counterfactual predictions, including the outcome model specification (e.g., regression-based, machine learning, or ensemble approaches) and the g-method used to generate predictions under the specified intervention (e.g., parametric g-formula, inverse probability weighting) 

  • Choose and implement appropriate analytical methods for the competing event and intercurrent event strategies specified in the estimand

  • For each source of error, bias, or validity threat identified in the O1 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2 

  • For internal validation, use counterfactual-adjusted performance measures for discrimination, calibration, and overall prediction error, as standard performance measures are biased when applied to counterfactual predictions (Boyer et al. 2025, Statistics in Medicine). Bootstrap resampling or cross-validation should incorporate counterfactual-adjusted measures at each stage 

  • For external validation of time-to-event outcomes, use artificial censoring and inverse probability weighting approaches to assess counterfactual calibration, discrimination, and prediction error (Keogh & van Geloven 2024, Epidemiology)

Causal Effect Estimation

  • Choose identification strategy (e.g., backdoor adjustment, IV estimation, or other quasi-experimental approaches) and state the identification assumptions for the target estimand

  • Choose appropriate estimation approach for the estimand and identification strategy (e.g., for backdoor approach with point treatment: outcome regression, propensity score methods, or doubly-robust estimators; for IV approach: two-stage regression or ratio of coefficients; for sustained or time-varying treatments, or causal mediation analyses: g-computation or inverse probability weighting of marginal structural models*)

  • Identify variables that need conditioning, weighting, and/or standardizing using causal reasoning (e.g., a directed acyclic graph)

  • Choose and implement appropriate analytical methods for the competing event and intercurrent event strategies specified in the estimand (e.g., cause-specific or subdistribution models for competing events; g-methods for hypothetical strategies) 

  • For each source of error, bias, or validity threat identified in the O4 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2 

  • Pre-specify falsification strategies, including negative control exposures and/or outcomes, to detect residual confounding or other biases

* Standard regression adjustment is inappropriate when time-varying confounders are affected by prior treatment, as conditioning on these variables simultaneously blocks causal pathways and introduces collider bias.

By the end of this step, you should have:

  • Selected an analytical approach that matches the research task

  • Explained why the method is appropriate for the estimand

  • Specified how competing events, censoring, missing data, and identified biases will be handled

  • Planned validation, replication, falsification, or sensitivity checks where appropriate

  • Documented the final analysis strategy

RIGOROUS