Counterfactual Prediction

Counterfactual Prediction

Predict outcomes under specified hypothetical treatment strategies.

R1: Research task identification

Hypothetical Prognostication
Estimating the probability of an outcome occurring in the future - or the expected future value of an outcome - conditional on individual characteristics and under one or more specified hypothetical treatment conditions.

I2: Identify estimand(s)

Carefully describe the target quantity/quantities of interest and all relevant criteria using the appropriate estimand framework

Clinical decision, action, or policy to be informed by the prediction (e.g., which intervention to provide, optimal dose or treatment sequence, personalised treatment selection)
Target population (including the population definition, sampling frame, and eligibility criteria for the specified intervention)
Outcome definition (including the prediction horizon where appropriate, e.g. 5-year risk of heart attack)
Intended deployment context, including proposed user (e.g. by primary care doctor during routine appointment at aged 40 years)
Reference time point (i.e. landmark time) when the prediction will be made
Hypothetical treatment strategies (a precise definition of each treatment strategy under which the outcome is predicted, e.g. if all individuals initiated statins at the reference time point)
Handling of competing events (e.g. whether competing events such as death are handled via cause-specific or subdistribution approaches)

G3: Gauge data fitness

Carefully describe the target quantity/quantities of interest and all relevant criteria using the appropriate estimand framework

Sample Requirements

The sample should be drawn from the target population, or contain sufficient information to enable transportation or reweighting of estimates to that population.
Sample size should be sufficient to estimate counterfactual risks with desired precision, while minimising overfitting and optimism, and with adequate observations per confounder to avoid sparse data bias
Period of observation is sufficient to observe the outcome within the intended prediction horizon, with sufficient prior observation time to establish baseline exposure status

Variable requirements

The outcome, all hypothetical treatment strategies, and all confounders, are available and accurately measured, with consistent definitions and coding practices across data sources, sites, and time periods
Variables are measured with sufficient timing and frequency to establish the correct causal ordering between the hypothetical treatment strategies and the outcome.
All hypothetical treatment strategies of interest are observed across all relevant confounder strata

O4: Outline and consider key sources of error, bias & threats to validity

Consider all potential sources of error, bias and threats to validity and outline mitigation strategies. Table 2 contains prompt questions to help identify major sources of bias and select mitigation strategies.

Selection

Type 2 selection bias (generalizability bias)
The sample is not a census or random sample from the target population, the distribution of certain effect modifiers differs between the sample and the target population, and the analytic sample cannot be reweighted to represent the target population (e.g., due to absence of survey weights or insufficient auxiliary data to construct them). Counterfactual predictions will not generalise to the intended deployment population.

Type 1 selection bias (collider restriction bias)
When both the hypothetical intervention and outcome of interest are related to presence in the data, whether directly or through shared or intermediate causes, the predicted outcomes under the intervention will be biased. Common instances include Berkson's bias*, index event bias**, survivorship bias***, and M-bias****.

Measurement

Outcome measurement error
The outcome is measured with systematic error, meaning the model learns to predict the outcome under the specified intervention conditions incorrectly.

Intervention measurement error
The intervention variables are measured with error, meaning the counterfactual predictions reflect a different intervention than the one specified and may be biased in any direction.

Dependent measurement errors
There is correlated measurement error in both the intervention variables and outcome, e.g. because both are measured by the same clinician during the same clinical encounter. Dependent errors can bias counterfactual predictions in any direction.

Missing Data

Data are missing for the outcome, intervention variables, or other adjustment variables, and the probability of missingness is related to the outcome and/or intervention (e.g., due to informative observation processes). When the analysis requires follow-up over time, loss to follow-up or informative censoring occurs when individuals leave the observation window for reasons related to the intervention or outcome (e.g., transferring care, death captured in a different system). Informative missingness can bias counterfactual predictions in any direction.

Data Source Heterogeneity

Data are pooled from multiple healthcare systems or across calendar periods with different measurement practices and/or case-mix. The estimated causal effects underpinning the counterfactual predictions may reflect an average across heterogeneous settings, biasing predictions for any specific deployment context.

Data sparsity

Insufficient sample size, or strong determination of one or more interventions of interest*, leads to few observations for certain intervention-covariate combinations, producing unstable counterfactual predictions.

Footnote: *Poor covariate overlap may arise because the sample is too small to adequately represent all covariate patterns or because certain covariate patterns strongly predict exposure/intervention status. This second issue cannot be resolved by simply collecting more data.

Confounding

Unobserved baseline confounding
One or more baseline common causes of the intervention and outcome are not captured in the data, leading to biased predictions. Important instances in EHR include confounding by indication (where the clinical reason for prescribing an intervention itself influences the outcome status) and protopathic bias (where early undiagnosed symptoms of the outcome influence intervention status).

Residual baseline confounding
One or more available baseline confounders are poorly measured, or have been coarsened (e.g., dichotomised), meaning conditioning does not fully remove confounding (e.g., dichotomised ‘obesity’ does not capture confounding by BMI). Predictions will remain biased, even after conditioning, though typically less so.

Unobserved time-varying confounding
For predictions involving multiple interventions, or sustained or dynamic intervention regimes, one or more time-varying common causes of subsequent intervention decisions and outcome are not captured in the data, leading to biased predictions.

Residual time-varying confounding
For predictions involving multiple interventions, or sustained or dynamic intervention regimes, one or more available time-varying confounders are poorly measured or have been coarsened, meaning predictions will remain biased even after appropriate handling, though typically less so.

Time Zero Alignment

Lead-time bias
When the hypothetical intervention influences the timing of the index event (e.g., a screening intervention leads to earlier diagnosis), counterfactual predictions of time-to-event from the index event may overestimate the benefit of the intervention.

Immortal time bias
Intervention definitions that require a period of time to be satisfied (e.g., "at least 7 days of treatment") guarantee that individuals assigned to that intervention have survived event-free for that period, biasing the counterfactual predictions in favour of the intervention.

Prevalent user bias
Basing predictions on prevalent users of the intervention means that adverse outcomes occurring before the observation window are uncaptured. The counterfactual predictions reflect a selected population of survivors and tolerators rather than all individuals who would initiate the intervention.

_______________________

* Berkson’s bias = A type of selection bias that occurs when both primary variables of interest (e.g. an exposure and outcome) both directly influence entry into the sample

** Index event bias = A type of selection bias that occurs when a primary variable of interest (e.g. the outcome) is only possible among people who have experienced a qualifying event that is directly influenced by another variable of interest (e.g. the exposure), and the primary variable is also related to the qualifying event through shared causes.

*** Survivorship bias = a type of selection bias that occurs when a primary variable of interest (e.g. the exposure) directly influences survival to study entry, and another variable of interest (e.g. the outcome) is also related to survival through shared causes.

**** M-bias = a type of selection bias that occurs when there are unmeasured causes of two primary variables of interest (e.g. the exposure and outcome) that also cause study entry. In EHR data, this often arises through informed presence bias, where presence in the dataset is influenced by factors (e.g. healthcare utilisation, socioeconomic position) that are also linked to the primary variables of interest.

R5: Run appropriate analysis

Select and conduct analyses that are suitable for estimating your target estimands in the available data

Choose and implement the modelling approach for generating counterfactual predictions, including the outcome model specification (e.g., regression-based, machine learning, or ensemble approaches) and the g-method used to generate predictions under the specified intervention (e.g., parametric g-formula, inverse probability weighting)
Choose and implement appropriate analytical methods for the competing event and intercurrent event strategies specified in the estimand
For each source of error, bias, or validity threat identified in the O1 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2
For internal validation, use counterfactual-adjusted performance measures for discrimination, calibration, and overall prediction error, as standard performance measures are biased when applied to counterfactual predictions (Boyer et al. 2025, Statistics in Medicine). Bootstrap resampling or cross-validation should incorporate counterfactual-adjusted measures at each stage
For external validation of time-to-event outcomes, use artificial censoring and inverse probability weighting approaches to assess counterfactual calibration, discrimination, and prediction error (Keogh & van Geloven 2024, Epidemiology)

*Standard regression adjustment is inappropriate when time-varying confounders are affected by prior treatment, as conditioning on these variables simultaneously blocks causal pathways and introduces collider bias.

O6: Outline and assess assumptions

Clearly outline the assumptions behind your results and conduct appropriate sensitivity analyses

Exchangeability
Identify and highlight all possible ways that exchangeability may be violated. For single point interventions, covariate balance diagnostics should be presented where possible. For sustained or dynamic intervention regimes, assess sequential exchangeability at each decision point where feasible. Quantitative bias analyses should be performed where specific unmeasured confounders are identified but unavailable in the data.
Positivity
Examine covariate overlap between intervention groups. For single point interventions, assess the distribution of propensity scores or analytic weights for extreme values or high variability. For sustained or dynamic intervention regimes, or when multiple interventions are compared, assess weight distributions for evidence of near-positivity violations at each decision point where feasible. Structural violations may indicate that certain counterfactual predictions are not reliably estimable from the available data.
Consistency
Discuss whether there are multiple versions of the intervention that could differ in their effect on the outcome (e.g., different formulations, doses, or routes of administration captured under a single intervention definition), and how this might affect the counterfactual predictions.
No interference
State whether the assumption that one individual's intervention does not affect another individual's outcome is plausible in the study context. Where interference is likely, discuss the implications for the counterfactual predictions.
Deployment validity
Assess whether the causal effects underpinning the counterfactual predictions are likely to hold in the intended deployment population. Consider whether effect modifiers are distributed differently between the development and deployment populations, and whether selection mechanisms or measurement processes specific to the development setting may limit validity in the deployment context.
Temporal stability
Assess whether the intervention-outcome relationships underpinning the counterfactual predictions are likely to remain stable over time, considering changes in clinical practice, coding practices, and evolving patient populations.
Competing events and censoring assumptions:
For time-to-event outcomes, state the assumptions underlying the chosen approach to competing events and censoring (e.g., non-informative censoring, independence of competing events), and discuss whether these assumptions are plausible given the causal structure.
Missing data assumptions
State the assumed missing data mechanism (MCAR, MAR, or MNAR) and justify the chosen analytical approach. Assess sensitivity of predictions to missing data assumptions and analytical approach if missing data are substantial.
Model assumptions
Report and check all model-specific assumptions (e.g., correct specification of the outcome model and treatment model in g-formula or IPW implementations).

U7: Use appropriate language

Describe and interpret aims, methods, and results in terms of predicted outcomes under hypothetical treatment strategies and counterfactual model performance, avoiding associational, risk factor, or definitive causal language.

Examples:

Aim: 'We aimed to predict 5-year risk of Y under hypothetical treatment strategies X and X* in population Z'
Methods: 'Counterfactual model discrimination and calibration were assessed using the C-statistic and calibration plots under each treatment strategy'
Results: 'The predicted 5-year risk of Y under strategy X* was lower than under strategy X (X% vs X%); the model showed good discrimination and calibration under both strategies'
Discussion: 'These findings suggest our model can accurately predict risk of Y under hypothetical treatment strategies X and X*'

Examples to avoid:

Throughout:
- 'X was associated with Y' (associational language obscures the counterfactual estimand)
- 'X was a risk factor for Y' (risk factor language is unclear and should be avoided)
- 'X caused Y' (definitive causal language presents model-based estimates as established facts rather than quantitative estimates based on assumptions)

S8: Satisfy reporting and transparency standards

Follow current best practice and relevant reporting guidelines for reporting study details and results.

Pre-register a study protocol and statistical analysis plan (e.g. on OSF) before data access, clearly stating the counterfactual prediction estimands of interest.
Make analytical code available (e.g., as a supplement to the publication, alongside the protocol, or in a public repository), having reviewed it for disclosive content
Provide a data availability statement describing the process for obtaining access to the source data. Report summary-level information including sample flow diagrams and baseline sample characteristics
Follow RECORD reporting guidelines, plus relevant items from TRIPOD+AI as appropriate.