Description
Estimate the occurrence or distribution of a health state, event, exposure, or practice in a defined population.
R1: Research task identification
Descriptions in a population
Estimating the occurrence or distribution of a health state, event, exposure, or practice in a defined population
Descriptions in subgroups
Estimating and comparing the occurrence or distribution of a health state, event, exposure, or practice or event in multiple populations or predefined subgroups.
Descriptions over time
Estimating the occurrence or distribution of a health state, event, exposure, or practice over time.
Descriptions in subgroups and over time
Estimating and comparing the occurrence or distribution of a health state, event, exposure, or practice over time and in multiple populations or predefined subgroups.
I2: Identify estimand(s)
Carefully describe the target quantity/quantities of interest and all relevant criteria using the appropriate estimand framework
Purpose or intended use of the description (e.g., disease burden estimation, resource allocation, health service planning, public health surveillance)
Target population (including the population definition and sampling frame)
Health state, event, exposure, or practice of interest
Summary measure and its time frame (e.g. point prevalence on a specified date, cumulative incidence over a specified period, rate per population per year, proportion or coverage at a specified time point)
Auxiliary variables and their role (e.g. age-sex standardisation, or stratification by country)
G3: Gauge data fitness
Carefully describe the target quantity/quantities of interest and all relevant criteria using the appropriate estimand framework
Sample Requirements
The sample should be drawn from the target population or contain sufficient information about the selection mechanism to enable standardisation or reweighting to that population.
Sample size should be sufficient to estimate the estimand with desired level of precision
Period of observation is sufficient for target estimand (e.g. calendar time coverage sufficient for trend analysis, or follow-up long enough for cumulative incidence estimation)
Variable requirements
Key health state, event, exposure, or practice of interest and all key auxiliary variables are available and accurately measured
Variable definitions, coding practices, and recording completeness are consistent across data sources, sites, and time periods (e.g. diagnostic codes are harmonised across datasets, and any secular changes in coding or recording practices are unlikely to produce artefactual trends)
O4: Outline and consider key sources of error, bias & threats to validity
Consider all potential sources of error, bias and threats to validity and outline mitigation strategies. Table 2 contains prompt questions to help identify major sources of bias and select mitigation strategies.
Selection
Non-representative sampling and/or participation
The study sample does not represent the target population, either because the contributing healthcare systems are a non-random subsample of the population of interest, or because of informative presence*. When the probability of presence in the data is related to the health state, event, exposure, or practice of interest, occurrence estimates will be biased.
Collider restriction fallacy
When both the primary variable of interest and a stratifying variable are related to presence in the data, whether directly or through shared or intermediate causes, the apparent pattern of occurrence across strata may be misleading. Common instances include Berkson's bias*, index event bias**, survivorship bias***, and M-bias****.
Measurement
Systematic measurement error in the primary variable
The health state/event of interest is subject to systematic measurement error, leading to misleading estimates of occurrence.
Differential measurement error across an auxiliary variable
The health state/event of interest is subject to systematic measurement error that varies across levels of one or more auxiliary variables (including data source, time, or calendar period), leading to misleading estimates within strata and misleading comparisons between strata.
Missing Data
Data are missing for the health state/event of interest or for auxiliary/stratifying variables, and the probability of missingness is related to the variable itself or to other variables of interest (e.g., due to informative observation processes). When the analysis requires follow-up over time (e.g., describing incidence or survival), loss to follow-up or informative censoring occurs when individuals leave the observation window for reasons related to the health state or event of interest (e.g., transferring care, death captured in a different system). Informative missingness can bias occurrence estimates and distort comparisons across strata.
Data Source Heterogeneity
Data are pooled from multiple healthcare systems or across calendar periods with different measurement practices and/or case-mix. This introduces uncertainty in pooled estimates, and can bias estimates within strata or distort comparisons across strata when these differences are related to one or more auxiliary variables of interest (e.g., temporal trends in occurrence may be artefacts of changing coding practices rather than genuine secular changes)
Data Sparsity
Insufficient observations within strata leads to unstable or biased estimates of stratum-specific occurrence (e.g., in standardisation, MAIHDA, or MrP)
Confounding
N/A
Time zero alignment
Lead-time Bias
When comparing time-to-event between groups or over time, differences may be misleading if the timing of the index event (e.g., diagnosis) itself varies between groups or over time. For example, if diagnosis is made progressively earlier due to screening, apparent survival time may appear to improve over time even if the true course of disease is unchanged.
Immortal Time Fallacy
When describing outcomes within strata of a time-dependent variable (e.g., 30-day mortality by duration of treatment), differences between strata may be misleading because strata requiring longer durations are only observable for individuals who survived to that point.
* Berkson’s bias = A type of selection bias that occurs when both primary variables of interest (e.g. an exposure and outcome) both directly influence entry into the sample
** Index event bias = A type of selection bias that occurs when a primary variable of interest (e.g. the outcome) is only possible among people who have experienced a qualifying event that is directly influenced by another variable of interest (e.g. the exposure), and the primary variable is also related to the qualifying event through shared causes.
*** Survivorship bias = a type of selection bias that occurs when a primary variable of interest (e.g. the exposure) directly influences survival to study entry, and another variable of interest (e.g. the outcome) is also related to survival through shared causes.
**** M-bias = a type of selection bias that occurs when there are unmeasured causes of two primary variables of interest (e.g. the exposure and outcome) that also cause study entry. In EHR data, this often arises through informed presence bias, where presence in the dataset is influenced by factors (e.g. healthcare utilisation, socioeconomic position) that are also linked to the primary variables of interest.
R5: Run appropriate analysis
Select and conduct analyses that are suitable for estimating your target estimands in the available data
Choose between design-based and model-based approaches
When stratifying or standardizing, explain the purpose of the stratification or standardization (and select an appropriate target population for stratification).
For model-based approaches, select an appropriate model family for the data structure (e.g., multilevel models for hierarchically structured data, small area estimation for sparse geographies)
When the analysis requires follow-up over time, specify how competing events and censoring are handled and state the assumptions of the chosen approach
For each source of error, bias, or validity threat identified in the O1 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2
Compare estimates with external benchmarks where available
*Standard regression adjustment is inappropriate when time-varying confounders are affected by prior treatment, as conditioning on these variables simultaneously blocks causal pathways and introduces collider bias.
O6: Outline and assess assumptions
Clearly outline the assumptions behind your results and conduct appropriate sensitivity analyses
Representativeness
Assess whether the sample is representative of the target population for the health state or event of interest. Discuss potential sources of discrepancy between sample and target population. Consider whether standardisation or weighting to a reference population is appropriate.Positivity
Confirm that all strata of interest are sufficiently observed in the data. Discuss implications of sparse or empty strata for model-based descriptive estimates (e.g. standardization, MAIHDA, small area estimation).Competing events assumptions
For time-to-event outcomes, state the assumptions underlying the chosen approach to competing events and censoring (e.g., non-informative censoring, independence of competing events).
Missing data assumptions
State the assumed missing data mechanism (MCAR, MAR, or MNAR) and justify the chosen analytical approach. Assess sensitivity of estimates to missing data assumptions and analytical approach if missing data are substantial.
Model assumptions
Report and check all model-specific assumptions where model-based approaches are used (e.g. linearity, distributional assumptions in multilevel models).
U7: Use appropriate language
Describe and interpret aims, methods, and results in terms of target occurrence measures and how these vary between pre-specified strata, avoiding associational, predictive, risk factor, or causal language.
Examples:
Aim: 'We aimed to estimate the prevalence of X in population Y and how this varied by Z'
Methods: 'We estimated the age-standardised prevalence of X and calculated prevalence ratios across strata of Z'
Results: 'The prevalence of X was higher in group Z than group Z*, with five-year survival rates of…'
Discussion: 'These findings suggest that the prevalence of X is higher among Z than Z*, increasing with age from… to…'
Examples to Avoid:
Aim: 'We aimed to estimate the association between X and Y' (associational language obscures the descriptive aim and target occurrence measure)
Methods: 'We modelled the association between X and Y adjusting for Z' (associational language obscures the descriptive aim and target occurrence measure)
Results: 'X was associated with Y' (associational language obscures the descriptive aim and target occurrence measure)
Discussion: 'These findings suggest X is associated with Y' (associational language obscures the descriptive aim and target occurrence measure)
Throughout:
'X predicted Y' (predictive language implies a model-based prediction task)
'X caused/influenced/modified Y' (causal language implies a causal effect estimation task)
'X was a risk factor for Y' (risk factor language is unclear and should be avoided)
S8: Satisfy reporting and transparency standards
Follow current best practice and relevant reporting guidelines for reporting study details and results.
Consider pre-registering a study protocol and statistical analysis plan (e.g. on OSF) before data access, clearly stating the descriptive estimands of interest.
Make analytical code available (e.g., as a supplement to the publication or in a public repository), having reviewed for disclosive content
Provide a data availability statement describing the process for obtaining access to the source data. Report relevant summary-level information including sample flow diagrams and sample characteristics
Follow the Lesko et al (2022, AJE) methodological and reporting framework for descriptive epidemiology.