Signal Discovery

Scan broadly to identify candidate patterns or signals for further study.

R1: Research task identification

  • Omic-wide associations: multiple molecular features, one outcome
    Scanning high-dimensional omic spaces for associations with a defined outcome, including genetic variants (GWAS), gene expression (TWAS), epigenetic markers (EWAS), and metabolites, as well as interaction effects (e.g. gene-environment, gene-gene).

  • Phenome-wide associations: one exposure, multiple phenotype outcomes
    Scanning the phenome for associations with a defined exposure, typically a genetic variant (PheWAS).

  • Exposome-wide associations: multiple environmental exposures, one outcome
    Scanning the exposome for associations with a defined outcome, including environmental, lifestyle, and occupational exposures (ExWAS).

  • Pharmacovigilance and drug repurposing: multiple drug exposures, multiple outcomes
    Scanning drug-outcome spaces for previously unrecognised adverse effects (pharmacovigilance) or new therapeutic indications (drug repurposing).

I2: Identify estimand(s)

Carefully describe the target quantity/quantities of interest and all relevant criteria using the appropriate estimand framework

  1. Purpose or intended use of the signals (e.g., hypothesis generation, prioritisation of candidates for confirmatory studies, regulatory safety actions, drug repurposing, biomarker identification)

  2. Target population (including the population definition and sampling frame)

  3. Definition of the feature space and outcome space being screened (e.g. all genotyped variants and a defined phenotype in GWAS; a defined variant and all available phenotypes in PheWAS; all drug exposures and a defined adverse event in drug-wide signal detection)

  4. Summary measures (e.g.  odds ratio, beta coefficient, log fold change, hazard ratio)

  5. Signal detection criteria (e.g. genome-wide significance threshold or false discovery rate target)

  6. Replication strategy (pre-specified plan for independent replication, including the replication cohort or data source, significance threshold for replication, and criteria for declaring a validated versus exploratory signal; if independent replication is not planned, this must be explicitly stated)

G3: Gauge data fitness

Carefully describe the target quantity/quantities of interest and all relevant criteria using the appropriate estimand framework

Sample Requirements

  • The sample should be drawn from the target population, or contain sufficient information to enable transportation or reweighting of estimates to that population.

  • Sample size should be sufficient to detect desired effect sizes after multiple testing correction, accounting for case-control imbalance where applicable (e.g. effective sample size in case-control designs)

Variable requirements

  • Key variables (i.e. exposures/variants and outcomes/traits) are consistently defined, available, and accurately measured across all data sources and cohorts

  • Where the signal of interest takes time to emerge, the period of observation is sufficient to detect it (e.g. the duration of follow-up in a pharmacovigilance analysis is sufficient for relevant adverse outcomes to occur)

O4: Outline and consider key sources of error, bias & threats to validity

Consider all potential sources of error, bias and threats to validity and outline mitigation strategies. Table 2 contains prompt questions to help identify major sources of bias and select mitigation strategies.

Selection

Type 2 selection bias (generalisability bias)
The sample is not a census or random sample from the target population, the distribution of certain effect modifiers differs between the sample and the target population, and the analytic sample cannot be reweighted to represent the target population. Detected signals may not replicate in the broader population, or true signals may be missed due to insufficient representation of relevant subgroups.

Type 1 selection bias (collider restriction bias)
When both the scanning variable (exposure, genotype, or phenotype) and the outcome of interest are related to presence in the data, whether directly or through shared or intermediate causes, spurious signals may be detected. Common instances include Berkson's bias*, index event bias**, survivorship bias***, and M-bias****.

Measurement

Outcome measurement error
The outcome is measured with error, which may be non-differential or differential. Non-differential error can attenuate true associations below detection thresholds after multiple testing correction; differential error can generate spurious signals or mask true signals.

Scanning variable measurement error
The scanning variable (exposure, genotype, or phenotype) is measured with error, which may be non-differential or differential. Non-differential error attenuates associations, increasing the risk of missed signals; differential error can generate spurious signals in any direction.

Dependent measurement errors
There is correlated measurement error in both the scanning variable and outcome, e.g. because both are derived from the same clinical documentation process. Dependent errors can generate spurious signals or mask true signals.

Missing Data

Data are missing for the outcome, scanning variable, or covariates, and the probability of missingness is related to the signal of interest. When the analysis requires follow-up over time, loss to follow-up or informative censoring occurs when individuals leave the observation window for reasons related to the scanning variable or outcome. Informative missingness can generate spurious signals or mask true signals.

Data Source Heterogeneity

Data are pooled from multiple healthcare systems or across calendar periods with different measurement practices and/or case-mix. This introduces uncertainty and can generate spurious signals or mask true signals when these differences covary with the scanning variable or outcome.

Data Sparsity

Insufficient observations for specific exposure-phenotype combinations leads to unstable association estimates, inflated effect sizes (i.e. ‘winners curse’), and an excess of false positives.

Confounding

Unobserved baseline confounding
One or more common causes of the scanning variable and outcome are not captured in the data, generating spurious signals. Important instances include population stratification in GWAS (where ancestry-related genetic variation correlates with both the variant and the outcome through shared environmental or demographic pathways), confounding by indication in pharmacovigilance, and shared lifestyle or environmental confounders in ExWAS.

Residual baseline confounding
One or more available baseline confounders are poorly measured or have been coarsened (e.g. dichotomised), meaning conditioning does not fully remove confounding. For example, principal component adjustment for population stratification in GWAS may not fully capture ancestry structure, leading to residual confounding by population stratification.

Time Zero Alignment

Lead-time bias
When the scanning variable influences the timing of the index event (e.g., a drug triggers earlier diagnostic investigation), spurious signals may be detected that reflect differential detection timing rather than genuine associations with the outcome.

Immortal time bias
Scanning variable definitions that require a period of time to be satisfied (e.g., "at least 7 days of drug exposure") guarantee that exposed individuals have survived event-free for that period, potentially generating spurious protective signals or masking harmful signals.

Prevalent user bias
Scanning for signals among prevalent users of an exposure means that adverse outcomes occurring before the observation window are uncaptured, selecting for individuals who survived and tolerated the exposure. This can mask true signals or attenuate genuine associations.

___________________________

* Berkson’s bias = A type of selection bias that occurs when both primary variables of interest (e.g. an exposure and outcome) both directly influence entry into the sample

** Index event bias = A type of selection bias that occurs when a primary variable of interest (e.g. the outcome) is only possible among people who have experienced a qualifying event that is directly influenced by another variable of interest (e.g. the exposure), and the primary variable is also related to the qualifying event through shared causes. 

*** Survivorship bias = a type of selection bias that occurs when a primary variable of interest (e.g. the exposure) directly influences survival to study entry, and another variable of interest (e.g. the outcome) is also related to survival through shared causes.

**** M-bias = a type of selection bias that occurs when there are unmeasured causes of two primary variables of interest (e.g. the exposure and outcome) that also cause study entry. In EHR data, this often arises through informed presence bias, where presence in the dataset is influenced by factors (e.g. healthcare utilisation, socioeconomic position) that are also linked to the primary variables of interest.

R5: Run appropriate analysis

Select and conduct analyses that are suitable for estimating your target estimands in the available data

  • Choose appropriate association model based on the outcome type and study design (e.g., linear or logistic regression for unrelated samples, linear mixed models for samples with population structure or relatedness, Cox models for time-to-event phenotypes)

  • For time-to-event phenotypes, specify how competing events will be handled

  • For each source of error, bias, or validity threat identified in the O1 step (above), specify the analytical strategy and mitigation approach, documenting the details in Table 2

  • Choose appropriate multiple testing correction for the inferential context (FDR vs. FWER)

  • Pre-specify falsification strategies to detect residual confounding or other systematic biases (e.g., genomic inflation diagnostics, negative control outcomes)

  • Seek replication in independent data where feasible; where replication is not possible, state the reasons and acknowledge the signals as preliminary

  • Use phenome-wide or exposure-wide visualization to contextualize signals

*Standard regression adjustment is inappropriate when time-varying confounders are affected by prior treatment, as conditioning on these variables simultaneously blocks causal pathways and introduces collider bias.

O6: Outline and assess assumptions

Clearly outline the assumptions behind your results and conduct appropriate sensitivity analyses

  • Confounding conditioning adequacy

    State and justify the assumption that the chosen confounding adjustment strategy (e.g., principal component adjustment for population stratification, covariate adjustment for confounding by indication) is sufficient. Where residual confounding is plausible, assess the potential impact on findings using quantitative bias analysis.

  • Multiple testing assumptions

    State the assumed dependence structure between tests and justify the chosen error rate control procedure (FDR vs FWER). Assess sensitivity of findings to the choice of correction approach where feasible.

  • Replicability

    Where replication in independent data has been conducted, assess whether signals replicate and discuss potential reasons for non-replication. Where replication has not been conducted, acknowledge the preliminary nature of the findings and discuss the plausibility of the signals in light of existing evidence.

  • Competing events and censoring assumptions

    For time-to-event outcomes, state the assumptions underlying the chosen approach to competing events and censoring (e.g., non-informative censoring, independence of competing events).]

  • Missing data assumptions

    State the assumed missing data mechanism (MCAR, MAR, or MNAR) and justify the chosen analytical approach. Assess sensitivity of findings to missing data assumptions and analytical approach if missing data are substantial.

  • Model assumptions 

    Report and check all model-specific assumptions (e.g. linearity, distributional assumptions).

U7: Use appropriate language

Describe and interpret aims, methods, and results in terms of detected signals and their statistical strength, avoiding predictive, causal, or definitive language, and emphasizing the hypothesis-generating nature of findings.

Examples:

  • Aim: 'We aimed to identify candidate signals for Y across a broad set of exposures/variants/drugs in population Z'

  • Methods: 'Associations between each exposure and Y were estimated using Method Z, with signals identified based on a pre-specified threshold of X'

  • Results: 'A potential signal was detected between X and Y (effect estimate, 95% CI…)'

  • Discussion: 'These findings suggest X may be a candidate signal for Y'

Examples to Avoid:

  • Throughout:

    • 'X predicted Y' (predictive language implies a model-based prediction task)

    • 'X caused/influenced/modified Y' (causal language implies a causal effect estimation task)

    • 'X was a risk factor for Y' (risk factor language is unclear and should be avoided)

S8: Satisfy reporting and transparency standards

Follow current best practice and relevant reporting guidelines for reporting study details and results.

  • Consider pre-registering a study protocol and statistical analysis plan (e.g. on OSF) before data access, clearly stating the signal discovery estimands of interest.

  • Make analytical code available (e.g., as a supplement to the publication, alongside the protocol, or in a public repository), having reviewed for disclosive content.

  • Provide a data availability statement describing the process for obtaining access to the source data. Report summary-level information including sample flow diagrams and baseline sample characteristics. 

  • Follow RECORD reporting guidelines, along with appropriate task-specific reporting and methodological guidance (STREGA for GWAS, CIOMS Working Group VIII for pharmacovigilance signal detection).