Methods for real-world studies of comparative effects

Corporate document

Introduction
Types of non-randomised study design
Study design
Analysis
Assessing robustness of studies
Reporting
Quality appraisal

Methods for real-world studies of comparative effects

Key messages

Non-randomised studies can be used to provide evidence on comparative effects in the absence of randomised controlled trials or to complement trial evidence to answer a broader range of questions about the effects of interventions in routine settings.
The recommendations presented here focus predominantly on cohort studies including those using real-world data to form external control arms.
Study design
- Design studies to emulate the preferred randomised controlled trial (target trial approach).
- Avoid time-related biases due to differences between patient eligibility criteria being met, treatment assignment, and start of follow up.
- For studies using external control, select and curate data to minimise differences between data sources including availability and operational definitions of key study variables, data collection processes, patient characteristics, treatment settings, care pathways, and time periods, and consider the implications for study quality and relevance.
Analysis
- Identify potential confounders (including time-varying confounders) using a systematic approach and clearly articulate causal assumptions.
- Use a statistical method that addresses confounding considering observed and unobserved confounders.
- Consider the impact of bias from informative censoring, missing data, and measurement error and address appropriately, if needed.
- Assess the external validity of findings to the target population and consider if adjustment methods are suitable or needed.
- Use sensitivity and bias analysis to assess the robustness of results to main risks of bias and uncertain data curation and analysis decisions.
Reporting
- Justify the need for non-randomised evidence.
- Provide a study protocol and statistical analysis plan before performing final analyses.
- Report studies in sufficient detail to enable independent researchers to reproduce the study and understand what was done and why.
- Assess the risk of bias and relevance of the study to the research question.
The acceptable quality of evidence may depend on the application and various contextual factors (see the section on considerations for the quality and acceptability of real-world evidence).

Introduction

We previously outlined principles for the robust and transparent conduct of quantitative real-world evidence studies across different use cases. In this section we provide more detailed recommendations for the conduct of studies of comparative effects using real-world data. This includes traditional observational studies based on primary or secondary data collection and trials in which real-world data is used to form an external control. We do not provide specific considerations for purely interventional studies (whether randomised or not) or external control studies using only interventional data. We focus here on quantitative studies but recognise that qualitative evidence can play an important role in improving our understanding of the value of interventions.

Randomised controlled trials are the preferred study design for estimating comparative effects. Non-randomised evidence may add value if randomised controlled trials are absent, not directly relevant to the research question or of poor quality (see the section on uses and challenges of randomised controlled trials). They can also complement trial evidence to answer a broader range of questions (see the section on estimating intervention effects using real-world data).

If real-world evidence on comparative effects may improve the evidence base, it is essential that studies are done using robust and transparent methods. We recommend designing real-world evidence studies to emulate the randomised trial that would ideally have been done (see the section on study design), using appropriate statistical methods to address confounding and informational biases (see the section on analysis), and assessing the robustness of results using sensitivity and bias analysis (see the section on assessing robustness). This approach is summarised in figure 1 a visual summary of key considerations for planning and reporting cohort studies using real-world data.

The recommendations provided here are intended to improve the quality of real-world studies of comparative effects, both in terms of methodological quality and validity, and the transparency of study conduct. They were derived from best-practice guidance from the published literature, international research consortia, and international regulatory and payer bodies, and will be updated regularly in line with developing methodologies. They build on NICE Decision Support Unit's technical support document 17, which presents statistical methods for analysing observational data.

We recognise that not all studies will be able to meet all recommendations in full. The ability to perform studies of the highest quality will depend on the availability of suitable data (see the section on assessing data suitability) and characteristics of the condition and intervention. Simpler methods may be appropriate for other applications including assessing non-health outcomes like user experience or some system outcomes. In addition, the acceptability and contribution of specific studies to decisions will depend on the application as well as several contextual factors (see the section on considerations for the quality and acceptability of real-world evidence studies).

Figure 1

Visual summary of key considerations for planning and reporting cohort studies using real-world data

Types of non-randomised study design

Overview

A large variety of study designs can be used to estimate the effects of interventions, exposures or policies. The preferred study design will be context dependent. It may depend on whether variation in the exposure is within individuals over time, between individuals, or between other groups such as healthcare providers. In general, confidence in non-randomised study results is strengthened if results are replicated using different study designs or analytical methods, known as triangulation (Lawlor et al. 2016).

One important distinction is between interventional and observational studies. In interventional studies, individuals (or groups of individuals) are allocated to 1 or more interventions according to a protocol. Allocation to interventions can be random, quasi-random or non-random. In observational studies, interventions are not determined by a protocol but instead according to the preferences of health and social care professionals and patients. Hybrid studies may make use of both interventional and observational data. In this section we focus on observational and hybrid studies only.

Both interventional and observational studies can be uncontrolled. Uncontrolled studies are appropriate only in rare cases, in which the natural course of the disease is well understood and highly predictable and the treatment effect is very large (see ICH E10 choice of control group in clinical trials and Deeks et al. 2003). In most cases a comparison group is needed to generate reliable and informative estimates of treatment effects. Controlled studies can make use of variation in exposures and outcomes across individuals (or groups), within individuals (or groups) over time, or both. In this section we focus on controlled studies.

Below we discuss types of comparative studies. Some taxonomies distinguish between prospective studies (involving primary data collection) and retrospective studies (based on already collected data). This distinction does not necessarily convey information about study quality and so we advise against its use (Dekkers and Groenwold 2020).

Cohort studies

In cohort studies, individuals are identified based on their exposures and outcomes compared during follow up. Usually, cohort studies will compare individuals subject to different exposures from the same data source. However, they can also combine data from different sources including from interventional and observational data sources. In this case, the observational data is used to form an external control to the intervention used in the trial. The trial will often be an uncontrolled single-arm trial but could also be an arm from a controlled trial. External data can also be used to augment concurrent controls within a randomised controlled trial.

External controls can also be formed from data from previous clinical trials. A potential advantage of such studies is greater similarity in patient inclusion criteria, follow up and outcome determination. Often only aggregate rather than individual patient-level data will be available from previous trials. NICE Decision Support Unit's technical support document 18 describes methods for unanchored indirect comparisons with aggregated data.

In the following study design and analysis sections, we focus on cohort studies including those using external control from real-world data sources which are the most common non-randomised study designs informing NICE guidance. Other study designs including quasi-experimental designs or self-controlled studies may be relevant in some contexts as outlined below.

Self-controlled studies

Self-controlled, or 'within-subject', designs make use of variation in exposure status within individuals over time. These include case-crossover, self-controlled case series, and variants of these designs. They are most appropriate for transient exposures with acute-onset events (Hallas and Pottegard 2014). While primarily used in studies of adverse effects of medicines (including vaccines), they have been used to assess the effects of oncology medicines using the experiences of individuals on prior lines of therapy (Hatswell and Sullivan 2020). This is most relevant if appropriate standard-of-care comparators are not available.

A key advantage of self-controlled methods is the ability to control for confounders (including unmeasured or unknown confounders) that do not vary over time, such as genetic inheritance, or vary slowly like many health behaviours. However, it is still necessary to adjust for covariates that may change over time (for example, disease severity). Such methods generally either assume no time-based trends in outcomes or try to model the trend statistically. These approaches can often be strengthened by the addition of control groups of people not exposed to the interventions.

Cross-sectional studies

In cross-sectional studies information on current exposures and outcomes is collected at a single time point. While they can be used to estimate intervention effects, they are less reliable than longitudinal studies (such as cohort studies) if there is need for a clear temporal separation of exposures and outcomes.

Case-control studies

In case-control studies individuals are selected based on outcomes, and odds of exposures are compared. Case-control studies embedded within an underlying cohort are known as nested case-cohort studies. Case-control studies conducted within existing database studies are generally not recommended because they use less information than cohort studies (Schuemie et al. 2019). Case-control studies are most useful for rare outcomes or if there is a need to collect further information on exposures, for example, from manual medical record review or primary data collection.

Quasi-experimental studies

Quasi-experimental studies and natural experiments exploit external variation in exposure across people or over time (an 'instrument') that is otherwise unrelated to the outcome to estimate causal effects (Reeves et al. 2017, Matthay et al. 2019). Common quasi-experimental methods include instrumental variable analysis, regression discontinuity, interrupted time series and difference-in-difference estimation. They are frequently used in public health settings when randomisation is not always feasible but have also been used in medical technologies evaluations (see NICE medical technologies guidance on Sleepio to treat insomnia and insomnia symptoms).

Instrument-based approaches may be useful if:

confounding because of unknown or poorly measured confounders is expected
an appropriate instrument is available that is associated with the exposure of interest and does not affect the outcome except through the exposure.

Examples of instruments that have been used in healthcare applications include variation in physician treatment preferences or hospital formularies, genes, distance to healthcare providers or geographic treatment rates, arbitrary thresholds for treatment access, or time (for example, time of change to clinical guidelines that have immediate and substantial impacts on care patterns).

A key advantage of these approaches is in addressing confounding due to unobserved or poorly measured covariates. However, consideration needs to be given to the validity of the instrument in addition to other methodological challenges depending on the particular design used (see NICE Decision Support Unit's technical support document 17). Applications of these methods are usually strongly dependent on assumptions that are difficult to test, and a clear case for validity based on substantive knowledge and empirical justification is required.

Study design

In this section we present study design considerations for cohort and external control studies using real-world data. These approaches may also be useful for other non-randomised study designs.

The target trial approach

Non-randomised studies should be designed to mimic the randomised trial that would ideally have been performed unconstrained by ethical or feasibility challenges (Hernán and Robins 2016, Gomes et al. 2022). This process, known as the target trial approach (or trial emulation), requires developers to clearly articulate the study design and helps avoid selection bias because of poor design (Bykov et al. 2022). Usually, the target trial would be a pragmatic randomised trial representing the target population of interest and reflecting routine care. This approach forms the basis of the Cochrane ROBINS-I risk of bias tool for non-randomised studies (Sterne et al. 2016).

Studies should aim to emulate the target trial as closely as possible and, if this is not possible, trade-offs should be clearly described. In some cases, a data source may not be of sufficient relevance or quality to allow trial emulation. This can be particularly problematic for studies using real-world data to form an external control because differences in terms of patients, settings, care, data collection and time periods can limit the comparability between the trial and the real-world data (Gray et al. 2020, Pocock 1976). Sometimes it will not be possible to adequately emulate a target trial with real-world data and bespoke data collection may be needed.

The target trial can be defined across 7 dimensions: eligibility criteria, treatment strategies, assignment procedure, follow-up period, outcomes, causal effect of interest and analysis plan. We describe each dimension below and provide considerations for those developing evidence to inform NICE guidance.

Eligibility criteria

For most studies, the eligibility criteria should mimic a hypothetical pragmatic trial by reflecting the clinical pathways (including diagnostic tests) and patients seen in routine care in the NHS. For external control studies, the focus should be on matching the eligibility criteria from the interventional study rather than the broader target population. As in a trial, eligibility criteria should be based on variables recorded before treatment assignment.

If heterogeneity is anticipated in the intervention effects, subgroup analysis can be done. The subgroups should be defined upfront when planning the study.

Treatment strategies

Treatment strategies include the intervention of interest and any comparators. Comparators could be different levels of an exposure (for example, different doses of a medicine), a different intervention, or the absence of intervention. In observational data it is very difficult to emulate a placebo-controlled trial because of higher risk of selection bias and intractable confounding.

Comparators that are for the same (or similar) treatment indication (that is, active comparators) are preferred to comparison with those not receiving an intervention. Active comparators reduce the risk of confounding by indication by ensuring greater similarity of patients having different interventions. If routine follow-up procedures are similar across interventions this also reduces the risk of detection bias. The active comparator should ideally reflect established practice in the NHS.

For studies of interventions, new (or incident) user designs are generally preferred to studies of prevalent users (those who have already been using the intervention for some time) because of the lower risk of selection bias and better emulation of trial designs. Prevalent users have, by definition, remained on-treatment and survived for some period of follow up. When making use of already collected data, new users are typically defined using an initial period in which the individual was not observed to use the intervention of interest (known as the 'washout' period in pharmacoepidemiology). A further advantage of new-user designs is the ability to estimate time-varying hazards from treatment initiation. The inclusion of prevalent users may be needed if the effects of interventions are cumulative, there are too few incident users in the data, or follow up is limited (Vandenbroucke and Pearce 2015, Suissa et al. 2016).

Data on comparators would ideally come from the same period as the intervention as well as from the same healthcare system and settings. This is to minimise any differences between treatment groups resulting from differences in care access, pathways (including diagnostic tests) or time-based trends in outcomes.

Assignment procedure

In randomised controlled trials, individuals (or groups) are randomly assigned to interventions. If possible, providers, patients and analysts are blinded to this assignment. Neither random assignment nor blinding are possible in observational studies. With sufficient information on confounders, random assignment can, however, be approximated through various analytical approaches (see the section on analysis).

In some applications, individuals will meet eligibility criteria at multiple time points. For example, they may start treatment more than once after a sufficient period without exposure (or 'washout' period). There are several approaches to deal with this including using only the first eligible time point, a random eligible time or all eligible time points (Hernán and Robins 2016).

Follow-up period

The start and end of follow up must be defined. The start of follow up should ideally begin at the same time at which all eligibility criteria are met and the intervention is assigned (or just after). If a substantial latency period is expected between treatment initiation and outcomes, it may be necessary to define an induction period before which outcomes are not counted. This can reduce the risk of reverse causation, in which the outcome influences the exposure.

The follow-up period should be long enough to capture the outcomes of interest but should not exceed the period beyond which outcomes could be reasonably impacted by the intervention (known as the exposure-effect window). Censoring events should be clearly defined and will depend on the causal effect of interest.

Outcomes

Primary and secondary outcomes should be defined and can include both patient and health system outcomes (such as resource use or costs). Patient outcomes should reflect how a patient feels, functions, or how long a patient lives. This includes quality of life and other patient-reported outcome measures. Objective clinical outcomes (such as survival) are typically subject to a lower risk of bias than subjective outcomes if outcome detection or reporting could be influenced by known treatment history.

For a surrogate outcome there should be good evidence that changes in the surrogate outcome are causally associated with changes in the final patient outcomes of interest (Ciani et al. 2017).

While outcome ascertainment is not blinded in observational data, analysts can be blinded to outcomes before finalising the analysis plan (see the section on analysis).

Causal effect of interest

Researchers should describe the causal effect of interest. Trials are usually designed to estimate 1 of 2 causal effects: the effect of assignment to an intervention (intention-to-treat) or the effect of adhering to treatment protocols (per-protocol). It is not usually possible to estimate the effect of treatment assignment using observational data because this is not typically recorded. However, it can be proxied using treatment initiation (the as-started effect). The equivalent of the per-protocol effect is sometimes called the on-treatment effect.

The as-started effect is usually of primary interest to NICE. However, if treatment discontinuation (or switching) is substantial or is not expected to reflect routine practice or outcomes in the NHS, it is important to present results from the on-treatment analysis. On-treatment analyses may also be most appropriate for the analysis of safety and adverse events. The on-treatment effect can also be extended to cover dynamic treatment strategies such as treatment sequences or other complex interventions which are of interest to NICE.

Analysis plan

The analysis plan should describe how the causal effect of interest is to be estimated, taking into account intercurrent events. Intercurrent events are events occurring after treatment initiation (such as treatment switching or non-adherence) that affect the interpretation of the outcome of interest. This is supported by the estimand framework (for further information, see ICH E9 [R1] addendum on estimands and sensitivity analysis in clinical trials).

The relevance of intercurrent events will depend on the causal effect of interest. In an as-started analysis, treatment discontinuation, switching or augmentation can usually be ignored. However, if these changes are substantial there is a risk of increasing exposure misclassification over time. In most cases this would bias estimates of effect towards the null.

In an on-treatment analysis or when modelling dynamic treatment strategies, the follow up is often censored once the patient stops adhering to the treatment plan plus some biologically informed effect window. For medicines (and some devices) continued exposure is proxied by dates of prescriptions and expected period of use (for example, derived from number of days' supply), with some grace period between observations permitted. Particular attention needs to be given to the possibility of informative censoring, which causes bias if censoring depends on outcomes and differs across interventions, and time-varying confounding.

Further content on statistical analysis including addressing confounding, informative censoring, missing data and measurement error is presented in the analysis section.

Panel 1 shows examples of using the target trial approach:

Panel 1: examples of the target trial approach

Example 1: What is the effect of initiating HRT on coronary heart disease in postmenopausal women?

The Women's Health Initiative randomised controlled trial showed that initiating treatment with hormone replacement therapy increased the risk of coronary heart disease in postmenopausal women. This contradicted earlier observational studies that found a reduction in the risk of coronary heart disease. Hernán et al. 2008 followed a target trial approach, replicating as far as possible the Women's Health Initiative trial using data from the Nurses' Health Study. They were able to show that the difference in results between the trial and observational studies resulted from the inclusion of prevalent users of hormone replacement therapy in the observational cohort. These women had already survived a period of time on-treatment without experiencing the outcome. Following a new-user design (as well as other principles of the target trial approach) they were able to produce effect estimates consistent with the trial.

Example 2: What is the optimal estimated glomerular filtration rate (eGFR) at which to initiate dialysis treatment in people with advanced chronic kidney disease?

The IDEAL randomised controlled trial showed a modest reduction in mortality and cardiovascular events for early versus late initiation of dialysis. The average eGFR scores in the early and late treatment arms were 9.0 and 7.2 mL/min/1.73 m², respectively. There therefore remains considerable uncertainty about the optimal time to initiate dialysis. Fu et al. 2021 emulated the IDEAL trial using data from the National Swedish Renal Registry and were ability to produce similar results over the narrow eGFR separation achieved in the trial. They were then able to extend the analysis to a wider range of eGFR values to identify the optimal point at which to initiate dialysis therapy.

Example 3: What is the effect of initiating treatment with fluticasone propionate plus salmeterol (FP-SAL) versus 1) no FP-SAL or 2) salmeterol only on COPD exacerbations in people with COPD?

The TORCH trial found that treatment with FP-SAL was associated with a reduction in the risk of COPD exacerbations compared with no FP-SAL or salmeterol only. However, the trial excluded adults aged above 80 years and those with asthma or mild COPD. There is uncertainty about the extent to which results from the TORCH trial apply to these patients. Wing et al. 2021 were able to replicate the findings of the TORCH trial for COPD exacerbations using primary care data from Clinical Practice Research Datalink in England for the comparison with salmeterol only but not with no FP-SAL. This reflects the challenge in emulating a trial with placebo control. By extending their analysis to a wider target population they were able to demonstrate evidence of treatment effect heterogeneity by COPD severity but not by age or asthma diagnosis.

Analysis

Addressing risk of confounding bias

Identification and selection of confounders

Potential confounders should be identified before analysis, based on a transparent, systematic and reproducible process. Key sources of evidence are published literature and expert opinion. Consideration should be given to the presence of time-varying confounders. These affect the outcome and future levels of the exposure and can be affected by previous levels of the exposure. They are especially relevant when modelling time-varying interventions or dynamic treatment strategies or addressing informative censoring.

Developers should outline their assumptions about the causal relationships between interventions, covariates and outcomes of interest. Ideally, this would be done using causal diagrams known as directed acyclic graphs (Shrier and Platt 2008).

Inappropriate adjustment for covariates should be avoided. This may result from controlling for variables on the causal pathway between exposure and outcomes (overadjustment), colliders or instruments. Confounders that may change value over time should be recorded before the index date, except when using statistical methods that appropriately address time-varying confounding.

The selection of covariates may use advanced computational approaches such as machine learning to identify a sufficient set of covariates, for example, when the number of potential covariates is very large (Ali et al. 2019, Tazare et al. 2022). The use of these methods should be clearly justified and their consistency with causal assumptions examined. Choosing covariates based on statistical significance should be avoided.

Selecting methods for addressing confounding

Adjusted comparisons based on clear causal assumptions are preferred to naive (or unadjusted) comparisons. Statistical approaches should be used to address confounding and approximate randomisation (see the section on assignment procedure).

Various approaches can be used to adjust for observed confounders including stratification, matching, multivariable regression and propensity score methods, or combinations of these. These methods assume no unmeasured confounding. Simple adjustment methods, such as stratification, restriction and exact matching, may be appropriate for research questions in which confounding is well understood and there are only a small number of confounders that are well recorded.

If there are many potential confounders, more complex methods such as multivariable regression and propensity score (or disease risk score) methods are preferred. Propensity scores give the probability of receiving an intervention based on observed covariates. Several methods use propensity scores including matching, stratification, weighting and regression (or combinations of these). General discussions of the strengths and weaknesses of these different approaches can be found in Ali et al. 2019. The choice of method should be justified and should be aligned with the causal effect of interest.

There is mixed evidence on the relative performance of regression and propensity score methods for addressing confounding bias (Stürmer et al. 2006). However, using propensity score methods may have advantages in terms of the transparency of study conduct:

Propensity scores are developed without reference to outcome data, which can reduce the risk of selective reporting of results when combined with strong research governance processes.
With certain propensity score methods it is possible to examine the similarity of intervention groups in terms of observed covariates, providing evidence on the extent to which comparability was achieved. Absolute standardised differences of less than 0.1 are generally considered to indicate good balance although small absolute differences may still be important if the variable has a strong effect on the outcome.

Regression and propensity score methods may also exclude some participants to enhance the similarity of people across intervention arms or levels. When using such methods, trade-offs between internal validity, power and generalisability should be considered. For studies of comparative effects, internal validity should generally be prioritised.

Time-varying confounders should typically not be adjusted for using the above methods. It may be acceptable for on-treatment analyses if confounders that vary over time are not affected by previous levels of the intervention but this is uncommon. G-methods including marginal structural models with weighting are preferred (Pazzagli et al. 2017, Mansournia et al. 2017). Adjustment for time-varying confounders requires high-quality data over the whole follow-up period.

Various sensitivity and bias analyses can be used to adjust for bias because of residual confounding or to explore its likely impact (see the section on assessing robustness of studies). This may be informed by external data on confounder-outcome relationships or data from a data-rich subsample of the analytical database, if available (Ali et al. 2019). Negative controls (that is, outcomes that are not expected to be related to the intervention) may also be useful (Lipsitch et al. 2010).

If there are multiple potential sources of suitable real-world data to provide external control to trial data, developers should consider whether to estimate effects separately for each data source or to increase power by pooling data sources. Data sources should only be pooled when there is limited heterogeneity between sources in terms of coverage and data quality. Individual estimates of effects for each data source should always be provided.

External controls can also be used to supplement internal (or concurrent) controls in randomised controlled trials. There are several methods available to combine internal and external controls, which place different weight on the external data (NICE Decision Support Unit report on sources and synthesis of evidence).

Instrument-based methods (or quasi-experimental designs) can be used to address unobserved confounding (Matthay et al. 2019). Further technical guidance on methods for addressing baseline confounding due to observed and unobserved characteristics using individual patient-level data is given in NICE's Decision Support Unit technical support document 17.

Addressing information bias

Limitations in data quality including missing data, measurement error or misclassification can cause bias and loss of precision. Here we describe analytical approaches to address information bias. The information needed to understand data suitability will provide an insight into the likely importance of information bias (see the section on assessing data suitability).

Informative censoring

Censoring occurs in longitudinal studies if follow up ends before the outcome is fully observed. It can happen because the data collection period ends (administrative censoring), loss to follow up, occurrence of events such as treatment switching, non-adherence, or death depending on the analysis. It may be induced by analytical strategies such as cloning to avoid time-related biases in studies without active comparators (Hernán and Robins 2016).

Censoring can create bias if it is informative (that is, it is related to the outcomes and treatment assignment). For example, in on-treatment analyses, if people on an experimental drug were less likely to adhere to the treatment protocol because of a perceived lack of benefit this could lead to informative censoring. When modelling effects on-treatment or dynamic treatment strategies, censoring because of treatment switching is likely to be informative. Methods to address informative censoring are similar to those for time-varying confounding such as marginal structural models with weighting or other G-methods (Pazzagli et al. 2017). Methods for dealing with missing data may also be used (see the section on missing data).

Missing data

The impact of missing data depends on the amount of missing data, the variables that have missing data, and the missing data mechanism. Developers should compare patterns of missingness across exposure groups and over time, if relevant, considering causes of missingness and whether these are related to outcomes of interest. Missing data on outcomes may arise for a number of reasons including non-response to questionnaires or censoring.

If the amount of missing data is low and likely to be missing completely at random, complete records analysis will be sufficient. Advanced methods for handling missing data include imputation, inverse probability weighting and maximum likelihood estimation. Most of these methods assume the missing data mechanism can be adequately modelled using available data (that is, missing at random). If this is not the case, sensitivity or bias analysis may be preferred (see the section on assessing robustness). A framework for handling missing data is provided in Carpenter and Smuk 2021.

Measurement error and misclassification

Measurement error describes the extent to which measurements of study variables deviate from the truth. For categorical variables, this is known as misclassification. The impact of measurement error depends on the size and direction of the error, the variables measured with error, and whether error varies across intervention groups. Measurement error can induce bias or reduce the precision of estimates.

Random measurement error in exposures tends to (but does not always) bias estimates of treatment effects towards the null (van Smeden et al. 2020). Random measurement error in continuous outcomes reduces the precision of estimates but provides unbiased estimates of comparative effects. For risk ratios and rate ratios, non-differential misclassification of a categorical outcome provides unbiased estimates of comparative effects when specificity is 100%, even if sensitivity is low. So, it is often recommended to define outcome variables to achieve high specificity.

Differential measurement error in exposures, covariates or outcomes generally produces biased estimates of comparative effects but the direction of bias can be hard to predict. If data is available on the likely structure and magnitude of measurement error (for example, through an internal or external validation study), this information can be incorporated into analyses using calibration or other advanced methods (van Smeden et al. 2020).

Addressing external validity bias

Assessing external validity

This section focuses on methods to assess and address external validity bias resulting from differences in patient characteristics (for example, age, disease risk scores) between the analytical sample and the target population. Importantly, differences in patient characteristics may not be the only, or most important, sources of external validity bias. Developers should also consider differences in: setting (for example, hospital type and access to care), treatment (for example, dosage or mode of delivery, timing, comparator therapies, concomitant and subsequent treatments) and outcomes (for example, follow-up, measurements, or timing of measurements). Identifying a suitable data source, using a target trial approach and using internally valid analysis methods remain the primary approaches by which external validity can be achieved.

To assess external validity, an explicit definition of the target population is needed and suitable reference information. Information can be drawn from published literature, context-relevant guidelines, or bespoke analysis of data from the target population alongside information gathered during the data suitability assessment.

To assess differences between the analytical sample and target population for patient characteristics, several tests are available:

averages and distributions of individual variables can be compared (for example, using absolute standardised mean differences);
multiple variables can be compared simultaneously using propensity scores (here, reflecting a patient's propensity for being selected into the study) which also support measures of differences arising from joint distributions of patient characteristics (for example, Tipton 2014).

In studies of relative treatment effects, differences observed between the analytical sample and target population do not necessarily lead to concerns about external validity bias unless those differences are considered to be important treatment effect modifiers. This depends on the causal effect of interest, the extent of heterogeneity in the treatment effect, and whether this has been adequately modelled. Assumptions about the causal relationships between interventions, outcomes, and other covariates can be outlined to help identify potential treatment effect modifiers, for example, using directed acyclic graphs (Shrier and Platt 2008). Under certain conditions, treatment effect modification can also be investigated statistically (Degtiar and Rose 2022).

In studies of absolute treatment effects, assessment of external validity requires consideration of all differences that are prognostic of the outcome of interest, not only treatment effect modifiers.

Methods to minimise external validity bias

Methods to adjust for external validity bias are similar to those which adjust for confounding bias, including matching, weighting, and outcome regression methods. These approaches can also be combined for additional robustness.

Matching and weighting methods balance individual characteristics associated with selection into the sample (for example, using propensity scores).
Regression methods model outcomes in the analytical sample and then standardise model predictions to the distribution of covariates in the target population.

Degtiar and Rose 2022 provides further guidance on these methods, including approaches for when only summary-level data is available for the target population. Adjustment approaches are unlikely to perform well when the target population is poorly represented in the analytical sample, that is, where there is insufficient overlap for important covariates, or across strata of these variables. Successful application of these methods also depends on good internal validity of analyses and consistency in measurements of outcomes, treatments, and covariates across settings.

Where the sample is drawn from an entirely different population to the target population, judgements of similarity will require stronger assumptions. Pre-specified, empirical assessments of 'transportability' for the decision context could provide supportive evidence (for example, see Ling et al. 2023). In all cases, sensitivity analyses are recommended to explore potential violation of study assumptions (for example, see Dahabreh et al. 2023, Nguyen et al. 2018).

Assessing robustness of studies

The complexity of studies of comparative effects using real-world data means developers must make many uncertain decisions and assumptions during data curation and analysis. These decisions can have a large impact, individually or collectively, on estimates of comparative effects. It is therefore essential that the robustness of results to deviations in these assumptions is demonstrated. We describe key sensitivity analyses across several domains in table 3. Which sensitivity analyses to focus on will vary across use cases depending on the strengths and weaknesses of the data as well as the areas in which the impact of bias, study assumptions and uncertainty are greatest. These approaches can be applied directly to measures of clinical effectiveness or propagated through to cost-effectiveness analyses.

For key risks of bias (for example, those arising because of unmeasured confounding, missing data or measurement error in key variables), quantitative bias analysis may be valuable. Quantitative bias analysis describes a set of techniques that can be used to:

examine the extent to which bias would have to be present to change results or affect a threshold for decision making, or
estimate the direction, magnitude and uncertainty of bias associated with measures of effect.

Methods that examine the extent to which bias would have to be present to change study conclusions tend to be simpler and include the e-value approach. These approaches are most useful when exploring a single unmeasured source of bias, however sources of bias are often multiple and may interact. Developers should consider and pre-specify a plausible level of bias in the parameter before application of these methods. More sophisticated approaches look to model bias and incorporate it into the estimation of effects (Lash et al. 2014). Bias parameters can be informed by external information or data-rich subsamples of the analytical data source. The identification and validity of external bias should be clearly described and justified.

Bias analysis may be particularly valuable in studies using real-world data external controls if differences between data collection, settings and time may reduce comparability of data. Panel 2 shows an example of bias analysis in practice, and table 3 shows examples of sensitivity analysis.

Panel 2: example of bias analysis

What is the effectiveness of the ALK-inhibitor alectinib compared with ceritinib in crizotinib-refractory, ALK-positive non-small-cell lung cancer?

The comparative effectiveness of alectinib versus ceritinib on overall survival in patients with ALK-positive non-small-cell lung cancer is uncertain because of a lack of head-to-head trials. Wilkinson et al. 2021 used real-world data on ceritinib from the Flatiron Health database (derived from US electronic health records) to form an external control to the alectinib arm of a phase 2 trial. The authors found a significant improvement in survival for those initiating alectinib. However, the study was at risk of residual bias from unmeasured confounding and missing baseline data on Eastern Cooperative Oncology Group Performance Status (ECOG PS) in patients having ceritinib (47% of patients had missing data).

Bias analysis methods were used to explore these risks. The e-value approach was used to estimate the relative risk of an unobserved confounder between intervention and mortality that would be needed to remove the treatment effect. The estimated relative risk of 2.2 was substantially higher than for any observed confounders and considered unlikely given the estimated imbalance for important but poorly captured confounders.

For missing ECOG data they assumed the causes of missing data were non-random and missing data values in the ceritinib arm were likely to be worse than expected based on multiple imputation. They argued that no plausible assumptions about missing data could explain the observed association between intervention and mortality.

Table 3

**Examples of sensitivity analyses to examine robustness of results to data curation, study design, and analysis decisions**
Domain	Example sensitivity or bias analysis
Exposure misclassification	On-treatment analyses Vary exposure definitions including, if relevant, days' supply, grace period, washout period, exposure-effect window and latency
Outcome misclassification	Adjust for known performance metrics Quantitative bias analysis
Population	Alternative patient eligibility criteria
Detection bias	Include measures of healthcare use as covariates Restrict to those with regular contact with the health system before baseline
Follow-up time	As-started and on-treatment analyses Restrict outcome period so it is similar between groups for informative censoring Prevalent-user and new-user analyses
Reverse causation	Introduce or change lag time between exposure end and start of follow up for outcomes
Confounding	Add or remove selected confounders Extend look-back period over which covariates are identified Use negative controls (also known as falsification endpoints or probe variables) to estimate comparative effects using the same model on outcomes, which should not be related to treatment (results from these can also be used to calibrate effect estimates) Propensity score calibration to adjust observed effect estimates for unmeasured bias using variables observed in a validation study Quantitative bias analysis
Missing data	Use different methods Include missing variable indicators for covariates in statistical models Quantitative bias analysis (for instance, assuming missing not at random mechanisms)
Model specification	Vary model specifications Use analytical approaches with different assumptions (triangulation)
Data curation	Alternative categorisations of continuous variable or adjust data exclusions

Reporting

We provide general principles for the transparent reporting and good conduct of real-world evidence studies in the section on conduct of quantitative real-world evidence studies. The following reporting considerations are especially important for comparative effects studies:

Justification of the use of real-world evidence. This should cover, as relevant, the reasons for the absence of randomised evidence, the limitations of existing trials and the ability to produce meaningful real-world evidence for the specific research question.
Publish a study protocol (including statistical analysis plan) on a publicly accessible platform before the analysis is done.
Report studies in sufficient detail to enable the study to be reproduced by an independent researcher.
Present study design diagrams.
For each data source, provide the information needed to understand data provenance and fitness for purpose (see the section on assessing data suitability).
Justify the use of statistical method for addressing confounding and report methods clearly (see appendix 3).
Clearly describe the exclusion of patients from the original data to the final analysis, including reasons for exclusion using patient flow (or attrition) diagrams.
Present characteristics of patients across treatment groups, before and after statistical adjustment if possible. For external control studies, differences in variable definitions and data collection should be clearly described.
Present results for adjusted and unadjusted analyses and for all subgroup and sensitivity and bias analyses.

Quality appraisal

Evidence developers should identify risks of bias at the study planning stage. These should be described alongside how design and analytical methods have been used to address them, and how robust the results are to deviations from assumptions in the main analysis using sensitivity or bias analysis. This can be done for specific domains of bias using the reporting methods in appendix 2. This information will help those completing (or critically appraising) risk of bias tools. The preferred risk of bias tool for non-randomised studies is the ROBINS-I (Sterne et al. 2016) but it should be recognised that it may not cover all risks of bias (D'Andrea et al. 2021). It should be recognised that the uncertainty in non-randomised studies will not typically be fully captured by the statistical uncertainty in the estimated intervention effect (Deeks et al. 2003).

Developers should comment on the generalisability of study results to the target population in the NHS. This may draw on differences in patients, care settings, treatment pathways or time and is supported by information provided from the data suitability assessment. Developers should also discuss any methods used to address external validity bias, with the results of adjusted and unadjusted analysis presented.