Corporate document

Appendix 1 – Data Suitability Assessment Tool (DataSAT)

Appendix 1 – Data Suitability Assessment Tool (DataSAT)

See tools and resources for a downloadable DataSAT assessment template.

DataSAT assessment template

Research question

Add the research question here.

Data provenance

Item

Response

Data sources

For each contributing data source provide the name, version and date of data cut. Provide links to their websites, if available.

Data linkage and data pooling

Report which datasets were linked, how these were linked, and performance characteristics of the linkage. Note whether linkage was done by a third party (such as NHS Digital).

Clearly describe which data sources were pooled.

Type of data source

Describe the types of data source (for example, electronic health record, registry, audit, survey).

Purpose of data collection

Describe the main purpose of data collection (for example, clinical care, reimbursement, device safety, research study).

Data collection

Describe the main types of data collected (for example, clinical diagnoses, prescriptions, procedures, patient experience data), how data was recorded (for example, clinical coding systems, free text, remote monitoring, survey response), and who collects the data (for example, healthcare professional, self-reported, digital health technology). If the nature of data collection has changed during the data period (for instance, change in coding system or practices, data capture systems) describe the changes clearly. Any differences between data providers in how and what data were collected and its quality should be described.

If additional data collection was done for a research study please describe, including how the validity and consistency of data collection was assured (for example, training).

Care setting

State the setting of care for each dataset used (for example, primary care, secondary care, specialist health centres, social services, home use [for wearable devices, or self-reported data on apps or websites]).

Geographical setting

State the geographical coverage of the data sources.

Population coverage

State how much of the target population is represented by the dataset (for example, population representativeness or patient accrual).

Time period of data

State the time period covered by the data.

Data preparation

Provide details of whether raw data were accessed for analysis, or whether the data owner had undertaken any data preparation steps such as cleansing or transformation. Mention whether centralised transformation to a common data model was undertaken. Include links to any relevant information including common data model type and version number and details of mapping.

Full details of data preparation specific to addressing the research question is covered in the section on reporting on data curation.

Data governance

Provide the details of the data controller and funding for each source. Describe the information governance processes for data access and use.

Data specification

Note whether a data specification document is available. This may include a data model, data dictionary, or both.

Data management plan and quality assurance methods

Note whether a data management plan, documentation of source quality assurance methods is available with links to relevant documents.

Other documents

Note whether any other documentation is available. Provide hyperlinks or citations to key publications, if available.

If the dataset is available from the Health Data Research UK (HDRUK) innovation gateway, provide the hyperlink to its profile on the HDRUK website.

Data quality

Details of data quality should be provided for key study variables including population eligibility criteria, outcomes, interventions or exposures, and covariates.

Study variable

Target concept

Operational definition

Quality dimension

How assessed

Assessment result

What type of variable (for example, population eligibility, outcome)

Define the target concept (for example, myocardial infarction [MI])

Define operational definition. For example, MI defined by an ICD-10 code of I21 in the primary diagnosis position

Choose: accuracy or completeness

Describe how quality was assessed. Provide reference to previous validation studies if applicable.

Provide quantitative assessment of quality if available. For example, 'positive predictive value 85% (75% to 95%)'

Data relevance

Please see recommendations for reporting data relevance.

Item

Response

Population

Describe the extent to which the analytical sample reflects the target population. This should consider any data exclusions (for example, because of missing data on key prognostic variables).

Care setting

Describe how well the care settings reflect routine care in the NHS.

Treatment pathway

Describe how the treatment pathways experienced by people in the data reflects routine care pathways in the NHS (including any diagnostic tests).

Availability of key study elements

Note how the dataset met the requirements of the research question in terms of availability of the necessary data variables including key population eligibility criteria, outcomes, intervention and covariates (including confounders and effect modifiers).

Study period

State the extent to which the time period covered by the data provides relevant information to decisions. This should cover any important changes to care pathways (including tests) or background changes in outcome rates.

Timing of measurements

Describe whether the timing of measurements meet the needs of the research question.

Follow up

Note how the follow-up period available in the dataset is sufficient for assessing the outcomes.

Sample size

Provide the sample size of the target population in the dataset and demonstrate that it is adequate to generate robust results.

DataSAT – case study

Please note that the reporting for this case study is based on publicly available information in Wing et al. 2021.

Research question

What is the effect of the long-acting beta-2 agonist and inhaled corticosteroid combination product fluticasone propionate plus salmeterol compared with no exposure or exposure to salmeterol only in people with chronic obstructive pulmonary disease (COPD)?

Data provenance

Item

Response

Data sources

Clinical Practice Research Datalink (CPRD) GOLD

Hospital episode statistics (HES) Admitted Patient Care data.

Data linkage and data pooling

CPRD and HES are linked. Patients are identified in a centralised linkage algorithm done by NHS digital. This uses an 8-step deterministic linkage algorithm based on 4 identifiers: NHS number, sex, date of birth and postcode.

Linkage to HES data is possible for 75% of enrolled patients.

See information on linked data for CPRD.

Type of data source

HES = administrative records

CPRD = electronic health records

Purpose of data collection

Hospital Episode Statistics (HES) is derived from the Secondary Uses Service (SUS) data based on information submitted to NHS digital by healthcare providers. Data collection is primarily intended to support the reimbursement of hospitals for the provision of services in England.

CPRD collects anonymised patient data from a network of GP practices across the UK. Initially this data is collected during a patient's time in primary care services.

Data collection

CPRD = demographics, clinical diagnoses (Read v2 or SNOMED-CT), tests (medcode or SNOMED-CT), prescriptions (prodcode) including dose, route of administration and duration. CPRD GOLD collects fully coded patient electronic health records from GP practices using the Vision software system. Data are recorded by health and care staff working within the Vision software.

HES = diagnoses (ICD-10), procedures (OPCS-4), admission, discharge, type of care, basic demographics. HES data are collected during a patient's time at hospital and may be recorded during their interactions with health and care staff in the hospital and assembled by teams of clinical coders.

Care setting

HES = secondary care

CPRD = primary care

Geographical setting

HES = England

CPRD = a representative sample of UK general practices using Vision software. HES-linked CPRD data is available for England only.

Population coverage

CPRD GOLD has data for about 3 million currently registered people (around 4.74% of UK population). See CPRD data highlights

HES data covers all NHS Clinical Commissioning Groups in England.

Time period of data

The CPRD-linked HES dataset covers from January 2000 to January 2017.

Data preparation

No details available for CPRD. However, general practices are included only after demonstrating their records are of research quality.

HES applies centralised processing before the data are released for research:

The rules that run during the processing of the HES data set. These are in place to improve the value and quality of the data and include rules that validate the data within certain fields, derive additional fields and values, remove records that are invalid or out of scope for the HES data set.

Data governance

CPRD is a centre of the MHRA, which is an executive agency of the Department of Health & Social Care (DHSC). DHSC is therefore the data controller for CPRD data.

HES data is controlled by the Health and Social Care Information Centre (also known as NHS Digital).

CPRD has received funding from the MHRA, Wellcome Trust, Medical Research Council, NIHR Health Technology Assessment programme, Innovative Medicines Initiative, UK Department of Health, Technology Strategy Board, Seventh Framework Programme EU, and various universities, contract research organisations and pharmaceutical companies.

HES data collection is mandated and funded by the UK Government.

Data protection and processing notice for CPRD.

Hospital episode statistics GDPR webpage.

Data specification

Fields in HES are derived from the NHS data model and the NHS data dictionary.

CPRD GOLD data specification document.

Data management plan and quality assurance methods

HES undertakes processing and data quality checks: The processing cycle and HES data quality.

No data quality assurance information was identified for CPRD GOLD. However, records from individual general practices are assessed and only included in CPRD after being deemed of research quality.

Other documents

None.

Data quality

Study variable

Target concept

Operational definition

Quality dimension

How assessed

Assessment result

Population

COPD

CPRD diagnostic (Read v2) codes for COPD (see codelist in supplementary material of Quint et al. 2014)

Accuracy

Previously published validation study comparing algorithms for identifying people with COPD with physician review questionnaire as gold standard (Quint et al. 2014)

Positive predictive value (PPV): 87% (95% Confidence interval [CI] 78% to 92%)

Population

Disease severity

Global Initiative for Chronic Obstructive Lung Disease (GOLD) stage derived from spirometry measurements (see codelist)

Completeness

Proportion of patients with missing spirometry data

20%

Intervention

Fluticasone propionate + salmeterol

CPRD prescribing record matching definition of drug treatment determined by codelist

Accuracy

CPRD prescribing data is expected to be highly accurate

n/a

Outcome

COPD exacerbation

Any of the following:

CPRD diagnostic (Read) code for lower respiratory tract infection or acute exacerbation of COPD

A prescription of a COPD-specific antibiotic combined with oral corticosteroid (OCS) for 5 to 14 days

A record (Read code) of 2 or more respiratory symptoms of AECOPD with a prescription of COPD-specific antibiotics and/or OCS on the same day.

See codelist

Accuracy

Previously published validation study comparing algorithms for identifying people with COPD exacerbations with physician review questionnaire as gold standard (Rothnie et al. 2016)

PPV: 86% (95% CI 83% to 88%)

Sensitivity: 63% (95% CI 55% to 70%)

Outcome

All-cause mortality

Record in Office for National Statistics (ONS) mortality statistics (centrally linked to CPRD data)

Accuracy

ONS mortality records are the gold standard data for deaths

n/a

Covariate (confounder)

Alcohol intake

Reported directly in CPRD (closest to index date)

Completeness

Proportion of patients with missing data on alcohol intake

30%

Data relevance

Item

Response

Population

Patients in CPRD have similar demographic characteristics to the wider UK population. Results from CPRD are generally expected to generalise to the wider eligible population.

Complete records analysis was done excluding records with missing data on socioeconomic status, alcohol consumption and BMI. All these variables had less than 5% of the data missing.

Around one-fifth of patients were excluded because they did not have spirometry measurements recorded in the CPRD. Those without measurements tend to have less contact with health services, which could impact on the generalisability of results.

Care setting

Appropriate. COPD drugs are typically administered in primary care (CPRD) while relevant events may be observed in primary or secondary care (CPRD or HES).

Treatment pathway

The data represents routine practice in the NHS.

Availability of key study elements

Sufficient data on exposures and outcomes are available. Although only prescribing and not dispensing data is available from CPRD this is expected to be a good proxy for dispensing.

No information was available on negative reversibility spirometry results which may be a key confounder.

Dosage information is limited in CPRD.

Study period

There have been no major changes to UK clinical practice for the management of COPD since the study period.

Timing of measurements

The longitudinal nature of the analysis allows for the research question to be answered. The date of entry is expected to reflect the actual timing of clinical events well.

Follow up

The average follow up of 2 years is sufficient for the primary outcome of COPD exacerbations to have occurred.

Sample size

The needed sample size for COPD exacerbations was estimated to be 600 per arm at 80% and 5% significance (see Wing et al. 2021 for details). The actual sample size of about 2,500 per arm far exceeds this.