Artificial intelligence-derived software to analyse chest X-rays for suspected lung cancer in primary care referrals: early value assessment

Health technology evaluation
Reference number: HTE12
Published: 28 September 2023

Download (PDF)

4 Approach to evidence generation

An approach to addressing the evidence gaps through real-world data collection is considered, and any strengths and weaknesses highlighted.

Most technologies do not have ongoing studies that will address the evidence gaps, although Lunit INSIGHT CXR has ongoing research that may address some of the gaps. So, for these technologies, additional evidence generation is necessary.

qXR has ongoing research that may address all the essential and important evidence gaps and may not need additional evidence generation.

4.1 Evidence generation plan

For technologies lacking information about diagnostic accuracy and technical failure rates, diagnostic accuracy studies should be done to show this.

Other evidence gaps can be addressed through a real-world historical control study alongside a qualitative survey.

Diagnostic accuracy study

This could be done as a diagnostic cross-sectional study. The study would compare agreement between clinical reviewer alone and clinical reviewer aided by the software for identification of abnormal X-rays (needing CT follow‑up). It would be possible to report accuracy (including sensitivity, specificity, negative predictive values and positive predictive values), variation across reviewers as well as technical failure rates.

Real-world historical control study

A historical control study could compare outcomes before and after the implementation of artificial intelligence (AI) software. This could assess the number and proportion of chest X-rays referred to CT scan, time from chest X-ray to completion of the report, number of chest X-rays assessed per reviewer per day, time from receipt of chest X-ray to CT scan report. The grade of NHS staff reviewing and reporting should also be collected.

This study could also collect additional diagnostic outcomes comparing AI-assisted review to reviewer alone. The study should assess whether abnormal findings on an X-ray correspond to disease-related abnormal findings on a follow-up CT scan (the reference standard). This would measure the positive predictive value aspect of diagnostic accuracy. Technical failure rates should also be reported. Information on number of cancers detected and stage of cancer at detection could be collected.

The study could also collect information on missed cancers among those who were not referred for chest CT during the study period, although this would give a biased estimate of false-negatives because not all missed cancers may be picked up over the observation period.

Data collection for each technology could be at a single centre or ideally across multiple centres. The study should also collect data on implementation costs for these technologies in routine clinical practice.

Qualitative survey

A qualitative survey is suggested to collect information on ease of use and acceptability of the software by clinicians. The format of the survey should include open-ended questions to give people the freedom to provide detailed insight. A range of views and perspectives should be collected that is representative of participating clinical reviewers at the sites where the technology is implemented.

4.2 Real-world data collections

The NHS England Secure Data Environment (SDE) service could potentially support evidence generation. This platform provides access to high standard NHS health and social care data that can be used for research and analysis. The Diagnostic Imaging Data Set within this service may be useful because it collects information about diagnostic imaging that people have and can be linked to other datasets.

There may be local or regional data collections that collect outcome measures specified in the research recommendation. The sub-national secure data environments could be a regional data collection alternative.

The quality and coverage of real-world data collections are of key importance when used in generating evidence. Active monitoring and follow‑up through a central coordinating point is an effective and viable approach of ensuring good-quality data with high coverage. NICE's real-world evidence framework also provides detailed guidance on assessing the suitability of a real-world data source to answer a specific research question.

4.3 Data to be collected

The following outcomes have been identified for collection through the suggested studies:

Quantitative

time from chest X-ray to report
time from chest X-ray to CT scan report
time from chest X-ray to diagnosis
number of chest X-rays reviewed per reviewer and centre per day
of those who had a chest X-ray, the number and proportion of people referred to have a chest CT scan
grade of NHS staff reviewing and reporting chest X-ray
agreement between AI-derived software and clinician review for normal and abnormal interpretation of chest X-ray
number and proportion of chest X-rays defined as abnormal confirmed as abnormal by CT
number of cancers detected
stage of cancer at detection
number of cancers missed, that is, those initially not picked up as abnormal, later referred to chest CT in the study period, and any subsequent cancer diagnosis
technical failure and rejection rates
all training and software implementation costs
characteristics of patients, including age, sex, weight and height or body mass index (BMI), and comorbidities such as asthma, scoliosis, interstitial lung disease, chronic obstructive pulmonary disease (COPD), family background of lung cancer or young people who do not smoke.

Qualitative

X-ray ease of use and acceptability
perceived accuracy of the technology in identifying abnormalities
perceived appropriateness of image triage
perceived impact on speed of review and reporting
perceived software's performance for people with underlying conditions and high-risk groups
clinician perspective on the use of AI-derived software.

Other information

The company should describe the process for monitoring the performance of the technologies while they are used in clinical practice. See NICE's evidence standards framework for digital health technologies for guidance on post-deployment reporting of changes in performance. This should include:

future plans for updating the technology, including how regularly the algorithms are expected to retrain, re-version or change functionality
the sources of retraining data, and how the quality of this data will be assessed
processes in place for measuring performance over time, to detect any impacts of planned changes or environmental factors that may impact performance
processes in place to detect decreasing performance in certain groups of people overtime
whether there is an independent overview process for reviewing changes in performance
an agreement on how and when changes in performance should be reported and to whom (evaluators, patients, carers and healthcare professionals).

The company should describe any actions taken in the design of the technology to mitigate against algorithmic bias that could lead to unequal impacts between different groups of people.