3 Committee discussion

The diagnostics advisory committee considered evidence on artificial intelligence (AI) technologies to help detect or characterise colorectal polyps from several sources. This included evidence submitted by the companies, a review of clinical and cost evidence by the external assessment group (EAG), and responses from stakeholders. Full details are available in the project documents for this guidance.

The condition

3.1

Colorectal polyps may cause no symptoms (asymptomatic), but some may bleed or cause abdominal pain or changes in bowel habits. Risk factors for colorectal polyps include older age, family medical history, lifestyle factors and conditions such as inflammatory bowel disease (IBD). Most colorectal cancers develop from polyps, so finding and removing them early is vital.

3.2

Colonoscopy allows healthcare professionals to detect and remove polyps before they become cancerous. This can reduce the risk of bowel cancer by up to 90%. Two types of polyps are especially important: adenomatous polyps, which are precancerous, and sessile serrated lesions (SSLs), which are harder to spot but can also develop into cancer. Polyps that are smaller than 5 mm are less likely to be cancerous and are more likely to be missed in standard care. Early and accurate detection helps ensure that people get the right treatment at the right time.

Diagnostic pathways

3.3

In England, colonoscopy is used:

to screen for colorectal cancer in adults aged 50 and over after a positive faecal immunochemical test
for surveillance in people with a higher risk of developing colorectal cancer, such as people with IBD, a family medical history of colorectal cancer, or previous polyps or cancer
to help investigate symptoms consistent with colorectal cancer, such as bleeding, pain or anaemia.

3.4

Therapeutic colonoscopy allows the endoscopist to remove polyps and take tissue samples during the procedure to confirm a diagnosis through a microscope (histology). But this has a cost, so the bowel cancer screening programme recommends that in certain cases, where the polyp is small and appears benign, the polyp can be removed but not analysed. This is known as the 'resect-and-discard' strategy. Similarly, the European Society of Gastrointestinal Endoscopy recommends that diminutive (smaller than 5 mm) rectosigmoid polyps that are predicted to be non-adenomatous with high confidence can be left in place. This is known as the 'diagnose-and-leave' strategy.

Unmet need

3.5

The demand for colonoscopy in the NHS is high, and missed polyps can lead to a delayed diagnosis of colorectal cancer. The clinical experts explained that smaller polyps and particular types of polyps, such as SSLs, can be more difficult to detect. AI software using computer-aided detection (CADe) may help by acting as a second observer, improving polyp detection and potentially leading to better clinical outcomes. Some software also uses computer-aided diagnosis (CADx) to support diagnosis and characterisation, potentially helping guide decisions about polyp removal and reducing unnecessary procedures. Using AI software could also help to reduce variation in polyp detection between endoscopists with different levels of skill and experience.

Clinical effectiveness

Overview of evidence base

3.6

The EAG searched for, identified and reviewed studies that used AI technologies during live colonoscopy, which reflects real-world use. It used studies in which AI software supported, not replaced, endoscopists. A wide range of different outcomes was reported across the studies. For CADe, diagnostic accuracy data was limited and often not relevant to clinical practice. Adenoma detection rate (ADR) was the most widely reported outcome and was used as the main outcome by the EAG in its economic modelling. ADR is the percentage of screening colonoscopies in which at least 1 adenomatous (precancerous) polyp is detected. For CADx, the EAG used diagnostic accuracy studies that supported endoscopists to characterise polyps and compared this to histology as the reference standard.

3.7

The EAG identified and reviewed 70 studies on AI-assisted colonoscopy, which focused mainly on polyp detection (CADe). Most of the studies were randomised controlled trials (RCTs) comparing AI-assisted colonoscopy with standard colonoscopy. The quality and quantity of evidence varied across AI technologies, with some supported by more robust data than others.

3.8

The EAG did meta-analyses for each AI technology individually, focusing mainly on ADR as the key outcome.

Suitability of ADR as an effectiveness measure

3.9

ADR was the only outcome that was widely reported in RCTs for all the technologies. The EAG explained that there is evidence to show that an improvement in ADR leads to a reduction in the number of colorectal cancer cases detected after colonoscopy. The EAG added that diagnostic accuracy data was rarely reported for using the software to aid detection. It said this is because the software aims to improve on the gold standard, which is colonoscopy without AI software.

3.10

The clinical experts noted that ADR only provides information about the proportion of colonoscopies in which at least 1 adenomatous polyp was detected and not whether the software is accurate in detecting all polyps. So, some potentially cancerous polyps could still be left behind even when ADR is improved. The clinical experts explained that the size and type of polyp detected is very important because smaller polyps (less than 5 mm) are much less likely to become cancerous, whereas advanced or larger adenomas are associated with a higher risk. They also noted that SSLs are important because they are harder to detect so could be missed, but they also have the potential to become cancerous. The clinical experts expressed concern that AI software may primarily pick up smaller, less advanced polyps and said that an endoscopist is less likely to miss more advanced or larger polyps without AI. In addition, the clinical experts were concerned that AI may not help detect SSLs, which can also cause cancer. They also commented on the observational evidence that demonstrates the link between increased ADR and reduced colorectal cancer rates and explained that this was prior to the introduction of AI and would likely be due to improved endoscopy technique. This can increase visualisation of the colon so more polyps might be identified. AI would not improve endoscopist technique or increase visualisation. Therefore, this evidence may not be generalisable to AI technologies.

3.11

The committee concluded that ADR is a useful outcome to provide evidence on whether the software improves detection overall. But it said that reporting ADR separately by polyp type and size is important to assess whether the software may be able to reduce colorectal cancer rates. They said that this is because only increasing the detection of smaller polyps, which are less likely to become cancerous, may have a very small effect on colorectal cancer rates. The clinical experts noted that other outcomes, such as adenomas per colonoscopy, may also be informative and should be collected.

Effectiveness for detection

3.12

The EAG reported pooled ADR data for each technology, which showed a statistically significant improvement for most technologies. It also acknowledged the nationwide study of AI in adenoma detection during colonoscopy (NAIAD) as an important piece of UK-based real-world evidence supporting 1 of the technologies, noting its large sample size and practical relevance. Where possible, the EAG reported ADR separately for advanced and non-advanced adenomas, SSLs and by polyp size. It also reported other outcomes when available, such as the number of adenomas found per colonoscopy. For advanced adenomas and SSLs, the AI software improved the ADR, but this was not statistically significant for any technology. For non-advanced adenomas, the impact of AI was greater, suggesting that the software is better at detecting smaller polyps. AI technologies may improve detection of smaller polyps (5 mm or less, and 6 mm to 9 mm) compared with polyps of 10 mm or more, but the evidence is inconsistent.

3.13

The committee concluded that using the AI technologies improves ADR overall. But it said that it is uncertain whether the technologies improve the detection of more clinically significant polyps, such as advanced or larger adenomas or SSLs. So, there is uncertainty about whether the increase in ADR overall would translate into a reduction in colorectal cancer rates.

Quality of evidence for different technologies

3.14

The EAG said that all the studies reported that the addition of AI led to improved ADRs, as measured by risk ratios. But it said that this was not statistically significant for 3 of the 10 technologies (Argus, Discovery and Endoscopic Multimedia Information System [EMIS]). These 3 technologies generally had fewer trials with fewer people enrolled.

3.15

The committee concluded that these 3 technologies could not be recommended for use with evidence generation because it was too uncertain whether they could improve ADR. It said that more research is needed.

Evidence in screening, symptomatic and surveillance populations

3.16

Most of the studies were done in mixed populations. The EAG explored subgroup analyses of CADe by colonoscopy indication (screening, symptomatic and surveillance). No strong differences in AI software performance were found across patient subgroups, although the EAG considered that subtle differences could not be ruled out.

3.17

The clinical experts discussed differences between populations and noted that the risk of them having polyps or cancer would differ, but for most populations, the software should work the same. They noted that in the bowel cancer screening programme, endoscopists must have special accreditation and could be expected to be more skilled. So, the scope to benefit from adding AI software for polyp detection may be lower for this population.

3.18

The EAG noted they had explored endoscopist experience and skill in another subgroup analysis. It said that the evidence appeared mixed, with some studies suggesting greater benefit for less experienced endoscopists and others suggesting the opposite. But the EAG said that inconsistent reporting, differing definitions of experience and limited data made interpretation difficult, which limited confidence in the conclusions. The clinical experts added that it is difficult to define experience and skill because baseline ADR can be affected by the population and case mix. They also said that years of experience does not necessarily correlate with increased skill.

3.19

The committee concluded that there is no clear evidence to suggest that the AI technologies are less effective in some populations.

Populations excluded from studies

3.20

The committee noted that some high-risk groups for colorectal cancer, such as people with Lynch syndrome, familial adenomatous polyposis (FAP) or IBD, were excluded from most studies. The clinical experts discussed whether the software might work differently in these groups. They noted that people with IBD or Lynch syndrome may have differences in their bowel appearance and have different polyp types. So, they thought that the AI technologies may not work as well for these people, particularly if people with these conditions were not represented well in datasets used to train the technologies. Also, because these groups are often excluded from clinical trials, the effectiveness of AI technologies for people with these conditions is more uncertain. The clinical experts said that people with FAP would usually present with lots of smaller polyps, but these would look the same as polyps in someone without this condition. So, they said it is more likely that the AI technology would perform in a similar way for people with this condition as for people without it.

3.21

The committee concluded that more research is needed to understand how well the AI technologies work for people with IBD or Lynch syndrome before they can be used in the NHS for this population.

Overreliance

3.22

The committee discussed the potential for deskilling among endoscopists because of overreliance on AI software. The EAG noted that the large real-world NAIAD study evaluated 1 of the technologies in an NHS setting across 3 phases. Phase 1 was before the introduction of the software and phase 3 was after withdrawal of the software. But the committee judged that it would be difficult to draw conclusions on whether any changes seen in ADR between phases 1 and 3 were caused by overreliance or for other reasons, such as changes in case mix. The clinical experts also noted studies done in other countries that showed a reduction in ADR from baseline after AI is removed. The committee concluded that it is difficult to control for other factors that may affect detection rates. So, it said that it is difficult to draw any strong conclusions from studies not specifically designed to investigate the potential of AI software to deskill endoscopists.

False positives

3.23

The committee raised the lack of specificity data for computer-aided detection. It noted that there was very little information available on how many times the technologies incorrectly flagged something as a polyp, known as a false positive. The EAG confirmed there was limited data on this for detecting polyps. But it noted that what was reported did not suggest this was a problem. It added that definitions varied, but it was usually defined as disagreement between the AI software and the endoscopist. The committee concluded that false positives should be identified by the endoscopist. But it said that an increase in detection, and therefore removal of polyps that are benign, may increase laboratory costs and could affect surveillance intervals and lead to an increase in colonoscopies. The committee concluded that more evidence is needed on the impact of introducing the technologies on the management of polyps. It said that more evidence is particularly needed on changes in decisions on patient follow up, surveillance intervals, and additional excision and testing of polyps.

Effectiveness for characterisation

3.24

The EAG found that evidence on the CADx functionality of the technologies was limited and inconsistent, with variable diagnostic accuracy and several methodological concerns. Key concerns included:

the autonomous use of the software (without endoscopist review), which does not reflect how the technologies would be used in the NHS
only reporting the results for polyps diagnosed with 'high confidence', which could limit the clinical relevance and bias the results.

As a result, the EAG considered the evidence insufficient to support strong conclusions about the effectiveness of the technologies used in this context.

3.25

The clinical experts also expressed concern about how some of the technologies classify SSLs as non-significant polyps. They said that this is not an accurate characterisation because they have the potential to cause cancer.

3.26

The committee concluded that more research is needed to address some of the methodological concerns identified by the EAG. It stated that the diagnostic accuracy (sensitivity and specificity) of the AI technologies when used alongside endoscopist judgement to characterise different types of polyps would be an important outcome. It also noted that the AI technologies should be able to accurately differentiate between polyps that may and may not potentially cause cancer, including SSLs.

Procedure time and usability

3.27

The patient experts noted that procedure length is a key consideration for patients and questioned whether AI technologies would influence this. The EAG explained that there was some evidence that procedure length increases with use of the technologies but that this was only by 1 to 2 minutes. The committee concluded that the AI technologies do not appear to have a substantive impact on procedure length. The committee noted comments received during consultation stating that some endoscopists have found the technologies distracting. It agreed that data on procedure time and endoscopist experience of using the technologies would be useful.

Cost effectiveness

Clinical-effectiveness inputs

3.28

The EAG developed a decision-tree model to simulate the impact of the AI technologies on colonoscopy. A hypothetical blended cohort of people (to reflect screening, surveillance and symptomatic groups) was assigned to 1 of 5 disease states. These were: no significant pathology, IBD, low-risk adenoma present, advanced adenoma present, and colorectal cancer. It was assumed that the colonoscopy would either detect all adenomas present or miss some. Sensitivity for standard colonoscopy without AI was estimated using published adenoma miss rates. The AI effectiveness was modelled using ADR as a proxy for sensitivity by applying the risk ratio from the meta-analyses to the colonoscopy sensitivity. Where data allowed, the EAG used ADR for advanced and non-advanced adenomas to reflect the improvements in detection for high- and low-risk adenomas. But data limitations meant that a single ADR estimate was applied for some technologies, which would be likely to overstate the effect of the software on the detection rate for higher-risk adenomas.

3.29

The committee considered the use of ADR as a surrogate outcome for sensitivity and noted it may not fully capture clinical benefit. It said that this was particularly in relation to long-term outcomes, such as post-colonoscopy colorectal cancer. The committee highlighted its concerns that the clinical evidence suggested that ADR improvements may be driven by increased detection of small, low-risk polyps. It recalled that these may not translate into meaningful health gains if the improvement in ADR does not result in fewer cases of colorectal cancer. This view was confirmed by a clinical expert, who stated that less than 0.1% of small (less than 5 mm) polyps develop into colorectal cancer. The committee noted that this would not be captured in the model for technologies that did not report ADR for advanced and non-advanced adenomas separately. The committee also recalled that even when ADR was reported for advanced adenomas separately, the results were not statistically significant (see section 3.12). So, the committee said that there was still uncertainty in the results for these technologies.

Model structure

3.30

The decision-tree model developed by the EAG modelled only the initial colonoscopy procedure. Lifetime-cost and quality-adjusted life year (QALY) outcomes were applied at the end of each branch to reflect the lifetime consequences of missing polyps. These were adjusted to reflect a delayed diagnosis for people with missed or misdiagnosed adenomas or colorectal cancer. The values for lifetime-cost and QALY outcomes were ultimately derived from the Microsimulation Model in Cancer of the Bowel (MiMiC-Bowel) individual patient simulation model developed by the Sheffield Centre for Health and Related Research.

3.31

This approach of using lifetime 'pay offs' meant that additional or future colonoscopies were not modelled to include AI. The committee questioned the impact of the assumption in the model that all colonoscopies after the initial index colonoscopy would be done without AI. The committee judged that the assumption undermined the ability to assess the full impact of AI technologies on long-term outcomes and did not reflect reality. The EAG explained that this was a limitation of the model that could not be easily resolved. It said that this was because it would need access to the MiMiC-Bowel model, which it did not have. The EAG added that addressing this concern would be likely to improve estimated QALYs in the intervention arm because AI would continue to detect polyps that would otherwise be missed. But it said that, similarly, costs may also increase to account for ongoing costs of the technologies. The EAG estimated that the impact would be small.

3.32

The clinical experts also noted another limitation of using the MiMiC-Bowel model in that it does not take SSLs into account. So, they said it may not reflect the outcomes associated with detection and non-detection of different types of polyps.

3.33

The committee also noted that the inputs used for long-term QALYs were key drivers of the model results. The committee agreed that within these inputs, a reduction in the rates of post-colonoscopy colorectal cancer and earlier detection would be driving the model results.

Uncaptured costs

3.34

The committee also raised concerns that the economic model did not fully account for the downstream impact of diagnostic decisions. For instance, increased detection of small polyps, driven by AI technologies, could lead to more people being placed onto surveillance pathways. This could result in additional colonoscopies and follow up and increased histology costs with no clear corresponding clinical benefit. The need for surveillance triggered by guideline thresholds (for example, 5 or more polyps) was also not modelled, because ADR does not capture the incidence of multiple polyps.

3.35

The committee concluded that there are likely to be some costs related to the use of AI technologies that have not been captured in the model.

Uncertain costs for some technologies

3.36

Nine of the 10 technologies were included in the analysis, with 1 technology excluded because of a lack of technology cost data. So, no health-economic results were available for ENDOANGEL. The committee concluded that the cost effectiveness of this technology was too uncertain to be able to make a recommendation for use in the NHS outside of research.

Cost-effectiveness conclusions

3.37

All the modelled technologies were estimated to be cost effective compared with standard colonoscopy without the use of AI. All but 1 were estimated to be less costly and more effective compared with standard colonoscopy. But, in all cases, the differences in costs and QALYs were small, and the EAG advised caution in interpreting the results. It noted that the results were very uncertain, with the probability of each intervention being cost effective estimated as approximately 50% at NICE's usual willingness-to-pay range (£20,000 to £30,000).

3.38

The committee noted that the cost-effectiveness estimates were highly sensitive to inputs and assumptions in the model, with some members noting that even minor changes could reverse the conclusions. The committee recalled the uncertainty in the clinical evidence about whether AI can improve the detection of clinically significant polyps. It recalled that this may not be well represented in the model because of a lack of clinical data to inform the differences in improvement between different polyp types and sizes. The committee questioned whether the technologies would remain cost effective if this was truly reflected in the model, along with other costs that may not be fully captured (see section 3.13 and section 3.29). The committee concluded that the economic case for adopting the technologies remains uncertain.

3.39

The committee explored how this uncertainty could be reduced. It agreed that the key clinical benefit of detecting more polyps, and the key driver in the model, is a reduction in rates of post-colonoscopy colorectal cancer. It noted that collecting data on this outcome would need studies with long follow ups. But the clinical experts advised that a study on 1 of the technologies was already collecting this data. They said that this could be informative about the link between increased polyp detection with AI and any reduction in cancer rates.

3.40

The committee concluded that the structure and inputs of the economic model developed by the EAG were broadly appropriate. It said that the results suggested that the technologies could be cost effective, but they were still very uncertain. The committee agreed that data on post-colonoscopy colorectal cancer rates would be beneficial and could reduce the uncertainty about the clinical and economic impact of the AI technologies before routine adoption.

Equality considerations

3.41

Many studies excluded key groups, such as people with IBD, Lynch syndrome, FAP or previous colorectal cancer. The EAG noted that AI technologies may not be adequately validated for these populations because of poor reporting of training data and limited subgroup representation, making reliable conclusions difficult. The committee recommended more research before the technologies can be used in some of these groups (see section 3.21).

3.42

A clinical expert highlighted that people from ethnic minority backgrounds often present for colonoscopy later and experience delays in diagnosis. Another clinical expert noted that there was a lower uptake of screening among these groups, and this could potentially lead to underrepresentation in training datasets and biased performance. Representatives from 2 of the companies confirmed that their technologies had been trained on datasets derived from a broad range of providers and countries to ensure equal representation.