Endoscopy 2021; 53(09): 941-942
DOI: 10.1055/a-1381-7849

Addressing false-positive findings with artificial intelligence for polyp detection

Referring to Holzwanger E et al. p. 937–940
1  Clinical Effectiveness Research Group, Institute of Health and Society, Faculty of Medicine, University of Oslo, Oslo, Norway
2  Digestive Disease Center, Showa University Northern Yokohama Hospital, Yokohama, Japan
Michael Bretthauer
1  Clinical Effectiveness Research Group, Institute of Health and Society, Faculty of Medicine, University of Oslo, Oslo, Norway
3  Department of Transplantation Medicine, Oslo University Hospital, Oslo, Norway
› Author Affiliations

The adoption of artificial intelligence (AI) with deep learning technology is revolutionizing daily practice in clinical medicine. In gastroenterology, AI for polyp detection in colonoscopy is leading the field both for clinical trial testing and implementation.

“...the authors propose that the false-positive definition of a duration of ≥ 2.0 seconds might be the best threshold because 2 seconds is sufficient time for endoscopists to rule out most false-positive frames by allowing folds to be flattened and by clearing bubbles/debris.”

Major endoscopy manufacturers have recently launched AI tools that assist in colorectal polyp detection in Europe, Japan, and other Asian countries. Commercialization is supported by recent evidence from rigorously designed clinical trials. A recent meta-analysis shows that AI during colonoscopy increases the adenoma detection rate by as much as a 50 % [1]. Although AI does not significantly increase the detection rates of advanced adenomas, the technology may play an important role in more effective colorectal cancer prevention by colonoscopy.

Academic endoscopy societies are increasingly interested in AI guidance. The European Society of Gastrointestinal Endoscopy published its first guideline endorsing the use of AI during colonoscopy in 2019 [2], and the American Society for Gastrointestinal Endoscopy recently published a position statement to accelerate the implementation of AI in endoscopy practice [3]. However, as clinical implementation of AI polyp detection products has just started, it is uncertain how they will perform in real-world practice.

One concern that may arise among clinicians is false-positive alarms, which occur when the AI tool flags areas suspicious for polyps where no polyps are actually present. False-positive alarms are annoying and distracting for the endoscopist. They may induce stress, lead to unnecessary checking of flagged areas with no pathology, and thus prolong procedure times. False-positive alarms may also result in psychological distress by patients who are alerted during colonoscopy, due to perceived uncertainty for whether the flagged alarm really is false and no polyp, or the endoscopist simply cannot find it.

Therefore, many endoscopists are interested in systems that can provide low false-positive rates and alarms. However, as with most diagnostic tools, a decrease in false-positive alarms is accompanied by a decrease in the number of correct alarms. The challenge is to find the optimum trade-off of few false-positive alarms while maintaining satisfactory sensitivity for polyp detection. However, it is difficult to compare different AI algorithms/settings/products due to the lack of established definition and measurement criteria for false-positive alarms in AI-assisted colonoscopy. This problem was also emphasized as one of the most crucial issues at a recent international expert meeting [4].

In this issue of Endoscopy, Holzwanger et al. clarify the importance of establishing consensus regarding the definition of false positives and provide important data on performance of AI tools at different false-positive thresholds [5]. The investigators evaluated the frequencies of false-positive alarms by assessing videos of 62 consecutive colonoscopies for various clinical indications at a single center in Costa Rica. The authors counted false-positive alarms according to the number of false-positive events rather than the number of false-positive frames. This approach is clinically relevant because counting the number of events is more intuitive and easier to assess for clinicians compared with counting every false-positive frame. The latter is very resource-demanding and involves annotation of hundreds of thousands of frames in the entire endoscopy video to obtain the frame-based false-positive rate.

The frequency of the false-positive events is greatly affected by its definition, namely the duration of the false-positive frames. In the present study, the authors counted the frequencies of the false-positive events according to the following time thresholds: false-positive duration of > 0 seconds, > 0.5 seconds, > 1.0 seconds, or > 2.0 seconds. They found that increasing the duration to define false positive dramatically reduced the frequency of false positives. With a false-positive duration of > 0 seconds (which means one frame indicating a polyp is enough to trigger an alarm), the mean number of alarms per colonoscopy was 26.3. Only counting alarms longer than 0.5 seconds reduced that number to 1.8; and with thresholds of 1 and 2 seconds, the mean number of false-positive alarms was reduced to 0.4 and 0.05, respectively.

Interestingly, for all false-positive thresholds between 0 and 2 seconds, the AI software applied in the study picked up all polyps; thus, the sensitivity was 100 % for all scenarios. Reasonably, the authors propose that the false-positive definition of a duration of ≥ 2.0 seconds might be the best threshold because 2 seconds is sufficient time for endoscopists to rule out most false-positive frames by allowing folds to be flattened and by clearing bubbles/debris (most of the false-positive events were associated with misrecognition of shrunken colonic folds and bubbles/debris).

It is crucially important to obtain a consensus definition of false positive because this will allow objective comparison of the currently available and varied AI modalities. To our understanding, a consensus definition may be beneficial not only for endoscopists who are using AI tools but also for public authorities, such as regulatory bodies and health insurance systems. Public authorities are asked to objectively assess newly developed AI systems and to provide permission-to-use or reimbursement-for-use decisions in the health care systems. These decisions are obviously important; however, the assessment process of these authorities has not been standardized owing to the lack of a definition of performance measures, such as false positives and common test data. Some research groups have been addressing the latter issue by providing publicly accessible video databases [3] [6] [7]; however, the former issue, namely the false-positive definition, has not been addressed. Therefore, the authors’ unique attempt is notable in that it potentially launches an active discussion of how to objectively define AI performance. The scientific rationale of the authors’ proposal to adopt 2 seconds as the false-positive threshold is sound given their data from this first study. However, the data need to be replicated and confirmed before decision making and guideline recommendations can be considered. We encourage endoscopy societies to move this discussion forward to determine appropriate definitions and propose recommendations.

Considering that increasing numbers of AI tools are available for polyp detection in colonoscopy practice, it seems an appropriate time to open the door to the next stage. Once AI performance has been transparently evaluated using a consensus definition and its use is appropriately guided by health care authorities, AI will become a more reliable player in colonoscopy.

Publication History

Publication Date:
26 August 2021 (online)

© 2021. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany