Evaluation of an improved computer-aided detection system for Barrett’s neoplasia in real-world imaging conditions

Martijn R. Jong; Rixta A. H. van Eijck van Heslinga; Carolus H. J. Kusters; Tim J. M. Jaspers; Tim G. W. Boers; Lucas C. Duits; Roos E. Pouw; Bas L. A. M. Weusten; Alaa Alkhalaf; Fons van der Sommen; Peter H. N. de With; Albert J. de Groof; Jacques J. Bergman; on behalf of the BONSAI Consortium (All members and collaborators of the “Barrett’s Oesophagus Imaging for Artificial Intelligence” (BONS-AI) consortium are listed in the online-only Supplementary materials.)

doi:10.1055/a-2642-7584

Endoscopy, Table of Contents

CC BY 4.0 · Endoscopy
DOI: 10.1055/a-2642-7584

Original Article

Evaluation of an improved computer-aided detection system for Barrett’s neoplasia in real-world imaging conditions

Martijn R. Jong‡^‡

¹Department of Gastroenterology and Hepatology, Amsterdam Gastroenterology, Endocrinology and Metabolism, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands

,

Rixta A. H. van Eijck van Heslinga‡^‡

¹Department of Gastroenterology and Hepatology, Amsterdam Gastroenterology, Endocrinology and Metabolism, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands

,

Carolus H. J. Kusters

²Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands

,

Tim J. M. Jaspers

²Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands

,

Tim G. W. Boers

²Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands

,

Lucas C. Duits

¹Department of Gastroenterology and Hepatology, Amsterdam Gastroenterology, Endocrinology and Metabolism, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands

,

Roos E. Pouw

¹Department of Gastroenterology and Hepatology, Amsterdam Gastroenterology, Endocrinology and Metabolism, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands

,

Bas L. A. M. Weusten

³Department of Gastroenterology and Hepatology, St Antonius Hospital, Nieuwegein, The Netherlands

⁴Department of Gastroenterology and Hepatology, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands

,

Alaa Alkhalaf

⁵Department of Gastroenterology and Hepatology, Isala Hospital, Zwolle, The Netherlands

,

Fons van der Sommen

²Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands

,

Peter H. N. de With

²Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands

,

Albert J. de Groof

¹Department of Gastroenterology and Hepatology, Amsterdam Gastroenterology, Endocrinology and Metabolism, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands

,

Jacques J. Bergman

¹Department of Gastroenterology and Hepatology, Amsterdam Gastroenterology, Endocrinology and Metabolism, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands

,

on behalf of the BONSAI Consortium (All members and collaborators of the “Barrett’s Oesophagus Imaging for Artificial Intelligence” (BONS-AI) consortium are listed in the online-only Supplementary materials.)

› Author Affiliations

Abstract

Full Text

PDF Download

Introduction

Barrett's esophagus (BE) patients have an increased risk of developing esophageal adenocarcinoma (EAC). Timely detection and treatment of EAC is associated with improved outcomes [1]. BE patients therefore undergo regular surveillance endoscopies [2]; however, early neoplasia is difficult to recognize and may be missed [3] [4] [5] [6] [7]. Computer-aided detection (CADe) systems may assist endoscopists in the detection of early BE neoplasia [8] [9] [10] [11].

Recently, we developed a CADe system for Barrett’s neoplasia on the largest BE dataset to date and extensively evaluated this system in an ex-vivo benchmarking study [12]. The system outperformed a large group of general endoscopists for neoplasia detection and performed on a par with BE experts. More importantly, the detection rate of general endoscopists significantly increased when they received CADe assistance. However, similarly to nearly every endoscopic AI system, all training and test data in this study was acquired by expert endoscopists in academic hospitals according to standardized protocols. This resulted in high quality imaging conditions that are unrepresentative of clinical practice.

The majority of endoscopic BE surveillance is performed by general endoscopists in community centers, where the endoscopic image quality is subject to considerable variation. Such variation can be attributed to endoscopist-dependent factors (e. g. skill, experience), as well as endoscopist-independent factors (e. g. endoscopy equipment, specific endoscopic software settings). This results in a so-called “domain gap,” a phenomenon where an AI system underperforms because of a mismatch between the dataset used for its development and the data on which it is deployed [13] [14] [15]. We have previously shown that endoscopic AI systems are vulnerable to these domain gaps [16] [17] [18]. Ensuring robustness to data heterogeneity is therefore crucial for successful implementation of AI systems in clinical practice.

In this study, we aimed to develop a more robust CADe system for the detection of Barrett’s neoplasia, from now on referred to as CADe 2.0, designed for use in routine clinical practice. We evaluated its performance under heterogeneous imaging conditions representative of real-world clinical practice and compared it to its predecessor, from now on referred to as CADe 1.0.

Methods

Study design

In this study, we first integrated multiple improvements into our CADe 2.0 system in order to enhance its performance and robustness to data heterogeneity. These improvements included self-supervised pretraining, increasing the quantity and diversity of the training data, model-architecture optimization, specific data augmentation methods, and optimized usage of ground truth segmentations. We then compared its performance to CADe 1.0 for the detection of Barrett’s neoplasia using three prospectively collected, independent test sets. These test sets aimed to reflect the image diversity encountered in daily endoscopic practice.

The study adhered to the QUAIDE reporting guidelines for preclinical studies in diagnostic endoscopy [19]. A checklist is provided in Table 1 s, see online-only Supplementary materials.

Data collection

Both CADe systems have been developed by the BONS-AI consortium (Barrett’s OesophaguS imaging for Artificial Intelligence). We collected white-light endoscopy images from consecutively enrolled, treatment-naïve BE patients undergoing regular endoscopic surveillance, or endoscopic treatment of a lesion in the Barrett segment. Patients with a visible lesion and histologically confirmed neoplasia, as well as patients without a visible lesion and no histologic evidence of neoplasia, were included. Images were excluded if substantial non-neoplastic esophageal changes were present (e. g. ulceration, scarring, or severe esophagitis).

All data were collected in a strictly anonymized manner and either originated from previous studies [10] [12] or were newly acquired for this study. The protocols were identical. The participating BONS-AI centers collected prospective images using a standardized workflow for image and video acquisition, which has been described previously [10].

CADe 1.0 system

The CADe 1.0 system was pretrained on GastroNet-5 M, using a semi-supervised learning method [20]. GastroNet-5 M is a largely unlabeled dataset comprising over 5 million endoscopic images from the complete gastrointestinal tract [21]. The system was subsequently trained and internally validated on a dataset that consisted of both retrospectively and prospectively collected BE images. The training data comprised exclusively high quality images acquired by expert endoscopists from 15 international tertiary hospitals. It included 6337 neoplastic images and 7695 nondysplastic images originating from 1362 and 1139 BE patients, respectively.

All images used in this study were obtained with X1 and 190-series gastroscopes and processors from Olympus (Tokyo, Japan) using high definition white-light endoscopy. Data were captured using Medicapture USB300 and Sony HVO-4000MT recorders. The CADe 1.0 system was constructed using an EfficientNet-Lite1 encoder and a MobileNetV2 DeepLabV3 + decoder. A comprehensive description of the development of CADe 1.0 has been published elsewhere [12].

CADe 2.0 system

The CADe system underwent six substantial improvements. These aimed to increase the performance and robustness of the CADe system against endoscopist-dependent and endoscopist-independent image quality variation.

[Table 1] summarizes the improvements implemented in CADe 2.0. Further technical details are provided in Appendix 1 s.

Table 1
Overview of improvements integrated into the CADe 2.0 system.
Aspect	CADe 1.0	CADe 2.0
Pretraining
Dataset	GastroNet-5 M	GastroNet-5 M
Method	Semi-supervised	Self-supervised
Training data
Patients	± 2500	± 2900
Images	± 14 000	± 39 000
Segmentations	± 2800	± 4300
Diversity	High quality	Diverse quality
Model architecture
Architecture	CNN	Hybrid CNN-ViT
Model size	5.2 Mb	116.2 Mb
Parameters	4.6 million	29.7 million
Data augmentation	Regular augmentations	Regular augmentations + image enhancement-based augmentations
Ground truth use	Single consensus mask	Multiple consensus masks

CNN, convolutional neural network; ViT, vision transformer.

Pretraining method

Before training on application-specific data (e. g. BE images), computer vision algorithms are often pretrained on large publicly available datasets such as ImageNet [22]. These datasets include generic images (e. g. animals, vehicles, buildings), helping the algorithm recognize basic image features such as shapes, edges, and colors. This approach preserves valuable application-specific data for more advanced tasks (e. g. detecting Barrett’s neoplasia). Recent studies suggest that pretraining with large domain-specific datasets (e. g. general endoscopic images) instead of generic images can lead to even better results [23] [24]. For this purpose, our group recently presented the GastroNet-5 M dataset [25]. GastroNet-5 M comprises roughly 5 million general endoscopic images of 500 000 endoscopy procedures and is largely unlabeled. Based on an extensive experimental framework including validation on nine separate endoscopic AI applications (e. g. colorectal polyp diagnosis and gastric cancer invasion-depth prediction) and 17 datasets, we found that pretraining using GastroNet-5 M resulted in higher diagnostic accuracy, required less application-specific training data, and led to more robust performance against data heterogeneity [25].

For the development of CADe 2.0, we adopted a fully self-supervised learning approach called DINO [26]. DINO, like most self-supervised methods, involves learning meaningful representations of unlabeled data by making predictions on one part of an image based on another part (i. e. pretext tasks). Compared with the semi-supervised learning strategy, DINO performs significantly better on well-established benchmarks [26].

Dataset size (patients)

The training set of CADe 2.0 was comparable to CADe 1.0 in terms of the number of patients. CADe 2.0 contained data from 1402 neoplastic and 1201 nondysplastic patients, compared with 1296 neoplastic and 1095 nondysplastic patients for CADe 1.0. In contrast, the internal validation set of CADe 2.0 was substantially larger. It included data from 97 neoplastic and 197 nondysplastic patients, compared with CADe 1.0, which had data from only 58 neoplastic and 36 nondysplastic patients. This larger internal validation set allowed for more informed developmental decisions during the training phase.

Dataset diversity (images/frames)

CADe 1.0 was exclusively trained and evaluated on high quality, expert-acquired images, which may lead to decreased performance on more diverse quality data [16] [17]. For CADe 2.0, we introduced greater data heterogeneity by incorporating video data alongside high quality still images. Videos, even when acquired by experts, inherently capture a broader range of image quality conditions, including quality artifacts, such as a partial blurred lens or suboptimal mucosal cleaning. In our earlier studies, adding video frames to a homogeneous training dataset of high quality still images resulted in significantly more robust performance to data heterogeneity [27]. We therefore extracted an additional 6725 neoplastic and 9729 nondysplastic video frames originating from specific video sequences of 162 and 188 patients. Frames were automatically sampled using a custom-designed algorithm, which ensures a wide variety of content and image quality (Appendix 2 s). The internal validation set was similarly enriched with diverse quality video frames. An overview of the CADe 1.0 and CADe 2.0 training and internal validation set is given in Table 2 s.

Model architecture

CADe 1.0 was designed to operate within the computational constraints of a typical endoscopy processor. Therefore, a computationally efficient convolutional neural network (CNN) was selected (i. e. a quantized int8-based EfficientNet-lite1 architecture paired with a MobileNetV2-DeepLabV3 + decoder). In contrast, CADe 2.0 was not subjected to such computational limitations and employed a more advanced architecture with a CaFormer-S18 encoder [28] [29], a hybrid model that integrates both convolutional layers and transformer-based self-attention mechanisms, alongside a DeepLabV3 + decoder. This transition from a purely CNN-based model to a hybrid architecture was made deliberately, supported by recent developments in model architectures in general computer vision and findings from our group, which empirically demonstrated that transformer-based models may offer improved performance and robustness for endoscopic image analysis [30].

Targeted data augmentation

Current endoscopy systems offer a broad range of post-processing enhancement settings (Fig. 1 s). These settings improve perceived image quality by enhancing image characteristics such as contrast, color, and texture. These settings can be adjusted based on endoscopist preferences or preinstalled by the technical services department, often resulting in significant differences between endoscopy units. In previous work we showed that the performance of endoscopic AI systems can vary substantially depending on the post-processing enhancement settings of the endoscopy system [18] [31]. We also showed that, by using all available enhancement settings to augment our training data, performance variability significantly decreased.

For CADe 1.0, we only used generic data augmentation techniques (e. g. flipping, rotation, and random alteration of color and contrast). For CADe 2.0, we included additional domain-specific data augmentation by including the complete spectrum of available enhancement settings of Olympus processors into the training set. This was facilitated by a proprietary software tool capable of accurately simulating different enhancement settings, as detailed in the original publication [18].

Segmentation ground truth use

Early BE lesions and particularly their outer margins are often subtle. Assessment of the outer margins of early BE lesions on images obtained with white-light endoscopy in overview may therefore be challenging. This leads to substantial interobserver variability in ground truth segmentations. For CADe 2.0, each neoplastic lesion was independently delineated by two expert endoscopists, who provided both lower likelihood and higher likelihood annotations, reflecting their graded confidence in the presence of neoplasia (Appendix 3 s). In previous studies, this approach substantially reduced variability between annotators [10] [12].

In the development of CADe 1.0, this nuanced information was not fully used. Instead, one single consensus segmentation was created by combining the union of both higher likelihood areas with the overlap of lower likelihood areas, resulting in a binary ground truth delineation that categorized regions as either 100 % neoplastic or 100 % nondysplastic. This approach oversimplifies clinical reality, where the transition zone of the likelihood of neoplasia is often more gradual. For CADe 2.0, we adopted a more refined approach, where varying degrees of annotator certainty were preserved during training. This method has been extensively described in a separate study [32]. An example case is depicted in Fig. 2 s.

Test sets for evaluation of CAD 1.0 versus CAD 2.0

Both CADe systems were evaluated using three prospectively collected, independent test sets. These test sets are summarized in [Table 2], with example cases shown in [Fig. 1]. Further details on the type of lesions in our test sets, including their macroscopic appearance and pathologic outcome, are provided in Table 3 s. As only lesions suitable for endoscopic resection are referred to our academic centers, our dataset consists predominantly of early stage neoplastic lesions.

Table 2
Overview of test sets used to evaluate the CADe 1.0 and CADe 2.0 systems.
Data set	Quality level	Total number of images[1]/videos; patients	Number of neoplastic images/videos; patients	Number of nondysplastic BE images/videos; patients
Peak performance test set		428; 114	84; 46	344; 68
Robustness to endoscopist-dependent variation test set	High	396; 122	119; 48	277; 74
	Moderate	396; 122	119; 48	277; 74
	Low	396; 122	119; 48	277; 74
Robustness to endoscopist-independent variation test set		396; 122	119; 48	277; 74

BE, Barrett’s esophagus.

¹ Images in this row include both still images and video frames.

Fig. 1 Example images from the three test sets used to evaluate the CADe systems. Test set 1 featured high quality videos recorded by expert endoscopists to assess performance under ideal imaging conditions. Test set 2 included matched video frames from the same patient, showing variations in endoscopist-dependent image quality. Test set 3 presented images displayed with different post-processing enhancement settings to evaluate robustness to endoscopist-independent image quality.

Test set 1: peak performance

Test set 1 comprised 428 10-second stationary videos of 114 BE patients, including 84 videos of 46 neoplastic patients and 344 videos of 68 non-dysplastic patients. Videos were collected in five BE referral centers between April 2022 and January 2024. As all videos were recorded by expert endoscopists, this test set evaluates an algorithm’s performance under ideal conditions (i. e. peak performance).

Test set 2: robustness to endoscopist-dependent variation

Test set 2 evaluated the CADe system's robustness to variation in endoscopist-dependent image quality. This variation includes factors such as esophageal expansion, mucosal cleaning, and the presence of blur. Test set 2 was composed of three subsets representing different levels of endoscopist-dependent image quality: high, moderate, and low. For each patient, a matched triplet of video frames was manually selected by a research fellow (R.A.H.vE.vH) and subsequently confirmed by an expert Barrett’s endoscopist (J.J.B., A.J.G.). Each triplet contained a high quality, moderate quality, and low quality frame, all captured from the same patient and at the same position within the Barrett’s segment. In total, each subset comprised 119 neoplastic and 277 nondysplastic video frames, derived from 48 and 74 patients, respectively.

Test set 3: robustness to endoscopist-independent variation

The third test set evaluated the CADe system's robustness to variation in endoscopist-independent image quality. Modern endoscopy systems offer a range of post-processing enhancement settings designed to improve the perceived image quality, such as contrast and color patterns. These settings can vary significantly between hospitals, often without the awareness of endoscopists. In this study, we focused on two enhancement setting types (A and B) available in current Olympus processors. Each processor can display an image using either setting A or B, each with intensity levels ranging from 1 to 8. Both settings enhance image sharpness and contrast, but differ in their specific adjustments.

This test set was derived from the high quality subset of test set 2. The subset was duplicated for every available enhancement setting (A1–A8 and B1–B8) by the use of a proprietary software tool, creating 16 subsets in total. This tool was specifically developed to be able to exactly replicate original endoscopic images with different enhancement settings [18].

Ground truth development

For both training and test data, the ground truth for classification of images was based on expert assessment and the corresponding histopathologic outcome. Images were labeled neoplastic if there was a visible lesion and the endoscopic resection specimen revealed high grade dysplasia or adenocarcinoma, or as nondysplastic BE if there was no visible lesion and all random biopsies were negative for neoplasia.

To provide a ground truth for lesion segmentation, neoplastic images were delineated by 16 expert endoscopists using proprietary software (Meducati, Göteborg, Sweden). All experts had both a scientific track record (i. e. had authored > 10 peer-reviewed studies) and a clinical track record (i. e. had been working for > 5 years in a tertiary referral center) in the diagnosis and treatment of Barrett’s neoplasia. Each image was delineated by at least two experts as described in the “Segmentation ground truth use” paragraph and in Fig. 2 s.

Performance metrics

Classification performance

Classification was considered correct when the CADe system correctly identified an image or video as either neoplastic or nondysplastic BE. For videos, classification as neoplastic required the system to classify all frames as neoplastic for a time interval of at least 1 second.

Localization performance

Localization performance was assessed for correctly classified neoplastic images. Scores were calculated using the Dice similarity coefficient, which measures the overlap between the CADe system’s segmentation and the consensus segmentation of two expert endoscopists. Localization assessment was only performed on high quality images, as ground truth segmentation on lower quality images is inherently unreliable.

Outcome measures

Peak performance (test set 1)

A comparison of the classification performance of the CADe 1.0 and CADe 2.0 systems in terms of their sensitivity and specificity was performed on test set 1.

Robustness to endoscopist-dependent variation (test set 2)

A comparison of the classification performance of the CADe 1.0 and CADe 2.0 systems across high, moderate, and low quality images, reporting sensitivity, specificity, and area under the curve (AUC), was performed on test set 2. Localization (Dice) scores were calculated for the high quality subset.

Robustness to endoscopist-independent variation (test set 3)

A comparison of the classification performance of the CADe 1.0 and CADe 2.0 systems across different enhancement settings, in terms of their median sensitivity, specificity, and AUC scores, was performed on test set 3. Localization performance was assessed in terms of the median Dice scores.

Statistical analysis

For descriptive statistics, categorical data are presented as frequencies and percentages. Continuous data are presented as mean (SD) or median with interquartile range (IQR) for normally distributed and skewed data, respectively. Sensitivity was defined as the proportion of all patients with a visible lesion and confirmed neoplastic pathology correctly identified as neoplastic by the CADe system. Specificity was defined as the proportion of all patients without a visible lesion and confirmed non-neoplastic pathology correctly classified as non-neoplastic by the CADe system. Sensitivity and specificity are expressed in percentages with 95 %CI, using the Wilson method.

For test sets 1 and 2, McNemar tests were used to compare sensitivity and specificity scores, DeLong tests were used for AUC comparisons, and Wilcoxon signed-rank tests for pairwise comparisons of Dice scores. For test set 3, comprising 16 copies of a single test set with varying enhancement settings, Wilcoxon signed-rank tests were applied to compare performance metrics across all settings. Correction for multiple testing was not applied, as all comparisons were predefined and performed on separate test sets targeting distinct evaluation objectives.

While the study aimed to confirm findings from previous work, it remained early phase and primarily exploratory in nature, and the results should be interpreted with some caution. Statistical analysis was performed using Python 3.8 (Python Software Foundation).

Results

Test set 1: peak performance

When lesions were evaluated under ideal imaging conditions, CADe 1.0 correctly classified 73/84 neoplastic videos and 252/345 nondysplastic BE videos, resulting in a sensitivity and specificity of 87 % (95 %CI 78 %–93 %) and 73 % (95 %CI 68 %–77 %), respectively. CADe 2.0 correctly classified 81/84 neoplastic videos, corresponding to a significantly higher sensitivity of 96 % (95 %CI 90 %–99 %) compared with CADe 1.0 (P = 0.02). For NDBE cases, CADe 2.0 correctly classified 256/345 videos, corresponding to a specificity of 74 % (95 %CI 69 %–79 %), with no significant difference compared to CADe 1.0 (P = 0.73).

Given the arbitrary nature of stand-alone performance scores for CADe systems on videos, we further assessed the performance of the CADe 1.0 and CADe 2.0 systems across various detection time cutoffs (i. e. 0.5, 1, 2, 3, and 4 seconds). CADe 2.0 consistently outperformed CADe 1.0 across all of the evaluated cutoffs. More detailed results for the individual detection time intervals are provided in Table 4 s.

Test set 2: robustness to endoscopist-dependent variation

Both CADe systems were subsequently tested for robustness against endoscopist-dependent image quality variation. As shown in [Table 3] and [Fig. 2], performance decreased for both systems as image quality declined from the high quality to the moderate and low quality test sets.

Table 3
Performance of the CADe 1.0 and CADe 2.0 systems per test set.
Test set	Quality	Metric	CADe 1.0	CADe 2.0	P value
Peak performance test set		Sensitivity (95 % CI)	86.9 (78.1–92.5)	96.4 (90.0–98.8)	0.02
Peak performance test set		Specificity (95 % CI)	73.0 (68.1–77.4)	74.2 (69.3–78.5)	0.73
Robustness to endoscopist-dependent variation test set	High	Sensitivity (95 % CI)	93.2 (87.2–96.5)	94.1 (88.3–97.1)	> 0.99
		Specificity (95 % CI)	67.3 (61.6–72.6)	82.4 (77.5–86.4)	< 0.001
		AUC (95 % CI)	91.7 (88.2–95.3)	95.3 (92.6–98.0)	0.34
		Dice score (95 % CI)	0.48 (0.44–0.53)	0.66 (0.62–0.71)	< 0.001
	Moderate	Sensitivity (95 % CI)	74.8 (66.3–81.7)	82.4 (74.6–88.2)	0.10
		Specificity (95 % CI)	71.5 (65.9–76.5)	84.1 (79.3–87.9)	< 0.001
		AUC (95 % CI)	79.4 (74.2–84.6)	90.3 (86.5–94.1)	0.02
	Low	Sensitivity (95 % CI)	61.3 (52.3–69.6)	78.2 (70.0–84.7)	0.002
		Specificity (95 % CI)	77.3 (72.0–81.8)	89.2 (85.0–92.3)	< 0.001
		AUC (95 % CI)	75.2 (69.6–80.8)	89.0 (85.0–93.0)	0.05
Robustness to endoscopist-independent variation test set		Sensitivity (IQR)	89.0 (81.4–93.6)	90.8 (89.1–92.4)	0.28
		Specificity (IQR)	73.7 (67.5–78.4)	85.4 (82.6–86.7)	< 0.001
		AUC (IQR)	88.6 (88.2–90.5)	95.5 (95.3–95.7)	< 0.001
		Dice (IQR)	0.45 (0.43–0.48)	0.64 (0.64–0.66)	< 0.001

IQR, interquartile range.

Fig. 2 Bar graphs depicting the classification and localization performance of the CADe 1.0 and CADe 2.0 systems across the test sets, with error bars indicating confidence intervals calculated using the Wilson method. Boxplots for test set 3 illustrate classification and localization performance with differing image enhancement settings.

On the high quality test set, CADe 1.0 achieved a sensitivity of 93 % (95 %CI 87 %–97 %), specificity of 67 % (95 %CI 62 %–73 %), and an AUC of 92 % (95 %CI 88 %–95 %). CADe 2.0 yielded a comparable sensitivity of 94 % (95 %CI 88 %–97 %; P > 0.99), with significantly increased specificity of 82 % (95 %CI 78 %–86 %; P < 0.001) and a numerically higher AUC of 95 % (95 %CI 93 %–98 %; P = 0.34).

On the moderate quality test set, CADe 1.0’s sensitivity and AUC dropped to 75 % (95 %CI 66 %–82 %) and 79 % (95 %CI 74 %–85 %), respectively, with similar specificity of 72 % (95 %CI 66 %–77 %). CADe 2.0 outperformed CADe 1.0, with sensitivity, AUC, and specificity of 82 % (95 %CI 75 %–88 %; P = 0.10), 90 % (95 %CI 87 %–94 %; P = 0.02), and 84 % (95 %CI 79 %–88 %; P < 0.001), respectively.

On the low quality test set, CADe 1.0’s performance further declined to a sensitivity of 61 % (95 %CI 52 %–70 %) and an AUC of 75 % (95 %CI 70 %–81 %), with an increase in specificity to 77 % (95 %CI 72 %–82 %). CADe 2.0 demonstrated significantly improved robustness, achieving 78 % (95 %CI 70 %–85 %; P = 0.002), 89 % (95 %CI 85 %–93 %; P = 0.05), and 89 % (95 %CI 85 %–92 %; P < 0.001), respectively.

CADe 2.0’s ability to localize neoplasia also significantly improved compared with CADe 1.0, with respective Dice scores of 0.66 (95 %CI 0.62–0.71) and 0.48 (95 %CI 0.44–0.53; P < 0.001) on the high quality test set.

Test set 3: robustness to endoscopist-independent variation

When assessing the performance of the two CADe systems across all 16 enhancement settings offered by the endoscopy system in a pair-wise manner, substantial differences in performance were observed ([Fig. 2]; [Table 3]). CADe 1.0 achieved a median sensitivity, specificity, AUC, and Dice score of 89 % (95 %CI 81 %–94 %), 74 % (95 %CI 68 %–78 %), 89 % (95 %CI 88 %–91 %), and 0.45 (95 %CI 0.43–0.48), respectively. In comparison, CADe 2.0 reached a median sensitivity, specificity, AUC, and Dice score of 91 % (95 %CI 89 %–92 %; P = 0.28), 85 % (95 %CI 83 %–87 %; P < 0.001), 96 % (95 %CI 95 %–96 %; P < 0.001), and 0.64 (95 %CI 0.64–0.66; P < 0.001). An example case illustrating the improved performance of CADe 2.0 is shown in Fig. 3 s.

Discussion

Most endoscopic AI systems are currently developed in academic centers based on data acquired by expert endoscopists using standardized protocols. These imaging conditions do not however reflect the variability encountered in community practice, posing significant challenges for clinical implementation [17] [18] [27].

In this study, we present multiple improvements in an updated CADe system for Barrett’s neoplasia detection (CADe 2.0). This system was specifically designed to be more robust against data heterogeneity compared with its predecessor (CADe 1.0) [12]. To evaluate these improvements, we assessed the performance of both CADe systems using three prospectively collected, independent test sets.

Our findings show that CADe 2.0 significantly outperformed CADe 1.0 in detecting Barrett’s neoplasia under ideal imaging circumstances. More importantly, CADe 2.0 exhibited enhanced robustness across a three-tiered test set simulating varying levels of endoscopist-dependent image quality and every post-processing enhancement setting offered by the latest Olympus endoscopy processor, simulating varying levels of endoscopist-independent image quality. These results highlight the potential of the CADe 2.0 system to maintain reliable performance when confronted with the diverse imaging conditions of clinical practice. Some visual examples are given in [Video 1].

Video 1 Examples illustrating how the CADe 2.0 system is able to improve detection of subtle lesions, while also coping with lower quality images and differing enhancement settings.

Download

First, the CADe systems were evaluated using high quality videos originating from five BE referral centers. CADe 1.0 achieved sensitivity and specificity scores of 87 % and 73 %, respectively, consistent with its originally reported performance on another test set [12]. CADe 2.0 significantly outperformed CADe 1.0, achieving sensitivity and specificity scores of 96 % and 74 %.

To account for the arbitrary nature of stand-alone video performance metrics (e. g. a detection lasting 0.9 seconds would not qualify as a detection, while one lasting 1.1 seconds would), we also assessed CADe performance across a range of different detection time cutoffs. Across all cutoffs, CADe 2.0 consistently demonstrated higher detection rates (Table 4 s). Moreover, sensitivity and specificity scores remained relatively stable across different cutoffs, highlighting the improved consistency of the CADe 2.0 system compared with CADe 1.0. For example, with a detection time cutoff of 4 seconds, the sensitivity of CADe 1.0 dropped to 54 %, whereas CADe 2.0 retained a sensitivity of 82 %.

This improved consistency also suggests the potential to bridge the performance gap between stand-alone CADe performance and the joint performance of endoscopists assisted by CADe, as reported in the original CADe 1.0 publication [12]. In that study, although endoscopists’ performance significantly improved with CADe assistance, their sensitivity lagged by over 10 % compared with the stand-alone performance of the CADe system. A possible explanation is that endoscopists may not act on very short CADe detections, while CADe systems are evaluated purely based on such arbitrary cutoffs. By providing more consistent detections, the CADe 2.0 system may address this limitation; however, this hypothesis requires formal evaluation in a benchmark study before definitive conclusions can be drawn.

Further to the evaluation under ideal circumstances in test set 1, the two CADe systems were also evaluated under more heterogeneous imaging conditions, which better reflect routine practice.

Test set 2 examined robustness to endoscopist-dependent variation using three subsets representing varying different levels of endoscopic image quality. The CADe 2.0 system outperformed CADe 1.0 in both neoplasia classification and localization on the high quality subset. Notably, as image quality decreased, the performance gap between the systems widened, further emphasizing the increased robustness of the CADe 2.0 system. For example, sensitivity for high quality images was 93 % for CADe 1.0 and 94 % for CADe 2.0, whereas for low quality images, sensitivity was 61 % and 78 %, respectively. An example case is provided in Fig. 4 s.

Test set 3 assessed robustness to endoscopist-independent variation using 16 paired subsets of identical images displayed using the full range of enhancement settings offered by Olympus endoscopy processors. The CADe 2.0 system not only outperformed CADe 1.0 for all performance metrics, but also demonstrated substantially greater consistency across different enhancement settings (Fig. 3 s). For instance, the median specificity (IQR) for CADe 1.0 was 10 % (68 %–78 %), compared with 4 % (83 %–87 %) for CADe 2.0. This is highly relevant as these settings, often preconfigured by technical services and unfamiliar to most endoscopists, can vary widely between hospitals or even endoscopy suites.

These findings highlight that the updated CADe 2.0 system is more robust under suboptimal imaging conditions, whether caused by endoscopist-dependent factors such as suboptimal esophageal expansion, motion blur, or inadequate cleaning, or by processor-specific enhancement settings. This improved robustness enhances the likelihood of successful clinical implementation of the CADe 2.0 system.

The observed improvements in both absolute performance and robustness of the CADe 2.0 system can be attributed to two key factors: the underlying training data and technological advancements in the system. First, the updated system benefited from a larger and more diverse training dataset. While the CADe 1.0 system was trained on a substantial dataset, it relied on a relatively small internal validation set. In contrast, the CADe 2.0 system allocated a greater proportion of new patient data for internal validation, enabling better decision-making during the system’s development.

Additionally, the training data for CADe 2.0 was more heterogeneous, encompassing a broader spectrum of video frames representing varying levels of image quality and content. In previous work, we have demonstrated that such diversity in training data improves algorithm robustness [27]. Frames were carefully selected using an algorithm designed to maximize variability in image characteristics, while adhering to predefined quality thresholds. Furthermore, the CADe 2.0 system incorporated a wide range of image enhancement settings used by current endoscopy systems. Modern endoscopy processors often include software settings that subtly alter image characteristics such as contrast, color, and sharpness to improve perceived image quality. While these changes are often imperceptible to human observers, a prior study by our group showed that such settings can significantly impact AI performance [18]. By including these variations in its training data, CADe 2.0 became more robust to differences in processor-specific settings.

The CADe 2.0 system also benefited from significant technological advancements. Its predecessor was designed to operate on a standard endoscopy processor, with limited dedicated computational resources to run an AI algorithm. The current trend in endoscopy is however the increasing availability of more powerful local computational resources, alongside the introduction of cloud-based AI systems [33]. Therefore, our new system was developed using a state-of-the-art architecture with significantly greater computational capacity. Combined with an optimized pretraining approach, this advancement greatly contributed to the improved performance and robustness.

This study has several unique features. First, our CADe system was trained on the largest dataset for BE to date. The structure of the BONS-AI consortium, involving 15 BE referral centers across seven countries, facilitated extensive and heterogeneous data collection, including a large number of prospectively acquired cases and a comprehensive set of ground truth segmentations provided by 14 international Barrett’s experts. Second, this is among the first studies to address the critical issue of domain shift and its implications for AI implementation in endoscopy. Moreover, it is the first to present an endoscopic AI system specifically designed to improve robustness against the data heterogeneity encountered in routine clinical practice. Importantly, all improvements implemented to bridge the domain gap were individually evaluated in prior studies, significantly enhancing the validity of our findings [17] [18] [21] [25] [26] [27] [29] [30] [32]. Third, all test datasets were prospectively collected, each dedicated to a specific evaluation target. Robustness test sets were carefully designed to encompass a wide range of endoscopic image qualities and paired on a patient basis, ensuring reliable and reproducible results.

This study also has limitations. First, we did not individually assess each performance- or robustness-enhancing method. Consequently, it is challenging to determine which factor contributed most to the improved performance of the CADe 2.0 system; however, as previously noted, all methods have been independently evaluated in separate studies. Furthermore, our group will publish an ablation study of all technical design choices in a separate, more technically oriented journal. Second, this study focused on directly comparing the two CADe systems. Unlike previous studies by our group, all evaluations were conducted in a stand-alone setting, without a benchmarking performance of endoscopists when assisted by the CADe system. It is well established that high stand-alone AI performance does not necessarily translate to improved performance of endoscopists supported by AI [9] [12]; however, we have already demonstrated that the CADe 1.0 system significantly improved detection rates when used in collaboration with endoscopists. The updated CADe system detected more lesions, provided more stable predictions, and displayed more robust performance under various sources of image quality heterogeneity. This strongly suggests that endoscopists using the updated system will also benefit from its improvements.

Third, although substantially more diverse, all data in this study still originated from expert centers. Ideally, algorithms should be trained and tested using datasets that include images from nonexpert centers; however, several practical limitations, primarily the low prevalence of Barrett’s neoplasia in community hospitals, make this approach practically unfeasible [34]. Fourth, while this study addressed several relevant sources of data heterogeneity encountered in clinical practice, other potential factors remain unexplored. For instance, all data in this study were captured using Olympus equipment, while variations between endoscope manufacturers and equipment series is likely to similarly impact generalizability. Finally, even robust AI systems depend on a certain level of endoscopic quality. Lesions obscured by mucus or blind spots can still lead to missed diagnoses, so maintaining high procedural standards is therefore critical. Computer-aided quality (CAQ) systems could play a key role in supporting endoscopists here and should be investigated in future studies [35].

In conclusion, we present an updated CADe system for early Barrett’s neoplasia detection, aimed at bridging the domain gap between performance in expert centers and community-based centers. The system demonstrated significant performance improvement compared with its predecessor. More importantly, this CADe system also displayed better robustness against real-world image quality variation, which is crucial for successful implementation into clinical practice. With the advancements of this robust algorithm, we are currently evaluating its real-time performance in clinical practice within a multicenter prospective study.

Data availability

Results can be shared upon reasonable request by contacting m.jong3@amsterdamumc.nl.

References

References
1 Sharma P. Barrett esophagus: a review. JAMA 2022; 328: 663-671
2 Weusten BLAM, Bisschops R, Dinis-Ribeiro M. et al. Diagnosis and management of Barrett esophagus: European Society of Gastrointestinal Endoscopy (ESGE) Guideline. Endoscopy 2023; 55: 1124-1146
3 chölvinck DW, van der Meulen K, Bergman J. et al. Detection of lesions in dysplastic Barrett's esophagus by community and expert endoscopists. Endoscopy 2017; 49: 113-120
4 Beveridge CA, Mittal C, Muthusamy VR. et al. Identification of visible lesions during surveillance endoscopy for Barrett's esophagus: a video-based survey study. Gastrointest Endosc 2023; 97: 241-247.e2
5 Sawas T, Majzoub AM, Haddad J. et al. Magnitude and time-trend analysis of postendoscopy esophageal adenocarcinoma: a systematic review and meta-analysis. Clin Gastroenterol Hepatol 2022; 20: e31-e50
6 Frazzoni L, Arribas J, Antonelli G. et al. Endoscopists' diagnostic accuracy in detecting upper gastrointestinal neoplasia in the framework of artificial intelligence studies. Endoscopy 2022; 54: 403-411
7 Desai D, Srinivasan S, Nutalapati V. et al. Post-endoscopy Barrett’s neoplasia after a negative index endoscopy: a systematic review and proposal for definitions and performance measures in endoscopy. Endoscopy 2022; 54: 881-889
8 Abdelrahim M, Saiko M, Maeda N. et al. Development and validation of artificial neural networks model for detection of Barrett’s neoplasia, a multicenter pragmatic non-randomized trial. Gastrointest Endosc 2023; 97: 422-434
9 Meinikheim M, Mendel R, Palm C. et al. Influence of artificial intelligence on the diagnostic performance of endoscopists in the assessment of Barrett’s esophagus: a tandem randomized and video trial. Endoscopy 2024; 56: 641-649
10 Fockens KN, Jukema JB, Boers T. et al. Towards a robust and compact deep learning system for primary detection of early Barrett's neoplasia: Initial image-based results of training on a multi-center retrospectively collected data set. United European Gastroenterol J 2023; 11: 324-336
11 Hussein M, González-Bueno Puyal J, Lines D. et al. A new artificial intelligence system successfully detects and localises early neoplasia in Barrett's esophagus by using convolutional neural networks. United European Gastroenterol J 2022; 10: 528-537
12 Fockens N, Jong MR, Jukema JB. et al. A deep learning system for detection of early Barrett's neoplasia: a model development and validation study. Lancet Digit Health 2023; 5: 905-921
13 Subbaswamy A, Saria S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 2020; 21: 345-352
14 Finlayson SG, Subbaswamy A, Singh K. et al. The clinician and dataset shift in artificial intelligence. NEJM 2021; 385: 283-286
15 De Groof AJ. Artificial intelligence (AI) systems for detection of Barrett’s neoplasia: time to bridge domain gaps and explore human–AI interaction. Endoscopy 2024; 56: 650-652
16 Jong MR, Jaspers TJM, Kusters CHJ. et al. Challenges in implementing endoscopic artificial intelligence: the impact of real-world imaging conditions on Barrett's neoplasia detection. United European Gastroenterol J 2025; 13: 929-937
17 Jaspers TJM, Boers TGW, Kusters CHJ. et al. Robustness evaluation of deep neural networks for endoscopic image analysis: Insights and strategies. Med Image Anal 2024; 94: 103157
18 Jong M, Kusters K, van Bokhorst Q. et al. Domain-specific data augmentation for robust AI systems in endoscopy. Gastrointest Endosc 2024; 99: AB27-AB28
19 Antonelli G, Libanio D, De Groof AJ. et al. QUAIDE – Quality assessment of AI preclinical studies in diagnostic endoscopy. Gut 2024; 74: 153-161
20 Xie Q, Dai Z, Hovy E. et al. Unsupervised data augmentation for consistency training. arXiv [eprint] 2019;
21 Jukema J, Fockens K, Boers T. et al. Domain-specific pretraining of deep learning systems in gastrointestinal endoscopy improves performance over current state-of-the-art pretraining methods. ESGE Days 2023. Endoscopy 2023; 55 (Suppl. 02) S83-S84
22 Deng J, Dong W, Socher R. et al. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: 2009: 248-255
23 He K, Girshick R, Dollár P. Rethinking ImageNet pre-training. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, South Korea: 2019: 4917-4926
24 Zamir AR, Sax A, Shen W. Taskonomy: disentangling task transfer learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: 2018: 3712-3722
25 Boers TGW, Fockens KN, van der Putten JA. et al. Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency. Med Image Anal 2024; 98: 103298
26 Caron M, Touvron H, Misra I. et al. Emerging properties in self-supervised vision transformers. arXiv [eprint] 2021;
27 Jong MR, Jaspers TJ, Kusters KC. et al. Tu2011 Image quality challenges in AI: improving robustness of a computer aided detection system for Barrett's neoplasia. Gastroenterology 2024; 166: S1491
28 Yu W, Luo M, Zhou P. et al. MetaFormer is actually what you need for vision. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: 2022: 10809-10819
29 Yu W, Si C, Zhou P. et al. MetaFormer baselines for vision. IEEE Trans Pattern Anal Mach Intell 2024; 46: 896-912
30 Kusters CHJ, Jaspers TJM, Boers TGW. et al. Will Transformers change gastrointestinal endoscopic image analysis? A comparative analysis between CNNs and Transformers, in terms of performance, robustness and generalization. Med Image Anal 2025; 99: 103348
31 Jong MR, Kusters CHJ, van Bokhorst QNE. et al. Impact of standard enhancement settings of endoscopy systems on performance of endoscopic artificial intelligence systems. Endoscopy 2025; 57: 602-610
32 Kusters CHJ, Boers TGW, Jaspers TJM. et al. Optimizing multi-expert consensus for classification and precise localization of Barrett’s neoplasia. Ali S, van der Sommen F, Papież BW. Cancer Prevention, Detection, and Intervention. Cham, Switzerland: Springer; 2025: 83-92
33 Kader R, Hassan C, Lanas Á. et al. Cloud-based artificial intelligence for detection of colorectal neoplasia – a randomized controlled trial (EAGLE Trial). Endoscopy 2025; 57 (Suppl. 02) BA003
34 Mori Y. Artificial intelligence for a rare disease. Endoscopy 2024; 57: 310-311
35 Jong MR, Jaspers TJM, van Eijck van Heslinga RAH. et al. The development and ex vivo evaluation of a computer-aided quality control system for Barrett's esophagus endoscopy. Endoscopy 2025; 57: 709-716

Figures

Fig. 1 Example images from the three test sets used to evaluate the CADe systems. Test set 1 featured high quality videos recorded by expert endoscopists to assess performance under ideal imaging conditions. Test set 2 included matched video frames from the same patient, showing variations in endoscopist-dependent image quality. Test set 3 presented images displayed with different post-processing enhancement settings to evaluate robustness to endoscopist-independent image quality.

Fig. 2 Bar graphs depicting the classification and localization performance of the CADe 1.0 and CADe 2.0 systems across the test sets, with error bars indicating confidence intervals calculated using the Wilson method. Boxplots for test set 3 illustrate classification and localization performance with differing image enhancement settings.

Supplementary Material

Supplementary material