Introduction
Gastric cancer is the third most lethal and the fifth most common malignancy from
a global perspective [1]. It is estimated that about 1 million new gastric cancer cases were diagnosed and
about 700 000 people died of gastric cancer in 2012, which represents up to 10 % of
the cancer-related deaths worldwide [1]. The 5-year survival rate for gastric cancer is 5 % – 25 % in its advanced stages,
but reaches 90 % in the early stages [2]
[3]. Early detection is therefore a key strategy to improve patient survival.
In recent decades, endoscopic technology has seen remarkable advances and endoscopy
has been widely used as a screening test for early gastric cancer (EGC) [4]. In one series, 7.2 % of patients with gastric cancer had however been misdiagnosed
at an endoscopy performed within the previous year, and 73 % of these cases arose
from endoscopist errors [5].
The performance quality of EGD varies significantly because of cognitive and technical
factors [6]. In the cognitive domain, EGC lesions are difficult to recognize because the mucosa
often shows only subtle changes, which require endoscopists to be well trained and
armed with a thorough knowledge [4]
[7]. In addition, endoscopists could be affected by their subjective state during endoscopy,
which restricts the detection of EGC to a large extent [8]. In the technical domain, guidelines to map the entire stomach exist, but are often
not well followed, especially in developing countries [9]
[10]. Therefore, it is important to develop a feasible and reliable method to alert endoscopists
to possible EGC lesions and blind spots.
A potential solution to mitigate the skill variations is to apply artificial intelligence
(AI) to EGD examinations. The past decades have seen an explosion of interest in the
application of AI in medicine [11]. More recently, a method of AI known as a deep convolutional neural network (DCNN),
a method transforming the representation at one level into a more abstract level to
make predictions, has opened the door to elaborate image analysis [12]. Recent studies have successfully used DCNNs in the field of endoscopy. Chen et
al. achieved accurate classification of diminutive colorectal polyps based on colonoscopy
images [13], and Byrne et al. achieved real-time differentiation of adenomatous and hyperplastic
diminutive colorectal polyps using colonoscopy videos [14]. However, in real-time EGC detection alone with blind spot monitoring, the application
of DCNN has not yet been researched.
In this work, we first developed a novel system using DCNN to analyze EGD images of
EGC and gastric locations. Furthermore, we exploited an activation map to proactively
track suspicious cancerous regions and built a grid model for the stomach to indicate
the existence of blind spots on unprocessed EGD videos.
Methods
Datasets, data preparation, and sample distribution
The flowchart of the data preparation and training/testing procedure of the DCNN is
shown in [Fig. 1]. Networks playing different roles were independently trained, and their functions,
inclusion criteria, exclusion criteria, image views, data sources, and data preparation
are described in [Table e1] ([Fig. e2]; [15]
[16]), available online in Supplementary materials. The sample distribution is presented
in [Fig. e3]. Images of the same lesion from multiple viewpoints, or similar lesions from the
same person were contained. Extensive attention was paid to ensure that images from
the same person were not split between the training, validation, and test sets.
Fig. 1 Flowchart of the data preparation and training/testing procedure of the deep convolutional
neural network (DCNN) model. The functions of networks 1, 2, and 3 are filtering blurry
images, early gastric cancer identification, and classification of gastric location,
respectively. The three networks were independently trained. Blurry images and some
clear images were used for the training of the network 1. Clear images were further
classified into malignant or benign depending on the pathology evidence for the training
and testing of network 2. Parallelly, clear images were classified into 10 or 26 gastric
locations by two endoscopists with more than 10 years of esophagogastroduodenoscopy
(EGD) experience for the training and testing of network 3. When running on the videos,
all frames will be first filtered via network 1, so only clear images can enter into
networks 2 and 3.
Table e1
Datasets, data preparation, and sample distribution.
|
Function
|
Inclusion criteria
|
Exclusion criteria
|
Image views
|
Data sources
|
Data preparation
|
Network 1
|
Filter unqualified images
|
Patients undergoing EGD examination
|
Age < 18 years and residual stomach
|
NBI, BLI and white light
|
Stored images from patients between 10 February 2016 and 10 March 2018 in Renmin Hospital
of Wuhan University
|
Two doctoral students classified images into blurry or clear
|
Network 2
|
Identify EGC
|
Patients with gastric cancer, superficial gastritis, and mild erosive gastritis
|
Poor gastric preparation, age < 18 years and residual stomach
|
NBI, BLI and white light
|
Stored images from patients between 10 February 2016 and 10 March 2018 in Renmin Hospital
of Wuhan University, images in published EGC atlas of Olympus company, and open-access
EGC repositories
|
Two doctoral students classified images into malignancy or non-malignancy by pathology
evidence; two endoscopists with more than 10 years of EGD experience reviewed these
images
|
Network 3
|
Classify gastric locations
|
Patients undergoing EGD examination
|
Poor gastric preparation, age < 18 years and residual stomach
|
White light
|
Stored images from patients between 10 February 2018 and 10 March 2018 in Renmin Hospital
of Wuhan University
|
Two endoscopists with more than 10 years of EGD experience
classified images into their corresponding locations according to the European ESGE
[15] and Japanese SSS guidelines [16] ([Fig. e2])
|
EGD, esophagogastroduodenoscopy; NBI, narrow-band imaging; BLI, blue-laser imaging;
EGC, early gastric cancer; ESGE, European Society of Gastrointestinal Endoscopy; SSS,
systematic screening protocol for the stomach.
Fig. e2 A schematic illustration of the classification of gastric locations. In line with
the guidelines of the European Society of Gastrointestinal Endoscopy (ESGE) Quality
Improvement Initiative, images were generally classified into 10 parts including esophagus,
squamocolumnar junction, antrum, duodenal bulb, descending duodenum, lower body in
forward view, middle-upper body in forward view, fundus, middle-upper body in retroflexed
view, and angulus. To refine the model in monitoring blind spots, we further classified
esophagogastroduodenoscopy (EGD) images into 26 parts (22 for stomach, 2 for esophagus,
and 2 for duodenum) following the “systematic screening protocol for the stomach”
(SSS), a minimum required standard in Japan. G, greater curvature; P, posterior wall;
A, anterior wall; L, lesser curvature.
Fig. e3 Sample distribution of the deep convolution neural network (DCNN) in different networks.
A total of 11 505 blurry and 11 960 clear images were enrolled to train the DCNN to
filter unqualified images; 3170 gastric cancer and 5981 benign images were collected
to train the DCNN to detect early gastric cancer (EGC); 24 549 images from different
parts of the stomach were collected to train the DCNN to monitor blind spots. The
training and validation datasets were divided in a ratio of 4 : 1, and the images
in test sets were independent from those in the training and validation datasets.
The videos used came from stored data at Renmin Hospital of Wuhan University. Instruments
that had been used included gastroscopes with an optical magnification function (CVL-290SL,
Olympus Optical Co. Ltd., Tokyo, Japan; VP-4450HD, Fujifilm Co., Kanagawa, Japan).
The number of enrolled images was based on the data availability, which led to malignant
images being relatively rare compared with non-malignant images and the number of
images from different locations varying widely. The standards of the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC) were used to justify these numbers [17].
Training algorithm
VGG-16 [18] and ResNet-50 [19], two state-of-the-art DCNN architectures pretrained with 1.28 million images from
1000 object classes, were used to train our system. Using transfer learning [20], we replaced the final classification layer with another fully connected layer,
retrained it using our datasets, and fine-tuned the parameters of all layers. Images
were resized to 224 × 224 pixels to suit the original dimensions of the models. Google’s
TensorFlow [21] deep learning framework was used to train, validate, and test our system. Confusion
matrices, learning curves, and the methods of avoiding overfitting are described in
Appendix e1 ([Table e2]; [Figs. e4 – e6] [22]
[23]
[24]
[25]
[26]), available online in Supplementary materials.
Table e2
Data augmentation methods and parameters used in the three independent networks.
|
Width shift range
|
Height shift range
|
Shear range
|
Zoom range
|
Horizontal flip
|
Fill mode
|
Image qualification
|
0.2
|
0.2
|
0.2
|
0.2
|
True
|
Nearest
|
EGC identification
|
0.2
|
0.2
|
0.2
|
0.2
|
True
|
Nearest
|
Gastric location classification
|
0.1
|
0.1
|
NA
|
0.1
|
NA
|
NA
|
EGC, early gastric cancer; NA, not applicable.
Fig. e4 A schematic illustration of the k-fold. K-fold cross-validation procedure was implemented
with k = 5, dividing the dataset into 5 subsets and validating each subset individually,
with the remaining subsets used for training.
Fig. e5 The confusion matrices and learning curves of the three different networks. a The loss and accuracy of network 1 (the network filtering unqualified images) in
the validation dataset are presented after different training times. Early stopping
occurred at the 14th training run. Loss value was calculated by binary crossentropy.
b The confusion matrix of network 1 in the validation dataset. Element (x, y) in the matrix represents the number of the predicted class (x) based on the true class (y), where the numbers represent: 0, blurry; 1, clear. c The loss and accuracy of network 2 (the network identifying early gastric cancer)
in the validation dataset are presented after different training times. Early stopping
occurred at the 6th training run. Loss value was calculated by binary crossentropy.
d The confusion matrix of network 2 in the validation dataset. Element (x, y) in the matrix represents the number of the predicted class (x) based on the true class (y), where the numbers represent: 0, non-malignant mucosa; 1, gastric cancer. e The loss and accuracy of network 3 (the network classifying gastric locations) in
the validation dataset are presented after different training times. Early stopping
occurred at the 7th training run. Loss value was calculated by categorical crossentropy.
f The confusion matrix of network 3 in the validation dataset. The confusion matrix
of network 3 in classifying gastric locations into 26 parts. Element (x, y) in the matrix represents the number of the predicted class (x) based on the true class (y), where the numbers represent the following: class 0, esophagus; 1, squamocolumnar
junction; 2 – 5, antrum (greater curvature [G], posterior wall [P], anterior wall
[A], lesser curvature [L], respectively); 6, duodenal bulb; 7, descending duodenum;
8 – 11, lower body in forward view (G, P, A, L, respectively); 12 – 15, middle-upper
body in forward view (G, P, A, L, respectively); 16 – 19, fundus (G, P, A, L, respectively);
20 – 22, middle-upper body in retroflex view (P, A, L, respectively); 23 – 25, angulus
(P, A, L, respectively). g The confusion matrix of network 3 in classifying gastric locations into 10 parts.
The confidence of each category was derived from the addition of their branches. Class
0, esophagus; 1, squamocolumnar junction; 2, antrum; 3, duodenal bulb; 4, descending
duodenum; 5, lower body in forward view; 6, middle-upper body in forward view; 7,
fundus; 8, middle-upper body in retroflexed view; 9, angulus.
Fig. e6 The process of building a synthesized model. a Flowchart illustrating the synthesis of VGG-16 and Resnet-50 models. In the task
of identifying early gastric cancer, after typical k-fold cross-validation, candidate
VGG-16 and Resnet-50 were differently combined (5 VGG; 4 VGG + 1 Resnet; 3 VGG + 2
Resnet; 2 VGG + 3 Resnet; 1 VGG + 4 Resnet; and 5 Resnet) and tested in the independent
test set. The combination of the best performance was selected for the subsequent
experiments. b The performance of different combined models. The accuracy of the models synthesized
with 5 VGG, 4 VGG + 1 Resnet, 3 VGG + 2 Resnet, 2 VGG + 3 Resnet, 1 VGG + 4 Resnet,
and 5 Resnet were 90.5 %, 92.0 %, 92.5 %, 90.5 %, 90.0 %, and 89.0 %, respectively.
The combination of 3 VGG-16 and 2 Resnet-50 achieved the highest accuracy.
Comparison between DCNN and endoscopists
To evaluate the DCNN’s diagnostic ability for EGC, 200 images independent from the
training/validation sets were selected as the test set. The performance of the DCNN
was compared with that of six expert endoscopists, eight seniors, and seven novices.
Lesions that are easily missed, including EGC type 0-I, 0-IIa, 0-IIb, 0-IIc, and 0-mixed
were selected ([Table 3]; [Fig. 7]). Two endoscopists with more than 10 years of EGD experience reviewed these images.
In the test, endoscopists were asked if there was a suspicious malignant lesion shown
in each image. The calculation formulas of comparison metrics are described Appendix
e1, available online in Supplementary materials.
Table 3
Lesion characteristics in the test set for the detection of early gastric cancer (EGC).
|
EGC
|
Control
|
0-I
|
2
|
|
0-IIa
|
10
|
|
0-IIb
|
25
|
|
0-IIc
|
31
|
|
0-mixed (0-IIa + IIc, IIc + IIb, IIc + III)
|
32
|
|
Normal
|
|
18
|
Superficial gastritis
|
|
40
|
Mild erosive gastritis
|
|
42
|
Total
|
100
|
100
|
Fig. 7 Representative images predicted by the deep convolution neural network (DCNN) in
the test set for the classification of gastric locations into 26 parts, showing the
gastric locations determined by the DCNN and their prediction confidence.
L, lesser curvature; A, anterior wall; G, greater curvature; P, posterior wall.
Class 0, esophagus; 1, squamocolumnar junction; 2 – 5, antrum (G, P, A, L, respectively);
6, duodenal bulb; 7, descending duodenum; 8 – 11, lower body in forward view (G, P,
A, L, respectively); 12 – 15, middle-upper body in forward view (G, P, A, L, respectively);
16 – 19, fundus (G, P, A, L, respectively); 20 – 22, middle-upper body in retroflexed
view (P, A, L, respectively); 23 – 25, angulus (P, A, L, respectively).
To evaluate the DCNN’s ability to classify gastric locations, we compared the accuracy
of DCNN to that of 10 experts, 16 seniors, and 9 novices. The test dataset consisted
of 170 images, independent from the training/validation sets, randomly selected from
each gastric location. In the test, endoscopists were asked which location a picture
referred to. Tests were performed using Document Star, a Chinese online test service
company. The description of the endoscopists participating in the experiment is presented
in Appendix e1, available in Supplementary materials.
Class activation maps
Class activation maps (CAMs) indicating suspicious cancerous regions were established,
as described previously [27]. In brief, before the final output layer of Resnet-50, a global average pooling
was performed on the convolutional feature maps and these were used as the features
for a fully-connected layer that produced the desired output. The color depth of CAMs
is positively correlated with the confidence of the prediction.
A grid model for the stomach
In order to automatically remind endoscopists of blind spots during EGD, a grid model
for the stomach was developed to present the covered parts. The onion-skin display
method was used to build a grid model for the stomach, as previously described [28]. The characteristics of each part of the stomach were extracted and put together
to generate a virtual stomach model. The model was set to be transparent before EGD.
As soon as the scope was inserted into the stomach, the DCNN model began to capture
images and filled them into the corresponding part in the model, coloring the various
parts.
Running the DCNN on videos
Frame-wise prediction was used on unprocessed videos using client-server interaction
([Fig. e8a]). Images were captured at 2 frames per second (fps). Noises were smoothed by the
Random Forest Classifier model [29] and the rule of “output results only when three of the five consecutive images show
a same result” ([Fig. e8b]). The time used for outputting a prediction per frame in the videos in the clinical
setting includes time consumed in the client (image capture, image resizing, and rendering
images based on predicted results), network communication, and the server (reading
and loading images, running the three networks, and saving images).
Fig. e8 Frame-wise prediction was used on unprocessed videos using client-server interaction.
a A schematic of a client communicating with a server via the internet. The network
is running in the background using client-server interaction; there was a network
delay of 100 – 400 milliseconds in the videos. b Results were outputted only when three of the five consecutive images showed the
same result. Images were captured at 2 frames per second.
The speed of the DCNN in the clinical setting was evaluated by 926 independent tests,
calculating the total time used to output a prediction per frame in the endoscopy
center of Renmin Hospital of Wuhan University.
Human subjects
Endoscopists that participated in our tests were under informed consent. This study
was approved by the Ethics Committee of Renmin Hospital of Wuhan University, and was
registered as trial number ChiCTR1800014809 of the Primary Registries of the WHO Registry
Network.
Statistical analysis
A two-tailed unpaired Student's t test with a significance level of 0.05 was used to compare differences in the accuracy,
sensitivity, specificity, and positive and negative predictive values (PPV and NPV,
respectively) of the DCNN and endoscopists. Interobserver and intraobserver agreement
of the endoscopists and intraobserver agreement of the DCNN were evaluated using Cohen’s
kappa coefficient. All calculations were performed using SPSS 20 (IBM, Chicago, Illinois,
USA).
Results
The performance of DCNN on identification of EGC
Comparison between the performance of DCNN and endoscopists
[Table 4] shows the predictions of the DCNN and endoscopists for identifying EGC. Among 200
gastroscope images, with or without malignant lesions, the DCNN correctly diagnosed
malignancy with an accuracy of 92.5 %, a sensitivity of 94 %, a specificity of 91 %,
a PPV of 91.3 %, and an NPV of 93.8 %. Six experts, eight seniors, and seven novices
attained an accuracy of 89.7 % (standard deviation [SD] 2.2 %), 86.7 % (SD 5.6 %),
and 81.2 % (SD 5.7 %) for each picture, respectively. The accuracy of DCNN was significantly
higher than that of all endoscopists. [Fig. 9] shows representative images predicted by the model in the test set. The CAMs highlighted
the cancerous regions after images were evaluated by the model.
Table 4
Performance of the deep convolution neural network (DCNN) and endoscopists in the
detection of early gastric cancer. All results are given as a percentage (standard
deviation).
|
DCCN
|
Experts (n = 6)
|
Seniors (n = 8)
|
Novices (n = 7)
|
Accuracy
|
92.50
|
89.73 (2.15)[1]
|
86.68 (5.58)[2]
|
81.16 (5.72)[1]
|
Sensitivity
|
94.00
|
93.86 (7.65)
|
90.00 (6.05)
|
75.33 (6.31)[1]
|
Specificity
|
91.00
|
87.33 (7.43)
|
85.05 (16.18)
|
88.83 (6.03)
|
PPV
|
91.26
|
91.75 (4.15)
|
90.91 (5.69)
|
80.47 (8.75)[1]
|
NPV
|
93.81
|
92.52 (5.76)
|
88.01 (6.55)[2]
|
82.32 (11.46)
|
PPV, positive predictive value; NPV, negative predictive value.
1
P < 0.01
2
P < 0.05
Fig. 9 Representative images predicted by the deep convolution neural network (DCNN) in
the test set for detection of early gastric cancer (EGC). a The displayed normal mucosa, superficial gastritis, and mild erosive gastritis were
predicted to be non-malignant by the DCNN with confidence levels of 0.98, 0.95, and
0.91, respectively. b The mucosal images from EGC type 0-IIa, 0-IIb, and 0-IIc were predicted to be malignant
by the DCNN with confidence of 0.86, 0.82, and 0.89, respectively. c Cancerous regions were indicated in the images from b after establishing class activation maps (CAMs). The color depth of the CAMs was
positively correlated with the prediction confidence.
Comparison between the stability of DCNN and endoscopists
To evaluate the stability of DCNN and endoscopists on identifying EGC, we mixed up
all of the test pictures and randomly selected six endoscopists (2 experts, 2 seniors,
and 2 novices) to do the same test again. As shown in [Table 5], the experts had substantial interobserver agreement (kappa 0.80), and the seniors
and novices achieved moderate interobserver agreement (kappa 0.49 and 0.42, respectively).
The intraobserver agreement of experts and nonexperts was moderate or better (kappa
0.84 in the expert group and 0.54 – 0.77 in the nonexpert group). The DCNN achieved
perfect intraobserver agreement (kappa 1.0).
Table 5
Intra- and interobserver agreement (kappa value) of endoscopists in identifying early
gastric cancer.
|
Expert 1
|
Expert 2
|
Senior 1
|
Senior 2
|
Novice 1
|
Novice 2
|
Expert 1
|
0.84
|
|
|
|
|
|
Expert 2
|
0.80
|
0.84
|
|
|
|
|
Senior 1
|
|
|
0.77
|
|
|
|
Senior 2
|
|
|
0.50
|
0.66
|
|
|
Novice 1
|
|
|
|
|
0.66
|
|
Novice 2
|
|
|
|
|
0.42
|
0.54
|
The performance of DCNN on classification of gastric locations
Comparison between the performance of DCNN and endoscopists
[Table 6] shows the predictions of the DCNN and endoscopists for classifying gastric locations.
A group of 10 experts, 16 seniors, and 9 novices classified EGD images into 10 stomach
parts with an accuracy of 90.2 % (SD 5.1 %), 86.8 % (5.2 %), and 83.3 % (10.3 %),
respectively, and into 26 sites with an accuracy of 63.8 % (6.9 %), 59.3 % (6.4 %),
and 46.5 % (7.2 %), respectively. The DCNN correctly identified EGD images into 10
parts with an accuracy of 90 % and into 26 parts with an accuracy of 65.9 %, showing
no significant difference with any of the levels of endoscopists. [Fig. 7] and [Fig. 10] show representative images in the test set that were predicted by the DCNN in the
task of classifying gastric locations into 26 parts and 10 parts, respectively.
Table 6
Accuracy of the deep convolution neural network (DCNN) and endoscopists in classifying
gastric location. All results are given as percentage (standard deviation).
|
10 parts
|
26 parts
|
DCNN
|
90.00
|
65.88
|
Experts (n = 10)
|
90.22 (5.09)
|
63.76 (6.87)
|
Seniors (n = 16)
|
86.81 (5.19)
|
59.26 (6.36)
|
Novices (n = 9)
|
83.30 (10.27)
|
46.47 (7.23)[*]
|
*
P < 0.01
Fig. 10 Representative images predicted by the deep convolution neural network (DCNN) in
the test set for the classification of gastric locations into 10 parts, showing the
gastric locations determined by the DCNN and their prediction confidence.
Class 0, esophagus; 1, squamocolumnar junction; 2, antrum; 3, duodenal bulb; 4, descending
duodenum; 5, lower body in forward view; 6, middle-upper body in forward view; 7,
fundus; 8, middle-upper body in retroflexed view; 9, angulus.
Comparison between the stability of DCNN and endoscopists
In the task of classifying gastric locations into 10 parts, all endoscopists achieved
substantial interobserver or intraobserver agreement (kappa 0.75 – 0.96). In the 26-part
classification, all endoscopists achieved moderate interobserver or intraobserver
agreement (kappa 0.50 – 0.68) ([Table 7]). The DCNN achieved perfect intraobserver agreement (kappa 1.0).
Table 7
Intra- and interobserver agreement (kappa value) of endoscopists in classifying gastric
location into 10 or 26 parts.
|
Expert 1
|
Expert 2
|
Senior 1
|
Senior 2
|
Novice 1
|
Novice 2
|
10 parts
|
|
|
|
|
|
|
Expert 1
|
0.96
|
|
|
|
|
|
Expert 2
|
0.91
|
0.90
|
|
|
|
|
Senior 1
|
|
|
0.86
|
|
|
|
Senior 2
|
|
|
0.75
|
0.89
|
|
|
Novice 1
|
|
|
|
|
0.89
|
|
Novice 2
|
|
|
|
|
0.85
|
0.93
|
26 parts
|
|
|
|
|
|
|
Expert 1
|
0.68
|
|
|
|
|
|
Expert 2
|
0.61
|
0.61
|
|
|
|
|
Senior 1
|
|
|
0.55
|
|
|
|
Senior 2
|
|
|
0.52
|
0.60
|
|
|
Novice 1
|
|
|
|
|
0.53
|
|
Novice 2
|
|
|
|
|
0.55
|
0.50
|
Testing of the DCNN in unprocessed gastroscope videos
To explore the ability of the DCNN in detecting EGC and monitoring blind spots in
a real-time clinical setting, we checked the model in two unprocessed gastroscope
videos. In [Video 1], which had no cancerous lesion, the DCNN accurately presented the covered parts
synchronized with the process of EGD to verify that the entire stomach was mapped.
Video 1 Testing of the deep convolution neural network (DCNN) in an unprocessed esophagogastroduodenoscopy
(EGD) video from case 1. The DCNN accurately presented the covered parts synchronized
with the process of EGD to verify that the entire stomach had been mapped. In the
grid model for the stomach, any transparent area indicated that this part had not
been observed during the EGD. No cancerous lesion was detected.
In [Video 2], which had cancerous lesions, the DCNN alerted about blind spots synchronized with
the process of EGD, and automatically indicated the suspicious EGC regions with CAMs.
All lesions were successfully detected; however, a false-positive error occurred when
the mucosa was covered by unwashed foam.
Video 2 Testing of the deep convolution neural network (DCNN) model in an unprocessed esophagogastroduodenoscopy
(EGD) video from case 2. The DCNN indicated suspicious gastric cancer regions and
presented the covered parts synchronized with the process of EGD. Suspicious cancerous
regions were indicated by class activation maps (CAMs) and the color depth of the
CAMs was positively correlated with the confidence of DCNN prediction. In the grid
model for the stomach, any transparent area indicated that this part had not been
observed during the EGD.
To test the speed of the DCNN, 926 independent tests were conducted in a clinical
setting. The total time to output a prediction using all three networks for each frame
was 230 milliseconds (SD 60; range 180 – 350). In the test of identifying EGC, six
experts, eight seniors, and seven novices took 3.29 seconds for each picture (SD 0.42),
3.96 (0.80), and 6.19 (1.92), respectively. In the test of classifying gastric locations
into 10 parts, 10 experts, 16 seniors, and 9 novices required 4.51 seconds per picture
(SD 2.07), 4.52 (0.65), and 4.76 (0.67), respectively; for classification into 26
parts, they took 14.23 seconds per picture (SD 2.41), 19.33 (9.34), and 24.15 (6.93),
respectively. The prediction time of the DCNN in the clinical setting was considerably
shorter compared with the time taken by the endoscopists.
Discussion
Endoscopy plays a pivotal role in the diagnosis of gastric cancer, the third leading
cause of cancer death worldwide [1]
[4]. Unfortunately, endoscopic diagnosis of gastric cancer at an early stage is difficult,
requiring endoscopists first to obtain thorough knowledge and good technique [7]. The training process of a qualified endoscopist is time- and cost-consuming. In
many countries, especially in Western Europe and China, the demand for endoscopists
familiar with the diagnosis of EGC means they are in short supply, which greatly limits
the effectiveness of endoscopy in the diagnosis and prevention of gastric cancer [7]
[8]
[9]
[10].
DCNN is one of the most important deep learning methods for computer vision and image
classification [12]
[13]
[14]. Chen et al. [13] achieved accurate classification of diminutive colorectal polyps based on images
captured during colonoscopy using DCNN, and Byrne et al. [14] achieved real-time differentiation of adenomatous and hyperplastic diminutive colorectal
polyps on colonoscopy videos. The most recent study [30] used DCNN to detect EGC with an overall sensitivity of 92.2 % and a PPV of 30.6 %
in their dataset. Here, we developed a DCNN system to detect EGC, with a sensitivity
of 94 % and a PPV of 91.3 %, and distinguish gastric locations on a par with the level
of expert endoscopists. Furthermore, we compared the competence of the DCNN with endoscopists
and used the DCNN in unprocessed EGD videos to proactively track suspicious cancerous
lesions without blind spots.
Observing the whole stomach is a basic prerequisite for the diagnosis of gastric cancer
at an early stage [7]
[16]. In order to avoid blind spots, standardized procedures have been made to map the
entire stomach during gastroscopy. The European Society of Gastrointestinal Endoscopy
(ESGE) published a protocol including 10 images of the stomach in 2016 [15]. Japanese researchers published a minimum required “systematic screening protocol
for the stomach” (SSS) standard including 22 images of the stomach so as not to miss
suspicious cancerous lesions [16]. However, these protocols are often not well followed, and endoscopists may ignore
some parts in the stomach because of subjective factors or limited operative levels,
which can lead to the misdiagnosis of EGC [7]
[8]
[10].
In the present study, using 24 549 images from EGDs, we developed a DCNN model that
accurately and stably recognizes each part of the stomach on a par with the expert
level, automatically captures images during endoscopy, and maps these onto a grid
model of the stomach to prompt the operator about blind spots. This real-time assistance
system will improve the quality of EGD and ensure that the whole stomach is observed
during endoscopy, thereby providing an important prerequisite for the detection of
EGC.
White-light imaging is the standard endoscopic examination for the identification
of gastric cancer lesions, although it is difficult to make an accurate detection
of EGC [31]
[32]. It has been reported that the sensitivity of white-light imaging in the diagnosis
of superficial EGC ranges from 33 % to 75 % [33]. Many new technologies have been developed to improve diagnostic abilities for EGC,
including image-enhanced endoscopy and magnifying endoscopy [34]. Plenty of time and money have been put into training endoscopists to become familiar
with the characteristics of early cancer lesions under different views [7]. However, while current technologies are still not strong for some secluded lesions,
such as the IIb-type EGC [34], manual diagnosis, on the other hand, greatly depends on the experience and subjective
state of the operator performing the endoscopy [8].
This subjective dependence on the operator decreases the accuracy and stability of
EGC diagnosis [8]. To overcome this limitation, DCNN, with its strong learning ability and good reproducibility,
has gained attention as a clinical tool for endoscopists. In the present study, 3170
gastric cancer and 6541 normal control images from EGD examinations were collected
to train a DCNN model with reliable and stable diagnosis ability for EGC. An independent
test was conducted to evaluate the diagnostic ability of DCNN and endoscopists. In
our study, the DCNN achieved an accuracy of 92.5 %, a sensitivity of 94 %, a specificity
of 91 %, a PPV of 91.3 %, and an NPV of 93.8 %, outperforming all levels of endoscopists.
In terms of stability evaluation, the nonexperts achieved moderate interobserver agreement
(kappa 0.42 – 0.49), and the experts achieved substantial interobserver agreement
(kappa 0.80). Because of the subjective interpretation of the EGC characteristics,
human learning curves in the diagnosis of EGC exist, and therefore an objective diagnosis
is necessary [5]
[8]. In our study, we used up-to-date neural network models to develop a DCNN system.
It achieved perfect intraobserver agreement (kappa 1.0), while endoscopists had variable
intraobserver agreement. Our results indicate that this gastric cancer screening system
based on a DCNN has adequate and consistent diagnostic performance, removes some of
the diagnostic subjectivity, and could be a powerful tool to assist endoscopists,
especially nonexperts, in detecting EGC.
The diagnostic ability and stability of the DCNN seems to outperform that of experienced
endoscopists. In addition, the diagnostic time of the DCNN was considerably shorter
than that of the endoscopists. It should be noted that the shorter screening time
and the absence of fatigue with the DCNN may make it possible to provide quick predictions
of EGC following an endoscopic examination. Importantly, the diagnosis of EGC by the
DCNN can be achieved completely automatically and online, which may contribute to
the development of telemedicine, thereby alleviating the problem of inadequate numbers
and experience of doctors in remote regions.
Another strength of this study is that the DCNN is armed with CAMs and a grid model
for the stomach to cover suspicious cancerous regions and indicate the existence of
potential blind spots. The CAMs are a weighted linear sum of the presence of visual
patterns with different characteristics, by which the discriminative regions of target
classification are highlighted [27]. As soon as we insert the scope into the stomach, we can determine whether and where
EGC is present in the background mucosa using the DCNN with CAMs. In addition, the
observed areas are instantaneously recorded and colored in the grid model of the stomach,
indicating whether blind spots exist during the EGD. Through these two auxiliary tools,
it is possible for the DCNN to proactively track suspicious cancerous lesions without
blind spots, so reducing the pressure and workload on endoscopists during real-time
EGD.
There are some limitations in the present study. First, the detection of EGC was based
on images only in white light, narrow-band imaging (NBI), and blue-laser imaging (BLI)
views. With images under more views, such as chromoendoscopy using indigo carmine
[7], i-scan optical enhancement [34], and even optical coherence tomography [35], it is possible to design a more universal EGC detection system.
Second, in the control group of the gastric cancer dataset, only normal, superficial
gastritis, and mild erosive gastritis mucosa were enrolled. Other benign diseases,
such as atrophic gastritis, gastritis verrucose, and typical benign ulcer, could be
enrolled in the control group later. In this way, the practicability of the DCNN in
detection of EGC will be further improved.
Third, when the DCNN was applied to unprocessed EGD videos, false-positive errors
occurred when the mucosa was not washed clean. We plan to train the DCNN to recognize
mucosa that is poorly prepared to avoid these mistakes and to transfer the false-positive
errors into a suggestion to the operator with regard to cleaning the mucosa.
Fourth, although the DCNN presented satisfactory results in the detection of EGC and
monitoring of EGD quality in real-time unprocessed videos, its competence was only
quantitatively evaluated in still images, not in videos. We will keep collecting data
to assess its ability in unprocessed videos and to provide accurate effectiveness
data for the DCNN in a real clinical setting in the near future.
In summary, a computer-aided system based on a DCNN provides automated, accurate,
and consistent diagnostic performance for the detection of EGC and the recognition
of gastric locations. The CAMs and grid model for the stomach enable the DCNN to proactively
track suspicious cancerous lesions without blind spots during EGD. The DCNN is a promising
technique in computer-based recognition and is not inferior to experts. It may be
a powerful tool to assist endoscopists in detecting EGC without blind spots. More
research should be conducted and clinical applications should be tested to further
verify and improve this system’s effectiveness.
Appendix e1 Supplementary methods
Appendix e1 Supplementary methods
The methods of avoiding overfitting
To minimize the overfitting risk, we used four methods, including data augmentation
[22], k-fold cross-validation [23], early stopping [24], and synthesis of different models. Images were randomly transformed using methods
including height/width shift, shear, zoom, horizontal flip, and fill mode. Parameters
of the augmentation methods used in the three independent networks are shown in [Table e2]. Through data augmentation, images could be differently presented in every round
of training. The k-fold cross-validation procedure was implemented with k = 5, dividing
the dataset into five subsets and validating each subset individually, with the remaining
used for training. ([Fig. e4]) Furthermore, early stopping was used to watch a validation curve while training
and to stop updating the weights when the validation error did not decrease for four
times (patience = 4). Early stopping epochs are indicated in [Fig. e5].
It has been reported that synthesizing different DCNN models [25] and taking average effect of optimal K candidates [26] could reduce the variance of the proposed estimates. In the task of identifying
blurry images and classifying gastric locations, the performance of VGG-16 was much
better than that of Resnet-50. While in the task of identifying EGC, the performance
of VGG-16 and Resnet-50 was comparative. Therefore, we used VGG-16 alone in the task
of identifying blurry images and classifying gastric locations, and synthesized VGG-16
and Resnet-50 in the task of EGC identification; the final prediction was based on
the average confidence of each category estimated by each model. Candidate VGG-16
and Resnet-50 were differently combined and tested in the independent test set illustrated
in the Methods section of the main text ([Fig. e6a]). It turned out that the combination of 3 VGG-16 and 2 Resnet-50 achieved the highest
accuracy ([Fig. e6b]). Therefore, this combination was chosen for subsequent experiments.
Confusion matrices and learning curves
The confusion matrices and learning curves of our method over independent neural networks
are
presented to demonstrate the accuracy of the DCNN ([Fig. e5]). In the matrix, element (x, y) represents the number of the predicted class (x) based on the true class (y). In the learning curves, the loss and accuracy of the DCNN in training datasets
after different times of training were presented. The loss value was calculated by
binary crossentropy in binary classification (networks 1 and 2), and by categorical
crossentropy in multivariate classification (network 3).
The calculation formulas of comparison metrics in the comparison between DCNN and
endoscopists
The comparison metrics are accuracy, sensitivity, specificity, PPV, and NPV, where
accuracy = true predictions / total number of cases, sensitivity = true positive/positive,
specificity = true negative / negative, PPV = true positive / (true positive + false
positive), and NPV = true negative / (true negative + false negative). The “true positive”
is the number of correctly predicted malignancy images, “false positive” is the number
of mistakenly predicted malignancy images, “positive” is the number of images showing
malignancy, “true negative” is the number of correctly predicted non-malignancy images,
“false negative” is the number of mistakenly predicted non-malignancy images, and
“negative” is the number of non-malignancy images shown.
Description of endoscopists participating in the experiment
Endoscopists were blinded to histologic data and the study design. Experts were staff
members in the Gastroenterology Department of Renmin Hospital, Wuhan University, with
more than 5 years of EGD experience, and their average annual EGD volume was higher
than 1000. Senior endoscopy physicians were staff members in the Gastroenterology
Department of Renmin Hospital, Wuhan University, with EGD experience of more than
1 year and less than 3 years, and their average annual EGD volume was higher than
800. Novices were fellows in the Gastroenterology Department of Renmin Hospital, Wuhan
University, with less than 1 year of EGD experience, and their average annual EGD
volume was 50 – 200.