Computer-aided diagnosis for optical diagnosis of diminutive colorectal polyps including sessile serrated lesions: a real-time comparison with screening endoscopists

Background  We aimed to compare the accuracy of the optical diagnosis of diminutive colorectal polyps, including sessile serrated lesions (SSLs), between a computer-aided diagnosis (CADx) system and endoscopists during real-time colonoscopy. Methods  We developed the POLyp Artificial Recognition (POLAR) system, which was capable of performing real-time characterization of diminutive colorectal polyps. For pretraining, the Microsoft-COCO dataset with over 300 000 nonpolyp object images was used. For training, eight hospitals prospectively collected 2637 annotated images from 1339 polyps (i. e. publicly available online POLAR database). For clinical validation, POLAR was tested during colonoscopy in patients with a positive fecal immunochemical test (FIT), and compared with the performance of 20 endoscopists from eight hospitals. Endoscopists were blinded to the POLAR output. Primary outcome was the comparison of accuracy of the optical diagnosis of diminutive colorectal polyps between POLAR and endoscopists (neoplastic [adenomas and SSLs] versus non-neoplastic [hyperplastic polyps]). Histopathology served as the reference standard. Results  During clinical validation, 423 diminutive polyps detected in 194 FIT-positive individuals were included for analysis (300 adenomas, 41 SSLs, 82 hyperplastic polyps). POLAR distinguished neoplastic from non-neoplastic lesions with 79 % accuracy, 89 % sensitivity, and 38 % specificity. The endoscopists achieved 83 % accuracy, 92 % sensitivity, and 44 % specificity. The optical diagnosis accuracy between POLAR and endoscopists was not significantly different ( P  = 0.10). The proportion of polyps in which POLAR was able to provide an optical diagnosis was 98 % (i. e. success rate). Conclusions  We developed a CADx system that differentiated neoplastic from non-neoplastic diminutive polyps during endoscopy, with an accuracy comparable to that of screening endoscopists and near-perfect success rate.


Table of contents
Appendix 1s Standardized image acquisition protocol prospective data collection for training and internal validation A. Colonoscopy procedure I. The endoscopist or nurse checks whether the patient signed informed consent. II.
The endoscopist or nurse starts the study tablet (POLAR tablet) by selecting "start treatment". The tablet will automatically generate a study number (Figure 1.1s).

III.
The endoscopist or nurse notes on a separate confidential linking list the study number and patient characteristics (name, date of birth and sex). IV.
The endoscopist or nurse selects the type of colonoscope and type of video processor used on the tablet. V.
If a polyp is detected, the endoscopist has to take at least two images of each polyp from the front with NBI without magnification. These endoscopy images will be transferred automatically to the storage box and study tablet (POLAR box and tablet). The storage box shows the last endoscopy images on the study tablet. VI.
The endoscopist or nurse selects the correct photos of the polyp on the tablet. Only these photos will be stored on the storage box (Figure 1.2s).

VII.
The endoscopist or nurse also records on the tablet the number of the pathology container in which the lesion is collected. The storage system stores this combination of image, type of colonoscope, type of video processor, pathology container number and study number. VIII.
Subsequently, the polyp can be removed at the discretion of the endoscopist. IX.
The endoscopy nurse collects each lesion in a separate pathology container, and sends it to the pathologist for histopathological examination. X.
The endoscopist or nurse finishes the session by selecting "end of treatment" on the tablet.
B. Procedure to collect pathology results of participants: After patients are successfully enrolled and images of polyps have been collected, the corresponding pathology results from these patients are collected through the national PALGA database (Pathologisch-Anatomisch Landelijk Geautomatiseerd Archief, https://www.palga.nl/). Figure 1.3s shows an overview of the anonymization process. I. Every participating center receives software from ZorgTTP (https://www.zorgttp.nl/) to pseudonymise the list of included study patients (name, date of birth and sex), including an administration number. II.
These pseudonymised data will be sent to ZorgTTP through a secure link. ZorgTTP is a Thrusted Third party, which provides the pseudonymization for PALGA. III.
ZorgTTP performs a second pseudonymization and send this pseudonymised data including administration number to PALGA. IV.
PALGA links the pseudonyms of the research cohort to the pseudonyms in the PALGA database and selects the requested pathology results. V.
PALGA links the pathology results with these pseudonymised data and sends it back to the participating center.
C. Procedure connecting pathology-endoscopic image: I. The local study investigator links the pathology results list from PALGA (pathology and patient numbers) with the linking list (study number, patient number). The local study investigator deletes the patient number after combining these two lists. This pathologystudy number list will be stored on the secured POLAR laptop. II.
Secondly, the local study investigator exports the endoscopic images and study numbers from the POLAR box to the secured POLAR laptop. III.
After, the local study investigator merges the pathology-study number list with the endoscopic images-study numbers.
This procedure is performed every 1-2 months at the local study site by the local study investigator (supplier) on a secured laptop in collaboration with the Amsterdam UMC.
D. Anonymization procedure: I. Finally, when the endoscopic images and pathology results are merged, the data will be anonymized with the use of an anonymization tool. This anonymization tool removes study numbers, metadata and changes a random pixel on the corner of the image to a random other pixel. The anonymization tool runs on the secured laptop by the local study investigator in collaboration with the Amsterdam UMC. After, the linking list will be removed by the local investigator. II.
The anonymization procedure will be double-checked by the local study investigator (supplier) and recipients. III.
After checking the anonymization procedure of the data, the anonymized set of images will be delivered on a secured hard drive together with polyp pathology type, type of colonoscope and video processor. These anonymized data re transferred to the Amsterdam UMC and ZiuZ.
After all images are collected in the storage system, the CAD system will be further trained and tested with these additional data. Polyp localization To localize a polyp within an endoscopic image, a convolutional neural network based on the YOLOv4 object detection model was trained. 1 Object detection models are trained to look at an image and search for a subset of object classes. When found, these object classes are enclosed in a bounding box and their class is identified. In our case there was only one class of objects we aimed to identify; that was the class polyp and therefore we have tweaked the YOLOv4 model to always return the bounding box that was most likely to contain a polyp.
YOLOv4 is a single shot object detection model. Such models are rather complex neural networks, but they typically consist of three main stages: the backbone, the neck, and the head. The backbone is in most cases a pre-trained, convolutional neural network (CNN) that is able to learn features from raw images and yield a class prediction. The CNN used as a backbone in YOLOv4 is CSPDarknet53. It is based on Darknet53 2 , a CNN with successive 3x3 and 1x1 convolutional layers as well as a multitude of skip connections among non-successive layers (DenseNet) in order to mitigate the problem of vanishing gradients and reduce the number of trainable network parameters. 3 Compared to Darknet53, the CSPDarknet53 variant employs cross stage partial connections that integrate the gradient changes into the feature maps across the whole network, thus reducing the number of calculations needed while ensuring accuracy. 4 The neck part of an object detector is used to collect and aggregate feature maps from different stages of the backbone and is composed of several bottom-up and several top-down paths. YOLOv4 chooses the Path Aggregation Net (PANet) 5 for feature aggregation and adds a spatial pyramid pooling (SPP) 6 block after the CSPDarknet53 backbone to increase its receptive field. The pyramid topology is especially powerful because it allows YOLOv4 to detect objects of different sizes. Blocks with smaller spatial resolution are able to detect large objects, whereas blocks with larger resolution specialize in detecting small objects. Since, the features from different resolutions are concatenated, the fully aggregated features at the end of the feature extractor are capable to detect objects of different scales and sizes. Finally, the head of the object detector is the part responsible for making the predictions; that is assigning bounding boxes and a set of conditional object class probabilities at specific coordinates on the image. This is done by first splitting the image in a grid of cells and subsequently predicting a fixed number of bounding boxes within each cell as well as the corresponding conditional class probabilities and a confidence score that represents the overlap between the predicted bounding box and any ground truth box. Combining the bounding boxes and the confidence scores thereof with the class probabilities yields the final predictions for object detection as shown in Fig 2.1s. 7 The YOLOv4 model we used for our CAD system was initially trained on the MS COCO dataset that contains more than 300k images from a broad range of object classes (80 in total), such as person, window, car, and chair. 8 Since YOLOv4 is able to learn features that generalize very well, we initially used the pre-trained backbone and neck parts of the neural network as is and we only trained the head part for the object classes we were interested in (domain adaptation). We initialized the weights of the head part of the network with the final weights of its pre-trained version and we trained further for a number of epochs (or until convergence is reached). In our case, we were only interested in one object type; that was polyps. At this point the YOLOv4 model we have trained was able to detect polyps, but we opted for one more, final training phase. In this training phase we further trained all neural network layers, only this time with a very small learning rate in order to fine-tune the model. This procedure of changing the domain of pre-trained neural network and then fine tuning to further improve its performance is called transfer learning. 9 The output of the localizer finally consisted of the bounding box of the area most likely to contain a polyp and a score between 0 and 1, representing the likelihood that a polyp lies within this bounding box. For the final CAD system, scores of ≥0.2 were deemed to be good enough to accept a localization. This threshold has been empirically chosen based on the cumulative distribution function of the localization confidence scores in dataset 2 (Figure 2.2s), which we obtained using 5-fold cross validation. Setting the threshold at 0.2 led to an Intersection over Union (IoU) score of 80.7%, a Dice coefficient of 87.8%, and an acceptable data loss, namely the loss of roughly 5.7%. Each polyp in the training datasets was annotated with a bounding box by two nonmedical experts, who consulted a medical expert in difficult cases.

Quality checks
In case of a successful localization, the area defined by the bounding box (sub-images) was evaluated with regard to three parameters: sharpness, contrast and percentage of overexposed pixels. This was done to ensure that only images of sufficient quality are presented to the classification stage of the pipeline. We defined sharpness as the variance of the Laplacian of the image. 10 The Laplacian operator highlights regions of an image containing rapid intensity changes. We set the threshold for sufficient sharpness to 70 after empirically evaluating a number of blurry images in our dataset and their corresponding sharpness values. Contrast was defined as the standard deviation of the intensity values of an image (Root mean scare contrast). Contrast is directly related to brightness and images with insufficient contrast are perceived as too dark from the human visual system. We set the minimum acceptable contrast equal to 10 in the same empirical manner as before. Finally, we defined as overexposed all those pixels with an intensity value >245 (the upper 4% of the dynamic range) across all three color channels. If a sub-image contained more than 25% overexposed pixels it was considered of insufficient quality. These thresholds were also set by visually evaluating random images from our dataset. Setting the quality check thresholds at these values resulted to the loss of roughly another 6% of the images. Consequently, our processing pipeline was able to reach the classification stage for approximately 88% of the input training images.
Polyp classification model In case of a successful localization and sufficient quality, the area defined by the bounding box (subimage) was evaluated by the classifier. The aim of the classifier in the POLAR system was to automatically differentiate adenomas, SSLs and hyperplastic polyps, similarly to the WASP classification for optical diagnosis. 11 In our search for the model with best performance, we compared a number of different classifiers including local image descriptors such as scale-invariant feature transform (SIFT) or Local Binary Pattern 12 , as well as popular CNN architectures such as the VGG, AlexNet and DenseNet. 3,13,14 The classifier that consistently outperformed all other classification models in the aforementioned 5-fold cross validation setting is finally based on the local features approach proposed by Tamaki et al 15 , which we have adapted and parameterized according to our dataset and classification problem (Figure 2.3s).
The core of our classifier relies on SIFT descriptors 16 extracted from each image using grid sampling (gridSIFT). 17 SIFT is a local feature descriptor invariant to shift, rotation, scale, and intensity changes that essentially encodes the image gradients of various orientations within a neighborhood of pixels.
In the traditional approach sparse points of interest are detected on an image and the SIFT descriptors are extracted only at those points, whereas in our approach the image was split in a regular grid and SIFT descriptors for multiple scales and orientations were extracted from each cell of the grid (Figure  2.4s). 18 Our gridSIFT implementation contained three pre-processing steps. First, we defined our grid to ignore 15% of the pixels at the periphery of the images, based on the empirical observation that the bounding boxes from the localization step tend to position the polyps in their center, but included always background pixels. Subsequently, we applied a local contrast enhancement technique called unsharp masking that allowed us to amplify the high frequency components of the image. Unsharp masking consists of computing a gaussian blurred version of the original image and then creating a new, sharpened image using the blending formula sharpened = original + (original − blurred) × amount. We obtained the best results when using a Gaussian filter with σ=2 and we set amount=0.5. Finally, we designated all pixels with values in the upper 4% of the dynamic range in all color channels as overexposed and excluded them from the gridSIFT feature extraction.
We used an initial grid spacing of 11 pixels for images up to 525x525 pixels. This grid spacing was enlarged by 11 pixels every time the image size became at least 66% larger with respect to one (or more) of its dimensions. Since smaller spacing means more features, we ensured that larger images would not generate a bigger number of features to such a degree that they would introduce a bias in our training data. Figure 2.5s depicts the locations of the key points within a grid that was automatically created for a random image in our dataset. At every grid point we specified a range of scales (5, 7, 9 and 11 pixels) and orientations (0, 45, 90, 135 and 180 degrees) to extract features from. Figure 2.6s shows an example of the spatial information that key points on the same grid point but with different scales will take into account. From every grid point we extracted 20 SIFT descriptors (one for every combination of scale and orientation), which added up to N*20 feature descriptors for an image with N grid cells. In our dataset (dataset 2) we had on average 320 grid points and thus 6400 SIFT descriptors per image (4 ≤ N ≤ 1026).
In the next step we used the extracted SIFT descriptors to train a Bag-of-Visual-Words (BoVW) representation of the polyp images. We chose the total size of our vocabulary to be 4500 visual code words, and explicitly assigned a specific amount of code words to represent each class of polyps according to their distribution across the examples (17% hyperplastic polyp, 10% sessile serrated lesion and 73% adenoma). This yielded 465 code words for hyperplastic polyps, 450 code words for sessile serrated lesions and 3285 code words for adenomas. To create these code words we used the minibatch k-Means clustering algorithm. 19 The SIFT descriptors from the images of each class of polyps were clustered separately with k = 465, 450 and 3285 for hyperplastic polyp, sessile serrated lesion and adenoma, respectively. The resulting cluster centroids are vectors in the same feature space as the SIFT descriptors and served as the code words of our visual vocabulary. Finally, we mapped every original SIFT descriptor to a certain code word through the clustering process, thus creating a histogram of code words for every image in the dataset. Ultimately, we used the normalized histogram of code words as the feature vector for our polyp classification model. After extensive experimentation with different classification algorithms, we opted for a Support Vector Machine (SVM) 20 with an exponentiated χ 2 kernel as the learning algorithm as suggested by Vempati and colleagues. 21 Prediction scores of ≥0.33 were deemed good enough for a low-confidence diagnosis. Prediction scores ≥0.50 were deemed good enough for a high-confidence diagnosis. These thresholds were chosen because it is found to produce a high-confidence rate of approximately 83% during the preclinical validation. A high-confidence rate of at least 75% was targeted based on levels achieved by previously published studies. 22, 23 After multiple runs of five-fold cross validation the classifier obtained an average diagnostic accuracy level of 84% when the dataset was divided in neoplastic and nonneoplastic lesions.   The NPV of neoplastic lesions in the rectosigmoid, based on optical diagnosis of diminutive polyps with high confidence (POLAR system and endoscopists), were compared to histopathology using a twosided McNemar test. To calculate the NPV for predicting diminutive neoplastic lesions in the rectosigmoid, optical diagnosis of each diminutive polyp with high confidence (POLAR system and endoscopists), was compared with histopathological diagnosis.
The agreement of the assigned surveillance intervals based on optical diagnosis of diminutive colorectal polyps with high confidence were calculated by the number of correct predicted surveillance intervals. The POLAR system and the endoscopists were again compared using the two-sided McNemar test. The reference standard for correct surveillance interval was based on histopathology and the European surveillance guidelines. 24 To examine the impact of optical diagnosis on surveillance interval recommendations, the appropriate recommendation was determined with and without the information of the optical diagnosis (endoscopists and the POLAR system). First, surveillance intervals were determined for each patient combining the results from high-confidence optical diagnosis of diminutive polyps with the pathology diagnosis of all other polyps. Then, surveillance intervals were assigned only based on the actual pathology results and this was used as a reference standard for calculation of the surveillance interval agreement. Patients were excluded from surveillance interval agreement if no diminutive polyps were detected, histopathology outcome was missing, or if only lowconfidence predictions were made for diminutive polyps. Patients were also excluded when they were referred for additional treatment of a detected lesion; diagnosed with synchronous CRC, inflammatory bowel disease, or polyposis syndrome; and those in which the surveillance advice was shortened or altered because of other clinical reasons. Patients with incomplete bowel examination (i.e. no cecal intubation or Boston Bowel Preparation Scale <6) were also excluded.
The diagnostic accuracy for optical diagnosis of diminutive polyps if endoscopists are assisted by POLAR-system (AI-assisted optical diagnosis) was calculated as follows. If the endoscopist assessed a diminutive polyp with high confidence, regardless of the optical diagnosis prediction of the POLAR system, the assessment by the endoscopist was used as final optical diagnosis. If the endoscopist assessed a diminutive polyp with low confidence, the optical diagnosis prediction of the POLAR system was only used as final optical diagnosis if it was a high-confidence predictions. If the POLAR system also assessed a diminutive polyp with low confidence, the assessment by the endoscopist was used as final optical diagnosis.
Generalized estimating equations modelling using binary logistic regression adjusted for clustering of polyps and patients per endoscopist was performed to evaluate factors associated with for accurate optical diagnosis by POLAR. The outcome variable of the model was accurate histology prediction of a polyp. Potential predictors included polyp location, size, morphology, confidence level endoscopists, optical diagnosis accuracy of endoscopists, technical characteristics (e.g. localization confidence score, Intersection of Union score, Dice Score, sharpness score, contrast score, overexposed score and classifier confidence score. For the optical diagnosis accuracy of endoscopists the endoscopist were split into the ten endoscopists with the highest accuracy vs the ten endoscopists with lowest accuracy. The association between predictors and accuracy were summarized as odds ratios, including the 95% confidence interval (CI) and corresponding P value. Analysis was performed in statistical software R Traditional serrated adenoma, n (%) 0 (0%) Hyperplastic polyp 82(19%)

Table 3s
Diagnostic performance of both CADX system, endoscopists and CADx-assisted for differentiating neoplastic from non-neoplastic diminutive lesions a CADx, computer-aided diagnosis; PPV, positive predictive value; NPV, negative predictive value a For the calculation of the diagnostic accuracies (neoplastic vs non-neoplastic), adenomas and sessile serrated lesions were considered neoplastic polyps, while hyperplastic polyps were considered non-neoplastic. Note that when an SSL was assessed as an adenoma or vice versa, this was also considered a correct diagnosis because both are categorized as neoplastic lesions. CADx-assisted (n=117) Only high-confidence predictions Table 4s Outcomes of multilevel logistic regression analysis to find factors associated with accurate optical diagnosis by POLAR