Subscribe to RSS

DOI: 10.1055/a-2577-3928
Performance of AI Approaches for COVID-19 Diagnosis Using Chest CT Scans: The Impact of Architecture and Dataset
Leistungsfähigkeit von KI-Methoden zur COVID-19-Diagnose mittels Thorax-CT: Der Einfluss von KI-Architektur und DatensätzenSupported by: Jilin Provincial Key Laboratory of Medical imaging & big data 20200601003JC
Supported by: Radiology and Technology Innovation Center of Jilin Province 20190902016TC
Supported by: China International Medical Foundation, Imaging Research, SKY Z-2014-07-2003-03
Supported by: RACOON (NUM), „NUM 2.0“ FKZ: 01KX2121
- Abstract
- Zusammenfassung
- Abbreviations
- Introduction
- Materials and methods
- Results
- Discussion
- Conclusion
- References
Abstract
Purpose
AI is emerging as a promising tool for diagnosing COVID-19 based on chest CT scans. The aim of this study was the comparison of AI models for COVID-19 diagnosis. Therefore, we: (1) trained three distinct AI models for classifying COVID-19 and non-COVID-19 pneumonia (nCP) using a large, clinically relevant CT dataset, (2) evaluated the models’ performance using an independent test set, and (3) compared the models both algorithmically and experimentally.
Materials and Methods
In this multicenter multi-vendor study, we collected n=1591 chest CT scans of COVID-19 (n=762) and nCP (n=829) patients from China and Germany. In Germany, the data was collected from three RACOON sites. We trained and validated three COVID-19 AI models with different architectures: COVNet based on 2D-CNN, DeCoVnet based on 3D-CNN, and AD3D-MIL based on 3D-CNN with attention module. 991 CT scans were used for training the AI models using 5-fold cross-validation. 600 CT scans from 6 different centers were used for independent testing. The models’ performance was evaluated using accuracy (Acc), sensitivity (Se), and specificity (Sp).
Results
The average validation accuracy of the COVNet, DeCoVnet, and AD3D-MIL models over the 5 folds was 80.9%, 82.0%, and 84.3%, respectively. On the independent test set with n=600 CT scans, COVNet yielded Acc=76.6%, Se=67.8%, Sp=85.7%; DeCoVnet provided Acc=75.1%, Se=61.2%, Sp=89.7%; and AD3D-MIL achieved Acc=73.9%, Se=57.7%, Sp=90.8%.
Conclusion
The classification performance of the evaluated AI models is highly dependent on the training data rather than the architecture itself. Our results demonstrate a high specificity and moderate sensitivity. The AI classification models should not be used unsupervised but could potentially assist radiologists in COVID-19 and nCP identification.
Key Points
-
This study compares AI approaches for diagnosing COVID-19 in chest CT scans, which is essential for further optimizing the delivery of healthcare and for pandemic preparedness.
-
Our experiments using a multicenter, multi-vendor, diverse dataset show that the training data is the key factor in determining the diagnostic performance.
-
The AI models should not be used unsupervised but as a tool to assist radiologists.
Citation Format
-
Jaiswal A, Fervers P, Meng F et al. Performance of AI Approaches for COVID-19 Diagnosis Using Chest CT Scans: The Impact of Architecture and Dataset. Rofo 2025; DOI 10.1055/a-2577-3928
#
Zusammenfassung
Ziel
Aktuell existieren verschiedenste Künstliche Intelligenz (KI)-Modelle zur Detektion und Klassifikation von Pneumonien in Thorax-CTs, aber unabhängige Vergleiche fehlen meist. In dieser Studie haben wir (1) drei verschiedene KI-Modelle zur Klassifizierung von COVID-19- und Nicht-COVID-19-Pneumonien (nCP) anhand eines klinisch relevanten CT-Datensatzes trainiert, (2) die Leistung der Modelle anhand eines unabhängigen Testsatzes bewertet und (3) die Modelle sowohl algorithmisch als auch experimentell verglichen.
Materialien und Methoden
In dieser multizentrischen, retrospektiven Studie haben wir insgesamt 1591 Thorax-CTs von COVID-19- (n=762) und nCP (n=829)-Patienten aus China und Deutschland zusammengestellt; in Deutschland wurden die CT-Daten von 3 RACOON-Standorten eingeschlossen. Es wurden 3 open-source KI-Modelle mit unterschiedlichen Architekturen trainiert und validiert: COVNet basierend auf 2D-CNN, DeCoVnet basierend auf 3D-CNN, und AD3D-MIL basierend auf 3D-CNN mit Attention-Modul. Die Performance der Modelle wurde anhand von Genauigkeit (Acc), Sensitivität (Se) und Spezifität (Sp) bewertet.
Ergebnisse
Die durchschnittliche Validierungsgenauigkeit der Modelle COVNet, DeCoVnet und AD3D-MIL über die 5-Fach-Validierung im Training mit n=991 CTs betrug 80,9%, 82,0% bzw. 84,3%. Auf dem unabhängigen Testsatz mit n=600 CTs lieferte COVNet: Acc=76,6%, Se=67,8%, Sp=85,7%; DeCoVnet: Acc=75,1%, Se=61,2%, Sp=89,7%; und AD3D-MIL: Acc=73,9%, Se=57,7%, Sp=90,8%.
Schlussfolgerung
Die Klassifizierungsleistung der evaluierten KI-Modelle hängt in hohem Maße von den Trainingsdaten und weniger von der Architektur selbst ab. Unsere Ergebnisse zeigen eine hohe Spezifität und eine moderate Sensitivität bei der Differenzierung von COVID-19- und Nicht-COVID-19-Pneumonien. Die KI-Klassifikationsmodelle sollten aber nicht unkritisch verwendet werden, könnten aber Radiologen unterstützen.
Kernaussagen
-
Vorliegende Studie vergleicht KI-Ansätze zur bildbasierten Diagnose von COVID-19 in Thorax-CTs, was relevant für die weitere Optimierung der Gesundheitsversorgung und die Vorbereitung auf etwaig kommende Pandemien ist.
-
Unser multizentrischer, herstellerübergreifender Datensatz zeigt, dass die Trainingsdaten der entscheidende Faktor für die diagnostische Leistungsfähigkeit sind.
-
KI-Modelle sollten nicht autonom eingesetzt werden, sondern als unterstützendes Werkzeug in die radiologische Befundung integriert werden, um die diagnostische Entscheidungsfindung zu ergänzen – nicht zu ersetzen.
#
Abbreviations
#
Introduction
The coronavirus disease 2019 (COVID-19) pandemic is a poignant reminder of how rapidly a global health crisis can emerge and pose a significant challenge to the world's healthcare systems. The potential for other pathogens to cause future pandemics, with the lungs as a primary target, underscores the critical need for comprehensive pandemic preparedness measures. As with COVID-19, fast detection and prompt patient isolation will be crucial for curbing the spread of future pandemics. Chest computed tomography (CT) is a method of choice for COVID-19 and other viral and bacterial pneumonias [1]. Due to a lack of alternative diagnostic methods in the early pandemic, chest CT was frequently performed to diagnose the disease [2]. The SARS-CoV-2 reverse polymerase chain reaction (RT-PCR) and antibody test has since become widely accessible and is considered the most reliable method for diagnosing COVID-19 [3]. Nevertheless, chest CT still has a potential role in the diagnosis of COVID-19 pneumonia and for determining disease stage [1]. The Fleischner Society recommends chest CT as a diagnostic tool, if RT-PCR resources are limited and could delay isolation or crucial treatment [4]. Furthermore, a patient with a suspected false-negative RT-PCR test and at least moderate clinical features qualifies for a chest CT scan [4]. Although the diagnosis of COVID-19 is currently the domain of laboratory testing, chest CT is frequently performed to provide detailed information about the severity and extent of lung involvement [5].
With the aim of supporting radiologists, numerous AI approaches have been developed for the automatic detection of COVID-19 based on CT scans. These algorithms are often based on convolution neural networks (CNN) with two dimensions (2D) [6] [7] [8] [9] [10] [11] [12] [13] or three dimensions (3D) [14] [15] [16] [17] [18]. 2D approaches learn features from individual slices in a volumetric CT scan [6] [7] [8] [9] [10] [11] [12] [13]. Slice-level results are often aggregated to obtain patient-level predictions. 3D approaches, on the other hand, utilize 3D volume for feature extraction and directly generate patient-level predictions [14] [15] [16] [17] [18]. Different approaches require patient-level [9] [10] [14] [15] [18], slice-level [8], or pixel-level [19] labels for training. Along with conventional CNNs, machine-driven design [20] [21] has also been explored. Hybrid strategy, which employs traditional machine learning along with deep learning [22] has also been proposed. A summary of various algorithms proposed in the literature is shown in [Table 1].
COVID-19 diagnosis algorithm |
2D CNN |
3D CNN |
Caps-Net |
Pixel-level label |
Slice-level label |
Patient-level label |
Machine-driven design |
Hybrid |
Xiong et al. (2020) [6], Rahimzadeh et al. (2021) [8], Wang et al. (2021) [12], Wang et al. (2021) [13] |
X |
X |
||||||
Song et al. (2021) [7], Jin et al. (2020) [9], Li et al. (2020) [10] |
X |
X |
||||||
Qian et al. (2020) [11] |
X |
X |
X |
|||||
Wang et al. (2020) [14], Han et al. (2020) [15], Lee et al. (2021) [16], Javaheri et al. (2021) [17], Wang et al. (2020) [18] |
X |
X |
||||||
Zhang et al. (2020) [19] |
X |
X |
||||||
Wu et al. (2021) [23] |
X |
X |
X |
|||||
Amyar et al. (2020) [24], Wang et al. (2021) [25], Gao et al. (2021) [26] |
X |
X |
||||||
Afshar et al. (2022) [27] |
X |
X |
||||||
Qi et al. (2022) [28] |
X |
X |
||||||
Gunraj et al. (2020) [20], Gunraj et al. (2022) [21] |
X |
X |
||||||
Qi et al. (2021) [22], Mei et al. (2020) [29] |
X |
X |
X |
|||||
Hou et al. (2021) [30] |
X |
X |
X |
X |
Many algorithms proposed in the literature lack an external validation dataset (e.g., [8] [10] [11]), which is important to assess the generalization of the AI models. An independent comparison of multiple approaches based on a common dataset can play a guiding role for both radiologists and AI developers. To address these issues, in this study, we collected a large and diverse set of CT scans from China and Germany. Chest CT data from three RACOON sites (Cologne, Frankfurt, Heidelberg) have been utilized in this study. RACOON is a nationwide RAdiological COOperative Network of 36 university hospitals in Germany. RACOON is supported by the National University Medicine Network (NUM) founded by the German Federal Ministry of Education and Research (BMBF). The unique RACOON infrastructure supports large-scale AI studies and could play a key role in Germany’s pandemic preparedness program.
Using datasets from China and Germany in this study, we aimed to assess and compare the performance of three distinct AI approaches (based on 2D and 3D CNNs) for distinguishing between COVID-19 pneumonia and non-COVID-19 pneumonia (nCP). Our overall goal was the assessment of three publically available AI tools for COVID-19 diagnosis and to determine whether these AI tools could potentially be used to support radiologists in clinical decision making.
#
Materials and methods
Dataset
For this retrospective IRB-approved study, we collected a multicenter, multi-vendor chest CT dataset consisting of n=1591 chest CT scans of COVID-19 (n=762) and nCP (n=829) patients from China and Germany ([Fig. 1]). For the COVID-19 class, the inclusion criteria were pulmonary infiltration, and a positive RT-PCR test within 48 h before the CT examination. For the nCP class, the inclusion criteria were: (1) inflammatory infiltrations on CT scans before the outbreak of COVID-19 (from February 2016 to December 2019), (2) an additional negative RT-PCR test after the outbreak of COVID-19 (January 2020). The exclusion criteria were imaging features consistent with lung tumors, tuberculosis, and traumatic and postoperative scarred lesions.


Our dataset is balanced across the two disease classes. The COVID-19 class includes cases with different disease stages (early: 0–3 days, progressive: 4–7 days, peak: 8–14 days, and absorption: ≥15 days) assessed based on CT morphology and the gap between CT scanning and symptom onset [31]. Based on laboratory etiological confirmation, the nCP class includes pneumonia caused by viral and bacterial pathogens. The number of scans and distribution over the two classes and the subcategories for training and test dataset are presented in [Table 2]. Detailed patient demographics across different centers are given in [Table 3]. The entire Chinese dataset, which contains only full-dose CT scans, was imaged using seven different CT device manufacturers. The dataset from Germany, which contains 53.7% low-dose CT scans and 46.3% full-dose CT scans, was imaged using two CT device manufacturers ([Table 3]). Further information about vendors, protocols, etc. can be found in the recent article [32].
We used n=991 CT scans (n=462 COVID-19 and n=529 nCP CT scans) from three different centers in China to train the AI models using a five-fold cross-validation approach. In each fold, non-overlapping 80% of the CT scans were used for training and 20% of the scans were used for validation. CT scans from all six centers from China (n=300; centers: Jilin, Wuhan, Ningbo) and Germany (n=300, external dataset; centers: Cologne, Frankfurt, Heidelberg) were used for independent testing to get an indication of the generalization of the trained models. The test set contained a total of n=600 CT scans with balanced COVID-19 and nCP classes, which the algorithms did not see during training or validation. Therefore, this test set is called the independent test set. The internal test set came from the same sites as the training dataset. The external test set came from different sites. An in-detail description of the training and validation process was recently published [32].
#
Automatic COVID-19 diagnosis
For this study, we selected AI algorithms based on criterion such as: (1) different network architectures and training strategies, (2) availability of code and documentation, and (3) ability to train with only patient-level labels. For more details about our literature search and model selection, see supplement 1. We trained and validated the selected models for diagnosing COVID-19 using chest CT scans: COVNet [10] based on 2D-CNN, DeCoVnet [14] based on 3D-CNN, and AD3D-MIL [15] based on 3D-CNN with an attention module. All three approaches consisted of two steps ([Fig. 2]). The first step segments the lung area to avoid the effect of irrelevant regions. The second step classifies the lung-masked CT scans as COVID-19 or nCP.


#
Lung segmentation
For lung segmentation, we utilized three different models: Seg-Net [33], U-Net [9] [34], and U-Net(R-231) [34] [35]. The models were trained on different datasets and employ slightly different preprocessing. The Seg-Net [33] model was trained and tested using 44,500 CT slices. Preprocessing involved resampling (1×1 mm), rescaling (512×512), windowing [-1000, 500 HU], and normalization (0–1). The 2D U-Net [9] [34] model was trained and tested using 16,223 CT slices. Preprocessing included resampling (1×1mm), windowing [-1200, 700 HU], and normalization (0–1). The U-Net(R-231) [35] model is based on U-Net [34] with batch normalization and was trained on a diverse dataset of 62,224 CT slices. Preprocessing included body cropping, rescaling (256×256), windowing [-1024, 600 HU], and normalization (0–1).
#
COVID-19 classification
The literature review (see supplement 1) yielded the following three algorithms for assessment in this study:
1. COVID-19 detection neural network (COVNet)
COVNet is based on ResNet-50 [36], which is a 50-layer residual network. COVNet takes a lung-masked CT scan as input and provides a patient-level prediction as output. As shown in [Fig. 3](a), ResNet-50 captures slice-level features using 2D convolutions (2D CNN). CT-level features are obtained by max. pooling. Preprocessing steps include resampling (224×224 in-plane), down sampling (by a factor of 5 in the Z-direction), intensity clipping (-1250, 250 HU), and normalization (0, 1). During training, data augmentation was used by applying random rotation, flipping, and adding Gaussian noise to the input data. The weights of ResNet-50 were initialized using weights optimized on the ImageNet database [37].


#
2. 3D deep convolutional neural network to detect COVID-19 (DeCoVnet)
In contrast to COVNet, DeCoVnet is based on 3D ResNet which performs 3D convolutions (3D CNN) to learn features. As shown in [Fig. 3](b), DeCoVnet consists of the network stem, residual blocks (ResBlocks), and progressive classifier stages. The four ResBlocks include shortcut connections and pass 3D feature maps. The classifier progressively extracts important features using 3D max pooling and directly yields CT-level class probabilities. Input to DeCoVnet is a lung-masked CT scan. Preprocessing included resampling (224×336), intensity clipping (-1200, 600 HU), and normalization (0, 1). During training, data augmentation was performed that included random affine transformations and color jittering. Weights of the model were initialized using the Kaiming initialization method [38].
#
3. Attention-based deep 3D multiple instance learning (AD3D-MIL)
AD3D-MIL addresses the problem of detecting COVID-19 as multiple instance learning (MIL), where instances are automatically generated using convolution layers of DeCoVnet [14]. Once instances are obtained, attention-based pooling is applied to concentrate on important instances by weighting them. Next, a two-layer fully connected neural network provides final class probabilities. Compared to the above two methods, AD3D-MIL is based on 3D ResNet with attention pooling ([Fig. 3](c)). AD3D-MIL takes lung-masked CT scans as input. Preprocessing was performed by using resampling (256×256), intensity clipping (-1024, 600 HU), and normalization (0, 1). Data augmentation included color jittering and random affine transformation. For training, the model was initialized using random weights following Kaiming initialization [38].
#
#
Training
For diagnosing COVID-19, we trained, validated, and tested the three models (COVNet, DeCoVnet, and AD3D-MIL) using exactly the same set of images. The models were trained using a five-fold cross-validation approach. Based on the highest validation accuracy (Acc), the best model was selected from each fold for inference. Pretrained models were used for lung segmentation. We used the Seg-Net [33] obtained lung masks for the COVNet model, U-Net [9] lung masks for DeCoVnet, and U-Net(R-231) [35] lung masks for AD3D-MIL in accordance with the original [14] [15] or previous publications [32]. Hyperparameters are presented in supplementary Table 2. During inference on the independent test set, the predictions from the five best models were ensembled using majority voting.
#
Statistics
Statistical analysis was performed in Python using SciPy (Stats) [39] and Scikit-learn (Metrics, Calibration) [40] packages and in R software using pROC [41] package. Figures were plotted using the Matplotlib (Pyplot) package [42]. Statistical hypothesis testing of the non-parametric dichotomous performance data was calculated from pairwise 2×2 contingency tables using McNemar’s test. A bootstrapping approach was applied to calculate the confidence interval. DeLong’s test was used to compare AUCs. Statistical significance was defined as p < .05.
#
#
Results
Lung segmentation
For lung segmentation, we used three different CNN models: Seg-Net [33], U-Net [9] [34], and U-Net(R-231) [34] [35], trained on different datasets. Upon visual analysis, we found that the lung masks produced by all three models were of sufficient quality. [Fig. 4] shows exemplary slices from nCP and COVID-19 CT scans along with the lung masks obtained by using the three lung segmentation models.


#
COVID-19 classification
In [Table 4], we present the performance (validation accuracy) of the best models from the five folds on the corresponding validation data. CV1 to CV5 represent cross-validation folds 1 to 5. The mean accuracy obtained by the COVNet, DeCoVnet, and AD3D-MIL models over the 5 folds are 80.9%, 82.0%, and 84.3%, respectively. In [Fig. 5], we show slices from CT scans of nCP (viral pneumonia, bacterial pneumonia) patients and COVID-19 patients as well as the corresponding patient-level predictions provided by the three diagnostic models. These scans depict different patterns including unilobular or bilobular infiltrations, and different disease stages or etiologies. As can be seen from these examples, all three COVID-19 diagnostic models yielded similar predictions.


We quantified the diagnostic performance using accuracy (Acc), sensitivity (Se), and specificity (Sp). [Table 4] presents the performance and the Brier score for the three AI models on the independent test set. COVNet yielded Acc=76.6%, Se=67.8%, Sp=85.7%; DeCoVnet provided Acc=75.1%, Se=61.2%, Sp=89.7%; and AD3D-MIL resulted in Acc=73.9%, Se=57.7%, Sp=90.8%. Each model yielded a moderate sensitivity and a relatively high specificity. The three models achieved similar performance with respect to the independent test set. The difference between the models’ performance was not statistically significant (COVNet vs. DeCoVnet: p=.49; COVNet vs. AD3D-MIL: p=.20; DeCoVnet vs. AD3D-MIL: p=.56). ROC curves, AUCs, and calibration curves for the three models for the test set (including scans from Germany and China) are presented in [Fig. 6]. Comparable AUCs (COVNet: 0.86; DeCoVnet: 0.86; AD3D-MIL: 0.84) and Brier scores (COVNet: 0.16; DeCoVnet: 0.18; AD3D-MIL: 0.20) indicate that the models have comparable discrimination performance and levels of calibration.


In addition to the complete test set, performance (accuracy, AUC, ROC and calibration curves) with respect to the test data from Germany (external) and China (internal) is also shown in [Table 4] and [Fig. 6]. Supplementary Table 3 presents the DeLong test for comparing AUCs. The AUCs of the three models are not significantly different for any of the internal and external test sets. AD3D-MIL performs better on the internal test set compared to the external test set. AUCs for COVNet and DeCoVnet do not differ significantly between internal and external test sets.
For a comprehensive comparison, we show datasets and classification results from the original publications of COVNet [10], DeCoVnet [14], and AD3D-MIL [15] in [Table 5]. Additionally, the results obtained from other studies (Hou et al. (2021) [30], Qian et al. (2020) [11], Wang et al. (2021) [12], Wang et al. (2021) [13]) that compared their approaches with COVNet and DeCoVnet are also shown.
Published study |
Dataset |
Classes |
Perf. |
COVNet |
DeCoVnet |
Theirs |
* Hou et al. (2021) [30] included two additional public datasets with 2D slices in their experiment. We only mention their 3D dataset and corresponding classification performance because of the focus of our study. |
||||||
COVNet Li et al. (2020) [10] |
4352 CT scans, 3322 patients, six centers |
COVID-19, CAP, non-pneumonia |
AUC |
> 0.95 |
– |
– |
DeCoVnet Wang et al. (2020) [14] |
630 CT scans, 540 patients, one center |
COVID-19, healthy |
Acc (%) |
– |
90.1 |
– |
AD3D-MIL Han et al. (2020) [15] |
460 CT scans, 309 patients, multi-center |
COVID-19, common pneumonia + no pneumonia |
Acc (%) |
– |
96.1 |
97.9 |
Hou et al. (2021) [30] |
801 scans, 707 patients, inhouse* |
COVID-19, H1N1, CAP |
Acc (%) |
68.2 |
84.9 |
90.5 |
Qian et al. (2020) [11] |
734 patients, inhouse |
COVID-19, H1N1, CAP, healthy |
Acc (%) |
68.8 |
93.8 |
95.2 |
Wang et al. (2021) [12] |
1164 CT scans, local hospitals |
COVID-19, CAP, SPT, healthy |
Se (%) |
89.8 |
91.1 |
95.6 |
Wang et al. (2021) [13] |
640 images, 284 patients |
COVID-19, healthy |
Acc (%) |
93.8 |
90.3 |
97.1 |
#
#
Discussion
Since the beginning of the pandemic, CT imaging has played an important role in the diagnosis, severity assessment, and management of COVID-19 disease. In general, given the potential for AI to improve patient care in radiology by aiding in detection and classification tasks, it is crucial to investigate methods and underlying principles for improving these AI algorithms in order to further optimize the delivery of healthcare. While various classification models have been proposed for diagnosing COVID-19 pneumonia on chest CT, this study fills the gap in the literature by performing an independent comparison of three AI models (COVNet, DeCoVnet, and AD3D-MIL) with different architectures for classifying COVID-19 and nCP, and thereby diagnosing COVID-19. Studies by Garg et al. [43] and Ardakani et al. [44] compared different neural networks. However, their models were trained using individual slices, 3D models were not evaluated, and the datasets were relatively small (collected from single country, number of patients n=210 and n=194, respectively). To the best of our knowledge, this is the first study to perform an independent comparison of these approaches using a multicenter, multi-vendor, multi-country dataset. From a clinical perspective, independent performance and robustness assessments are important to evaluate which AI models can potentially be used to support radiologists in clinical decision making.
The three AI models compared in this study have architectural differences (COVNet: 2D-CNN, DeCoVnet: 3D-CNN, and AD3D-MIL: 3D-CNN with attention module). COVNet [10] is based on a 2D CNN with ResNet-50 as a backbone to learn slice-level features which are aggregated to obtain global features. Different ResNet variants have been employed in various studies [7] [8] [9] [11] [30]. In contrast to 2D CNN, DeCoVnet [14] utilizes 3D ResBlocks and learns volumetric features using 3D convolutions. Compared to 2D CNN, it exploits multiple slices simultaneously and learns rich features. AD3D-MIL [15] additionally utilizes attention-based pooling that focuses on the most important instances for making a decision.
Using the validation sets, the three models achieved good and comparable performance. The similar mean accuracies and low standard deviations indicate the stability of the models’ performance across the folds ([Table 4]). Using the test dataset, our classification results (obtained by an ensemble of the five best models), achieved good accuracy (Acc=73.9 – 76.6%) with high specificity (Sp=85.7 – 90.8%) and moderate sensitivity (Se=57.7 – 67.8%). The performance assessment shows that these models are useful in distinguishing between COVID-19 and nCP with good discriminating performance (AUC=0.84 – 0.86). Moreover, the models performed well on both internal (AUC=0.87, 0.88) and external (AUC=0.81 – 0.84) test sets. However, because of the only moderate sensitivity, their unsupervised clinical use is not recommended. Yet, they can potentially be used to assist radiologists. AI assistance to radiologists in differentiating between COVID-19 and nCP has been found to increase performance in terms of accuracy, diagnostic time, and diagnostic confidence [32]. In addition to the predicted classes, model confidence might play a role in certain clinical applications which will benefit from the model’s recalibration.
Other studies that evaluated COVNet and DeCoVnet with their independent datasets have reported varying levels of performance ([11] [12] [13] [30]; see [Table 5]). Hou et al. [30] and Qian et al. [11] reported that COVNet yielded accuracies lower than 70% (68.2% and 68.8%, respectively) for their respective datasets. This is lower than the accuracy we obtained from COVNet (76.6%). In the present study, the evaluated models achieved similar performance, which is analogous to the previous findings of Wang et al. [12]. The authors obtained comparable sensitivities from COVNet (Se=89.8%) and DeCoVnet (Se=91.1%). In another report from Wang et al. [13], COVNet (Acc=93.8%) performed slightly better than DeCoVnet (Acc=90.3%). However, DeCoVnet performed better than COVNet in the other studies [11] [12] [30]. These findings indicate that the superior performance of 2D CNN and 3D CNN depends on the underlying dataset.
Compared to the performance we achieved, first reports of COVNet [10], DeCoVnet [14], and AD3D-MIL [15] as well as the other studies [11] [12] [13] [30] reported >90% performance (Acc >90% or AUC >0.95 or Se >95%) for their proposed models ([Table 5]). One reason could be the inclusion of healthy controls in the dataset. In our present study, we focused on COVID-19 and nCP patient data with pulmonary infiltrations only. The reason being that distinguishing healthy controls from COVID-19 patients is a relatively easier task. Pneumonia in the nCP class and COVID-19 class might show similar disease patterns on CT scans [45]. Similarities in the imaging pattern make the classification problem challenging even for experienced radiologists. COVID-19 pneumonia can show a large overlap of imaging patterns with non-COVID infective lung disease, which limits the diagnostic performance of CT [46]. Our moderate classification performance is therefore in line with this commonly acknowledged limitation.
Another reason for the difference in the performance we achieved and the performance reported in the other studies could stem from dataset diversity. The studies shown in [Table 5] often use a small dataset collected from local sources, i.e., data from one center [11] [14] [30] or from multiple centers in a single country [10] [12] [15], and do not describe the inclusion of different COVID-19 stages [11] [13]. In contrast, our large, diverse, and balanced dataset includes images from different scanners from China and Germany. Moreover, our data also covers different disease stages (see examples in [Fig. 4]). The COVID-19 set includes scans with early, progressive, peak, absorption stages. Viral and bacterial pneumonias in the nCP set have mild, moderate, and extensive severities. Although this diversity in the dataset may result in lower performance metrics, it is also important to consider that the AI algorithms are better equipped to address radiologically relevant cases that require accurate differentiation support.
The models trained in this study have the advantage of firstly not including healthy subjects in the training dataset, secondly having a heterogeneous disease stage within the COVID-19 cases, and thirdly requiring differentiation between viral as well as bacterial non-COVID-19 cases. As a result, the models might be more focused on the challenging cases where augmented diagnostic decision-making could offer clinical benefits.
In addition to the architectural differences, the models use slightly different preprocessing, different initialization, and in our experiments, employed different lung segmentation models. Despite these differences, the three models achieved comparable performance. As discussed above, these architectures achieved high performance using different datasets in their original publications. However, based on our diverse dataset with the two classes having similar patterns, their performance is inferior. This implies that the performance is not dependent on the architecture but rather the training data. This main finding of our study indicates that the classification performance is highly dependent on the training data rather than the underlying CNN architecture itself.
This study has certain limitations. This retrospective study design focuses on COVID-19 with exclusion of other classification options in daily clinical reporting. Moreover, human-machine interaction was not evaluated. One of the model selection criteria for this study was the ability to train with only patient-level labels. Although these labels can be easily obtained and enable quick experimentation with the AI models, diagnostic performance using slice-/pixel-level labels was not explored in this study. Follow-up studies should also evaluate AI models exploiting clinical parameters along with CT imaging. In the future, an extended dataset with scans from other centers as well as public datasets could be used. Furthermore, understanding the model’s prediction using explainable AI techniques as well as analyzing model’s confidence will also be a focus of our future studies.
#
Conclusion
In summary, we trained and compared three AI models for diagnosing COVID-19. The models trained on our diverse dataset resulted in comparable performance for the independent test dataset despite fundamental algorithmic differences. The heterogeneity of the training data and the considered classification options determine the diagnostic performance. The only moderate performance of all included models with respect to the independent test set underlines that these models should not be used unsupervised but rather as a tool to assist radiologists.
#
#
Conflict of Interest
Rahil Shahzad is employee of Philips Healthcare, other authors declare no conflicts of interest.
-
References
- 1 Inui S, Gonoi W, Kurokawa R. et al. The role of chest imaging in the diagnosis, management, and monitoring of coronavirus disease 2019 (COVID-19). Insights Imaging 2021; 12: 155
- 2 Kwee TC, Kwee RM. Chest ct in covid-19: What the radiologist needs to know. Radiographics 2020; 40: 1848-1865
- 3 Sethuraman N, Jeremiah SS, Ryo A. Interpreting Diagnostic Tests for SARS-CoV-2. JAMA – Journal of the American Medical Association 2020; 323: 2249-2251
- 4 Rubin GD, Ryerson CJ, Haramati LB. et al. The role of chest imaging in patient management during the covid-19 pandemic: A multinational consensus statement from the fleischner society. Radiology 2020; 296: 172-180
- 5 Fervers P, Fervers F, Jaiswal A. et al. Assessment of COVID-19 lung involvement on computed tomography by deep-learning-, threshold-, and human reader-based approaches—an international, multi-center comparative study. Quant Imaging Med Surg 2022; 12: 5156-5170
- 6 Xiong Z, Wang R, Bai HX. et al. Artificial Intelligence Augmentation of Radiologist Performance in Distinguishing COVID-19 from Pneumonia of Other Origin at Chest CT. Radiology 2020; 296: E156-E165
- 7 Song Y, Zheng S, Li L. et al. Deep Learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) with CT Images. IEEE/ACM Trans Comput Biol Bioinform 2021; 18: 2775-2780
- 8 Rahimzadeh M, Attar A, Sakhaei SM. A fully automated deep learning-based network for detecting COVID-19 from a new and large lung CT scan dataset. Biomed Signal Process Control 2021; 68
- 9 Jin C, Chen W, Cao Y. et al. Development and evaluation of an artificial intelligence system for COVID-19 diagnosis. Nat Commun 2020; 11: 5088
- 10 Li L, Qin L, Xu Z. et al. Using Artificial Intelligence to Detect COVID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy. Radiology 2020; 296: E65-E71
- 11 Qian X, Fu H, Shi W. et al. M3Lung-Sys: A Deep Learning System for Multi-Class Lung Pneumonia Screening from CT Imaging. IEEE J Biomed Health Inform 2020; 24: 3539-3550
- 12 Wang SH, Nayak DR, Guttery DS. et al. COVID-19 classification by CCSHNet with deep fusion using transfer learning and discriminant correlation analysis. Information Fusion 2021; 68: 131-148
- 13 Wang SH, Govindaraj VV, Górriz JM. et al. Covid-19 classification by FGCNet with deep feature fusion from graph convolutional network and convolutional neural network. Information Fusion 2021; 67: 208-229
- 14 Wang X, Deng X, Fu Q. et al. A Weakly-Supervised Framework for COVID-19 Classification and Lesion Localization from Chest CT. IEEE Trans Med Imaging 2020; 39: 2615-2625
- 15 Han Z, Wei B, Hong Y. et al. Accurate Screening of COVID-19 Using Attention-Based Deep 3D Multiple Instance Learning. IEEE Trans Med Imaging 2020; 39: 2584-2594
- 16 Lee EH, Zheng J, Colak E. et al. Deep COVID DeteCT: an international experience on COVID-19 lung detection and prognosis using chest CT. NPJ Digit Med 2021; 4: 11
- 17 Javaheri T, Homayounfar M, Amoozgar Z. et al. CovidCTNet: an open-source deep learning approach to diagnose covid-19 using small cohort of CT images. NPJ Digit Med 2021; 4: 29
- 18 Wang S, Zha Y, Li W. et al. A fully automatic deep learning system for COVID-19 diagnostic and prognostic analysis. European Respiratory Journal 2020; 56
- 19 Zhang K, Liu X, Shen J. et al. Clinically Applicable AI System for Accurate Diagnosis, Quantitative Measurements, and Prognosis of COVID-19 Pneumonia Using Computed Tomography. Cell 2020; 181: 1423-1433.e11
- 20 Gunraj H, Wang L, Wong A. COVIDNet-CT: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases From Chest CT Images. Front Med (Lausanne) 2020; 7
- 21 Gunraj H, Sabri A, Koff D. et al. COVID-Net CT-2: Enhanced Deep Neural Networks for Detection of COVID-19 From Chest CT Images Through Bigger, More Diverse Learning. Front Med (Lausanne) 2022; 8
- 22 Qi S, Xu C, Li C. et al. DR-MIL: deep represented multiple instance learning distinguishes COVID-19 from community-acquired pneumonia in CT images. Comput Methods Programs Biomed 2021; 211
- 23 Wu YH, Gao SH, Mei J. et al. JCS: An Explainable COVID-19 Diagnosis System by Joint Classification and Segmentation. IEEE Transactions on Image Processing 2021; 30: 3113-3126
- 24 Amyar A, Modzelewski R, Li H. et al. Multi-task deep learning based CT imaging analysis for COVID-19 pneumonia: Classification and segmentation. Comput Biol Med 2020; 126
- 25 Wang B, Jin S, Yan Q. et al. AI-assisted CT imaging analysis for COVID-19 screening: Building and deploying a medical AI system. Appl Soft Comput 2021; 98
- 26 Gao K, Su J, Jiang Z. et al. Dual-branch combination network (DCN): Towards accurate diagnosis and lesion segmentation of COVID-19 using CT images. Med Image Anal 2021; 67
- 27 Afshar P, Rafiee MJ, Naderkhani F. et al. Human-level COVID-19 diagnosis from low-dose CT scans using a two-stage time-distributed capsule network. Sci Rep 2022; 12: 4827
- 28 Qi Q, Qi S, Wu Y. et al. Fully automatic pipeline of convolutional neural networks and capsule networks to distinguish COVID-19 from community-acquired pneumonia via CT images. Comput Biol Med 2022; 141
- 29 Mei X, Lee HC, yue Diao K. et al. Artificial intelligence–enabled rapid diagnosis of patients with COVID-19. Nat Med 2020; 26: 1224-1228
- 30 Hou J, Xu J, Jiang L. et al. Periphery-aware COVID-19 diagnosis with contrastive representation enhancement. Pattern Recognit 2021; 118
- 31 Jin YH, Cai L, Cheng ZS. et al. A rapid advice guideline for the diagnosis and treatment of 2019 novel coronavirus (2019-nCoV) infected pneumonia (standard version). Mil Med Res 2020; 7: 4
- 32 Meng F, Kottlors J, Shahzad R. et al. AI support for accurate and fast radiological diagnosis of COVID-19: an international multicenter, multivendor CT study. Eur Radiol 2022;
- 33 Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans Pattern Anal Mach Intell 2017; 39: 2481-2495
- 34 Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation, in: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). In: . Springer Verlag; 2015: 234-241
- 35 Hofmanninger J, Prayer F, Pan J. et al. Langs, Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur Radiol Exp 2020; 4: 50
- 36 He K, Zhang X, Ren S. et al. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). 2016
- 37 Deng J, Dong W, Socher R. et al. ImageNet: A large-scale hierarchical image database. Institute of Electrical and Electronics Engineers (IEEE) 2010; 248-255
- 38 He K, Zhang X, Ren S. et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026–1034). 2015
- 39 Virtanen P, Gommers R, Oliphant TE. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 2020; 17: 261-272
- 40 Pedregosa F, Varoquaux G, Gramfort A. et al. Scikit-learn: Machine learning in Python. The Journal of machine Learning research 2011; 12: 2825-2830
- 41 Robin X, Turck N, Hainard A. et al. pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011; 12: 77
- 42 Hunter JD. Matplotlib: A 2D graphics environment. Computing in science & engineering 2007; 9 (03) 90-95
- 43 Garg A, Salehi S, la Rocca M. et al. Efficient and visualizable convolutional neural networks for COVID-19 classification using Chest CT. Expert Syst Appl 2022; 195
- 44 Ardakani AA, Kanafi AR, Acharya UR. et al. Application of deep learning technique to manage COVID-19 in routine clinical practice using CT images: Results of 10 convolutional neural networks. Comput Biol Med 2020; 121
- 45 Bai HX, Hsieh B, Xiong Z. et al. Performance of Radiologists in Differentiating COVID-19 from Non-COVID-19 Viral Pneumonia at Chest CT. Radiology 2020; 296: E46-E54
- 46 Hochhegger B, Zanon M, Altmayer S. et al. COVID-19 mimics on chest CT: a pictorial review and radiologic guide. Br J Radiol 2021; 94
Correspondence
Publication History
Received: 02 August 2024
Accepted after revision: 17 February 2025
Article published online:
29 April 2025
© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/).
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Inui S, Gonoi W, Kurokawa R. et al. The role of chest imaging in the diagnosis, management, and monitoring of coronavirus disease 2019 (COVID-19). Insights Imaging 2021; 12: 155
- 2 Kwee TC, Kwee RM. Chest ct in covid-19: What the radiologist needs to know. Radiographics 2020; 40: 1848-1865
- 3 Sethuraman N, Jeremiah SS, Ryo A. Interpreting Diagnostic Tests for SARS-CoV-2. JAMA – Journal of the American Medical Association 2020; 323: 2249-2251
- 4 Rubin GD, Ryerson CJ, Haramati LB. et al. The role of chest imaging in patient management during the covid-19 pandemic: A multinational consensus statement from the fleischner society. Radiology 2020; 296: 172-180
- 5 Fervers P, Fervers F, Jaiswal A. et al. Assessment of COVID-19 lung involvement on computed tomography by deep-learning-, threshold-, and human reader-based approaches—an international, multi-center comparative study. Quant Imaging Med Surg 2022; 12: 5156-5170
- 6 Xiong Z, Wang R, Bai HX. et al. Artificial Intelligence Augmentation of Radiologist Performance in Distinguishing COVID-19 from Pneumonia of Other Origin at Chest CT. Radiology 2020; 296: E156-E165
- 7 Song Y, Zheng S, Li L. et al. Deep Learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) with CT Images. IEEE/ACM Trans Comput Biol Bioinform 2021; 18: 2775-2780
- 8 Rahimzadeh M, Attar A, Sakhaei SM. A fully automated deep learning-based network for detecting COVID-19 from a new and large lung CT scan dataset. Biomed Signal Process Control 2021; 68
- 9 Jin C, Chen W, Cao Y. et al. Development and evaluation of an artificial intelligence system for COVID-19 diagnosis. Nat Commun 2020; 11: 5088
- 10 Li L, Qin L, Xu Z. et al. Using Artificial Intelligence to Detect COVID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy. Radiology 2020; 296: E65-E71
- 11 Qian X, Fu H, Shi W. et al. M3Lung-Sys: A Deep Learning System for Multi-Class Lung Pneumonia Screening from CT Imaging. IEEE J Biomed Health Inform 2020; 24: 3539-3550
- 12 Wang SH, Nayak DR, Guttery DS. et al. COVID-19 classification by CCSHNet with deep fusion using transfer learning and discriminant correlation analysis. Information Fusion 2021; 68: 131-148
- 13 Wang SH, Govindaraj VV, Górriz JM. et al. Covid-19 classification by FGCNet with deep feature fusion from graph convolutional network and convolutional neural network. Information Fusion 2021; 67: 208-229
- 14 Wang X, Deng X, Fu Q. et al. A Weakly-Supervised Framework for COVID-19 Classification and Lesion Localization from Chest CT. IEEE Trans Med Imaging 2020; 39: 2615-2625
- 15 Han Z, Wei B, Hong Y. et al. Accurate Screening of COVID-19 Using Attention-Based Deep 3D Multiple Instance Learning. IEEE Trans Med Imaging 2020; 39: 2584-2594
- 16 Lee EH, Zheng J, Colak E. et al. Deep COVID DeteCT: an international experience on COVID-19 lung detection and prognosis using chest CT. NPJ Digit Med 2021; 4: 11
- 17 Javaheri T, Homayounfar M, Amoozgar Z. et al. CovidCTNet: an open-source deep learning approach to diagnose covid-19 using small cohort of CT images. NPJ Digit Med 2021; 4: 29
- 18 Wang S, Zha Y, Li W. et al. A fully automatic deep learning system for COVID-19 diagnostic and prognostic analysis. European Respiratory Journal 2020; 56
- 19 Zhang K, Liu X, Shen J. et al. Clinically Applicable AI System for Accurate Diagnosis, Quantitative Measurements, and Prognosis of COVID-19 Pneumonia Using Computed Tomography. Cell 2020; 181: 1423-1433.e11
- 20 Gunraj H, Wang L, Wong A. COVIDNet-CT: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases From Chest CT Images. Front Med (Lausanne) 2020; 7
- 21 Gunraj H, Sabri A, Koff D. et al. COVID-Net CT-2: Enhanced Deep Neural Networks for Detection of COVID-19 From Chest CT Images Through Bigger, More Diverse Learning. Front Med (Lausanne) 2022; 8
- 22 Qi S, Xu C, Li C. et al. DR-MIL: deep represented multiple instance learning distinguishes COVID-19 from community-acquired pneumonia in CT images. Comput Methods Programs Biomed 2021; 211
- 23 Wu YH, Gao SH, Mei J. et al. JCS: An Explainable COVID-19 Diagnosis System by Joint Classification and Segmentation. IEEE Transactions on Image Processing 2021; 30: 3113-3126
- 24 Amyar A, Modzelewski R, Li H. et al. Multi-task deep learning based CT imaging analysis for COVID-19 pneumonia: Classification and segmentation. Comput Biol Med 2020; 126
- 25 Wang B, Jin S, Yan Q. et al. AI-assisted CT imaging analysis for COVID-19 screening: Building and deploying a medical AI system. Appl Soft Comput 2021; 98
- 26 Gao K, Su J, Jiang Z. et al. Dual-branch combination network (DCN): Towards accurate diagnosis and lesion segmentation of COVID-19 using CT images. Med Image Anal 2021; 67
- 27 Afshar P, Rafiee MJ, Naderkhani F. et al. Human-level COVID-19 diagnosis from low-dose CT scans using a two-stage time-distributed capsule network. Sci Rep 2022; 12: 4827
- 28 Qi Q, Qi S, Wu Y. et al. Fully automatic pipeline of convolutional neural networks and capsule networks to distinguish COVID-19 from community-acquired pneumonia via CT images. Comput Biol Med 2022; 141
- 29 Mei X, Lee HC, yue Diao K. et al. Artificial intelligence–enabled rapid diagnosis of patients with COVID-19. Nat Med 2020; 26: 1224-1228
- 30 Hou J, Xu J, Jiang L. et al. Periphery-aware COVID-19 diagnosis with contrastive representation enhancement. Pattern Recognit 2021; 118
- 31 Jin YH, Cai L, Cheng ZS. et al. A rapid advice guideline for the diagnosis and treatment of 2019 novel coronavirus (2019-nCoV) infected pneumonia (standard version). Mil Med Res 2020; 7: 4
- 32 Meng F, Kottlors J, Shahzad R. et al. AI support for accurate and fast radiological diagnosis of COVID-19: an international multicenter, multivendor CT study. Eur Radiol 2022;
- 33 Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans Pattern Anal Mach Intell 2017; 39: 2481-2495
- 34 Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation, in: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). In: . Springer Verlag; 2015: 234-241
- 35 Hofmanninger J, Prayer F, Pan J. et al. Langs, Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur Radiol Exp 2020; 4: 50
- 36 He K, Zhang X, Ren S. et al. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). 2016
- 37 Deng J, Dong W, Socher R. et al. ImageNet: A large-scale hierarchical image database. Institute of Electrical and Electronics Engineers (IEEE) 2010; 248-255
- 38 He K, Zhang X, Ren S. et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026–1034). 2015
- 39 Virtanen P, Gommers R, Oliphant TE. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 2020; 17: 261-272
- 40 Pedregosa F, Varoquaux G, Gramfort A. et al. Scikit-learn: Machine learning in Python. The Journal of machine Learning research 2011; 12: 2825-2830
- 41 Robin X, Turck N, Hainard A. et al. pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011; 12: 77
- 42 Hunter JD. Matplotlib: A 2D graphics environment. Computing in science & engineering 2007; 9 (03) 90-95
- 43 Garg A, Salehi S, la Rocca M. et al. Efficient and visualizable convolutional neural networks for COVID-19 classification using Chest CT. Expert Syst Appl 2022; 195
- 44 Ardakani AA, Kanafi AR, Acharya UR. et al. Application of deep learning technique to manage COVID-19 in routine clinical practice using CT images: Results of 10 convolutional neural networks. Comput Biol Med 2020; 121
- 45 Bai HX, Hsieh B, Xiong Z. et al. Performance of Radiologists in Differentiating COVID-19 from Non-COVID-19 Viral Pneumonia at Chest CT. Radiology 2020; 296: E46-E54
- 46 Hochhegger B, Zanon M, Altmayer S. et al. COVID-19 mimics on chest CT: a pictorial review and radiologic guide. Br J Radiol 2021; 94











