Subscribe to RSS
DOI: 10.1055/a-2772-7798
Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine
Neue KI-Systeme zur Thoraxröntgen-Diagnostik: Qualitätsbewertung der Übereinstimmung mit ärztlichen Diagnosen im klinischen AlltagAuthors
Supported by: JF Senge and P Dlotko were supported by the Dioscuri program initiated by the Max Planck Society, jointly managed with the National Science Centre (Poland), and mutually funded by the Polish Ministry of Science and Higher Education and the German Federal Ministry of Education and Research.
Abstract
Purpose
The rising demand for radiology services calls for innovative solutions to sustain diagnostic quality and efficiency. This study evaluated the diagnostic agreement between two commercially available artificial intelligence (AI) chest X-ray systems and human radiologists during routine clinical practice.
Materials and Methods
We retrospectively analyzed 279 chest X-rays (204 standing, 63 supine, 12 sitting) from a Swiss university hospital. Seven thoracic pathologies – cardiomegaly, consolidation, mediastinal mass, nodule, pleural effusion, pneumothorax, and pulmonary oedema – were assessed. Radiologists’ routine reports were compared against Rayvolve (AZmed) and ChestView (Gleamer, both from Paris, France). A Python code, provided as open access supplement, calculated performance metrics, agreement measures, and effect size quantification.
Results
Agreement between radiologists and AI ranged from moderate to almost perfect: Human-AZmed (Gwet’s AC1: 0.47–0.72, moderate to substantial), and Human-Gleamer (Gwet’s AC1: 0.56–0.96, moderate to almost perfect). Balanced accuracies ranged from 0.67–0.85 for Human-AZmed and 0.71–0.85 for Human-Gleamer, with peak performance for pleural effusion (0.85 both systems). Specificity consistently exceeded sensitivity across pathologies (0.70–0.98 vs 0.45–0.85). Common findings showed strong performance, pleural effusion (MCC 0.70–0.73), cardiomegaly (MCC 0.51), and consolidation (MCC 0.45–0.46). Rare pathologies demonstrated lower agreement, mediastinal mass, and nodules (MCC 0.23–0.31). Standing radiographs yielded superior agreement compared to supine studies. The two AI systems showed substantial inter-system agreement for consolidation and pleural effusion (balanced accuracy 0.81–0.84).
Conclusion
Both commercial AI chest X-ray systems demonstrated comparable performance to human radiologists for common thoracic pathologies, with no meaningful differences between platforms. Performance was strongest for standing radiographs but declined for rare findings and supine studies. Position-dependent variability and reduced sensitivity for uncommon pathologies underscore the continued need for human oversight in clinical practice.
Key Points
-
AI systems matched radiologists for common chest X-ray findings.
-
Standing radiographs achieved the highest diagnostic agreement.
-
Rare pathologies showed weaker AI-human agreement.
-
Supine studies reduced diagnostic performance.
-
Human oversight remains essential in clinical practice.
Citation Format
-
Bosbach WA, Schoeni L, Senge JF et al. Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine. Rofo 2025; DOI 10.1055/a-2778-3892
Zusammenfassung
Ziel
Die steigende Nachfrage nach radiologischen Untersuchungen erfordert innovative Lösungen zur Aufrechterhaltung der diagnostischen Qualität und Effizienz. Diese Studie bewertete die diagnostische Übereinstimmung zwischen zwei kommerziell verfügbaren KI-Systemen für Thoraxröntgenaufnahmen und Radiologen im klinischen Alltag.
Materialien und Methoden
Wir analysierten retrospektiv 279 Thoraxröntgenaufnahmen (204 stehend, 63 liegend, 12 sitzend) eines Schweizer Universitätsspitals. Sieben thorakale Pathologien wurden bewertet: Kardiomegalie, Konsolidierung, Mediastinaltumor, Rundherd, Pleuraerguss, Pneumothorax und Lungenödem. Die Routinebefunde der Radiologen wurden mit Rayvolve (AZmed) und ChestView (Gleamer, beide aus Paris, Frankreich) verglichen. Ein Python-Code, als Open-Access-Supplement bereitgestellt, berechnete Leistungsmetriken, Übereinstimmungsmaße und Effektstärkenquantifizierung.
Ergebnisse
Die Übereinstimmung zwischen Radiologen und KI reichte von moderat bis fast perfekt: Mensch-AZmed (Gwet’s AC1: 0,47–0,72, moderat bis substanziell) und Mensch-Gleamer (Gwet’s AC1: 0,56–0,96, moderat bis fast perfekt). Die balancierte Genauigkeit lag zwischen 0,67–0,85 für Mensch-AZmed und 0,71–0,85 für Mensch-Gleamer, mit Höchstleistung bei Pleuraerguss (0,85 beide Systeme). Die Spezifität übertraf durchgehend die Sensitivität bei allen Pathologien (0,70–0,98 vs. 0,45–0,85). Häufige Befunde zeigten starke Leistung: Pleuraerguss (MCC 0,70–0,73), Kardiomegalie (MCC 0,51) und Konsolidierung (MCC 0,45–0,46). Seltene Pathologien demonstrierten geringere Übereinstimmung: Mediastinaltumor und Rundherde (MCC 0,23–0,31). Stehende Röntgenaufnahmen erzielten bessere Übereinstimmung als Aufnahmen in Rückenlage. Die beiden KI-Systeme zeigten substanzielle Übereinstimmung untereinander bei Konsolidierung und Pleuraerguss (balancierte Genauigkeit 0,81–0,84).
Schlussfolgerung
Beide kommerziellen KI-Systeme für Thoraxröntgen zeigten vergleichbare Leistung zu Radiologen bei häufigen thorakalen Pathologien, ohne bedeutsame Unterschiede zwischen den Plattformen. Die Leistung war bei stehenden Aufnahmen am stärksten, nahm jedoch bei seltenen Befunden und Aufnahmen in Rückenlage ab. Lageabhängige Variabilität und reduzierte Sensitivität für seltene Pathologien unterstreichen die anhaltende Notwendigkeit ärztlicher Supervision in der klinischen Praxis.
Kernaussagen
-
KI-Systeme entsprachen Radiologen bei häufigen Thoraxröntgen-Befunden.
-
Stehende Aufnahmen erzielten die höchste diagnostische Übereinstimmung.
-
Seltene Pathologien zeigten schwächere KI-Mensch-Übereinstimmung.
-
Liegende Aufnahmen reduzierten die diagnostische Leistung.
-
Ärztliche Supervision bleibt in der klinischen Praxis unerlässlich.
Keywords
Chest X-ray - Deep Learning - Multi-label Classification - Explainability - Medical ImagingIntroduction
The demand for clinical radiology services is predicted to grow substantially in the future. According to certain scenarios, future demand could potentially outpace available capacities in radiology [1]. Novel artificial intelligence (AI) software solutions might provide valuable support and assist the human radiologist, increasing patient throughput while maintaining or even improving diagnostic quality. Applications of AI are thought to be possible in, for instance, report drafting by large language models (LLM) [2] [3] [4], recommendation of appropriate interventional procedures [5], assessment of intramuscular fat fraction [6], or also in the reconstruction of undersampled magnetic resonance imaging (MRI) data [7]. One of the largest fields of potential AI application in radiology is that of pattern recognition, for example, for lesion labeling [8] [9]. The competence to reliably recognize or create patterns is a vital requirement, regardless whether working with text data or image data. Despite promising recent developments, it has been reported that the potential offered by novel AI systems is not unlimited. This appears to be true, for example, for quantification of radiation dose in computed tomography [10] or optimization of acquisition time in MRI reconstruction [11] [12]. The impact of hallucinations on clinical data is important to consider in this context [13].
Chest X-ray, although basic and established for a long time, is of particular importance in clinical radiology due to its low cost, low radiation, and widespread availability. The volume of case numbers makes chest X-ray a promising target for automation attempts [14]. There is already research on chest X-ray report automation [15] [16]. Recently, commercial software providers have started to offer chest X-ray analysis tools. One prominent example is Rayvolve for Chest (manufactured by AZmed, Paris, France). This AI tool designed to detect chest pathologies in X-rays has been tested before and was found to improve the speed and performance of human radiologists [17]. There are studies also available on the AZmed sister tool for fracture detection [18] [19] [20]. The AZmed system for chest X-ray consists of an ensemble approach that combines five RetinaNet-based object-detection models that share a common VGG-16 backbone architecture [21] [22]. Another chest X-ray tool now available commercially is ChestView (manufactured by Gleamer, Paris, France). Gleamer has been tested before for fracture detection [23] and using chest X-rays. In the area of chest X-ray, Gleamer has been shown to reduce the time needed by human radiologists to complete study interpretations and to increase the sensitivity of human radiologists [24] [25]. Gleamer relies for its procedures on a deep convolutional neural network, namely the object detection framework Detectron2 [26].
In this study, we assess the output of the two chest X-ray AI systems mentioned above. We compare their results – AI to AI – and against diagnoses by human radiologists made during non-blinded clinical practice, reflecting the intended use case of the software providers (see [Table 1] for list of diagnoses; the assessments follow non-blinded work published in three studies [17] [24] [25]). To the best of our knowledge, this is the first study to compare both systems while operating in parallel on a shared set of chest X-ray studies.
Materials and Methods
The following section describes the three reporting entities, the data set (n = 279, [Fig. 1], [Table 2]), and statistical evaluation methods. The study’s raw data and the study’s Python code are included as open access supplements (S 1 raw input data, S 2 Python source code).


Raters
This present study compares three reporting entities for chest X-ray diagnostics:
-
Reports from human radiologists during routine clinical practice,
-
Rayvolve for Chest (AZmed, Paris, France),
-
ChestView (Gleamer, Paris, France).
Human radiologists wrote their reports as part of their everyday clinical routine. Standard procedure was for residents to draft reports, which were then reviewed and finalized by consultants. For this study, all human ratings were extracted directly from the finalized radiology report texts. The two software applications were running in the background and automatically deposited their assessments in the picture archiving and communication system (PACS). Radiologists had been made aware of the software trials and that the software had pre-approved status. Although radiologists could access the AI assessments from Rayvolve and ChestView through the PACS (similar to the study protocol found in the publications referenced [17] [24] [25]), they maintained full responsibility for writing their official clinical reports with all associated legal liabilities. The radiologists maintained their independent diagnostic judgment and remained professionally accountable for their interpretations and conclusions, regardless of whether they chose to consult the AI-generated assessments during their workflow. The study measured Human-AI agreement and disagreement for non-blinded human radiologists. This type of non-blinded measurement has been reported on previously for both AZmed [17] and Gleamer [24] [25], and the setup reflects the intended real-world use case for such AI systems operating alongside routine clinical workflows.
In total, seven pathologies are included in this study: cardiomegaly, consolidation, mediastinal mass, nodule, pleural effusion, pneumothorax, and pulmonary oedema. Human radiologists assessed all seven. AZmed reported four of the pathologies, and Gleamer reported five ([Table 1], supplement S 1). For the diagnoses listed in [Table 1], AZmed reports a probability estimate on a discrete three-point scale: [no, low, high] and Gleamer uses: [no, doubt, yes]. To enable comparability with the variable expression of certainty in human-written reports [27], we transformed all ratings into a binary scale [0, 1] for this study, where 0 implies that a diagnosis is negative and 1 implies that probability for that diagnosis is anything >0.
Sample set
Chest X-ray images were acquired from patients attending a University Hospital in Switzerland. Standing studies include in each case a postero-anterior (PA) projection as well as a second projection in lateral (lat). Supine position leads to projection in antero-posterior. Images were acquired for routine assessments (e.g. pneumonia follow-ups, port or pacemaker localization), intensive care unit (ICU) imaging, and accidents and emergencies (A&E) referrals (e.g. for chest pain or chest trauma). Starting on March 10, 2024, 300 consecutive studies were included chronologically; after excluding 21 studies due to partially incomplete human reports (n = 20) and mixed imaging positions (n = 1), 279 studies remained for analysis. The age distribution is provided in [Fig. 1] and [Table 2]. For anonymization, no further patient characteristics were reported.
Statistical evaluation methods
For comparison between raters, used Python code to calculate an evaluation (supplement S 2, [Table 3], [Table 4]). To analyze reader results, pairwise 2×2 contingency tables (Human-AZmed, Human-Gleamer, AZmed-Gleamer) were generated for each diagnosis. Without an independent ground truth, the interpretation of the study results had to consider agreement with the human clinical report, which itself is an imperfect reference standard. For agreement analysis, sensitivity, specificity, and human prevalence were calculated with Wilson score 95% confidence intervals using the statsmodels confint package [28]. Balanced accuracy/ Matthews correlation coefficient (MCC) were added through the corresponding scikit-learn routines [29] [30], both of which were combined with a 5,000-fold bootstrap confidence interval. For bootstrap samples containing only one outcome class, balanced accuracy was set to 0.5 (chance-level performance) and MCC to 0 (no correlation) to ensure numerical stability and to avoid undefined values.
|
Kappa statistic |
< 0.00 |
0.00–0.20 |
0.21–0.40 |
0.41–0.60 |
0.61–0.80 |
0.81–1.00 |
1.00 |
|
Strength of agreement |
Poor |
Slight |
Fair |
Moderate |
Substantial |
Almost Perfect |
Perfect |
The agreement per diagnosis observed was calculated as the ratio of matching ratings on the main diagonal of the 2×2 contingency table relative to all cases. Wilson score 95% confidence intervals were computed using the statsmodels confint package [28]. Interrater reliability (IRR) was tested by application of the interrater reliability Chance-corrected Agreement Coefficients (irrCAC) package [32]. irrCAC allows the extraction of IRR variables such as Gwet’s AC1, which is known to be advantageous for imbalanced data sets [33]. In addition, irrCAC can provide the corresponding p-value, allowing researchers to test the Null Hypothesis (H0), which is used for confirming no agreement beyond what would be expected purely by chance. Gwet’s AC1 is defined for the interval [–1, 1]. Landis and Koch have defined a table for interpretation, see [Table 3] [31].
Systematic disagreement between paired raters was tested using McNemar’s exact test [34]. Since statistical significance testing can overemphasize clinical relevance, as p-values do not reflect the magnitude of an effect [35] [36] [37], we added measures of clinical effect size together with their 95% confidence intervals. Specifically, we calculated McNemar odds ratios (ORs), conditional ORs, risk differences (RD), likelihood ratios for positive (LR+) and negative (LR−) test results, and the number needed to diagnose (NND). Paired agreement between raters was assessed using McNemar odds ratios, with confidence intervals calculated from the standard error of the log odds ratio. Marginal odds ratios were also computed, applying a continuity correction of 0.5 to all cells when any cell contained zero counts, and confidence intervals were derived on the logarithmic scale to quantify the association between raters’ classifications. The RD and its confidence interval were calculated from the discordant pairs of the contingency table using the standard error formula for paired proportions. LR+ and LR- with confidence intervals were calculated from the 2×2 contingency table, applying log-transformation for interval estimation. NND, defined as the reciprocal of the proportion of diagnostic disagreements, was calculated with its confidence interval derived from the Wilson score interval for the disagreement proportion, i.e. patients per misdiagnosis [28].
P-values were reported exactly to two decimal places or, if very small, as thresholds (<alpha, <0.01, <0.001, <0.0001, <0.00001) to indicate increasing levels of statistical significance. Alpha was set to 5/100 for the present study. The LLMs GPT-5 [38], Claude Opus 4 [39], and DeepSeek-V3.1 [40] were used in Python code debugging, and manuscript writing up.
Results
In the following section, we discuss the sample set itself and the results obtained by pairs of raters, including both Human-AI and AI-AI comparisons.
Sample set
Each of the three raters assessed all of the 279 patients included in the present study, which allowed for paired testing. Of the 279 studies included, 204 of these were acquired in standing position, 63 in supine position, and 12 in sitting position. The median patient age was 66 years (interquartile range: 51–76 years, [Table 2]). The age distribution demonstrated a peak in the 7th decade of life ([Fig. 2]), which is typical for the patient clientele in university medicine.


Human-AI paired ratings, assessed overall
The prevalence of pathological findings as determined by human readers varied considerably. Common findings (>20%) included cardiomegaly (40%), pleural effusion (40%), and consolidation (35%). Pulmonary oedema (22%) occurred with intermediate frequency, while pneumothorax (5%), mediastinal mass (3%), and pulmonary nodule (3%) were rare.
Sensitivity varied across diagnoses and AI systems. In the Human-AZmed comparison, pleural effusion reached the highest sensitivity (0.85), followed by cardiomegaly (0.77) and consolidation (0.76). Pulmonary oedema showed lower sensitivity (0.45). In the Human-Gleamer comparison, sensitivity was 0.73 for pleural effusion, 0.71 for pulmonary nodule, 0.63 for consolidation, 0.57 for pneumothorax, and 0.50 for mediastinal mass. Specificity was generally higher than sensitivity. Human-AZmed specificity ranged from 0.70 for consolidation to 0.88 for pulmonary oedema. Human-Gleamer specificity was consistently high, with values from 0.83 for consolidation up to 0.98 for pneumothorax; mediastinal mass and pulmonary nodule both reached 0.91. Balanced accuracy reflected these findings. Human-AZmed values were 0.85 for pleural effusion, 0.76 for cardiomegaly, 0.73 for consolidation, and 0.67 for pulmonary oedema. Human-Gleamer values were 0.85 for pleural effusion, 0.81 for pulmonary nodule, 0.78 for pneumothorax, 0.73 for consolidation, and 0.71 for mediastinal mass.
Agreement measures showed pleural effusion as the strongest category, with MCC up to 0.70 (Human-AZmed) and 0.73 (Human-Gleamer). In contrast, rare pathologies such as mediastinal mass demonstrated substantially lower agreement. Overall agreement ranged from 0.72 to 0.86 for Human-AZmed, and 0.76 to 0.96 for Human-Gleamer. Gwet’s AC1 coefficients indicated moderate to substantial agreement for Human-AZmed (0.47–0.72) and moderate to almost perfect agreement for Human-Gleamer (0.56–0.96); all Gwet’s AC1 measures were statistically significant, i.e. observed agreement exceeded what would have been expected purely by chance.
McNemar’s test indicated significant differences for consolidation in the Human-AZmed comparison (p < 0.001); for mediastinal mass (p < 0.001), pulmonary nodule (p < 0.00001), and pleural effusion (p < 0.0001) in the Human-Gleamer comparison. No significant differences were observed for cardiomegaly, pleural effusion (Human-AZmed), pulmonary oedema, consolidation (Human-Gleamer), or pneumothorax.
McNemar OR took values greater and less than 1 for both Human-AZmed and Human-Gleamer, depending on diagnoses. The strongest likelihood for humans to miss or under call a diagnoses was seen for mediastinal masses and for nodules compared to Gleamer (McNemar OR 0.17 and 0.08). Marginal odds ratios demonstrated very strong positive associations, particularly for pleural effusion and pneumothorax (both Human-Gleamer), i.e. having a positive human label was positively associated with being predicted positive by the AI systems.
Compared with humans, AZmed tended to assign fewer positive labels for cardiomegaly (RD -0.06, 95% CI -0.11–0.00) and consolidation (RD -0.11, 95% CI -0.17–-0.05), while the differences were minimal for pleural effusion (RD -0.02, 95% CI -0.07–0.02) and pulmonary oedema (RD 0.03, 95% CI -0.03–0.08). Analysis of RDs showed that Gleamer had a higher positive call rate for pleural effusion (RD 0.09, 95% CI 0.05–0.13), whereas humans more frequently classified mediastinal masses (RD -0.07, 95% CI -0.11–-0.04) and nodules (RD -0.08, 95% CI -0.12–-0.05) as negative. Likelihood ratios further supported these findings; positive Gleamer results were strongly corroborated, particularly for pleural effusion (LR+ 18.25, LR− 0.28) and pneumothorax (LR+ 28.50, LR− 0.44), with other variables showing moderate support (LR+ 3.7–7.9, LR− 0.32–0.55). AZmed demonstrated moderate support for positive findings across all variables (LR+ 2.5–6.1, LR− 0.17–0.62), most notably for pleural effusion (LR+ 6.07, LR− 0.17). Overall, these results indicate that both AI systems reliably identify key pathologies, with Gleamer showing particularly strong performance for pleural effusion and pneumothorax.
The NND was lowest for common pathologies, including cardiomegaly (NND 5, 95% CI 4–6) and consolidation (AZmed 4, 95% CI 4–5; Gleamer 5, 95% CI 4–6); NND was higher for less frequent findings such as mediastinal masses (10, 95% CI 8–15) and pleural effusion (AZmed 7, 95% CI 6–10; Gleamer 8, 95% CI 6–11). Pneumothorax had the highest NND (28, 95% CI 16–52), reflecting the rarity of this finding despite a generally high level of agreement between AI and human readers.
Human-AI paired ratings, split by positions
In standing radiographs (n=204), human prevalence was lower than overall for cardiomegaly (26% vs 40%), consolidation (30% vs 35%), and pleural effusion (34% vs 40%). Compared with overall data, AZmed showed higher sensitivity for cardiomegaly (0.91 vs 0.77) and pleural effusion (0.91 vs 0.85), lower sensitivity for pulmonary oedema (0.32 vs 0.45), and balanced accuracy reflected these trends. Likelihood ratios supported strong diagnostic value for pleural effusion (Human-AZmed, LR+ 7.00, LR− 0.10). Gleamer showed lower sensitivity for pleural effusion (0.77) but higher specificity (0.97) compared to AZmed, with strong likelihood ratios (LR+ 25.67, LR− 0.24). NND was higher for Gleamer (5–26 vs 4–9 AZmed). Overall, standing radiographs highlighted increased AZmed sensitivity for common findings and higher Gleamer specificity for rarer pathologies.
In supine radiographs (n=63), human prevalence was higher than overall for cardiomegaly (81% vs 40%) and moderate for consolidation (49% vs 35%) and pleural effusion (56% vs 40%). AZmed sensitivity was higher for consolidation (0.94 vs 0.76) but lower for cardiomegaly (0.59 vs 0.77), while pleural effusion and pulmonary oedema were moderate (0.71 and 0.62). Gleamer showed moderate sensitivity for consolidation (0.71) and pleural effusion (0.63) with high specificity (0.62–1.00). Balanced accuracy was slightly lower than overall, and McNemar ORs did not exhibit a clear trend compared to overall data. NND ranged from 3–32, reflecting limited incremental diagnostic gain due to higher baseline prevalence.
In sitting radiographs (n=12), human prevalence was intermediate for cardiomegaly and pleural effusion (58%) and lower for consolidation (42%) and pulmonary oedema (8%). AZmed sensitivity was high for cardiomegaly and pulmonary oedema (both 1.00) and pleural effusion (0.86), moderate for consolidation (0.80). Gleamer showed, compared to the overall data, high sensitivity for consolidation (0.80) and pleural effusion (0.86) and high specificity (0.75–1.00).
AI-AI comparison
Across all positions, AZmed and Gleamer showed substantial agreement for consolidation and pleural effusion (balanced accuracy 0.81–0.84, MCC 0.65–0.73). For standing scans (n=204), agreement was strong for consolidation (0.85) and pleural effusion (0.87), with McNemar ORs indicating more AZmed-positive than Gleamer-positive detections (consolidation 9.0, pleural effusion 25.0). For scans in supine position (n=63), agreement remained high for consolidation (0.73) and pleural effusion (0.81), with conditional ORs suggesting raters’ classifications to be positively associated (11.1–23.3). For sitting patients (n=12), both systems showed high to perfect agreement for consolidation and pleural effusion (0.75–1.00). Overall, AZmed and Gleamer exhibited strong agreement with regard to consolidation and pleural effusion across positions.
Discussion
In this study, we performed a direct, real-world clinical comparison of two commercially available AI-based chest X-ray tools – Rayvolve (AZmed) and ChestView (Gleamer) – against human radiologists using a cohort of 279 studies and seven key thoracic pathologies. Our findings provide insights into the performance, limitations, and potential clinical integration of AI systems in radiology and are contextualized by the growing body of evidence from similar commercial systems. This evaluation reflects the intended real-world deployment of such AI tools as decision-support systems running alongside routine radiological practice. While non-blinded, our study adds relevant insights about the performance of AI systems, similar to previously non-blinded published work in three studies [17] [24] [25].
Interpretation of study findings
Both AI systems demonstrated strong overall concordance when compared with human radiologists, for example, with strong balanced accuracy, agreement, and IRR (Gwet’s AC1 up to 0.96, pneumothorax). These results are in line with prior work showing that modern AI algorithms can achieve clinical agreement comparable to trained radiologists for common chest X-ray findings [41] [42]. The high level of agreement for common pathologies like pleural effusion and consolidation supports the potential utility of AI for triage and second-read scenarios, reinforcing its role as a complementary tool. Our findings are strongly supported by previous studies on these specific systems. Bennani et al. (2023) demonstrated that the Gleamer AI (ChestView) improved radiologists’ sensitivity across all expertise levels, with absolute increases of up to 26% for pneumothorax and 14% for consolidation, while also reducing reading times by 31% [24]. In a separate follow-up study by Selvam et al. (2025) on emergency chest X-ray, the Gleamer AI system improved sensitivity for consolidation, pleural effusion, and nodules [25]. Similarly, a multi-reader, multi-case study by Bettinger et al. (2024) on the AZmed system (Rayvolve) reported that AI assistance led to a significant 16% increase in the area under the curve (AUC), an 11% boost in sensitivity, and a 36% reduction in interpretation time [17]. The current study expands on this evidence by providing a direct head-to-head comparison of both commercially implemented AI systems operating in parallel within a routine clinical workflow.
Across most pathologies, both AI systems in our study exhibited higher specificity than sensitivity, without a ground truth available, compared to human readers. This conservative detection pattern, which prioritizes reducing false positives, is clinically valuable in high-pressure settings such as emergency departments or ICUs, where overcalling may lead to unnecessary and invasive follow-up studies [43].
A key finding of our study was that patient positioning substantially influenced AI-Human agreement. Supine radiographs tended to show the greatest discrepancies. This aligns with prior literature indicating that supine imaging – common in ICU or trauma settings – poses inherent challenges due to altered anatomical projections, magnification, and overlapping structures [14] [44]. These findings highlight the critical importance of taking patient positioning into account when implementing AI in clinical workflows, and they suggest a potential value for position-specific algorithm training to improve generalizability.
The overall data set contained a strong imbalance towards negative assessments, which is a recurring attribute in medical imaging [33]. This imbalance was particularly pronounced for rare findings such as mediastinal mass (human prevalence 3%), nodule (3%), and pneumothorax (5%). Conversely, higher positive rates were observed for cardiomegaly, consolidation, and pleural effusion, especially in supine studies. This is likely because supine positioning naturally increases the width of the heart’s silhouette, making the distinction between physiological and pathological enlargement more challenging and subjective.
Clinical implications
Overall, our findings are consistent with the broader literature [17] [24] [25] and they suggest that those commercially available AI systems can reliably support radiologists in routine chest X-ray interpretation. The proven benefits in clinical agreement and efficiency gains support their use for common pathologies and standard projections. However, our results also clearly show that AI performance is not infallible; it varies with patient positioning and pathology prevalence. This highlights areas where human oversight remains essential, particularly for complex cases, rare findings, and non-standard projections [14].
Limitations
Several limitations of this study should be acknowledged. First, the data set was limited to a single university hospital, which may reduce generalizability to other institutions or patient populations. Second, the number of supine and in particular sitting radiographs was relatively small, limiting the statistical power to assess AI-Human agreement in these positions robustly. Third, rare pathologies were underrepresented, which is a common challenge in AI imaging studies but can nevertheless lead to less reliable performance metrics for these conditions. Fourth, both AI systems were evaluated in a “black box” manner without insight into their specific decision-making processes, which can limit interpretability and clinical trust. Fifth, gender distribution data were not captured due to institutional privacy protocols, limiting demographic characterization of our cohort.
Finally, while radiologists maintained full legal responsibility, their real-time access to AI outputs via PACS could have subtly influenced their reporting behavior, potentially introducing bias into the comparison. No ground truth was available, for example, through additional imaging or biopsy. Instead, human labels served as reference for the AI evaluation. This might, as in previous studies [17] [24] [25], introduce mutual reinforcement; however, concordance or agreement must not be misinterpreted as accuracy.
Future studies should aim for multicenter designs, larger data sets enriched for rare conditions, and prospective assessment with AI feedback blinded to the human readers, in order to evaluate true performance and integration more rigorously.
Conclusions
This study demonstrates that commercially available AI chest X-ray tools achieve a high level of agreement with human radiologists in real-world practice, especially for common pathologies and standing-position radiographs. Specificity tends to exceed sensitivity, emphasizing conservative detection strategies. Patient positioning and low-prevalence pathologies remain key challenges, underscoring the importance of careful implementation and continued human oversight in clinical workflows.
Previous work has already revealed that either of the two AI systems can be a valuable support for clinical radiologists, increasing performance and speed [17] [24] [25]. Similarly, our findings reinforce this conclusion. However, the observed discrepancies with human diagnoses highlight the continued need for human oversight in clinical decision-making. In this context, running multiple AI systems in parallel and considering their consensus could further enhance diagnostic reliability. Greater in-depth understanding of raters’ diagnostic processes will also be attained by integrating not only diagnoses but also model reasoning, for example, using class activation maps (activation mapping) in the statistical analyses. Two raters might give the identical diagnosis for two different lesions, or two raters might provide a different diagnosis when labeling the identical lesion.
Once these AI systems begin to be used widely in clinical routine, they will start to fulfill the prediction made at the Dartmouth AI workshop [45]. However, a dramatic deskilling of radiologists due to AI applications [46] seems to be a concern for the distant future rather than the immediate one.
Supplements
S 1: collected raw data (n = 279) with diagnoses per rater (.xlsx, as well as .csv)
S 2: statistical evaluation code (.py)
The study's raw data and Python code are deposited permanently on figshare under DOI: 10.6084/m9.figshare.28692659.
Ethics declaration
The study received an ethics waiver (Req-2025-00216) from the Cantonal Ethics Committee Bern (Kantonale Ethikkommission für die Forschung, Gesundheits-, Sozial- und Integrationsdirektion), dated 17 February 2025. All experiments were conducted in accordance with relevant guidelines and regulations, including the Declaration of Helsinki. Informed consent was obtained from all participants and/or their legal guardians prior to study participation.
Conflict of Interest
The authors declare that they have no conflict of interest.
-
References
- 1 Tang A, Tam R, Cadrin-Chênevert A. et al. Canadian Association of Radiologists White Paper on Artificial Intelligence in Radiology. Can Assoc Radiol J 2018; 69: 120-35
- 2 Bosbach WA, Senge JF, Nemeth B. et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol 2023; 53: 102-10
- 3 Senge JF, Mc Murray MT, Haupt F. et al. ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template. Swiss J Radiol Nucl Med 2024; 7: 1-14
- 4 Bosbach WA, Clement C, Strunz F. et al. Automation of 99mTc Mercaptoacetyltriglycine (MAG3) Report Writing Using a Vision Language Model. EJNMMI Res 2025; 15: 142
- 5 Barat M, Soyer P, Dohan A. Appropriateness of Recommendations Provided by ChatGPT to Interventional Radiologists. Can Assoc Radiol J 2023; 1-6
- 6 Ramedani S, Ramedani M, Tengg-Kobligk Von H. et al. A Deep Learning-based Fully Automated Approach for Body Composition Analysis in 3D Whole Body Dixon MRI. 2023 IEEE 19th Int Conf Intell Comput Commun Process 2023; 287-292
- 7 Hammernik K, Klatzer T, Kobler E. et al. Learning a variational network for reconstruction of accelerated MRI data. Magn Reson Med 2018; 79: 3055-71
- 8 Bosbach WA, Schoeni L, Beisbart C. et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 in Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025;
- 9 Bosbach WA, Schoeni L, Beisbart C. et al. Open access supplement to the manuscript: Bosbach WA, Schoeni L, Beisbart C et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 in Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025; accepted for publicat. Figshare 2025;
- 10 Garni SN, Mertineit N, Nöldge G. et al. Regulatory Needs for Radiation Protection Devices based upon Artificial Intelligence – State task or leave unregulated?. Swiss J Radiol Nucl Med 2024; 5: 5
- 11 Bosbach WA, Merdes KC, Jung B. et al. Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Top Magn Reson Imaging 2024; 33: e0313
- 12 Bosbach WA, Merdes KC, Jung B. et al. Open access supplement to the publication: Bosbach WA et al. (2024). Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Topics in MRI. Figshare 2024;
- 13 Granstedt J, Kc P, Deshpande R. et al. Hallucinations in medical devices. ArXiv 2025;
- 14 De Lacey G, Morley S, Berman L. The Chest X-Ray – A Survival Guide. Cambridge (UK): 2008
- 15 Babar Z, van Laarhoven T, Zanzotto FM. et al. Evaluating diagnostic content of AI-generated radiology reports of chest X-rays. Artif Intell Med 2021; 116: 102075
- 16 Yu F, Endo M, Krishnan R. et al. Evaluating Progress in Automatic Chest X-Ray Radiology Report Generation. MedRxiv 2022;
- 17 Bettinger H, Lenczner G, Guigui J. et al. Evaluation of the Performance of an Artificial Intelligence (AI) Algorithm in Detecting Thoracic Pathologies on Chest Radiographs. Diagnostics 2024; 14
- 18 Gasmi I, Calinghen A, Parienti JJ. et al. Comparison of diagnostic performance of a deep learning algorithm, emergency physicians, junior radiologists and senior radiologists in the detection of appendicular fractures in children. Pediatr Radiol 2023; 53: 1675-84
- 19 Dupuis M, Delbos L, Veil R. et al. External validation of a commercially available deep learning algorithm for fracture detection in children. Diagn Interv Imaging 2022; 103: 151-159
- 20 Fu T, Viswanathan V, Attia A. et al. Assessing the Potential of a Deep Learning Tool to Improve Fracture Detection by Radiologists and Emergency Physicians on Extremity Radiographs. Acad Radiol 2023; 1-11
- 21 Lin TY, Goyal P, Girshick R. et al. Focal Loss for Dense Object Detection. IEEE Trans Pattern Anal Mach Intell 2020; 42: 318-27
- 22 Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 3rd Int Conf Learn Represent ICLR 2015 – Conf Track Proc. In: . 2015: 1-14
- 23 Regnard NE, Lanseur B, Ventre J. et al. Assessment of performances of a deep learning algorithm for the detection of limbs and pelvic fractures, dislocations, focal bone lesions, and elbow effusions on trauma X-rays. Eur J Radiol 2022; 154
- 24 Bennani S, Regnard NE, Ventre J. et al. Using AI to Improve Radiologist Performance in Detection of Abnormalities on Chest Radiographs. Radiology 2023; 309
- 25 Selvam S, Peyrony O, Elezi A. et al. Efficacy of a deep learning-based software for chest X-ray analysis in an emergency department. Diagn Interv Imaging 2025; 106: 299-311
- 26 Wu Y, Kirillov A, Massa F. et al. Detectron2 2019. Accessed September 14, 2025 at: https://github.com/facebookresearch/detectron2
- 27 Panicek DM, Hricak H. How sure are you, doctor? A standardized lexicon to describe the radiologists level of certainty. Am J Roentgenol 2016; 207: 2-3
- 28 statsmodels.stats.proportion.proportion_confint. Statsmodels 0150 (+710) 2025. Accessed September 08, 2025 at: https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportion_confint.html
- 29 balanced_accuracy_score. Scikit-Learn 172 Doc 2025. Accessed September 10, 2025 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html
- 30 matthews_corrcoef. Scikit-Learn 172 Doc 2025. Accessed September 10, 2025 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html
- 31 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33: 159-74
- 32 Gwet K, Fergadis A. irrCAC – Chance-corrected Agreement Coefficients 2023. irrcac.readthedocs.io/en/latest/usage/usage_raw_data.html; accessed September 3, 2025.
- 33 Wongpakaran N, Wongpakaran T, Wedding D. et al. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples Nahathai. BMC Med Res Methodol 2013; 13: 1-7
- 34 statsmodels.stats.contingency_tables.mcnemar. Statsmodels 0150 (+638) 2025. Accessed March 30, 2025 at: https://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html
- 35 Bartolucci AA, Tendera M, Howard G. Meta-analysis of multiple primary prevention trials of cardiovascular events using aspirin. Am J Cardiol 2011; 107: 1796-801
- 36 Cohen J. The earth is round (p<.05). Am Psychol 1994; 49: 997-1003
- 37 Sullivan GM, Feinn R. Using Effect Size—or Why the P Value Is Not Enough. J Grad Med Educ 2012; 4: 279-82
- 38 OpenAI Inc. GPT-5 2025. Accessed August 30, 2025 at: https://chatgpt.com/overview
- 39 Anthropic PBC. Claude Sonnet 4 [Large language model] 2025. Accessed August 05, 2025 at: https://www.anthropic.com
- 40 Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co. L. DeepSeek V3.1 2025. Accessed September 10, 2025 at: https://www.deepseek.com
- 41 Rajpurkar P, Irvin J, Zhu K. et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. ArXiv 2017;
- 42 Irvin J, Rajpurkar P, Ko M. et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. 33rd AAAI Conf Artif Intell. In: . 2019: 590-597
- 43 Feng Y, Teh HS, Cai Y. Deep Learning for Chest Radiology: A Review. Curr Radiol Rep 2019; 7: 1-9
- 44 Gefter WB, Post BA, Hatabu H. Commonly Missed Findings on Chest Radiographs: Causes and Consequences. Chest 2023; 163: 650-61
- 45 McCarthy J, Minsky ML, Rochester N. et al. A Proposal For The Dartmouth Summer Research Project On Artificial Intelligence 1955: 1–13. Accessed October 30, 2021 at: http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf
- 46 Duran LDD. Deskilling of medical professionals: An unintended consequence of AI implementation?. G Di Filos 2021; 2
Correspondence
Publication History
Received: 09 April 2025
Accepted after revision: 11 December 2025
Article published online:
20 January 2026
© 2026. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Tang A, Tam R, Cadrin-Chênevert A. et al. Canadian Association of Radiologists White Paper on Artificial Intelligence in Radiology. Can Assoc Radiol J 2018; 69: 120-35
- 2 Bosbach WA, Senge JF, Nemeth B. et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol 2023; 53: 102-10
- 3 Senge JF, Mc Murray MT, Haupt F. et al. ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template. Swiss J Radiol Nucl Med 2024; 7: 1-14
- 4 Bosbach WA, Clement C, Strunz F. et al. Automation of 99mTc Mercaptoacetyltriglycine (MAG3) Report Writing Using a Vision Language Model. EJNMMI Res 2025; 15: 142
- 5 Barat M, Soyer P, Dohan A. Appropriateness of Recommendations Provided by ChatGPT to Interventional Radiologists. Can Assoc Radiol J 2023; 1-6
- 6 Ramedani S, Ramedani M, Tengg-Kobligk Von H. et al. A Deep Learning-based Fully Automated Approach for Body Composition Analysis in 3D Whole Body Dixon MRI. 2023 IEEE 19th Int Conf Intell Comput Commun Process 2023; 287-292
- 7 Hammernik K, Klatzer T, Kobler E. et al. Learning a variational network for reconstruction of accelerated MRI data. Magn Reson Med 2018; 79: 3055-71
- 8 Bosbach WA, Schoeni L, Beisbart C. et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 in Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025;
- 9 Bosbach WA, Schoeni L, Beisbart C. et al. Open access supplement to the manuscript: Bosbach WA, Schoeni L, Beisbart C et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 in Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025; accepted for publicat. Figshare 2025;
- 10 Garni SN, Mertineit N, Nöldge G. et al. Regulatory Needs for Radiation Protection Devices based upon Artificial Intelligence – State task or leave unregulated?. Swiss J Radiol Nucl Med 2024; 5: 5
- 11 Bosbach WA, Merdes KC, Jung B. et al. Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Top Magn Reson Imaging 2024; 33: e0313
- 12 Bosbach WA, Merdes KC, Jung B. et al. Open access supplement to the publication: Bosbach WA et al. (2024). Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Topics in MRI. Figshare 2024;
- 13 Granstedt J, Kc P, Deshpande R. et al. Hallucinations in medical devices. ArXiv 2025;
- 14 De Lacey G, Morley S, Berman L. The Chest X-Ray – A Survival Guide. Cambridge (UK): 2008
- 15 Babar Z, van Laarhoven T, Zanzotto FM. et al. Evaluating diagnostic content of AI-generated radiology reports of chest X-rays. Artif Intell Med 2021; 116: 102075
- 16 Yu F, Endo M, Krishnan R. et al. Evaluating Progress in Automatic Chest X-Ray Radiology Report Generation. MedRxiv 2022;
- 17 Bettinger H, Lenczner G, Guigui J. et al. Evaluation of the Performance of an Artificial Intelligence (AI) Algorithm in Detecting Thoracic Pathologies on Chest Radiographs. Diagnostics 2024; 14
- 18 Gasmi I, Calinghen A, Parienti JJ. et al. Comparison of diagnostic performance of a deep learning algorithm, emergency physicians, junior radiologists and senior radiologists in the detection of appendicular fractures in children. Pediatr Radiol 2023; 53: 1675-84
- 19 Dupuis M, Delbos L, Veil R. et al. External validation of a commercially available deep learning algorithm for fracture detection in children. Diagn Interv Imaging 2022; 103: 151-159
- 20 Fu T, Viswanathan V, Attia A. et al. Assessing the Potential of a Deep Learning Tool to Improve Fracture Detection by Radiologists and Emergency Physicians on Extremity Radiographs. Acad Radiol 2023; 1-11
- 21 Lin TY, Goyal P, Girshick R. et al. Focal Loss for Dense Object Detection. IEEE Trans Pattern Anal Mach Intell 2020; 42: 318-27
- 22 Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 3rd Int Conf Learn Represent ICLR 2015 – Conf Track Proc. In: . 2015: 1-14
- 23 Regnard NE, Lanseur B, Ventre J. et al. Assessment of performances of a deep learning algorithm for the detection of limbs and pelvic fractures, dislocations, focal bone lesions, and elbow effusions on trauma X-rays. Eur J Radiol 2022; 154
- 24 Bennani S, Regnard NE, Ventre J. et al. Using AI to Improve Radiologist Performance in Detection of Abnormalities on Chest Radiographs. Radiology 2023; 309
- 25 Selvam S, Peyrony O, Elezi A. et al. Efficacy of a deep learning-based software for chest X-ray analysis in an emergency department. Diagn Interv Imaging 2025; 106: 299-311
- 26 Wu Y, Kirillov A, Massa F. et al. Detectron2 2019. Accessed September 14, 2025 at: https://github.com/facebookresearch/detectron2
- 27 Panicek DM, Hricak H. How sure are you, doctor? A standardized lexicon to describe the radiologists level of certainty. Am J Roentgenol 2016; 207: 2-3
- 28 statsmodels.stats.proportion.proportion_confint. Statsmodels 0150 (+710) 2025. Accessed September 08, 2025 at: https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportion_confint.html
- 29 balanced_accuracy_score. Scikit-Learn 172 Doc 2025. Accessed September 10, 2025 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html
- 30 matthews_corrcoef. Scikit-Learn 172 Doc 2025. Accessed September 10, 2025 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html
- 31 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33: 159-74
- 32 Gwet K, Fergadis A. irrCAC – Chance-corrected Agreement Coefficients 2023. irrcac.readthedocs.io/en/latest/usage/usage_raw_data.html; accessed September 3, 2025.
- 33 Wongpakaran N, Wongpakaran T, Wedding D. et al. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples Nahathai. BMC Med Res Methodol 2013; 13: 1-7
- 34 statsmodels.stats.contingency_tables.mcnemar. Statsmodels 0150 (+638) 2025. Accessed March 30, 2025 at: https://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html
- 35 Bartolucci AA, Tendera M, Howard G. Meta-analysis of multiple primary prevention trials of cardiovascular events using aspirin. Am J Cardiol 2011; 107: 1796-801
- 36 Cohen J. The earth is round (p<.05). Am Psychol 1994; 49: 997-1003
- 37 Sullivan GM, Feinn R. Using Effect Size—or Why the P Value Is Not Enough. J Grad Med Educ 2012; 4: 279-82
- 38 OpenAI Inc. GPT-5 2025. Accessed August 30, 2025 at: https://chatgpt.com/overview
- 39 Anthropic PBC. Claude Sonnet 4 [Large language model] 2025. Accessed August 05, 2025 at: https://www.anthropic.com
- 40 Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co. L. DeepSeek V3.1 2025. Accessed September 10, 2025 at: https://www.deepseek.com
- 41 Rajpurkar P, Irvin J, Zhu K. et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. ArXiv 2017;
- 42 Irvin J, Rajpurkar P, Ko M. et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. 33rd AAAI Conf Artif Intell. In: . 2019: 590-597
- 43 Feng Y, Teh HS, Cai Y. Deep Learning for Chest Radiology: A Review. Curr Radiol Rep 2019; 7: 1-9
- 44 Gefter WB, Post BA, Hatabu H. Commonly Missed Findings on Chest Radiographs: Causes and Consequences. Chest 2023; 163: 650-61
- 45 McCarthy J, Minsky ML, Rochester N. et al. A Proposal For The Dartmouth Summer Research Project On Artificial Intelligence 1955: 1–13. Accessed October 30, 2021 at: http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf
- 46 Duran LDD. Deskilling of medical professionals: An unintended consequence of AI implementation?. G Di Filos 2021; 2




