Rofo
DOI: 10.1055/a-2772-7798
Technical Innovations

Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine

Neue KI-Systeme zur Thoraxröntgen-Diagnostik: Qualitätsbewertung der Übereinstimmung mit ärztlichen Diagnosen im klinischen Alltag

Authors

  • Wolfram A. Bosbach

    1   Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
    2   Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
  • Luca Schoeni

    1   Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
  • Jan Felix Senge

    3   Department of Mathematics and Computer Science, University of Bremen, Bremen, Germany (Ringgold ID: RIN9168)
    4   Dioscuri Centre in Topological Data Analysis, Polish Academy of Sciences, Warsaw, Poland (Ringgold ID: RIN49559)
  • Milena Mitrakovic

    2   Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
  • Marc-André Weber

    5   Institute of Diagnostic and Interventional Radiology, Pediatric Radiology and Neuroradiology, University Medical Center Rostock, Rostock, Germany
  • Pawel Dlotko

    4   Dioscuri Centre in Topological Data Analysis, Polish Academy of Sciences, Warsaw, Poland (Ringgold ID: RIN49559)
  • Keivan Daneshvar

    2   Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland

Supported by: JF Senge and P Dlotko were supported by the Dioscuri program initiated by the Max Planck Society, jointly managed with the National Science Centre (Poland), and mutually funded by the Polish Ministry of Science and Higher Education and the German Federal Ministry of Education and Research.
 

Abstract

Purpose

The rising demand for radiology services calls for innovative solutions to sustain diagnostic quality and efficiency. This study evaluated the diagnostic agreement between two commercially available artificial intelligence (AI) chest X-ray systems and human radiologists during routine clinical practice.

Materials and Methods

We retrospectively analyzed 279 chest X-rays (204 standing, 63 supine, 12 sitting) from a Swiss university hospital. Seven thoracic pathologies – cardiomegaly, consolidation, mediastinal mass, nodule, pleural effusion, pneumothorax, and pulmonary oedema – were assessed. Radiologists’ routine reports were compared against Rayvolve (AZmed) and ChestView (Gleamer, both from Paris, France). A Python code, provided as open access supplement, calculated performance metrics, agreement measures, and effect size quantification.

Results

Agreement between radiologists and AI ranged from moderate to almost perfect: Human-AZmed (Gwet’s AC1: 0.47–0.72, moderate to substantial), and Human-Gleamer (Gwet’s AC1: 0.56–0.96, moderate to almost perfect). Balanced accuracies ranged from 0.67–0.85 for Human-AZmed and 0.71–0.85 for Human-Gleamer, with peak performance for pleural effusion (0.85 both systems). Specificity consistently exceeded sensitivity across pathologies (0.70–0.98 vs 0.45–0.85). Common findings showed strong performance, pleural effusion (MCC 0.70–0.73), cardiomegaly (MCC 0.51), and consolidation (MCC 0.45–0.46). Rare pathologies demonstrated lower agreement, mediastinal mass, and nodules (MCC 0.23–0.31). Standing radiographs yielded superior agreement compared to supine studies. The two AI systems showed substantial inter-system agreement for consolidation and pleural effusion (balanced accuracy 0.81–0.84).

Conclusion

Both commercial AI chest X-ray systems demonstrated comparable performance to human radiologists for common thoracic pathologies, with no meaningful differences between platforms. Performance was strongest for standing radiographs but declined for rare findings and supine studies. Position-dependent variability and reduced sensitivity for uncommon pathologies underscore the continued need for human oversight in clinical practice.

Key Points

  • AI systems matched radiologists for common chest X-ray findings.

  • Standing radiographs achieved the highest diagnostic agreement.

  • Rare pathologies showed weaker AI-human agreement.

  • Supine studies reduced diagnostic performance.

  • Human oversight remains essential in clinical practice.

Citation Format

  • Bosbach WA, Schoeni L, Senge JF et al. Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine. Rofo 2025; DOI 10.1055/a-2778-3892


Zusammenfassung

Ziel

Die steigende Nachfrage nach radiologischen Untersuchungen erfordert innovative Lösungen zur Aufrechterhaltung der diagnostischen Qualität und Effizienz. Diese Studie bewertete die diagnostische Übereinstimmung zwischen zwei kommerziell verfügbaren KI-Systemen für Thoraxröntgenaufnahmen und Radiologen im klinischen Alltag.

Materialien und Methoden

Wir analysierten retrospektiv 279 Thoraxröntgenaufnahmen (204 stehend, 63 liegend, 12 sitzend) eines Schweizer Universitätsspitals. Sieben thorakale Pathologien wurden bewertet: Kardiomegalie, Konsolidierung, Mediastinaltumor, Rundherd, Pleuraerguss, Pneumothorax und Lungenödem. Die Routinebefunde der Radiologen wurden mit Rayvolve (AZmed) und ChestView (Gleamer, beide aus Paris, Frankreich) verglichen. Ein Python-Code, als Open-Access-Supplement bereitgestellt, berechnete Leistungsmetriken, Übereinstimmungsmaße und Effektstärkenquantifizierung.

Ergebnisse

Die Übereinstimmung zwischen Radiologen und KI reichte von moderat bis fast perfekt: Mensch-AZmed (Gwet’s AC1: 0,47–0,72, moderat bis substanziell) und Mensch-Gleamer (Gwet’s AC1: 0,56–0,96, moderat bis fast perfekt). Die balancierte Genauigkeit lag zwischen 0,67–0,85 für Mensch-AZmed und 0,71–0,85 für Mensch-Gleamer, mit Höchstleistung bei Pleuraerguss (0,85 beide Systeme). Die Spezifität übertraf durchgehend die Sensitivität bei allen Pathologien (0,70–0,98 vs. 0,45–0,85). Häufige Befunde zeigten starke Leistung: Pleuraerguss (MCC 0,70–0,73), Kardiomegalie (MCC 0,51) und Konsolidierung (MCC 0,45–0,46). Seltene Pathologien demonstrierten geringere Übereinstimmung: Mediastinaltumor und Rundherde (MCC 0,23–0,31). Stehende Röntgenaufnahmen erzielten bessere Übereinstimmung als Aufnahmen in Rückenlage. Die beiden KI-Systeme zeigten substanzielle Übereinstimmung untereinander bei Konsolidierung und Pleuraerguss (balancierte Genauigkeit 0,81–0,84).

Schlussfolgerung

Beide kommerziellen KI-Systeme für Thoraxröntgen zeigten vergleichbare Leistung zu Radiologen bei häufigen thorakalen Pathologien, ohne bedeutsame Unterschiede zwischen den Plattformen. Die Leistung war bei stehenden Aufnahmen am stärksten, nahm jedoch bei seltenen Befunden und Aufnahmen in Rückenlage ab. Lageabhängige Variabilität und reduzierte Sensitivität für seltene Pathologien unterstreichen die anhaltende Notwendigkeit ärztlicher Supervision in der klinischen Praxis.

Kernaussagen

  • KI-Systeme entsprachen Radiologen bei häufigen Thoraxröntgen-Befunden.

  • Stehende Aufnahmen erzielten die höchste diagnostische Übereinstimmung.

  • Seltene Pathologien zeigten schwächere KI-Mensch-Übereinstimmung.

  • Liegende Aufnahmen reduzierten die diagnostische Leistung.

  • Ärztliche Supervision bleibt in der klinischen Praxis unerlässlich.


Introduction

The demand for clinical radiology services is predicted to grow substantially in the future. According to certain scenarios, future demand could potentially outpace available capacities in radiology [1]. Novel artificial intelligence (AI) software solutions might provide valuable support and assist the human radiologist, increasing patient throughput while maintaining or even improving diagnostic quality. Applications of AI are thought to be possible in, for instance, report drafting by large language models (LLM) [2] [3] [4], recommendation of appropriate interventional procedures [5], assessment of intramuscular fat fraction [6], or also in the reconstruction of undersampled magnetic resonance imaging (MRI) data [7]. One of the largest fields of potential AI application in radiology is that of pattern recognition, for example, for lesion labeling [8] [9]. The competence to reliably recognize or create patterns is a vital requirement, regardless whether working with text data or image data. Despite promising recent developments, it has been reported that the potential offered by novel AI systems is not unlimited. This appears to be true, for example, for quantification of radiation dose in computed tomography [10] or optimization of acquisition time in MRI reconstruction [11] [12]. The impact of hallucinations on clinical data is important to consider in this context [13].

Chest X-ray, although basic and established for a long time, is of particular importance in clinical radiology due to its low cost, low radiation, and widespread availability. The volume of case numbers makes chest X-ray a promising target for automation attempts [14]. There is already research on chest X-ray report automation [15] [16]. Recently, commercial software providers have started to offer chest X-ray analysis tools. One prominent example is Rayvolve for Chest (manufactured by AZmed, Paris, France). This AI tool designed to detect chest pathologies in X-rays has been tested before and was found to improve the speed and performance of human radiologists [17]. There are studies also available on the AZmed sister tool for fracture detection [18] [19] [20]. The AZmed system for chest X-ray consists of an ensemble approach that combines five RetinaNet-based object-detection models that share a common VGG-16 backbone architecture [21] [22]. Another chest X-ray tool now available commercially is ChestView (manufactured by Gleamer, Paris, France). Gleamer has been tested before for fracture detection [23] and using chest X-rays. In the area of chest X-ray, Gleamer has been shown to reduce the time needed by human radiologists to complete study interpretations and to increase the sensitivity of human radiologists [24] [25]. Gleamer relies for its procedures on a deep convolutional neural network, namely the object detection framework Detectron2 [26].

In this study, we assess the output of the two chest X-ray AI systems mentioned above. We compare their results – AI to AI – and against diagnoses by human radiologists made during non-blinded clinical practice, reflecting the intended use case of the software providers (see [Table 1] for list of diagnoses; the assessments follow non-blinded work published in three studies [17] [24] [25]). To the best of our knowledge, this is the first study to compare both systems while operating in parallel on a shared set of chest X-ray studies.

Table 1 Pathologies in alphabetical order reported by the human radiologist, AZmed, and Gleamer.

Human radiologist

AZmed

Gleamer

Cardiomegaly

reported

reported

not reported

Consolidation

reported

reported

reported

Mediastinal Mass

reported

not reported

reported

Nodule

reported

not reported

reported

Pleural Effusion

reported

reported

reported

Pneumothorax

reported

not reported

reported

Pulmonary Oedema

reported

reported

not reported


Materials and Methods

The following section describes the three reporting entities, the data set (n = 279, [Fig. 1], [Table 2]), and statistical evaluation methods. The study’s raw data and the study’s Python code are included as open access supplements (S 1 raw input data, S 2 Python source code).

Zoom
Fig. 1 Study sample set, n = 279 studies included.

Table 2 Overview of included sample set.

Number of studies included

279

  • Number standing studies (PA & lat)

204

  • Number supine studies

63

  • Number sitting studies

12

Patient age [yrs], median (Q1–Q3)

66 (51–76)

Raters

This present study compares three reporting entities for chest X-ray diagnostics:

  1. Reports from human radiologists during routine clinical practice,

  2. Rayvolve for Chest (AZmed, Paris, France),

  3. ChestView (Gleamer, Paris, France).

Human radiologists wrote their reports as part of their everyday clinical routine. Standard procedure was for residents to draft reports, which were then reviewed and finalized by consultants. For this study, all human ratings were extracted directly from the finalized radiology report texts. The two software applications were running in the background and automatically deposited their assessments in the picture archiving and communication system (PACS). Radiologists had been made aware of the software trials and that the software had pre-approved status. Although radiologists could access the AI assessments from Rayvolve and ChestView through the PACS (similar to the study protocol found in the publications referenced [17] [24] [25]), they maintained full responsibility for writing their official clinical reports with all associated legal liabilities. The radiologists maintained their independent diagnostic judgment and remained professionally accountable for their interpretations and conclusions, regardless of whether they chose to consult the AI-generated assessments during their workflow. The study measured Human-AI agreement and disagreement for non-blinded human radiologists. This type of non-blinded measurement has been reported on previously for both AZmed [17] and Gleamer [24] [25], and the setup reflects the intended real-world use case for such AI systems operating alongside routine clinical workflows.

In total, seven pathologies are included in this study: cardiomegaly, consolidation, mediastinal mass, nodule, pleural effusion, pneumothorax, and pulmonary oedema. Human radiologists assessed all seven. AZmed reported four of the pathologies, and Gleamer reported five ([Table 1], supplement S 1). For the diagnoses listed in [Table 1], AZmed reports a probability estimate on a discrete three-point scale: [no, low, high] and Gleamer uses: [no, doubt, yes]. To enable comparability with the variable expression of certainty in human-written reports [27], we transformed all ratings into a binary scale [0, 1] for this study, where 0 implies that a diagnosis is negative and 1 implies that probability for that diagnosis is anything >0.


Sample set

Chest X-ray images were acquired from patients attending a University Hospital in Switzerland. Standing studies include in each case a postero-anterior (PA) projection as well as a second projection in lateral (lat). Supine position leads to projection in antero-posterior. Images were acquired for routine assessments (e.g. pneumonia follow-ups, port or pacemaker localization), intensive care unit (ICU) imaging, and accidents and emergencies (A&E) referrals (e.g. for chest pain or chest trauma). Starting on March 10, 2024, 300 consecutive studies were included chronologically; after excluding 21 studies due to partially incomplete human reports (n = 20) and mixed imaging positions (n = 1), 279 studies remained for analysis. The age distribution is provided in [Fig. 1] and [Table 2]. For anonymization, no further patient characteristics were reported.


Statistical evaluation methods

For comparison between raters, used Python code to calculate an evaluation (supplement S 2, [Table 3], [Table 4]). To analyze reader results, pairwise 2×2 contingency tables (Human-AZmed, Human-Gleamer, AZmed-Gleamer) were generated for each diagnosis. Without an independent ground truth, the interpretation of the study results had to consider agreement with the human clinical report, which itself is an imperfect reference standard. For agreement analysis, sensitivity, specificity, and human prevalence were calculated with Wilson score 95% confidence intervals using the statsmodels confint package [28]. Balanced accuracy/ Matthews correlation coefficient (MCC) were added through the corresponding scikit-learn routines [29] [30], both of which were combined with a 5,000-fold bootstrap confidence interval. For bootstrap samples containing only one outcome class, balanced accuracy was set to 0.5 (chance-level performance) and MCC to 0 (no correlation) to ensure numerical stability and to avoid undefined values.

Table 3 Interpretation of strength of agreement for kappa statistics used in the present study, adapted from [31].

Kappa statistic

< 0.00

0.00–0.20

0.21–0.40

0.41–0.60

0.61–0.80

0.81–1.00

1.00

Strength of agreement

Poor

Slight

Fair

Moderate

Substantial

Almost Perfect

Perfect

Table 4 Performance of AI systems compared to human results.

Human-AZmed

Human-Gleamer

AZmed-Gleamer

Position

Variable

Cardiomegaly

Consolidation

Pleural Effusion

Pulmonary Oedema

Consolidation

Mediastinal Mass

Nodule

Pleural Effusion

Pneumothorax

Consolidation

Pleural Effusion

all

cases

279

279

279

279

279

279

279

279

279

279

279

all

confusion matrix

AZmed:pos|neg

Human:pos|86|26

neg|42|125

AZmed:pos|neg

Human:pos|74|23

neg|54|128

AZmed:pos|neg

Human:pos|94|17

neg|23|145

AZmed:pos|neg

Human:pos|28|34

neg|26|191

Gleamer:pos|neg

Human:pos|61|36

neg|31|151

Gleamer:pos|neg

Human:pos|4|4

neg|24|247

Gleamer:pos|neg

Human:pos|5|2

neg|25|247

Gleamer:pos|neg

Human:pos|81|30

neg|6|162

Gleamer:pos|neg

Human:pos|8|6

neg|4|261

Gleamer:pos|neg

AZmed:pos|85|43

neg|7|144

Gleamer:pos|neg

AZmed:pos|83|34

neg|4|158

all

human prevalence

0.40 (0.35–0.46)

0.35 (0.29–0.41)

0.40 (0.34–0.46)

0.22 (0.18–0.27)

0.35 (0.29–0.41)

0.03 (0.01–0.06)

0.03 (0.01–0.05)

0.40 (0.34–0.46)

0.05 (0.03–0.08)

0.46 (0.40–0.52)

0.42 (0.36–0.48)

all

sensitivity

0.77 (0.68–0.84)

0.76 (0.67–0.84)

0.85 (0.77–0.90)

0.45 (0.33–0.57)

0.63 (0.53–0.72)

0.50 (0.22–0.78)

0.71 (0.36–0.92)

0.73 (0.64–0.80)

0.57 (0.33–0.79)

0.66 (0.58–0.74)

0.71 (0.62–0.78)

all

specificity

0.75 (0.68–0.81)

0.70 (0.63–0.76)

0.86 (0.80–0.91)

0.88 (0.83–0.92)

0.83 (0.77–0.88)

0.91 (0.87–0.94)

0.91 (0.87–0.94)

0.96 (0.92–0.98)

0.98 (0.96–0.99)

0.95 (0.91–0.98)

0.98 (0.94–0.99)

all

balanced acc

0.76 (0.70–0.81)

0.73 (0.68–0.79)

0.85 (0.81–0.90)

0.67 (0.60–0.73)

0.73 (0.67–0.78)

0.71 (0.52–0.90)

0.81 (0.62–0.96)

0.85 (0.80–0.89)

0.78 (0.64–0.91)

0.81 (0.76–0.85)

0.84 (0.80–0.89)

all

MCC

0.51 (0.41–0.61)

0.45 (0.34–0.54)

0.70 (0.62–0.78)

0.35 (0.22–0.48)

0.46 (0.36–0.57)

0.23 (0.03–0.42)

0.31 (0.10–0.49)

0.73 (0.65–0.81)

0.60 (0.34–0.81)

0.65 (0.57–0.73)

0.73 (0.65–0.80)

all

agreement

0.76 (0.70–0.80)

0.72 (0.67–0.77)

0.86 (0.81–0.89)

0.78 (0.73–0.83)

0.76 (0.71–0.81)

0.90 (0.86–0.93)

0.90 (0.86–0.93)

0.87 (0.83–0.91)

0.96 (0.94–0.98)

0.82 (0.77–0.86)

0.86 (0.82–0.90)

all

Gwet’s AC1

0.52

0.47

0.72

0.68

0.56

0.89

0.89

0.76

0.96

0.66

0.75

all

interpretation

moderate

moderate

substantial

substantial

moderate

almost perfect

almost perfect

substantial

almost perfect

substantial

substantial

all

95% CI

0.42–0.62

0.36–0.57

0.64–0.80

0.59–0.76

0.47–0.66

0.84–0.93

0.85–0.93

0.69–0.84

0.94–0.99

0.57–0.75

0.67–0.82

all

interpretation

moderate-substantial

fair-moderate

substantial-almost perfect

moderate-substantial

moderate-substantial

almost perfect

almost perfect

substantial-almost perfect

almost perfect

moderate-substantial

substantial-almost perfect

all

p-value

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

all

McNemar p-value

0.07

<0.001

0.43

0.37

0.63

<0.001

<0.00001

<0.0001

0.75

<0.00001

<0.00001

all

McNemar OR

0.62 (0.38–1.01)

0.43 (0.26–0.69)

0.74 (0.39–1.38)

1.31 (0.78–2.18)

1.16 (0.72–1.88)

0.17 (0.06–0.48)

0.08 (0.02–0.34)

5.00 (2.08–12.01)

1.50 (0.42–5.32)

6.14 (2.76–13.66)

8.50 (3.02–23.95)

all

conditional OR

9.84 (5.62–17.25)

7.63 (4.33–13.43)

34.86 (17.69–68.71)

6.05 (3.17–11.55)

8.25 (4.69–14.52)

10.29 (2.42–43.78)

24.70 (4.55–133.95)

72.90 (29.16–182.24)

87.00 (20.46–370.01)

40.66 (17.51–94.44)

96.43 (33.09–281.00)

all

RD

–0.06 (–0.11–0.00)

–0.11 (–0.17––0.05)

–0.02 (–0.07–0.02)

0.03 (–0.03–0.08)

0.02 (–0.04–0.08)

–0.07 (–0.11––0.04)

–0.08 (–0.12––0.05)

0.09 (0.05–0.13)

0.01 (–0.02–0.03)

0.13 (0.08–0.18)

0.11 (0.07–0.15)

all

LR+

3.08 (2.33–4.08)

2.53 (1.97–3.25)

6.07 (4.12–8.95)

3.75 (2.38–5.90)

3.71 (2.60–5.29)

5.56 (2.52–12.26)

7.89 (4.33–14.36)

18.25 (8.25–40.37)

28.50 (9.74–83.35)

13.20 (6.34–27.50)

35.50 (13.39–94.09)

all

LR–

0.31 (0.22–0.43)

0.34 (0.24–0.50)

0.17 (0.11–0.27)

0.62 (0.50–0.79)

0.45 (0.34–0.58)

0.55 (0.27–1.10)

0.32 (0.10–1.03)

0.28 (0.21–0.38)

0.44 (0.24–0.80)

0.36 (0.28–0.46)

0.30 (0.22–0.39)

all

NND

5 (4–6)

4 (4–5)

7 (6–10)

5 (4–6)

5 (4–6)

10 (8–15)

11 (8–15)

8 (6–11)

28 (16–52)

6 (5–8)

8 (6–10)

standing

cases

204

204

204

204

204

204

204

204

204

204

204

standing

confusion matrix

AZmed:pos|neg

Human:pos|49|5

neg|36|114

AZmed:pos|neg

Human:pos|41|20

neg|37|106

AZmed:pos|neg

Human:pos|63|6

neg|18|117

AZmed:pos|neg

Human:pos|12|25

neg|12|155

Gleamer:pos|neg

Human:pos|35|26

neg|19|124

Gleamer:pos|neg

Human:pos|4|4

neg|12|184

Gleamer:pos|neg

Human:pos|3|2

neg|22|177

Gleamer:pos|neg

Human:pos|53|16

neg|4|131

Gleamer:pos|neg

Human:pos|6|5

neg|3|190

Gleamer:pos|neg

AZmed:pos|51|27

neg|3|123

Gleamer:pos|neg

AZmed:pos|56|25

neg|1|122

standing

human prevalence

0.26 (0.21–0.33)

0.30 (0.24–0.37)

0.34 (0.28–0.41)

0.18 (0.13–0.24)

0.30 (0.24–0.37)

0.04 (0.02–0.08)

0.02 (0.01–0.06)

0.34 (0.28–0.41)

0.05 (0.03–0.09)

0.38 (0.32–0.45)

0.40 (0.33–0.47)

standing

sensitivity

0.91 (0.80–0.96)

0.67 (0.55–0.78)

0.91 (0.82–0.96)

0.32 (0.20–0.49)

0.57 (0.45–0.69)

0.50 (0.22–0.78)

0.60 (0.23–0.88)

0.77 (0.66–0.85)

0.55 (0.28–0.79)

0.65 (0.54–0.75)

0.69 (0.58–0.78)

standing

specificity

0.76 (0.69–0.82)

0.74 (0.66–0.81)

0.87 (0.80–0.91)

0.93 (0.88–0.96)

0.87 (0.80–0.91)

0.94 (0.90–0.96)

0.89 (0.84–0.93)

0.97 (0.93–0.99)

0.98 (0.96–0.99)

0.98 (0.93–0.99)

0.99 (0.96–1.00)

standing

balanced acc

0.83 (0.78–0.88)

0.71 (0.64–0.77)

0.89 (0.84–0.93)

0.63 (0.55–0.71)

0.72 (0.65–0.79)

0.72 (0.54–0.90)

0.74 (0.45–0.96)

0.87 (0.81–0.92)

0.76 (0.61–0.92)

0.82 (0.76–0.87)

0.84 (0.79–0.89)

standing

MCC

0.60 (0.49–0.69)

0.39 (0.26–0.52)

0.75 (0.66–0.84)

0.30 (0.12–0.47)

0.46 (0.32–0.59)

0.32 (0.05–0.57)

0.23 (–0.03–0.43)

0.78 (0.68–0.86)

0.58 (0.27–0.83)

0.69 (0.60–0.79)

0.75 (0.66–0.82)

standing

agreement

0.80 (0.74–0.85)

0.72 (0.66–0.78)

0.88 (0.83–0.92)

0.82 (0.76–0.87)

0.78 (0.72–0.83)

0.92 (0.88–0.95)

0.88 (0.83–0.92)

0.90 (0.85–0.94)

0.96 (0.92–0.98)

0.85 (0.80–0.90)

0.87 (0.82–0.91)

standing

Gwet’s AC1

0.64

0.49

0.78

0.76

0.63

0.91

0.86

0.83

0.96

0.74

0.77

standing

interpretation

substantial

moderate

substantial

substantial

substantial

almost perfect

almost perfect

almost perfect

almost perfect

substantial

substantial

standing

95% CI

0.53–0.74

0.37–0.62

0.69–0.87

0.67–0.84

0.52–0.74

0.87–0.96

0.81–0.92

0.75–0.90

0.93–0.99

0.65–0.83

0.68–0.86

standing

interpretation

moderate-substantial

fair-substantial

substantial-almost perfect

substantial-almost perfect

moderate-substantial

almost perfect

almost perfect

substantial-almost perfect

almost perfect

substantial-almost perfect

substantial-almost perfect

standing

p-value

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

<0.00001

standing

McNemar p-value

<0.00001

<0.05

<0.05

<0.05

0.37

0.08

<0.0001

<0.05

0.73

<0.00001

<0.00001

standing

McNemar OR

0.14 (0.05–0.35)

0.54 (0.31–0.93)

0.33 (0.13–0.84)

2.08 (1.05–4.15)

1.37 (0.76–2.47)

0.33 (0.11–1.03)

0.09 (0.02–0.39)

4.00 (1.34–11.96)

1.67 (0.40–6.97)

9.00 (2.73–29.67)

25.00 (3.39–184.51)

standing

conditional OR

31.03 (11.49–83.81)

5.87 (3.06–11.28)

68.25 (25.78–180.65)

6.20 (2.51–15.32)

8.79 (4.36–17.70)

15.33 (3.41–68.99)

12.07 (1.91–76.24)

108.48 (34.65–339.61)

76.00 (14.65–394.15)

77.44 (22.49–266.73)

273.28 (36.12–2067.71)

standing

RD

–0.15 (–0.21––0.09)

–0.08 (–0.15––0.01)

–0.06 (–0.11––0.01)

0.06 (0.01–0.12)

0.03 (–0.03–0.10)

–0.04 (–0.08––0.00)

–0.10 (–0.14––0.05)

0.06 (0.02–0.10)

0.01 (–0.02–0.04)

0.12 (0.07–0.17)

0.12 (0.07–0.16)

standing

LR+

3.79 (2.82–5.10)

2.58 (1.86–3.58)

7.00 (4.53–10.83)

4.57 (2.23–9.36)

4.38 (2.74–7.02)

8.33 (3.44–20.16)

5.45 (2.41–12.35)

25.67 (9.69–67.98)

27.50 (7.91–95.57)

32.50 (10.50–100.57)

69.00 (9.74–488.60)

standing

LR–

0.12 (0.05–0.27)

0.45 (0.31–0.65)

0.10 (0.05–0.22)

0.73 (0.58–0.92)

0.49 (0.37–0.67)

0.53 (0.27–1.06)

0.45 (0.15–1.32)

0.24 (0.15–0.36)

0.46 (0.24–0.88)

0.36 (0.26–0.49)

0.31 (0.23–0.43)

standing

NND

5 (4–7)

4 (3–5)

9 (6–13)

6 (5–8)

5 (4–6)

13 (9–21)

9 (6–13)

11 (7–16)

26 (14–50)

7 (5–10)

8 (6–12)

supine

cases

63

63

63

63

63

63

63

63

63

63

63

supine

confusion matrix

AZmed:pos|neg

Human:pos|30|21

neg|4|8

AZmed:pos|neg

Human:pos|29|2

neg|16|16

AZmed:pos|neg

Human:pos|25|10

neg|5|23

AZmed:pos|neg

Human:pos|15|9

neg|13|26

Gleamer:pos|neg

Human:pos|22|9

neg|12|20

Gleamer:pos|neg

Human:pos|0|0

neg|9|54

Gleamer:pos|neg

Human:pos|1|0

neg|2|60

Gleamer:pos|neg

Human:pos|22|13

neg|2|26

Gleamer:pos|neg

Human:pos|2|1

neg|0|60

Gleamer:pos|neg

AZmed:pos|31|14

neg|3|15

Gleamer:pos|neg

AZmed:pos|21|9

neg|3|30

supine

human prevalence

0.81 (0.70–0.89)

0.49 (0.37–0.61)

0.56 (0.43–0.67)

0.38 (0.27–0.50)

0.49 (0.37–0.61)

0.00 (0.00–0.06)

0.02 (0.00–0.08)

0.56 (0.43–0.67)

0.05 (0.02–0.13)

0.71 (0.59–0.81)

0.48 (0.36–0.60)

supine

sensitivity

0.59 (0.45–0.71)

0.94 (0.79–0.98)

0.71 (0.55–0.84)

0.62 (0.43–0.79)

0.71 (0.53–0.84)

– (---)

1.00 (0.21–1.00)

0.63 (0.46–0.77)

0.67 (0.21–0.94)

0.69 (0.54–0.80)

0.70 (0.52–0.83)

supine

specificity

0.67 (0.39–0.86)

0.50 (0.34–0.66)

0.82 (0.64–0.92)

0.67 (0.51–0.79)

0.62 (0.45–0.77)

0.86 (0.75–0.92)

0.97 (0.89–0.99)

0.93 (0.77–0.98)

1.00 (0.94–1.00)

0.83 (0.61–0.94)

0.91 (0.76–0.97)

supine

balanced acc

0.63 (0.47–0.77)

0.72 (0.62–0.82)

0.77 (0.66–0.87)

0.65 (0.52–0.77)

0.67 (0.55–0.78)

0.86 (0.50–0.50)

0.98 (0.50–1.00)

0.78 (0.68–0.87)

0.83 (0.50–1.00)

0.76 (0.65–0.86)

0.80 (0.71–0.90)

supine

MCC

0.20 (–0.05–0.44)

0.48 (0.27–0.66)

0.53 (0.32–0.73)

0.29 (0.04–0.52)

0.34 (0.10–0.56)

0.00 (0.00–0.00)

0.57 (0.00–1.00)

0.57 (0.38–0.75)

0.81 (0.00–1.00)

0.47 (0.27–0.67)

0.63 (0.43–0.81)

supine

agreement

0.60 (0.48–0.71)

0.71 (0.59–0.81)

0.76 (0.64–0.85)

0.65 (0.53–0.76)

0.67 (0.54–0.77)

0.86 (0.75–0.92)

0.97 (0.89–0.99)

0.76 (0.64–0.85)

0.98 (0.92–1.00)

0.73 (0.61–0.82)

0.81 (0.70–0.89)

supine

Gwet’s AC1

0.29

0.45

0.52

0.32

0.33

0.84

0.97

0.53

0.98

0.49

0.63

supine

interpretation

fair

moderate

moderate

fair

fair

almost perfect

almost perfect

moderate

almost perfect

moderate

substantial

supine

95% CI

0.03–0.56

0.22–0.68

0.31–0.74

0.08–0.57

0.09–0.57

0.72–0.95

0.92–1.00

0.31–0.74

0.95–1.00

0.27–0.72

0.43–0.83

supine

interpretation

slight-moderate

fair-substantial

fair-substantial

slight-moderate

slight-moderate

substantial-almost perfect

almost perfect-perfect

fair-substantial

almost perfect-perfect

fair-substantial

moderate-almost perfect

supine

p-value

<0.05

<0.001

<0.00001

<0.05

<0.01

<0.00001

<0.00001

<0.00001

<0.00001

<0.0001

<0.00001

supine

McNemar p-value

<0.001

<0.01

0.30

0.52

0.66

<0.01

0.50

<0.01

1.00

<0.05

0.15

supine

McNemar OR

5.25 (1.80–15.29)

0.12 (0.03–0.54)

2.00 (0.68–5.85)

0.69 (0.30–1.62)

0.75 (0.32–1.78)

0.00 (---)

0.00 (---)

6.50 (1.47–28.80)

– (---)

4.67 (1.34–16.24)

3.00 (0.81–11.08)

supine

conditional OR

2.86 (0.76–10.73)

14.50 (2.95–71.22)

11.50 (3.42–38.71)

3.33 (1.15–9.63)

4.07 (1.42–11.70)

5.74 (0.11–307.05)

72.60 (2.32–2267.73)

22.00 (4.47–108.24)

201.67 (6.46–6299.25)

11.07 (2.75–44.50)

23.33 (5.64–96.60)

supine

RD

0.27 (0.13–0.41)

–0.22 (–0.34––0.10)

0.08 (–0.04–0.20)

–0.06 (–0.21–0.08)

–0.05 (–0.19–0.09)

–0.14 (–0.23––0.06)

–0.03 (–0.08–0.01)

0.17 (0.06–0.29)

0.02 (–0.01–0.05)

0.17 (0.05–0.30)

0.10 (–0.01–0.20)

supine

LR+

1.79 (0.78–4.11)

1.88 (1.31–2.69)

3.94 (1.73–8.97)

1.88 (1.09–3.23)

1.87 (1.13–3.08)

– (---)

– (---)

9.00 (2.31–35.05)

– (---)

4.06 (1.42–11.62)

7.78 (2.58–23.46)

supine

LR–

0.61 (0.36–1.03)

0.12 (0.03–0.48)

0.35 (0.20–0.61)

0.57 (0.32–1.00)

0.47 (0.25–0.86)

– (---)

– (---)

0.40 (0.26–0.62)

– (---)

0.37 (0.23–0.60)

0.33 (0.19–0.58)

supine

NND

3 (2–4)

4 (3–6)

5 (3–7)

3 (3–5)

3 (3–5)

7 (4–12)

21 (8–62)

5 (3–7)

32 (10–115)

4 (3–6)

6 (4–9)

sitting

cases

12

12

12

12

12

12

12

12

12

12

12

sitting

confusion matrix

AZmed:pos|neg

Human:pos|7|0

neg|2|3

AZmed:pos|neg

Human:pos|4|1

neg|1|6

AZmed:pos|neg

Human:pos|6|1

neg|0|5

AZmed:pos|neg

Human:pos|1|0

neg|1|10

Gleamer:pos|neg

Human:pos|4|1

neg|0|7

Gleamer:pos|neg

Human:pos|0|0

neg|3|9

Gleamer:pos|neg

Human:pos|1|0

neg|1|10

Gleamer:pos|neg

Human:pos|6|1

neg|0|5

Gleamer:pos|neg

Human:pos|0|0

neg|1|11

Gleamer:pos|neg

AZmed:pos|3|2

neg|1|6

Gleamer:pos|neg

AZmed:pos|6|0

neg|0|6

sitting

human prevalence

0.58 (0.32–0.81)

0.42 (0.19–0.68)

0.58 (0.32–0.81)

0.08 (0.01–0.35)

0.42 (0.19–0.68)

0.00 (0.00–0.24)

0.08 (0.01–0.35)

0.58 (0.32–0.81)

0.00 (0.00–0.24)

0.42 (0.19–0.68)

0.50 (0.25–0.75)

sitting

sensitivity

1.00 (0.65–1.00)

0.80 (0.38–0.96)

0.86 (0.49–0.97)

1.00 (0.21–1.00)

0.80 (0.38–0.96)

– (---)

1.00 (0.21–1.00)

0.86 (0.49–0.97)

– (---)

0.60 (0.23–0.88)

1.00 (0.61–1.00)

sitting

specificity

0.60 (0.23–0.88)

0.86 (0.49–0.97)

1.00 (0.57–1.00)

0.91 (0.62–0.98)

1.00 (0.65–1.00)

0.75 (0.47–0.91)

0.91 (0.62–0.98)

1.00 (0.57–1.00)

0.92 (0.65–0.99)

0.86 (0.49–0.97)

1.00 (0.61–1.00)

sitting

balanced acc

0.80 (0.50–1.00)

0.83 (0.56–1.00)

0.93 (0.75–1.00)

0.95 (0.50–1.00)

0.90 (0.67–1.00)

0.75 (0.50–0.50)

0.95 (0.50–1.00)

0.93 (0.78–1.00)

0.92 (0.50–0.50)

0.73 (0.44–1.00)

1.00 (1.00–1.00)

sitting

MCC

0.68 (0.00–1.00)

0.66 (0.12–1.00)

0.85 (0.52–1.00)

0.67 (0.00–1.00)

0.84 (0.52–1.00)

0.00 (0.00–0.00)

0.67 (0.00–1.00)

0.85 (0.53–1.00)

0.00 (0.00–0.00)

0.48 (–0.13–1.00)

1.00 (1.00–1.00)

sitting

agreement

0.83 (0.55–0.95)

0.83 (0.55–0.95)

0.92 (0.65–0.99)

0.92 (0.65–0.99)

0.92 (0.65–0.99)

0.75 (0.47–0.91)

0.92 (0.65–0.99)

0.92 (0.65–0.99)

0.92 (0.65–0.99)

0.75 (0.47–0.91)

1.00 (0.76–1.00)

The agreement per diagnosis observed was calculated as the ratio of matching ratings on the main diagonal of the 2×2 contingency table relative to all cases. Wilson score 95% confidence intervals were computed using the statsmodels confint package [28]. Interrater reliability (IRR) was tested by application of the interrater reliability Chance-corrected Agreement Coefficients (irrCAC) package [32]. irrCAC allows the extraction of IRR variables such as Gwet’s AC1, which is known to be advantageous for imbalanced data sets [33]. In addition, irrCAC can provide the corresponding p-value, allowing researchers to test the Null Hypothesis (H0), which is used for confirming no agreement beyond what would be expected purely by chance. Gwet’s AC1 is defined for the interval [–1, 1]. Landis and Koch have defined a table for interpretation, see [Table 3] [31].

Systematic disagreement between paired raters was tested using McNemar’s exact test [34]. Since statistical significance testing can overemphasize clinical relevance, as p-values do not reflect the magnitude of an effect [35] [36] [37], we added measures of clinical effect size together with their 95% confidence intervals. Specifically, we calculated McNemar odds ratios (ORs), conditional ORs, risk differences (RD), likelihood ratios for positive (LR+) and negative (LR−) test results, and the number needed to diagnose (NND). Paired agreement between raters was assessed using McNemar odds ratios, with confidence intervals calculated from the standard error of the log odds ratio. Marginal odds ratios were also computed, applying a continuity correction of 0.5 to all cells when any cell contained zero counts, and confidence intervals were derived on the logarithmic scale to quantify the association between raters’ classifications. The RD and its confidence interval were calculated from the discordant pairs of the contingency table using the standard error formula for paired proportions. LR+ and LR- with confidence intervals were calculated from the 2×2 contingency table, applying log-transformation for interval estimation. NND, defined as the reciprocal of the proportion of diagnostic disagreements, was calculated with its confidence interval derived from the Wilson score interval for the disagreement proportion, i.e. patients per misdiagnosis [28].

P-values were reported exactly to two decimal places or, if very small, as thresholds (<alpha, <0.01, <0.001, <0.0001, <0.00001) to indicate increasing levels of statistical significance. Alpha was set to 5/100 for the present study. The LLMs GPT-5 [38], Claude Opus 4 [39], and DeepSeek-V3.1 [40] were used in Python code debugging, and manuscript writing up.



Results

In the following section, we discuss the sample set itself and the results obtained by pairs of raters, including both Human-AI and AI-AI comparisons.

Sample set

Each of the three raters assessed all of the 279 patients included in the present study, which allowed for paired testing. Of the 279 studies included, 204 of these were acquired in standing position, 63 in supine position, and 12 in sitting position. The median patient age was 66 years (interquartile range: 51–76 years, [Table 2]). The age distribution demonstrated a peak in the 7th decade of life ([Fig. 2]), which is typical for the patient clientele in university medicine.

Zoom
Fig. 2 Age distribution of patients included in the study.

Human-AI paired ratings, assessed overall

The prevalence of pathological findings as determined by human readers varied considerably. Common findings (>20%) included cardiomegaly (40%), pleural effusion (40%), and consolidation (35%). Pulmonary oedema (22%) occurred with intermediate frequency, while pneumothorax (5%), mediastinal mass (3%), and pulmonary nodule (3%) were rare.

Sensitivity varied across diagnoses and AI systems. In the Human-AZmed comparison, pleural effusion reached the highest sensitivity (0.85), followed by cardiomegaly (0.77) and consolidation (0.76). Pulmonary oedema showed lower sensitivity (0.45). In the Human-Gleamer comparison, sensitivity was 0.73 for pleural effusion, 0.71 for pulmonary nodule, 0.63 for consolidation, 0.57 for pneumothorax, and 0.50 for mediastinal mass. Specificity was generally higher than sensitivity. Human-AZmed specificity ranged from 0.70 for consolidation to 0.88 for pulmonary oedema. Human-Gleamer specificity was consistently high, with values from 0.83 for consolidation up to 0.98 for pneumothorax; mediastinal mass and pulmonary nodule both reached 0.91. Balanced accuracy reflected these findings. Human-AZmed values were 0.85 for pleural effusion, 0.76 for cardiomegaly, 0.73 for consolidation, and 0.67 for pulmonary oedema. Human-Gleamer values were 0.85 for pleural effusion, 0.81 for pulmonary nodule, 0.78 for pneumothorax, 0.73 for consolidation, and 0.71 for mediastinal mass.

Agreement measures showed pleural effusion as the strongest category, with MCC up to 0.70 (Human-AZmed) and 0.73 (Human-Gleamer). In contrast, rare pathologies such as mediastinal mass demonstrated substantially lower agreement. Overall agreement ranged from 0.72 to 0.86 for Human-AZmed, and 0.76 to 0.96 for Human-Gleamer. Gwet’s AC1 coefficients indicated moderate to substantial agreement for Human-AZmed (0.47–0.72) and moderate to almost perfect agreement for Human-Gleamer (0.56–0.96); all Gwet’s AC1 measures were statistically significant, i.e. observed agreement exceeded what would have been expected purely by chance.

McNemar’s test indicated significant differences for consolidation in the Human-AZmed comparison (p < 0.001); for mediastinal mass (p < 0.001), pulmonary nodule (p < 0.00001), and pleural effusion (p < 0.0001) in the Human-Gleamer comparison. No significant differences were observed for cardiomegaly, pleural effusion (Human-AZmed), pulmonary oedema, consolidation (Human-Gleamer), or pneumothorax.

McNemar OR took values greater and less than 1 for both Human-AZmed and Human-Gleamer, depending on diagnoses. The strongest likelihood for humans to miss or under call a diagnoses was seen for mediastinal masses and for nodules compared to Gleamer (McNemar OR 0.17 and 0.08). Marginal odds ratios demonstrated very strong positive associations, particularly for pleural effusion and pneumothorax (both Human-Gleamer), i.e. having a positive human label was positively associated with being predicted positive by the AI systems.

Compared with humans, AZmed tended to assign fewer positive labels for cardiomegaly (RD -0.06, 95% CI -0.11–0.00) and consolidation (RD -0.11, 95% CI -0.17–-0.05), while the differences were minimal for pleural effusion (RD -0.02, 95% CI -0.07–0.02) and pulmonary oedema (RD 0.03, 95% CI -0.03–0.08). Analysis of RDs showed that Gleamer had a higher positive call rate for pleural effusion (RD 0.09, 95% CI 0.05–0.13), whereas humans more frequently classified mediastinal masses (RD -0.07, 95% CI -0.11–-0.04) and nodules (RD -0.08, 95% CI -0.12–-0.05) as negative. Likelihood ratios further supported these findings; positive Gleamer results were strongly corroborated, particularly for pleural effusion (LR+ 18.25, LR− 0.28) and pneumothorax (LR+ 28.50, LR− 0.44), with other variables showing moderate support (LR+ 3.7–7.9, LR− 0.32–0.55). AZmed demonstrated moderate support for positive findings across all variables (LR+ 2.5–6.1, LR− 0.17–0.62), most notably for pleural effusion (LR+ 6.07, LR− 0.17). Overall, these results indicate that both AI systems reliably identify key pathologies, with Gleamer showing particularly strong performance for pleural effusion and pneumothorax.

The NND was lowest for common pathologies, including cardiomegaly (NND 5, 95% CI 4–6) and consolidation (AZmed 4, 95% CI 4–5; Gleamer 5, 95% CI 4–6); NND was higher for less frequent findings such as mediastinal masses (10, 95% CI 8–15) and pleural effusion (AZmed 7, 95% CI 6–10; Gleamer 8, 95% CI 6–11). Pneumothorax had the highest NND (28, 95% CI 16–52), reflecting the rarity of this finding despite a generally high level of agreement between AI and human readers.


Human-AI paired ratings, split by positions

In standing radiographs (n=204), human prevalence was lower than overall for cardiomegaly (26% vs 40%), consolidation (30% vs 35%), and pleural effusion (34% vs 40%). Compared with overall data, AZmed showed higher sensitivity for cardiomegaly (0.91 vs 0.77) and pleural effusion (0.91 vs 0.85), lower sensitivity for pulmonary oedema (0.32 vs 0.45), and balanced accuracy reflected these trends. Likelihood ratios supported strong diagnostic value for pleural effusion (Human-AZmed, LR+ 7.00, LR− 0.10). Gleamer showed lower sensitivity for pleural effusion (0.77) but higher specificity (0.97) compared to AZmed, with strong likelihood ratios (LR+ 25.67, LR− 0.24). NND was higher for Gleamer (5–26 vs 4–9 AZmed). Overall, standing radiographs highlighted increased AZmed sensitivity for common findings and higher Gleamer specificity for rarer pathologies.

In supine radiographs (n=63), human prevalence was higher than overall for cardiomegaly (81% vs 40%) and moderate for consolidation (49% vs 35%) and pleural effusion (56% vs 40%). AZmed sensitivity was higher for consolidation (0.94 vs 0.76) but lower for cardiomegaly (0.59 vs 0.77), while pleural effusion and pulmonary oedema were moderate (0.71 and 0.62). Gleamer showed moderate sensitivity for consolidation (0.71) and pleural effusion (0.63) with high specificity (0.62–1.00). Balanced accuracy was slightly lower than overall, and McNemar ORs did not exhibit a clear trend compared to overall data. NND ranged from 3–32, reflecting limited incremental diagnostic gain due to higher baseline prevalence.

In sitting radiographs (n=12), human prevalence was intermediate for cardiomegaly and pleural effusion (58%) and lower for consolidation (42%) and pulmonary oedema (8%). AZmed sensitivity was high for cardiomegaly and pulmonary oedema (both 1.00) and pleural effusion (0.86), moderate for consolidation (0.80). Gleamer showed, compared to the overall data, high sensitivity for consolidation (0.80) and pleural effusion (0.86) and high specificity (0.75–1.00).


AI-AI comparison

Across all positions, AZmed and Gleamer showed substantial agreement for consolidation and pleural effusion (balanced accuracy 0.81–0.84, MCC 0.65–0.73). For standing scans (n=204), agreement was strong for consolidation (0.85) and pleural effusion (0.87), with McNemar ORs indicating more AZmed-positive than Gleamer-positive detections (consolidation 9.0, pleural effusion 25.0). For scans in supine position (n=63), agreement remained high for consolidation (0.73) and pleural effusion (0.81), with conditional ORs suggesting raters’ classifications to be positively associated (11.1–23.3). For sitting patients (n=12), both systems showed high to perfect agreement for consolidation and pleural effusion (0.75–1.00). Overall, AZmed and Gleamer exhibited strong agreement with regard to consolidation and pleural effusion across positions.



Discussion

In this study, we performed a direct, real-world clinical comparison of two commercially available AI-based chest X-ray tools – Rayvolve (AZmed) and ChestView (Gleamer) – against human radiologists using a cohort of 279 studies and seven key thoracic pathologies. Our findings provide insights into the performance, limitations, and potential clinical integration of AI systems in radiology and are contextualized by the growing body of evidence from similar commercial systems. This evaluation reflects the intended real-world deployment of such AI tools as decision-support systems running alongside routine radiological practice. While non-blinded, our study adds relevant insights about the performance of AI systems, similar to previously non-blinded published work in three studies [17] [24] [25].

Interpretation of study findings

Both AI systems demonstrated strong overall concordance when compared with human radiologists, for example, with strong balanced accuracy, agreement, and IRR (Gwet’s AC1 up to 0.96, pneumothorax). These results are in line with prior work showing that modern AI algorithms can achieve clinical agreement comparable to trained radiologists for common chest X-ray findings [41] [42]. The high level of agreement for common pathologies like pleural effusion and consolidation supports the potential utility of AI for triage and second-read scenarios, reinforcing its role as a complementary tool. Our findings are strongly supported by previous studies on these specific systems. Bennani et al. (2023) demonstrated that the Gleamer AI (ChestView) improved radiologists’ sensitivity across all expertise levels, with absolute increases of up to 26% for pneumothorax and 14% for consolidation, while also reducing reading times by 31% [24]. In a separate follow-up study by Selvam et al. (2025) on emergency chest X-ray, the Gleamer AI system improved sensitivity for consolidation, pleural effusion, and nodules [25]. Similarly, a multi-reader, multi-case study by Bettinger et al. (2024) on the AZmed system (Rayvolve) reported that AI assistance led to a significant 16% increase in the area under the curve (AUC), an 11% boost in sensitivity, and a 36% reduction in interpretation time [17]. The current study expands on this evidence by providing a direct head-to-head comparison of both commercially implemented AI systems operating in parallel within a routine clinical workflow.

Across most pathologies, both AI systems in our study exhibited higher specificity than sensitivity, without a ground truth available, compared to human readers. This conservative detection pattern, which prioritizes reducing false positives, is clinically valuable in high-pressure settings such as emergency departments or ICUs, where overcalling may lead to unnecessary and invasive follow-up studies [43].

A key finding of our study was that patient positioning substantially influenced AI-Human agreement. Supine radiographs tended to show the greatest discrepancies. This aligns with prior literature indicating that supine imaging – common in ICU or trauma settings – poses inherent challenges due to altered anatomical projections, magnification, and overlapping structures [14] [44]. These findings highlight the critical importance of taking patient positioning into account when implementing AI in clinical workflows, and they suggest a potential value for position-specific algorithm training to improve generalizability.

The overall data set contained a strong imbalance towards negative assessments, which is a recurring attribute in medical imaging [33]. This imbalance was particularly pronounced for rare findings such as mediastinal mass (human prevalence 3%), nodule (3%), and pneumothorax (5%). Conversely, higher positive rates were observed for cardiomegaly, consolidation, and pleural effusion, especially in supine studies. This is likely because supine positioning naturally increases the width of the heart’s silhouette, making the distinction between physiological and pathological enlargement more challenging and subjective.


Clinical implications

Overall, our findings are consistent with the broader literature [17] [24] [25] and they suggest that those commercially available AI systems can reliably support radiologists in routine chest X-ray interpretation. The proven benefits in clinical agreement and efficiency gains support their use for common pathologies and standard projections. However, our results also clearly show that AI performance is not infallible; it varies with patient positioning and pathology prevalence. This highlights areas where human oversight remains essential, particularly for complex cases, rare findings, and non-standard projections [14].


Limitations

Several limitations of this study should be acknowledged. First, the data set was limited to a single university hospital, which may reduce generalizability to other institutions or patient populations. Second, the number of supine and in particular sitting radiographs was relatively small, limiting the statistical power to assess AI-Human agreement in these positions robustly. Third, rare pathologies were underrepresented, which is a common challenge in AI imaging studies but can nevertheless lead to less reliable performance metrics for these conditions. Fourth, both AI systems were evaluated in a “black box” manner without insight into their specific decision-making processes, which can limit interpretability and clinical trust. Fifth, gender distribution data were not captured due to institutional privacy protocols, limiting demographic characterization of our cohort.

Finally, while radiologists maintained full legal responsibility, their real-time access to AI outputs via PACS could have subtly influenced their reporting behavior, potentially introducing bias into the comparison. No ground truth was available, for example, through additional imaging or biopsy. Instead, human labels served as reference for the AI evaluation. This might, as in previous studies [17] [24] [25], introduce mutual reinforcement; however, concordance or agreement must not be misinterpreted as accuracy.

Future studies should aim for multicenter designs, larger data sets enriched for rare conditions, and prospective assessment with AI feedback blinded to the human readers, in order to evaluate true performance and integration more rigorously.


Conclusions

This study demonstrates that commercially available AI chest X-ray tools achieve a high level of agreement with human radiologists in real-world practice, especially for common pathologies and standing-position radiographs. Specificity tends to exceed sensitivity, emphasizing conservative detection strategies. Patient positioning and low-prevalence pathologies remain key challenges, underscoring the importance of careful implementation and continued human oversight in clinical workflows.

Previous work has already revealed that either of the two AI systems can be a valuable support for clinical radiologists, increasing performance and speed [17] [24] [25]. Similarly, our findings reinforce this conclusion. However, the observed discrepancies with human diagnoses highlight the continued need for human oversight in clinical decision-making. In this context, running multiple AI systems in parallel and considering their consensus could further enhance diagnostic reliability. Greater in-depth understanding of raters’ diagnostic processes will also be attained by integrating not only diagnoses but also model reasoning, for example, using class activation maps (activation mapping) in the statistical analyses. Two raters might give the identical diagnosis for two different lesions, or two raters might provide a different diagnosis when labeling the identical lesion.

Once these AI systems begin to be used widely in clinical routine, they will start to fulfill the prediction made at the Dartmouth AI workshop [45]. However, a dramatic deskilling of radiologists due to AI applications [46] seems to be a concern for the distant future rather than the immediate one.



Supplements

S 1: collected raw data (n = 279) with diagnoses per rater (.xlsx, as well as .csv)

S 2: statistical evaluation code (.py)

The study's raw data and Python code are deposited permanently on figshare under DOI: 10.6084/m9.figshare.28692659.


Ethics declaration

The study received an ethics waiver (Req-2025-00216) from the Cantonal Ethics Committee Bern (Kantonale Ethikkommission für die Forschung, Gesundheits-, Sozial- und Integrationsdirektion), dated 17 February 2025. All experiments were conducted in accordance with relevant guidelines and regulations, including the Declaration of Helsinki. Informed consent was obtained from all participants and/or their legal guardians prior to study participation.



Conflict of Interest

The authors declare that they have no conflict of interest.


Correspondence

PD Dr. Dr. med. Wolfram A. Bosbach
Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern
Bern
Switzerland   

Publication History

Received: 09 April 2025

Accepted after revision: 11 December 2025

Article published online:
20 January 2026

© 2026. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany


Zoom
Fig. 1 Study sample set, n = 279 studies included.
Zoom
Fig. 2 Age distribution of patients included in the study.