Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine

Wolfram A. Bosbach; Luca Schoeni; Jan Felix Senge; Milena Mitrakovic; Marc-André Weber; Pawel Dlotko; Keivan Daneshvar

doi:10.1055/a-2772-7798

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00000066.xml

Download PDF

Rofo
DOI: 10.1055/a-2772-7798

Technical Innovations

Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine

Neue KI-Systeme zur Thoraxröntgen-Diagnostik: Qualitätsbewertung der Übereinstimmung mit ärztlichen Diagnosen im klinischen Alltag

Authors

Wolfram A. Bosbach

¹Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland

²Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
Luca Schoeni

¹Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
Jan Felix Senge

³Department of Mathematics and Computer Science, University of Bremen, Bremen, Germany (Ringgold ID: RIN9168)

⁴Dioscuri Centre in Topological Data Analysis, Polish Academy of Sciences, Warsaw, Poland (Ringgold ID: RIN49559)
Milena Mitrakovic

²Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
Marc-André Weber

⁵Institute of Diagnostic and Interventional Radiology, Pediatric Radiology and Neuroradiology, University Medical Center Rostock, Rostock, Germany
Pawel Dlotko

⁴Dioscuri Centre in Topological Data Analysis, Polish Academy of Sciences, Warsaw, Poland (Ringgold ID: RIN49559)
Keivan Daneshvar

²Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland

Supported by: JF Senge and P Dlotko were supported by the Dioscuri program initiated by the Max Planck Society, jointly managed with the National Science Centre (Poland), and mutually funded by the Polish Ministry of Science and Higher Education and the German Federal Ministry of Education and Research.

Further Information

Also available at

PDF Download Permissions and Reprints

Abstract

Purpose

The rising demand for radiology services calls for innovative solutions to sustain diagnostic quality and efficiency. This study evaluated the diagnostic agreement between two commercially available artificial intelligence (AI) chest X-ray systems and human radiologists during routine clinical practice.

Materials and Methods

We retrospectively analyzed 279 chest X-rays (204 standing, 63 supine, 12 sitting) from a Swiss university hospital. Seven thoracic pathologies – cardiomegaly, consolidation, mediastinal mass, nodule, pleural effusion, pneumothorax, and pulmonary oedema – were assessed. Radiologists’ routine reports were compared against Rayvolve (AZmed) and ChestView (Gleamer, both from Paris, France). A Python code, provided as open access supplement, calculated performance metrics, agreement measures, and effect size quantification.

Results

Agreement between radiologists and AI ranged from moderate to almost perfect: Human-AZmed (Gwet’s AC1: 0.47–0.72, moderate to substantial), and Human-Gleamer (Gwet’s AC1: 0.56–0.96, moderate to almost perfect). Balanced accuracies ranged from 0.67–0.85 for Human-AZmed and 0.71–0.85 for Human-Gleamer, with peak performance for pleural effusion (0.85 both systems). Specificity consistently exceeded sensitivity across pathologies (0.70–0.98 vs 0.45–0.85). Common findings showed strong performance, pleural effusion (MCC 0.70–0.73), cardiomegaly (MCC 0.51), and consolidation (MCC 0.45–0.46). Rare pathologies demonstrated lower agreement, mediastinal mass, and nodules (MCC 0.23–0.31). Standing radiographs yielded superior agreement compared to supine studies. The two AI systems showed substantial inter-system agreement for consolidation and pleural effusion (balanced accuracy 0.81–0.84).

Conclusion

Both commercial AI chest X-ray systems demonstrated comparable performance to human radiologists for common thoracic pathologies, with no meaningful differences between platforms. Performance was strongest for standing radiographs but declined for rare findings and supine studies. Position-dependent variability and reduced sensitivity for uncommon pathologies underscore the continued need for human oversight in clinical practice.

Key Points

AI systems matched radiologists for common chest X-ray findings.
Standing radiographs achieved the highest diagnostic agreement.
Rare pathologies showed weaker AI-human agreement.
Supine studies reduced diagnostic performance.
Human oversight remains essential in clinical practice.

Citation Format

Bosbach WA, Schoeni L, Senge JF et al. Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine. Rofo 2025; DOI 10.1055/a-2778-3892

Zusammenfassung

Ziel

Die steigende Nachfrage nach radiologischen Untersuchungen erfordert innovative Lösungen zur Aufrechterhaltung der diagnostischen Qualität und Effizienz. Diese Studie bewertete die diagnostische Übereinstimmung zwischen zwei kommerziell verfügbaren KI-Systemen für Thoraxröntgenaufnahmen und Radiologen im klinischen Alltag.

Materialien und Methoden

Wir analysierten retrospektiv 279 Thoraxröntgenaufnahmen (204 stehend, 63 liegend, 12 sitzend) eines Schweizer Universitätsspitals. Sieben thorakale Pathologien wurden bewertet: Kardiomegalie, Konsolidierung, Mediastinaltumor, Rundherd, Pleuraerguss, Pneumothorax und Lungenödem. Die Routinebefunde der Radiologen wurden mit Rayvolve (AZmed) und ChestView (Gleamer, beide aus Paris, Frankreich) verglichen. Ein Python-Code, als Open-Access-Supplement bereitgestellt, berechnete Leistungsmetriken, Übereinstimmungsmaße und Effektstärkenquantifizierung.

Ergebnisse

Die Übereinstimmung zwischen Radiologen und KI reichte von moderat bis fast perfekt: Mensch-AZmed (Gwet’s AC1: 0,47–0,72, moderat bis substanziell) und Mensch-Gleamer (Gwet’s AC1: 0,56–0,96, moderat bis fast perfekt). Die balancierte Genauigkeit lag zwischen 0,67–0,85 für Mensch-AZmed und 0,71–0,85 für Mensch-Gleamer, mit Höchstleistung bei Pleuraerguss (0,85 beide Systeme). Die Spezifität übertraf durchgehend die Sensitivität bei allen Pathologien (0,70–0,98 vs. 0,45–0,85). Häufige Befunde zeigten starke Leistung: Pleuraerguss (MCC 0,70–0,73), Kardiomegalie (MCC 0,51) und Konsolidierung (MCC 0,45–0,46). Seltene Pathologien demonstrierten geringere Übereinstimmung: Mediastinaltumor und Rundherde (MCC 0,23–0,31). Stehende Röntgenaufnahmen erzielten bessere Übereinstimmung als Aufnahmen in Rückenlage. Die beiden KI-Systeme zeigten substanzielle Übereinstimmung untereinander bei Konsolidierung und Pleuraerguss (balancierte Genauigkeit 0,81–0,84).

Schlussfolgerung

Beide kommerziellen KI-Systeme für Thoraxröntgen zeigten vergleichbare Leistung zu Radiologen bei häufigen thorakalen Pathologien, ohne bedeutsame Unterschiede zwischen den Plattformen. Die Leistung war bei stehenden Aufnahmen am stärksten, nahm jedoch bei seltenen Befunden und Aufnahmen in Rückenlage ab. Lageabhängige Variabilität und reduzierte Sensitivität für seltene Pathologien unterstreichen die anhaltende Notwendigkeit ärztlicher Supervision in der klinischen Praxis.

Kernaussagen

KI-Systeme entsprachen Radiologen bei häufigen Thoraxröntgen-Befunden.
Stehende Aufnahmen erzielten die höchste diagnostische Übereinstimmung.
Seltene Pathologien zeigten schwächere KI-Mensch-Übereinstimmung.
Liegende Aufnahmen reduzierten die diagnostische Leistung.
Ärztliche Supervision bleibt in der klinischen Praxis unerlässlich.

Keywords

Chest X-ray - Deep Learning - Multi-label Classification - Explainability - Medical Imaging

Introduction

The demand for clinical radiology services is predicted to grow substantially in the future. According to certain scenarios, future demand could potentially outpace available capacities in radiology [1]. Novel artificial intelligence (AI) software solutions might provide valuable support and assist the human radiologist, increasing patient throughput while maintaining or even improving diagnostic quality. Applications of AI are thought to be possible in, for instance, report drafting by large language models (LLM) [2] [3] [4], recommendation of appropriate interventional procedures [5], assessment of intramuscular fat fraction [6], or also in the reconstruction of undersampled magnetic resonance imaging (MRI) data [7]. One of the largest fields of potential AI application in radiology is that of pattern recognition, for example, for lesion labeling [8] [9]. The competence to reliably recognize or create patterns is a vital requirement, regardless whether working with text data or image data. Despite promising recent developments, it has been reported that the potential offered by novel AI systems is not unlimited. This appears to be true, for example, for quantification of radiation dose in computed tomography [10] or optimization of acquisition time in MRI reconstruction [11] [12]. The impact of hallucinations on clinical data is important to consider in this context [13].

Chest X-ray, although basic and established for a long time, is of particular importance in clinical radiology due to its low cost, low radiation, and widespread availability. The volume of case numbers makes chest X-ray a promising target for automation attempts [14]. There is already research on chest X-ray report automation [15] [16]. Recently, commercial software providers have started to offer chest X-ray analysis tools. One prominent example is Rayvolve for Chest (manufactured by AZmed, Paris, France). This AI tool designed to detect chest pathologies in X-rays has been tested before and was found to improve the speed and performance of human radiologists [17]. There are studies also available on the AZmed sister tool for fracture detection [18] [19] [20]. The AZmed system for chest X-ray consists of an ensemble approach that combines five RetinaNet-based object-detection models that share a common VGG-16 backbone architecture [21] [22]. Another chest X-ray tool now available commercially is ChestView (manufactured by Gleamer, Paris, France). Gleamer has been tested before for fracture detection [23] and using chest X-rays. In the area of chest X-ray, Gleamer has been shown to reduce the time needed by human radiologists to complete study interpretations and to increase the sensitivity of human radiologists [24] [25]. Gleamer relies for its procedures on a deep convolutional neural network, namely the object detection framework Detectron2 [26].

In this study, we assess the output of the two chest X-ray AI systems mentioned above. We compare their results – AI to AI – and against diagnoses by human radiologists made during non-blinded clinical practice, reflecting the intended use case of the software providers (see [Table 1] for list of diagnoses; the assessments follow non-blinded work published in three studies [17] [24] [25]). To the best of our knowledge, this is the first study to compare both systems while operating in parallel on a shared set of chest X-ray studies.

Table 1 Pathologies in alphabetical order reported by the human radiologist, AZmed, and Gleamer.
	Human radiologist	AZmed	Gleamer
Cardiomegaly	reported	reported	not reported
Consolidation	reported	reported	reported
Mediastinal Mass	reported	not reported	reported
Nodule	reported	not reported	reported
Pleural Effusion	reported	reported	reported
Pneumothorax	reported	not reported	reported
Pulmonary Oedema	reported	reported	not reported

Materials and Methods

The following section describes the three reporting entities, the data set (n = 279, [Fig. 1], [Table 2]), and statistical evaluation methods. The study’s raw data and the study’s Python code are included as open access supplements (S 1 raw input data, S 2 Python source code).

Fig. 1 Study sample set, n = 279 studies included.

Table 2 Overview of included sample set.
Number of studies included	279
Number standing studies (PA & lat)	204
Number supine studies	63
Number sitting studies	12
Patient age [yrs], median (Q1–Q3)	66 (51–76)

Raters

This present study compares three reporting entities for chest X-ray diagnostics:

Reports from human radiologists during routine clinical practice,
Rayvolve for Chest (AZmed, Paris, France),
ChestView (Gleamer, Paris, France).

Human radiologists wrote their reports as part of their everyday clinical routine. Standard procedure was for residents to draft reports, which were then reviewed and finalized by consultants. For this study, all human ratings were extracted directly from the finalized radiology report texts. The two software applications were running in the background and automatically deposited their assessments in the picture archiving and communication system (PACS). Radiologists had been made aware of the software trials and that the software had pre-approved status. Although radiologists could access the AI assessments from Rayvolve and ChestView through the PACS (similar to the study protocol found in the publications referenced [17] [24] [25]), they maintained full responsibility for writing their official clinical reports with all associated legal liabilities. The radiologists maintained their independent diagnostic judgment and remained professionally accountable for their interpretations and conclusions, regardless of whether they chose to consult the AI-generated assessments during their workflow. The study measured Human-AI agreement and disagreement for non-blinded human radiologists. This type of non-blinded measurement has been reported on previously for both AZmed [17] and Gleamer [24] [25], and the setup reflects the intended real-world use case for such AI systems operating alongside routine clinical workflows.

In total, seven pathologies are included in this study: cardiomegaly, consolidation, mediastinal mass, nodule, pleural effusion, pneumothorax, and pulmonary oedema. Human radiologists assessed all seven. AZmed reported four of the pathologies, and Gleamer reported five ([Table 1], supplement S 1). For the diagnoses listed in [Table 1], AZmed reports a probability estimate on a discrete three-point scale: [no, low, high] and Gleamer uses: [no, doubt, yes]. To enable comparability with the variable expression of certainty in human-written reports [27], we transformed all ratings into a binary scale [0, 1] for this study, where 0 implies that a diagnosis is negative and 1 implies that probability for that diagnosis is anything >0.

Sample set

Chest X-ray images were acquired from patients attending a University Hospital in Switzerland. Standing studies include in each case a postero-anterior (PA) projection as well as a second projection in lateral (lat). Supine position leads to projection in antero-posterior. Images were acquired for routine assessments (e.g. pneumonia follow-ups, port or pacemaker localization), intensive care unit (ICU) imaging, and accidents and emergencies (A&E) referrals (e.g. for chest pain or chest trauma). Starting on March 10, 2024, 300 consecutive studies were included chronologically; after excluding 21 studies due to partially incomplete human reports (n = 20) and mixed imaging positions (n = 1), 279 studies remained for analysis. The age distribution is provided in [Fig. 1] and [Table 2]. For anonymization, no further patient characteristics were reported.

Statistical evaluation methods

For comparison between raters, used Python code to calculate an evaluation (supplement S 2, [Table 3], [Table 4]). To analyze reader results, pairwise 2×2 contingency tables (Human-AZmed, Human-Gleamer, AZmed-Gleamer) were generated for each diagnosis. Without an independent ground truth, the interpretation of the study results had to consider agreement with the human clinical report, which itself is an imperfect reference standard. For agreement analysis, sensitivity, specificity, and human prevalence were calculated with Wilson score 95% confidence intervals using the statsmodels confint package [28]. Balanced accuracy/ Matthews correlation coefficient (MCC) were added through the corresponding scikit-learn routines [29] [30], both of which were combined with a 5,000-fold bootstrap confidence interval. For bootstrap samples containing only one outcome class, balanced accuracy was set to 0.5 (chance-level performance) and MCC to 0 (no correlation) to ensure numerical stability and to avoid undefined values.

Table 3 Interpretation of strength of agreement for kappa statistics used in the present study, adapted from [31].
Kappa statistic	< 0.00	0.00–0.20	0.21–0.40	0.41–0.60	0.61–0.80	0.81–1.00	1.00
Strength of agreement	Poor	Slight	Fair	Moderate	Substantial	Almost Perfect	Perfect

Table 4 Performance of AI systems compared to human results.
		Human-AZmed				Human-Gleamer					AZmed-Gleamer
Position	Variable	Cardiomegaly	Consolidation	Pleural Effusion	Pulmonary Oedema	Consolidation	Mediastinal Mass	Nodule	Pleural Effusion	Pneumothorax	Consolidation	Pleural Effusion
all	cases	279	279	279	279	279	279	279	279	279	279	279
all	confusion matrix	AZmed:pos\|neg Human:pos\|86\|26 neg\|42\|125	AZmed:pos\|neg Human:pos\|74\|23 neg\|54\|128	AZmed:pos\|neg Human:pos\|94\|17 neg\|23\|145	AZmed:pos\|neg Human:pos\|28\|34 neg\|26\|191	Gleamer:pos\|neg Human:pos\|61\|36 neg\|31\|151	Gleamer:pos\|neg Human:pos\|4\|4 neg\|24\|247	Gleamer:pos\|neg Human:pos\|5\|2 neg\|25\|247	Gleamer:pos\|neg Human:pos\|81\|30 neg\|6\|162	Gleamer:pos\|neg Human:pos\|8\|6 neg\|4\|261	Gleamer:pos\|neg AZmed:pos\|85\|43 neg\|7\|144	Gleamer:pos\|neg AZmed:pos\|83\|34 neg\|4\|158
all	human prevalence	0.40 (0.35–0.46)	0.35 (0.29–0.41)	0.40 (0.34–0.46)	0.22 (0.18–0.27)	0.35 (0.29–0.41)	0.03 (0.01–0.06)	0.03 (0.01–0.05)	0.40 (0.34–0.46)	0.05 (0.03–0.08)	0.46 (0.40–0.52)	0.42 (0.36–0.48)
all	sensitivity	0.77 (0.68–0.84)	0.76 (0.67–0.84)	0.85 (0.77–0.90)	0.45 (0.33–0.57)	0.63 (0.53–0.72)	0.50 (0.22–0.78)	0.71 (0.36–0.92)	0.73 (0.64–0.80)	0.57 (0.33–0.79)	0.66 (0.58–0.74)	0.71 (0.62–0.78)
all	specificity	0.75 (0.68–0.81)	0.70 (0.63–0.76)	0.86 (0.80–0.91)	0.88 (0.83–0.92)	0.83 (0.77–0.88)	0.91 (0.87–0.94)	0.91 (0.87–0.94)	0.96 (0.92–0.98)	0.98 (0.96–0.99)	0.95 (0.91–0.98)	0.98 (0.94–0.99)
all	balanced acc	0.76 (0.70–0.81)	0.73 (0.68–0.79)	0.85 (0.81–0.90)	0.67 (0.60–0.73)	0.73 (0.67–0.78)	0.71 (0.52–0.90)	0.81 (0.62–0.96)	0.85 (0.80–0.89)	0.78 (0.64–0.91)	0.81 (0.76–0.85)	0.84 (0.80–0.89)
all	MCC	0.51 (0.41–0.61)	0.45 (0.34–0.54)	0.70 (0.62–0.78)	0.35 (0.22–0.48)	0.46 (0.36–0.57)	0.23 (0.03–0.42)	0.31 (0.10–0.49)	0.73 (0.65–0.81)	0.60 (0.34–0.81)	0.65 (0.57–0.73)	0.73 (0.65–0.80)
all	agreement	0.76 (0.70–0.80)	0.72 (0.67–0.77)	0.86 (0.81–0.89)	0.78 (0.73–0.83)	0.76 (0.71–0.81)	0.90 (0.86–0.93)	0.90 (0.86–0.93)	0.87 (0.83–0.91)	0.96 (0.94–0.98)	0.82 (0.77–0.86)	0.86 (0.82–0.90)
all	Gwet’s AC1	0.52	0.47	0.72	0.68	0.56	0.89	0.89	0.76	0.96	0.66	0.75
all	interpretation	moderate	moderate	substantial	substantial	moderate	almost perfect	almost perfect	substantial	almost perfect	substantial	substantial
all	95% CI	0.42–0.62	0.36–0.57	0.64–0.80	0.59–0.76	0.47–0.66	0.84–0.93	0.85–0.93	0.69–0.84	0.94–0.99	0.57–0.75	0.67–0.82
all	interpretation	moderate-substantial	fair-moderate	substantial-almost perfect	moderate-substantial	moderate-substantial	almost perfect	almost perfect	substantial-almost perfect	almost perfect	moderate-substantial	substantial-almost perfect
all	p-value	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001
all	McNemar p-value	0.07	<0.001	0.43	0.37	0.63	<0.001	<0.00001	<0.0001	0.75	<0.00001	<0.00001
all	McNemar OR	0.62 (0.38–1.01)	0.43 (0.26–0.69)	0.74 (0.39–1.38)	1.31 (0.78–2.18)	1.16 (0.72–1.88)	0.17 (0.06–0.48)	0.08 (0.02–0.34)	5.00 (2.08–12.01)	1.50 (0.42–5.32)	6.14 (2.76–13.66)	8.50 (3.02–23.95)
all	conditional OR	9.84 (5.62–17.25)	7.63 (4.33–13.43)	34.86 (17.69–68.71)	6.05 (3.17–11.55)	8.25 (4.69–14.52)	10.29 (2.42–43.78)	24.70 (4.55–133.95)	72.90 (29.16–182.24)	87.00 (20.46–370.01)	40.66 (17.51–94.44)	96.43 (33.09–281.00)
all	RD	–0.06 (–0.11–0.00)	–0.11 (–0.17––0.05)	–0.02 (–0.07–0.02)	0.03 (–0.03–0.08)	0.02 (–0.04–0.08)	–0.07 (–0.11––0.04)	–0.08 (–0.12––0.05)	0.09 (0.05–0.13)	0.01 (–0.02–0.03)	0.13 (0.08–0.18)	0.11 (0.07–0.15)
all	LR+	3.08 (2.33–4.08)	2.53 (1.97–3.25)	6.07 (4.12–8.95)	3.75 (2.38–5.90)	3.71 (2.60–5.29)	5.56 (2.52–12.26)	7.89 (4.33–14.36)	18.25 (8.25–40.37)	28.50 (9.74–83.35)	13.20 (6.34–27.50)	35.50 (13.39–94.09)
all	LR–	0.31 (0.22–0.43)	0.34 (0.24–0.50)	0.17 (0.11–0.27)	0.62 (0.50–0.79)	0.45 (0.34–0.58)	0.55 (0.27–1.10)	0.32 (0.10–1.03)	0.28 (0.21–0.38)	0.44 (0.24–0.80)	0.36 (0.28–0.46)	0.30 (0.22–0.39)
all	NND	5 (4–6)	4 (4–5)	7 (6–10)	5 (4–6)	5 (4–6)	10 (8–15)	11 (8–15)	8 (6–11)	28 (16–52)	6 (5–8)	8 (6–10)

standing	cases	204	204	204	204	204	204	204	204	204	204	204
standing	confusion matrix	AZmed:pos\|neg Human:pos\|49\|5 neg\|36\|114	AZmed:pos\|neg Human:pos\|41\|20 neg\|37\|106	AZmed:pos\|neg Human:pos\|63\|6 neg\|18\|117	AZmed:pos\|neg Human:pos\|12\|25 neg\|12\|155	Gleamer:pos\|neg Human:pos\|35\|26 neg\|19\|124	Gleamer:pos\|neg Human:pos\|4\|4 neg\|12\|184	Gleamer:pos\|neg Human:pos\|3\|2 neg\|22\|177	Gleamer:pos\|neg Human:pos\|53\|16 neg\|4\|131	Gleamer:pos\|neg Human:pos\|6\|5 neg\|3\|190	Gleamer:pos\|neg AZmed:pos\|51\|27 neg\|3\|123	Gleamer:pos\|neg AZmed:pos\|56\|25 neg\|1\|122
standing	human prevalence	0.26 (0.21–0.33)	0.30 (0.24–0.37)	0.34 (0.28–0.41)	0.18 (0.13–0.24)	0.30 (0.24–0.37)	0.04 (0.02–0.08)	0.02 (0.01–0.06)	0.34 (0.28–0.41)	0.05 (0.03–0.09)	0.38 (0.32–0.45)	0.40 (0.33–0.47)
standing	sensitivity	0.91 (0.80–0.96)	0.67 (0.55–0.78)	0.91 (0.82–0.96)	0.32 (0.20–0.49)	0.57 (0.45–0.69)	0.50 (0.22–0.78)	0.60 (0.23–0.88)	0.77 (0.66–0.85)	0.55 (0.28–0.79)	0.65 (0.54–0.75)	0.69 (0.58–0.78)
standing	specificity	0.76 (0.69–0.82)	0.74 (0.66–0.81)	0.87 (0.80–0.91)	0.93 (0.88–0.96)	0.87 (0.80–0.91)	0.94 (0.90–0.96)	0.89 (0.84–0.93)	0.97 (0.93–0.99)	0.98 (0.96–0.99)	0.98 (0.93–0.99)	0.99 (0.96–1.00)
standing	balanced acc	0.83 (0.78–0.88)	0.71 (0.64–0.77)	0.89 (0.84–0.93)	0.63 (0.55–0.71)	0.72 (0.65–0.79)	0.72 (0.54–0.90)	0.74 (0.45–0.96)	0.87 (0.81–0.92)	0.76 (0.61–0.92)	0.82 (0.76–0.87)	0.84 (0.79–0.89)
standing	MCC	0.60 (0.49–0.69)	0.39 (0.26–0.52)	0.75 (0.66–0.84)	0.30 (0.12–0.47)	0.46 (0.32–0.59)	0.32 (0.05–0.57)	0.23 (–0.03–0.43)	0.78 (0.68–0.86)	0.58 (0.27–0.83)	0.69 (0.60–0.79)	0.75 (0.66–0.82)
standing	agreement	0.80 (0.74–0.85)	0.72 (0.66–0.78)	0.88 (0.83–0.92)	0.82 (0.76–0.87)	0.78 (0.72–0.83)	0.92 (0.88–0.95)	0.88 (0.83–0.92)	0.90 (0.85–0.94)	0.96 (0.92–0.98)	0.85 (0.80–0.90)	0.87 (0.82–0.91)
standing	Gwet’s AC1	0.64	0.49	0.78	0.76	0.63	0.91	0.86	0.83	0.96	0.74	0.77
standing	interpretation	substantial	moderate	substantial	substantial	substantial	almost perfect	almost perfect	almost perfect	almost perfect	substantial	substantial
standing	95% CI	0.53–0.74	0.37–0.62	0.69–0.87	0.67–0.84	0.52–0.74	0.87–0.96	0.81–0.92	0.75–0.90	0.93–0.99	0.65–0.83	0.68–0.86
standing	interpretation	moderate-substantial	fair-substantial	substantial-almost perfect	substantial-almost perfect	moderate-substantial	almost perfect	almost perfect	substantial-almost perfect	almost perfect	substantial-almost perfect	substantial-almost perfect
standing	p-value	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001	<0.00001
standing	McNemar p-value	<0.00001	<0.05	<0.05	<0.05	0.37	0.08	<0.0001	<0.05	0.73	<0.00001	<0.00001
standing	McNemar OR	0.14 (0.05–0.35)	0.54 (0.31–0.93)	0.33 (0.13–0.84)	2.08 (1.05–4.15)	1.37 (0.76–2.47)	0.33 (0.11–1.03)	0.09 (0.02–0.39)	4.00 (1.34–11.96)	1.67 (0.40–6.97)	9.00 (2.73–29.67)	25.00 (3.39–184.51)
standing	conditional OR	31.03 (11.49–83.81)	5.87 (3.06–11.28)	68.25 (25.78–180.65)	6.20 (2.51–15.32)	8.79 (4.36–17.70)	15.33 (3.41–68.99)	12.07 (1.91–76.24)	108.48 (34.65–339.61)	76.00 (14.65–394.15)	77.44 (22.49–266.73)	273.28 (36.12–2067.71)
standing	RD	–0.15 (–0.21––0.09)	–0.08 (–0.15––0.01)	–0.06 (–0.11––0.01)	0.06 (0.01–0.12)	0.03 (–0.03–0.10)	–0.04 (–0.08––0.00)	–0.10 (–0.14––0.05)	0.06 (0.02–0.10)	0.01 (–0.02–0.04)	0.12 (0.07–0.17)	0.12 (0.07–0.16)
standing	LR+	3.79 (2.82–5.10)	2.58 (1.86–3.58)	7.00 (4.53–10.83)	4.57 (2.23–9.36)	4.38 (2.74–7.02)	8.33 (3.44–20.16)	5.45 (2.41–12.35)	25.67 (9.69–67.98)	27.50 (7.91–95.57)	32.50 (10.50–100.57)	69.00 (9.74–488.60)
standing	LR–	0.12 (0.05–0.27)	0.45 (0.31–0.65)	0.10 (0.05–0.22)	0.73 (0.58–0.92)	0.49 (0.37–0.67)	0.53 (0.27–1.06)	0.45 (0.15–1.32)	0.24 (0.15–0.36)	0.46 (0.24–0.88)	0.36 (0.26–0.49)	0.31 (0.23–0.43)
standing	NND	5 (4–7)	4 (3–5)	9 (6–13)	6 (5–8)	5 (4–6)	13 (9–21)	9 (6–13)	11 (7–16)	26 (14–50)	7 (5–10)	8 (6–12)

supine	cases	63	63	63	63	63	63	63	63	63	63	63
supine	confusion matrix	AZmed:pos\|neg Human:pos\|30\|21 neg\|4\|8	AZmed:pos\|neg Human:pos\|29\|2 neg\|16\|16	AZmed:pos\|neg Human:pos\|25\|10 neg\|5\|23	AZmed:pos\|neg Human:pos\|15\|9 neg\|13\|26	Gleamer:pos\|neg Human:pos\|22\|9 neg\|12\|20	Gleamer:pos\|neg Human:pos\|0\|0 neg\|9\|54	Gleamer:pos\|neg Human:pos\|1\|0 neg\|2\|60	Gleamer:pos\|neg Human:pos\|22\|13 neg\|2\|26	Gleamer:pos\|neg Human:pos\|2\|1 neg\|0\|60	Gleamer:pos\|neg AZmed:pos\|31\|14 neg\|3\|15	Gleamer:pos\|neg AZmed:pos\|21\|9 neg\|3\|30
supine	human prevalence	0.81 (0.70–0.89)	0.49 (0.37–0.61)	0.56 (0.43–0.67)	0.38 (0.27–0.50)	0.49 (0.37–0.61)	0.00 (0.00–0.06)	0.02 (0.00–0.08)	0.56 (0.43–0.67)	0.05 (0.02–0.13)	0.71 (0.59–0.81)	0.48 (0.36–0.60)
supine	sensitivity	0.59 (0.45–0.71)	0.94 (0.79–0.98)	0.71 (0.55–0.84)	0.62 (0.43–0.79)	0.71 (0.53–0.84)	– (---)	1.00 (0.21–1.00)	0.63 (0.46–0.77)	0.67 (0.21–0.94)	0.69 (0.54–0.80)	0.70 (0.52–0.83)
supine	specificity	0.67 (0.39–0.86)	0.50 (0.34–0.66)	0.82 (0.64–0.92)	0.67 (0.51–0.79)	0.62 (0.45–0.77)	0.86 (0.75–0.92)	0.97 (0.89–0.99)	0.93 (0.77–0.98)	1.00 (0.94–1.00)	0.83 (0.61–0.94)	0.91 (0.76–0.97)
supine	balanced acc	0.63 (0.47–0.77)	0.72 (0.62–0.82)	0.77 (0.66–0.87)	0.65 (0.52–0.77)	0.67 (0.55–0.78)	0.86 (0.50–0.50)	0.98 (0.50–1.00)	0.78 (0.68–0.87)	0.83 (0.50–1.00)	0.76 (0.65–0.86)	0.80 (0.71–0.90)
supine	MCC	0.20 (–0.05–0.44)	0.48 (0.27–0.66)	0.53 (0.32–0.73)	0.29 (0.04–0.52)	0.34 (0.10–0.56)	0.00 (0.00–0.00)	0.57 (0.00–1.00)	0.57 (0.38–0.75)	0.81 (0.00–1.00)	0.47 (0.27–0.67)	0.63 (0.43–0.81)
supine	agreement	0.60 (0.48–0.71)	0.71 (0.59–0.81)	0.76 (0.64–0.85)	0.65 (0.53–0.76)	0.67 (0.54–0.77)	0.86 (0.75–0.92)	0.97 (0.89–0.99)	0.76 (0.64–0.85)	0.98 (0.92–1.00)	0.73 (0.61–0.82)	0.81 (0.70–0.89)
supine	Gwet’s AC1	0.29	0.45	0.52	0.32	0.33	0.84	0.97	0.53	0.98	0.49	0.63
supine	interpretation	fair	moderate	moderate	fair	fair	almost perfect	almost perfect	moderate	almost perfect	moderate	substantial
supine	95% CI	0.03–0.56	0.22–0.68	0.31–0.74	0.08–0.57	0.09–0.57	0.72–0.95	0.92–1.00	0.31–0.74	0.95–1.00	0.27–0.72	0.43–0.83
supine	interpretation	slight-moderate	fair-substantial	fair-substantial	slight-moderate	slight-moderate	substantial-almost perfect	almost perfect-perfect	fair-substantial	almost perfect-perfect	fair-substantial	moderate-almost perfect
supine	p-value	<0.05	<0.001	<0.00001	<0.05	<0.01	<0.00001	<0.00001	<0.00001	<0.00001	<0.0001	<0.00001
supine	McNemar p-value	<0.001	<0.01	0.30	0.52	0.66	<0.01	0.50	<0.01	1.00	<0.05	0.15
supine	McNemar OR	5.25 (1.80–15.29)	0.12 (0.03–0.54)	2.00 (0.68–5.85)	0.69 (0.30–1.62)	0.75 (0.32–1.78)	0.00 (---)	0.00 (---)	6.50 (1.47–28.80)	– (---)	4.67 (1.34–16.24)	3.00 (0.81–11.08)
supine	conditional OR	2.86 (0.76–10.73)	14.50 (2.95–71.22)	11.50 (3.42–38.71)	3.33 (1.15–9.63)	4.07 (1.42–11.70)	5.74 (0.11–307.05)	72.60 (2.32–2267.73)	22.00 (4.47–108.24)	201.67 (6.46–6299.25)	11.07 (2.75–44.50)	23.33 (5.64–96.60)
supine	RD	0.27 (0.13–0.41)	–0.22 (–0.34––0.10)	0.08 (–0.04–0.20)	–0.06 (–0.21–0.08)	–0.05 (–0.19–0.09)	–0.14 (–0.23––0.06)	–0.03 (–0.08–0.01)	0.17 (0.06–0.29)	0.02 (–0.01–0.05)	0.17 (0.05–0.30)	0.10 (–0.01–0.20)
supine	LR+	1.79 (0.78–4.11)	1.88 (1.31–2.69)	3.94 (1.73–8.97)	1.88 (1.09–3.23)	1.87 (1.13–3.08)	– (---)	– (---)	9.00 (2.31–35.05)	– (---)	4.06 (1.42–11.62)	7.78 (2.58–23.46)
supine	LR–	0.61 (0.36–1.03)	0.12 (0.03–0.48)	0.35 (0.20–0.61)	0.57 (0.32–1.00)	0.47 (0.25–0.86)	– (---)	– (---)	0.40 (0.26–0.62)	– (---)	0.37 (0.23–0.60)	0.33 (0.19–0.58)
supine	NND	3 (2–4)	4 (3–6)	5 (3–7)	3 (3–5)	3 (3–5)	7 (4–12)	21 (8–62)	5 (3–7)	32 (10–115)	4 (3–6)	6 (4–9)

sitting	cases	12	12	12	12	12	12	12	12	12	12	12
sitting	confusion matrix	AZmed:pos\|neg Human:pos\|7\|0 neg\|2\|3	AZmed:pos\|neg Human:pos\|4\|1 neg\|1\|6	AZmed:pos\|neg Human:pos\|6\|1 neg\|0\|5	AZmed:pos\|neg Human:pos\|1\|0 neg\|1\|10	Gleamer:pos\|neg Human:pos\|4\|1 neg\|0\|7	Gleamer:pos\|neg Human:pos\|0\|0 neg\|3\|9	Gleamer:pos\|neg Human:pos\|1\|0 neg\|1\|10	Gleamer:pos\|neg Human:pos\|6\|1 neg\|0\|5	Gleamer:pos\|neg Human:pos\|0\|0 neg\|1\|11	Gleamer:pos\|neg AZmed:pos\|3\|2 neg\|1\|6	Gleamer:pos\|neg AZmed:pos\|6\|0 neg\|0\|6
sitting	human prevalence	0.58 (0.32–0.81)	0.42 (0.19–0.68)	0.58 (0.32–0.81)	0.08 (0.01–0.35)	0.42 (0.19–0.68)	0.00 (0.00–0.24)	0.08 (0.01–0.35)	0.58 (0.32–0.81)	0.00 (0.00–0.24)	0.42 (0.19–0.68)	0.50 (0.25–0.75)
sitting	sensitivity	1.00 (0.65–1.00)	0.80 (0.38–0.96)	0.86 (0.49–0.97)	1.00 (0.21–1.00)	0.80 (0.38–0.96)	– (---)	1.00 (0.21–1.00)	0.86 (0.49–0.97)	– (---)	0.60 (0.23–0.88)	1.00 (0.61–1.00)
sitting	specificity	0.60 (0.23–0.88)	0.86 (0.49–0.97)	1.00 (0.57–1.00)	0.91 (0.62–0.98)	1.00 (0.65–1.00)	0.75 (0.47–0.91)	0.91 (0.62–0.98)	1.00 (0.57–1.00)	0.92 (0.65–0.99)	0.86 (0.49–0.97)	1.00 (0.61–1.00)
sitting	balanced acc	0.80 (0.50–1.00)	0.83 (0.56–1.00)	0.93 (0.75–1.00)	0.95 (0.50–1.00)	0.90 (0.67–1.00)	0.75 (0.50–0.50)	0.95 (0.50–1.00)	0.93 (0.78–1.00)	0.92 (0.50–0.50)	0.73 (0.44–1.00)	1.00 (1.00–1.00)
sitting	MCC	0.68 (0.00–1.00)	0.66 (0.12–1.00)	0.85 (0.52–1.00)	0.67 (0.00–1.00)	0.84 (0.52–1.00)	0.00 (0.00–0.00)	0.67 (0.00–1.00)	0.85 (0.53–1.00)	0.00 (0.00–0.00)	0.48 (–0.13–1.00)	1.00 (1.00–1.00)
sitting	agreement	0.83 (0.55–0.95)	0.83 (0.55–0.95)	0.92 (0.65–0.99)	0.92 (0.65–0.99)	0.92 (0.65–0.99)	0.75 (0.47–0.91)	0.92 (0.65–0.99)	0.92 (0.65–0.99)	0.92 (0.65–0.99)	0.75 (0.47–0.91)	1.00 (0.76–1.00)

The agreement per diagnosis observed was calculated as the ratio of matching ratings on the main diagonal of the 2×2 contingency table relative to all cases. Wilson score 95% confidence intervals were computed using the statsmodels confint package [28]. Interrater reliability (IRR) was tested by application of the interrater reliability Chance-corrected Agreement Coefficients (irrCAC) package [32]. irrCAC allows the extraction of IRR variables such as Gwet’s AC1, which is known to be advantageous for imbalanced data sets [33]. In addition, irrCAC can provide the corresponding p-value, allowing researchers to test the Null Hypothesis (H0), which is used for confirming no agreement beyond what would be expected purely by chance. Gwet’s AC1 is defined for the interval [–1, 1]. Landis and Koch have defined a table for interpretation, see [Table 3] [31].

Systematic disagreement between paired raters was tested using McNemar’s exact test [34]. Since statistical significance testing can overemphasize clinical relevance, as p-values do not reflect the magnitude of an effect [35] [36] [37], we added measures of clinical effect size together with their 95% confidence intervals. Specifically, we calculated McNemar odds ratios (ORs), conditional ORs, risk differences (RD), likelihood ratios for positive (LR+) and negative (LR−) test results, and the number needed to diagnose (NND). Paired agreement between raters was assessed using McNemar odds ratios, with confidence intervals calculated from the standard error of the log odds ratio. Marginal odds ratios were also computed, applying a continuity correction of 0.5 to all cells when any cell contained zero counts, and confidence intervals were derived on the logarithmic scale to quantify the association between raters’ classifications. The RD and its confidence interval were calculated from the discordant pairs of the contingency table using the standard error formula for paired proportions. LR+ and LR- with confidence intervals were calculated from the 2×2 contingency table, applying log-transformation for interval estimation. NND, defined as the reciprocal of the proportion of diagnostic disagreements, was calculated with its confidence interval derived from the Wilson score interval for the disagreement proportion, i.e. patients per misdiagnosis [28].

P-values were reported exactly to two decimal places or, if very small, as thresholds (<alpha, <0.01, <0.001, <0.0001, <0.00001) to indicate increasing levels of statistical significance. Alpha was set to 5/100 for the present study. The LLMs GPT-5 [38], Claude Opus 4 [39], and DeepSeek-V3.1 [40] were used in Python code debugging, and manuscript writing up.

Results

In the following section, we discuss the sample set itself and the results obtained by pairs of raters, including both Human-AI and AI-AI comparisons.

Sample set

Each of the three raters assessed all of the 279 patients included in the present study, which allowed for paired testing. Of the 279 studies included, 204 of these were acquired in standing position, 63 in supine position, and 12 in sitting position. The median patient age was 66 years (interquartile range: 51–76 years, [Table 2]). The age distribution demonstrated a peak in the 7th decade of life ([Fig. 2]), which is typical for the patient clientele in university medicine.

Fig. 2 Age distribution of patients included in the study.

Human-AI paired ratings, assessed overall

The prevalence of pathological findings as determined by human readers varied considerably. Common findings (>20%) included cardiomegaly (40%), pleural effusion (40%), and consolidation (35%). Pulmonary oedema (22%) occurred with intermediate frequency, while pneumothorax (5%), mediastinal mass (3%), and pulmonary nodule (3%) were rare.

Sensitivity varied across diagnoses and AI systems. In the Human-AZmed comparison, pleural effusion reached the highest sensitivity (0.85), followed by cardiomegaly (0.77) and consolidation (0.76). Pulmonary oedema showed lower sensitivity (0.45). In the Human-Gleamer comparison, sensitivity was 0.73 for pleural effusion, 0.71 for pulmonary nodule, 0.63 for consolidation, 0.57 for pneumothorax, and 0.50 for mediastinal mass. Specificity was generally higher than sensitivity. Human-AZmed specificity ranged from 0.70 for consolidation to 0.88 for pulmonary oedema. Human-Gleamer specificity was consistently high, with values from 0.83 for consolidation up to 0.98 for pneumothorax; mediastinal mass and pulmonary nodule both reached 0.91. Balanced accuracy reflected these findings. Human-AZmed values were 0.85 for pleural effusion, 0.76 for cardiomegaly, 0.73 for consolidation, and 0.67 for pulmonary oedema. Human-Gleamer values were 0.85 for pleural effusion, 0.81 for pulmonary nodule, 0.78 for pneumothorax, 0.73 for consolidation, and 0.71 for mediastinal mass.

Agreement measures showed pleural effusion as the strongest category, with MCC up to 0.70 (Human-AZmed) and 0.73 (Human-Gleamer). In contrast, rare pathologies such as mediastinal mass demonstrated substantially lower agreement. Overall agreement ranged from 0.72 to 0.86 for Human-AZmed, and 0.76 to 0.96 for Human-Gleamer. Gwet’s AC1 coefficients indicated moderate to substantial agreement for Human-AZmed (0.47–0.72) and moderate to almost perfect agreement for Human-Gleamer (0.56–0.96); all Gwet’s AC1 measures were statistically significant, i.e. observed agreement exceeded what would have been expected purely by chance.

McNemar’s test indicated significant differences for consolidation in the Human-AZmed comparison (p < 0.001); for mediastinal mass (p < 0.001), pulmonary nodule (p < 0.00001), and pleural effusion (p < 0.0001) in the Human-Gleamer comparison. No significant differences were observed for cardiomegaly, pleural effusion (Human-AZmed), pulmonary oedema, consolidation (Human-Gleamer), or pneumothorax.

McNemar OR took values greater and less than 1 for both Human-AZmed and Human-Gleamer, depending on diagnoses. The strongest likelihood for humans to miss or under call a diagnoses was seen for mediastinal masses and for nodules compared to Gleamer (McNemar OR 0.17 and 0.08). Marginal odds ratios demonstrated very strong positive associations, particularly for pleural effusion and pneumothorax (both Human-Gleamer), i.e. having a positive human label was positively associated with being predicted positive by the AI systems.

Compared with humans, AZmed tended to assign fewer positive labels for cardiomegaly (RD -0.06, 95% CI -0.11–0.00) and consolidation (RD -0.11, 95% CI -0.17–-0.05), while the differences were minimal for pleural effusion (RD -0.02, 95% CI -0.07–0.02) and pulmonary oedema (RD 0.03, 95% CI -0.03–0.08). Analysis of RDs showed that Gleamer had a higher positive call rate for pleural effusion (RD 0.09, 95% CI 0.05–0.13), whereas humans more frequently classified mediastinal masses (RD -0.07, 95% CI -0.11–-0.04) and nodules (RD -0.08, 95% CI -0.12–-0.05) as negative. Likelihood ratios further supported these findings; positive Gleamer results were strongly corroborated, particularly for pleural effusion (LR+ 18.25, LR− 0.28) and pneumothorax (LR+ 28.50, LR− 0.44), with other variables showing moderate support (LR+ 3.7–7.9, LR− 0.32–0.55). AZmed demonstrated moderate support for positive findings across all variables (LR+ 2.5–6.1, LR− 0.17–0.62), most notably for pleural effusion (LR+ 6.07, LR− 0.17). Overall, these results indicate that both AI systems reliably identify key pathologies, with Gleamer showing particularly strong performance for pleural effusion and pneumothorax.

The NND was lowest for common pathologies, including cardiomegaly (NND 5, 95% CI 4–6) and consolidation (AZmed 4, 95% CI 4–5; Gleamer 5, 95% CI 4–6); NND was higher for less frequent findings such as mediastinal masses (10, 95% CI 8–15) and pleural effusion (AZmed 7, 95% CI 6–10; Gleamer 8, 95% CI 6–11). Pneumothorax had the highest NND (28, 95% CI 16–52), reflecting the rarity of this finding despite a generally high level of agreement between AI and human readers.

Human-AI paired ratings, split by positions

In standing radiographs (n=204), human prevalence was lower than overall for cardiomegaly (26% vs 40%), consolidation (30% vs 35%), and pleural effusion (34% vs 40%). Compared with overall data, AZmed showed higher sensitivity for cardiomegaly (0.91 vs 0.77) and pleural effusion (0.91 vs 0.85), lower sensitivity for pulmonary oedema (0.32 vs 0.45), and balanced accuracy reflected these trends. Likelihood ratios supported strong diagnostic value for pleural effusion (Human-AZmed, LR+ 7.00, LR− 0.10). Gleamer showed lower sensitivity for pleural effusion (0.77) but higher specificity (0.97) compared to AZmed, with strong likelihood ratios (LR+ 25.67, LR− 0.24). NND was higher for Gleamer (5–26 vs 4–9 AZmed). Overall, standing radiographs highlighted increased AZmed sensitivity for common findings and higher Gleamer specificity for rarer pathologies.

In supine radiographs (n=63), human prevalence was higher than overall for cardiomegaly (81% vs 40%) and moderate for consolidation (49% vs 35%) and pleural effusion (56% vs 40%). AZmed sensitivity was higher for consolidation (0.94 vs 0.76) but lower for cardiomegaly (0.59 vs 0.77), while pleural effusion and pulmonary oedema were moderate (0.71 and 0.62). Gleamer showed moderate sensitivity for consolidation (0.71) and pleural effusion (0.63) with high specificity (0.62–1.00). Balanced accuracy was slightly lower than overall, and McNemar ORs did not exhibit a clear trend compared to overall data. NND ranged from 3–32, reflecting limited incremental diagnostic gain due to higher baseline prevalence.

In sitting radiographs (n=12), human prevalence was intermediate for cardiomegaly and pleural effusion (58%) and lower for consolidation (42%) and pulmonary oedema (8%). AZmed sensitivity was high for cardiomegaly and pulmonary oedema (both 1.00) and pleural effusion (0.86), moderate for consolidation (0.80). Gleamer showed, compared to the overall data, high sensitivity for consolidation (0.80) and pleural effusion (0.86) and high specificity (0.75–1.00).

AI-AI comparison

Across all positions, AZmed and Gleamer showed substantial agreement for consolidation and pleural effusion (balanced accuracy 0.81–0.84, MCC 0.65–0.73). For standing scans (n=204), agreement was strong for consolidation (0.85) and pleural effusion (0.87), with McNemar ORs indicating more AZmed-positive than Gleamer-positive detections (consolidation 9.0, pleural effusion 25.0). For scans in supine position (n=63), agreement remained high for consolidation (0.73) and pleural effusion (0.81), with conditional ORs suggesting raters’ classifications to be positively associated (11.1–23.3). For sitting patients (n=12), both systems showed high to perfect agreement for consolidation and pleural effusion (0.75–1.00). Overall, AZmed and Gleamer exhibited strong agreement with regard to consolidation and pleural effusion across positions.

Discussion

In this study, we performed a direct, real-world clinical comparison of two commercially available AI-based chest X-ray tools – Rayvolve (AZmed) and ChestView (Gleamer) – against human radiologists using a cohort of 279 studies and seven key thoracic pathologies. Our findings provide insights into the performance, limitations, and potential clinical integration of AI systems in radiology and are contextualized by the growing body of evidence from similar commercial systems. This evaluation reflects the intended real-world deployment of such AI tools as decision-support systems running alongside routine radiological practice. While non-blinded, our study adds relevant insights about the performance of AI systems, similar to previously non-blinded published work in three studies [17] [24] [25].

Interpretation of study findings

Both AI systems demonstrated strong overall concordance when compared with human radiologists, for example, with strong balanced accuracy, agreement, and IRR (Gwet’s AC1 up to 0.96, pneumothorax). These results are in line with prior work showing that modern AI algorithms can achieve clinical agreement comparable to trained radiologists for common chest X-ray findings [41] [42]. The high level of agreement for common pathologies like pleural effusion and consolidation supports the potential utility of AI for triage and second-read scenarios, reinforcing its role as a complementary tool. Our findings are strongly supported by previous studies on these specific systems. Bennani et al. (2023) demonstrated that the Gleamer AI (ChestView) improved radiologists’ sensitivity across all expertise levels, with absolute increases of up to 26% for pneumothorax and 14% for consolidation, while also reducing reading times by 31% [24]. In a separate follow-up study by Selvam et al. (2025) on emergency chest X-ray, the Gleamer AI system improved sensitivity for consolidation, pleural effusion, and nodules [25]. Similarly, a multi-reader, multi-case study by Bettinger et al. (2024) on the AZmed system (Rayvolve) reported that AI assistance led to a significant 16% increase in the area under the curve (AUC), an 11% boost in sensitivity, and a 36% reduction in interpretation time [17]. The current study expands on this evidence by providing a direct head-to-head comparison of both commercially implemented AI systems operating in parallel within a routine clinical workflow.

Across most pathologies, both AI systems in our study exhibited higher specificity than sensitivity, without a ground truth available, compared to human readers. This conservative detection pattern, which prioritizes reducing false positives, is clinically valuable in high-pressure settings such as emergency departments or ICUs, where overcalling may lead to unnecessary and invasive follow-up studies [43].

A key finding of our study was that patient positioning substantially influenced AI-Human agreement. Supine radiographs tended to show the greatest discrepancies. This aligns with prior literature indicating that supine imaging – common in ICU or trauma settings – poses inherent challenges due to altered anatomical projections, magnification, and overlapping structures [14] [44]. These findings highlight the critical importance of taking patient positioning into account when implementing AI in clinical workflows, and they suggest a potential value for position-specific algorithm training to improve generalizability.

The overall data set contained a strong imbalance towards negative assessments, which is a recurring attribute in medical imaging [33]. This imbalance was particularly pronounced for rare findings such as mediastinal mass (human prevalence 3%), nodule (3%), and pneumothorax (5%). Conversely, higher positive rates were observed for cardiomegaly, consolidation, and pleural effusion, especially in supine studies. This is likely because supine positioning naturally increases the width of the heart’s silhouette, making the distinction between physiological and pathological enlargement more challenging and subjective.

Clinical implications

Overall, our findings are consistent with the broader literature [17] [24] [25] and they suggest that those commercially available AI systems can reliably support radiologists in routine chest X-ray interpretation. The proven benefits in clinical agreement and efficiency gains support their use for common pathologies and standard projections. However, our results also clearly show that AI performance is not infallible; it varies with patient positioning and pathology prevalence. This highlights areas where human oversight remains essential, particularly for complex cases, rare findings, and non-standard projections [14].

Limitations

Several limitations of this study should be acknowledged. First, the data set was limited to a single university hospital, which may reduce generalizability to other institutions or patient populations. Second, the number of supine and in particular sitting radiographs was relatively small, limiting the statistical power to assess AI-Human agreement in these positions robustly. Third, rare pathologies were underrepresented, which is a common challenge in AI imaging studies but can nevertheless lead to less reliable performance metrics for these conditions. Fourth, both AI systems were evaluated in a “black box” manner without insight into their specific decision-making processes, which can limit interpretability and clinical trust. Fifth, gender distribution data were not captured due to institutional privacy protocols, limiting demographic characterization of our cohort.

Finally, while radiologists maintained full legal responsibility, their real-time access to AI outputs via PACS could have subtly influenced their reporting behavior, potentially introducing bias into the comparison. No ground truth was available, for example, through additional imaging or biopsy. Instead, human labels served as reference for the AI evaluation. This might, as in previous studies [17] [24] [25], introduce mutual reinforcement; however, concordance or agreement must not be misinterpreted as accuracy.

Future studies should aim for multicenter designs, larger data sets enriched for rare conditions, and prospective assessment with AI feedback blinded to the human readers, in order to evaluate true performance and integration more rigorously.

Conclusions

This study demonstrates that commercially available AI chest X-ray tools achieve a high level of agreement with human radiologists in real-world practice, especially for common pathologies and standing-position radiographs. Specificity tends to exceed sensitivity, emphasizing conservative detection strategies. Patient positioning and low-prevalence pathologies remain key challenges, underscoring the importance of careful implementation and continued human oversight in clinical workflows.

Previous work has already revealed that either of the two AI systems can be a valuable support for clinical radiologists, increasing performance and speed [17] [24] [25]. Similarly, our findings reinforce this conclusion. However, the observed discrepancies with human diagnoses highlight the continued need for human oversight in clinical decision-making. In this context, running multiple AI systems in parallel and considering their consensus could further enhance diagnostic reliability. Greater in-depth understanding of raters’ diagnostic processes will also be attained by integrating not only diagnoses but also model reasoning, for example, using class activation maps (activation mapping) in the statistical analyses. Two raters might give the identical diagnosis for two different lesions, or two raters might provide a different diagnosis when labeling the identical lesion.

Once these AI systems begin to be used widely in clinical routine, they will start to fulfill the prediction made at the Dartmouth AI workshop [45]. However, a dramatic deskilling of radiologists due to AI applications [46] seems to be a concern for the distant future rather than the immediate one.

Supplements

S 1: collected raw data (n = 279) with diagnoses per rater (.xlsx, as well as .csv)

S 2: statistical evaluation code (.py)

The study's raw data and Python code are deposited permanently on figshare under DOI: 10.6084/m9.figshare.28692659.

Ethics declaration

The study received an ethics waiver (Req-2025-00216) from the Cantonal Ethics Committee Bern (Kantonale Ethikkommission für die Forschung, Gesundheits-, Sozial- und Integrationsdirektion), dated 17 February 2025. All experiments were conducted in accordance with relevant guidelines and regulations, including the Declaration of Helsinki. Informed consent was obtained from all participants and/or their legal guardians prior to study participation.

Conflict of Interest

The authors declare that they have no conflict of interest.

References
1 Tang A, Tam R, Cadrin-Chênevert A. et al. Canadian Association of Radiologists White Paper on Artificial Intelligence in Radiology. Can Assoc Radiol J 2018; 69: 120-35

Crossref PubMed Search in Google Scholar
Download RIS citation
2 Bosbach WA, Senge JF, Nemeth B. et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol 2023; 53: 102-10

Crossref PubMed Search in Google Scholar
Download RIS citation
3 Senge JF, Mc Murray MT, Haupt F. et al. ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template. Swiss J Radiol Nucl Med 2024; 7: 1-14

Crossref Search in Google Scholar
Download RIS citation
4 Bosbach WA, Clement C, Strunz F. et al. Automation of 99mTc Mercaptoacetyltriglycine (MAG3) Report Writing Using a Vision Language Model. EJNMMI Res 2025; 15: 142

Crossref PubMed Search in Google Scholar
Download RIS citation
5 Barat M, Soyer P, Dohan A. Appropriateness of Recommendations Provided by ChatGPT to Interventional Radiologists. Can Assoc Radiol J 2023; 1-6

Crossref PubMed Search in Google Scholar
Download RIS citation
6 Ramedani S, Ramedani M, Tengg-Kobligk Von H. et al. A Deep Learning-based Fully Automated Approach for Body Composition Analysis in 3D Whole Body Dixon MRI. 2023 IEEE 19th Int Conf Intell Comput Commun Process 2023; 287-292

Crossref Search in Google Scholar
Download RIS citation
7 Hammernik K, Klatzer T, Kobler E. et al. Learning a variational network for reconstruction of accelerated MRI data. Magn Reson Med 2018; 79: 3055-71

Crossref PubMed Search in Google Scholar
Download RIS citation
8 Bosbach WA, Schoeni L, Beisbart C. et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 in Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025;

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
9 Bosbach WA, Schoeni L, Beisbart C. et al. Open access supplement to the manuscript: Bosbach WA, Schoeni L, Beisbart C et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 in Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025; accepted for publicat. Figshare 2025;

Crossref Search in Google Scholar
Download RIS citation
10 Garni SN, Mertineit N, Nöldge G. et al. Regulatory Needs for Radiation Protection Devices based upon Artiﬁcial Intelligence – State task or leave unregulated?. Swiss J Radiol Nucl Med 2024; 5: 5

Crossref Search in Google Scholar
Download RIS citation
11 Bosbach WA, Merdes KC, Jung B. et al. Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Top Magn Reson Imaging 2024; 33: e0313

Crossref PubMed Search in Google Scholar
Download RIS citation
12 Bosbach WA, Merdes KC, Jung B. et al. Open access supplement to the publication: Bosbach WA et al. (2024). Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Topics in MRI. Figshare 2024;

Crossref Search in Google Scholar
Download RIS citation
13 Granstedt J, Kc P, Deshpande R. et al. Hallucinations in medical devices. ArXiv 2025;

Crossref Search in Google Scholar
Download RIS citation
14 De Lacey G, Morley S, Berman L. The Chest X-Ray – A Survival Guide. Cambridge (UK): 2008

Search in Google Scholar
Download RIS citation
15 Babar Z, van Laarhoven T, Zanzotto FM. et al. Evaluating diagnostic content of AI-generated radiology reports of chest X-rays. Artif Intell Med 2021; 116: 102075

Crossref PubMed Search in Google Scholar
Download RIS citation
16 Yu F, Endo M, Krishnan R. et al. Evaluating Progress in Automatic Chest X-Ray Radiology Report Generation. MedRxiv 2022;

Crossref Search in Google Scholar
Download RIS citation
17 Bettinger H, Lenczner G, Guigui J. et al. Evaluation of the Performance of an Artificial Intelligence (AI) Algorithm in Detecting Thoracic Pathologies on Chest Radiographs. Diagnostics 2024; 14

Crossref PubMed Search in Google Scholar
Download RIS citation
18 Gasmi I, Calinghen A, Parienti JJ. et al. Comparison of diagnostic performance of a deep learning algorithm, emergency physicians, junior radiologists and senior radiologists in the detection of appendicular fractures in children. Pediatr Radiol 2023; 53: 1675-84

Crossref PubMed Search in Google Scholar
Download RIS citation
19 Dupuis M, Delbos L, Veil R. et al. External validation of a commercially available deep learning algorithm for fracture detection in children. Diagn Interv Imaging 2022; 103: 151-159

Crossref PubMed Search in Google Scholar
Download RIS citation
20 Fu T, Viswanathan V, Attia A. et al. Assessing the Potential of a Deep Learning Tool to Improve Fracture Detection by Radiologists and Emergency Physicians on Extremity Radiographs. Acad Radiol 2023; 1-11

Crossref PubMed Search in Google Scholar
Download RIS citation
21 Lin TY, Goyal P, Girshick R. et al. Focal Loss for Dense Object Detection. IEEE Trans Pattern Anal Mach Intell 2020; 42: 318-27

Crossref PubMed Search in Google Scholar
Download RIS citation
22 Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 3rd Int Conf Learn Represent ICLR 2015 – Conf Track Proc. In: . 2015: 1-14

Search in Google Scholar
Download RIS citation
23 Regnard NE, Lanseur B, Ventre J. et al. Assessment of performances of a deep learning algorithm for the detection of limbs and pelvic fractures, dislocations, focal bone lesions, and elbow effusions on trauma X-rays. Eur J Radiol 2022; 154

Crossref PubMed Search in Google Scholar
Download RIS citation
24 Bennani S, Regnard NE, Ventre J. et al. Using AI to Improve Radiologist Performance in Detection of Abnormalities on Chest Radiographs. Radiology 2023; 309

Crossref PubMed Search in Google Scholar
Download RIS citation
25 Selvam S, Peyrony O, Elezi A. et al. Efficacy of a deep learning-based software for chest X-ray analysis in an emergency department. Diagn Interv Imaging 2025; 106: 299-311

Crossref PubMed Search in Google Scholar
Download RIS citation
26 Wu Y, Kirillov A, Massa F. et al. Detectron2 2019. Accessed September 14, 2025 at: https://github.com/facebookresearch/detectron2

Download RIS citation
27 Panicek DM, Hricak H. How sure are you, doctor? A standardized lexicon to describe the radiologists level of certainty. Am J Roentgenol 2016; 207: 2-3

Crossref PubMed Search in Google Scholar
Download RIS citation
28 statsmodels.stats.proportion.proportion_confint. Statsmodels 0150 (+710) 2025. Accessed September 08, 2025 at: https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportion_confint.html

Download RIS citation
29 balanced_accuracy_score. Scikit-Learn 172 Doc 2025. Accessed September 10, 2025 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html

Download RIS citation
30 matthews_corrcoef. Scikit-Learn 172 Doc 2025. Accessed September 10, 2025 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html

Download RIS citation
31 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33: 159-74

Crossref PubMed Search in Google Scholar
Download RIS citation
32 Gwet K, Fergadis A. irrCAC – Chance-corrected Agreement Coefficients 2023. irrcac.readthedocs.io/en/latest/usage/usage_raw_data.html; accessed September 3, 2025.

Download RIS citation
33 Wongpakaran N, Wongpakaran T, Wedding D. et al. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples Nahathai. BMC Med Res Methodol 2013; 13: 1-7

Crossref PubMed Search in Google Scholar
Download RIS citation
34 statsmodels.stats.contingency_tables.mcnemar. Statsmodels 0150 (+638) 2025. Accessed March 30, 2025 at: https://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html

Download RIS citation
35 Bartolucci AA, Tendera M, Howard G. Meta-analysis of multiple primary prevention trials of cardiovascular events using aspirin. Am J Cardiol 2011; 107: 1796-801

Crossref PubMed Search in Google Scholar
Download RIS citation
36 Cohen J. The earth is round (p<.05). Am Psychol 1994; 49: 997-1003

Crossref Search in Google Scholar
Download RIS citation
37 Sullivan GM, Feinn R. Using Effect Size—or Why the P Value Is Not Enough. J Grad Med Educ 2012; 4: 279-82

Crossref PubMed Search in Google Scholar
Download RIS citation
38 OpenAI Inc. GPT-5 2025. Accessed August 30, 2025 at: https://chatgpt.com/overview

Download RIS citation
39 Anthropic PBC. Claude Sonnet 4 [Large language model] 2025. Accessed August 05, 2025 at: https://www.anthropic.com

Download RIS citation
40 Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co. L. DeepSeek V3.1 2025. Accessed September 10, 2025 at: https://www.deepseek.com

Download RIS citation
41 Rajpurkar P, Irvin J, Zhu K. et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. ArXiv 2017;

Crossref Search in Google Scholar
Download RIS citation
42 Irvin J, Rajpurkar P, Ko M. et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. 33rd AAAI Conf Artif Intell. In: . 2019: 590-597

Crossref Search in Google Scholar
Download RIS citation
43 Feng Y, Teh HS, Cai Y. Deep Learning for Chest Radiology: A Review. Curr Radiol Rep 2019; 7: 1-9

Crossref Search in Google Scholar
Download RIS citation
44 Gefter WB, Post BA, Hatabu H. Commonly Missed Findings on Chest Radiographs: Causes and Consequences. Chest 2023; 163: 650-61

Crossref PubMed Search in Google Scholar
Download RIS citation
45 McCarthy J, Minsky ML, Rochester N. et al. A Proposal For The Dartmouth Summer Research Project On Artificial Intelligence 1955: 1–13. Accessed October 30, 2021 at: http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf

Download RIS citation
46 Duran LDD. Deskilling of medical professionals: An unintended consequence of AI implementation?. G Di Filos 2021; 2

Crossref Search in Google Scholar
Download RIS citation

Correspondence

PD Dr. Dr. med. Wolfram A. Bosbach

Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern

Bern

Switzerland

Email: WolframAndreas.Bosbach@Insel.CH

Publication History

Received: 09 April 2025

Accepted after revision: 11 December 2025

Article published online:
20 January 2026

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

References
1 Tang A, Tam R, Cadrin-Chênevert A. et al. Canadian Association of Radiologists White Paper on Artificial Intelligence in Radiology. Can Assoc Radiol J 2018; 69: 120-35

Crossref PubMed Search in Google Scholar
Download RIS citation
2 Bosbach WA, Senge JF, Nemeth B. et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol 2023; 53: 102-10

Crossref PubMed Search in Google Scholar
Download RIS citation
3 Senge JF, Mc Murray MT, Haupt F. et al. ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template. Swiss J Radiol Nucl Med 2024; 7: 1-14

Crossref Search in Google Scholar
Download RIS citation
4 Bosbach WA, Clement C, Strunz F. et al. Automation of 99mTc Mercaptoacetyltriglycine (MAG3) Report Writing Using a Vision Language Model. EJNMMI Res 2025; 15: 142

Crossref PubMed Search in Google Scholar
Download RIS citation
5 Barat M, Soyer P, Dohan A. Appropriateness of Recommendations Provided by ChatGPT to Interventional Radiologists. Can Assoc Radiol J 2023; 1-6

Crossref PubMed Search in Google Scholar
Download RIS citation
6 Ramedani S, Ramedani M, Tengg-Kobligk Von H. et al. A Deep Learning-based Fully Automated Approach for Body Composition Analysis in 3D Whole Body Dixon MRI. 2023 IEEE 19th Int Conf Intell Comput Commun Process 2023; 287-292

Crossref Search in Google Scholar
Download RIS citation
7 Hammernik K, Klatzer T, Kobler E. et al. Learning a variational network for reconstruction of accelerated MRI data. Magn Reson Med 2018; 79: 3055-71

Crossref PubMed Search in Google Scholar
Download RIS citation
8 Bosbach WA, Schoeni L, Beisbart C. et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 in Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025;

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
9 Bosbach WA, Schoeni L, Beisbart C. et al. Open access supplement to the manuscript: Bosbach WA, Schoeni L, Beisbart C et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 in Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025; accepted for publicat. Figshare 2025;

Crossref Search in Google Scholar
Download RIS citation
10 Garni SN, Mertineit N, Nöldge G. et al. Regulatory Needs for Radiation Protection Devices based upon Artiﬁcial Intelligence – State task or leave unregulated?. Swiss J Radiol Nucl Med 2024; 5: 5

Crossref Search in Google Scholar
Download RIS citation
11 Bosbach WA, Merdes KC, Jung B. et al. Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Top Magn Reson Imaging 2024; 33: e0313

Crossref PubMed Search in Google Scholar
Download RIS citation
12 Bosbach WA, Merdes KC, Jung B. et al. Open access supplement to the publication: Bosbach WA et al. (2024). Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Topics in MRI. Figshare 2024;

Crossref Search in Google Scholar
Download RIS citation
13 Granstedt J, Kc P, Deshpande R. et al. Hallucinations in medical devices. ArXiv 2025;

Crossref Search in Google Scholar
Download RIS citation
14 De Lacey G, Morley S, Berman L. The Chest X-Ray – A Survival Guide. Cambridge (UK): 2008

Search in Google Scholar
Download RIS citation
15 Babar Z, van Laarhoven T, Zanzotto FM. et al. Evaluating diagnostic content of AI-generated radiology reports of chest X-rays. Artif Intell Med 2021; 116: 102075

Crossref PubMed Search in Google Scholar
Download RIS citation
16 Yu F, Endo M, Krishnan R. et al. Evaluating Progress in Automatic Chest X-Ray Radiology Report Generation. MedRxiv 2022;

Crossref Search in Google Scholar
Download RIS citation
17 Bettinger H, Lenczner G, Guigui J. et al. Evaluation of the Performance of an Artificial Intelligence (AI) Algorithm in Detecting Thoracic Pathologies on Chest Radiographs. Diagnostics 2024; 14

Crossref PubMed Search in Google Scholar
Download RIS citation
18 Gasmi I, Calinghen A, Parienti JJ. et al. Comparison of diagnostic performance of a deep learning algorithm, emergency physicians, junior radiologists and senior radiologists in the detection of appendicular fractures in children. Pediatr Radiol 2023; 53: 1675-84

Crossref PubMed Search in Google Scholar
Download RIS citation
19 Dupuis M, Delbos L, Veil R. et al. External validation of a commercially available deep learning algorithm for fracture detection in children. Diagn Interv Imaging 2022; 103: 151-159

Crossref PubMed Search in Google Scholar
Download RIS citation
20 Fu T, Viswanathan V, Attia A. et al. Assessing the Potential of a Deep Learning Tool to Improve Fracture Detection by Radiologists and Emergency Physicians on Extremity Radiographs. Acad Radiol 2023; 1-11

Crossref PubMed Search in Google Scholar
Download RIS citation
21 Lin TY, Goyal P, Girshick R. et al. Focal Loss for Dense Object Detection. IEEE Trans Pattern Anal Mach Intell 2020; 42: 318-27

Crossref PubMed Search in Google Scholar
Download RIS citation
22 Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 3rd Int Conf Learn Represent ICLR 2015 – Conf Track Proc. In: . 2015: 1-14

Search in Google Scholar
Download RIS citation
23 Regnard NE, Lanseur B, Ventre J. et al. Assessment of performances of a deep learning algorithm for the detection of limbs and pelvic fractures, dislocations, focal bone lesions, and elbow effusions on trauma X-rays. Eur J Radiol 2022; 154

Crossref PubMed Search in Google Scholar
Download RIS citation
24 Bennani S, Regnard NE, Ventre J. et al. Using AI to Improve Radiologist Performance in Detection of Abnormalities on Chest Radiographs. Radiology 2023; 309

Crossref PubMed Search in Google Scholar
Download RIS citation
25 Selvam S, Peyrony O, Elezi A. et al. Efficacy of a deep learning-based software for chest X-ray analysis in an emergency department. Diagn Interv Imaging 2025; 106: 299-311

Crossref PubMed Search in Google Scholar
Download RIS citation
26 Wu Y, Kirillov A, Massa F. et al. Detectron2 2019. Accessed September 14, 2025 at: https://github.com/facebookresearch/detectron2

Download RIS citation
27 Panicek DM, Hricak H. How sure are you, doctor? A standardized lexicon to describe the radiologists level of certainty. Am J Roentgenol 2016; 207: 2-3

Crossref PubMed Search in Google Scholar
Download RIS citation
28 statsmodels.stats.proportion.proportion_confint. Statsmodels 0150 (+710) 2025. Accessed September 08, 2025 at: https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportion_confint.html

Download RIS citation
29 balanced_accuracy_score. Scikit-Learn 172 Doc 2025. Accessed September 10, 2025 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html

Download RIS citation
30 matthews_corrcoef. Scikit-Learn 172 Doc 2025. Accessed September 10, 2025 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html

Download RIS citation
31 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33: 159-74

Crossref PubMed Search in Google Scholar
Download RIS citation
32 Gwet K, Fergadis A. irrCAC – Chance-corrected Agreement Coefficients 2023. irrcac.readthedocs.io/en/latest/usage/usage_raw_data.html; accessed September 3, 2025.

Download RIS citation
33 Wongpakaran N, Wongpakaran T, Wedding D. et al. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples Nahathai. BMC Med Res Methodol 2013; 13: 1-7

Crossref PubMed Search in Google Scholar
Download RIS citation
34 statsmodels.stats.contingency_tables.mcnemar. Statsmodels 0150 (+638) 2025. Accessed March 30, 2025 at: https://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html

Download RIS citation
35 Bartolucci AA, Tendera M, Howard G. Meta-analysis of multiple primary prevention trials of cardiovascular events using aspirin. Am J Cardiol 2011; 107: 1796-801

Crossref PubMed Search in Google Scholar
Download RIS citation
36 Cohen J. The earth is round (p<.05). Am Psychol 1994; 49: 997-1003

Crossref Search in Google Scholar
Download RIS citation
37 Sullivan GM, Feinn R. Using Effect Size—or Why the P Value Is Not Enough. J Grad Med Educ 2012; 4: 279-82

Crossref PubMed Search in Google Scholar
Download RIS citation
38 OpenAI Inc. GPT-5 2025. Accessed August 30, 2025 at: https://chatgpt.com/overview

Download RIS citation
39 Anthropic PBC. Claude Sonnet 4 [Large language model] 2025. Accessed August 05, 2025 at: https://www.anthropic.com

Download RIS citation
40 Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co. L. DeepSeek V3.1 2025. Accessed September 10, 2025 at: https://www.deepseek.com

Download RIS citation
41 Rajpurkar P, Irvin J, Zhu K. et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. ArXiv 2017;

Crossref Search in Google Scholar
Download RIS citation
42 Irvin J, Rajpurkar P, Ko M. et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. 33rd AAAI Conf Artif Intell. In: . 2019: 590-597

Crossref Search in Google Scholar
Download RIS citation
43 Feng Y, Teh HS, Cai Y. Deep Learning for Chest Radiology: A Review. Curr Radiol Rep 2019; 7: 1-9

Crossref Search in Google Scholar
Download RIS citation
44 Gefter WB, Post BA, Hatabu H. Commonly Missed Findings on Chest Radiographs: Causes and Consequences. Chest 2023; 163: 650-61

Crossref PubMed Search in Google Scholar
Download RIS citation
45 McCarthy J, Minsky ML, Rochester N. et al. A Proposal For The Dartmouth Summer Research Project On Artificial Intelligence 1955: 1–13. Accessed October 30, 2021 at: http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf

Download RIS citation
46 Duran LDD. Deskilling of medical professionals: An unintended consequence of AI implementation?. G Di Filos 2021; 2

Crossref Search in Google Scholar
Download RIS citation

Permissions and Reprints

Related Books

Subscribe to RSS

Share / Bookmark

Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine

Authors

Abstract

Purpose

Materials and Methods

Results

Conclusion

Key Points

Citation Format

Zusammenfassung

Ziel

Materialien und Methoden

Ergebnisse

Schlussfolgerung

Kernaussagen

Keywords

Introduction

Table 1 Pathologies in alphabetical order reported by the human radiologist, AZmed, and Gleamer.

Materials and Methods

Table 2 Overview of included sample set.

Raters

Sample set

Statistical evaluation methods

Table 3 Interpretation of strength of agreement for kappa statistics used in the present study, adapted from [31].

Table 4 Performance of AI systems compared to human results.

Results

Sample set

Human-AI paired ratings, assessed overall

Human-AI paired ratings, split by positions

AI-AI comparison

Discussion

Interpretation of study findings

Clinical implications

Limitations

Conclusions

Supplements

Ethics declaration

Conflict of Interest

References

Correspondence

Publication History

References