Keywords AI - medical - education
Introduction
The field of medical education is continually evolving with advancements in technology
reshaping the way medical students are trained and assessed. One such technological
innovation that has garnered significant attention in recent years is the integration
of large language models (LLMs) [1 ]. A significant advantage of LLMs such as ChatGPT is their ability to provide explanations
for solutions, thereby making it easier for students to understand exam architecture.
Learning content can be tailored based on the userʼs knowledge level, and the chat
function allows interactive learning.
LLMs, such as ChatGPT, are supported by deep neural networks and have been trained
on vast datasets. These models have profound text analysis and generation capabilities,
making them exceptionally promising tools for both medical practice and education
[2 ]
[3 ].
As an indispensable component of medical practice, radiology necessitates profound
comprehension of intricate imaging studies and clinical implications. Medical students,
on their journey toward becoming proficient healthcare professionals, undergo rigorous
training and examinations to help them gain the requisite skills and knowledge. Although
the field of artificial intelligence in diagnostic radiology has primarily centered
on image analysis, there has been growing enthusiasm surrounding the potential applications
of LLMs, including ChatGPT, within radiology [4 ]
[5 ]
[6 ]. These applications encompass a wide spectrum, including radiology education, assistance
in differential diagnoses, computer-aided diagnosis, and disease classification [5 ]
[6 ]
[7 ]. If these LLMs can demonstrate accuracy and reliability, they have the potential
to serve as invaluable resources for learners, enabling rapid responses to inquiries
and simplification of intricate concepts. ChatGPT has already undergone investigation
regarding its potential with respect to streamlining radiology reports and facilitating
clinical decision-making [8 ]
[9 ]. Furthermore, LLMs have already performed commendably in a diverse array of professional
examinations, even without specialized domain pretraining [10 ]. In the realm of medicine, they showed convincing results with respect to medical
examinations [11 ]
[12 ]
[13 ].
The aim of this study was to explore and evaluate the performance of LLMs in radiology
examinations for medical students in order to provide insight into the present capabilities
and implications of LLMs.
Methods
This exploratory prospective study was carried out from August to October 2023. We
obtained informed consent from the head of the institute to utilize the instituteʼs
own radiology examination questions for medical students.
Multiple-Choice Question Selection and Classification
200 multiple-choice questions, each featuring four incorrect answers and one correct
answer, were identified using the database of our radiology institute. These questions
were originally designed for use in the radiology examination for medical students
at our hospital. The exclusion criteria comprised questions containing images (n =40)
and questions with multiple correct answers (n = 9). After this selection process,
151 questions remained. The questions were then either prompted through OpenAIʼs API
for GPT-3.5 and GPT-4 or manually pasted into the user interface (UI) of Perplexity
AI (GPT 3.5+Bing). To avoid the influence of previous responses on the modelʼs output,
a new ChatGPT session was initiated for each query. All questions were asked in three
separate ChatGPT sessions, and the average performance was calculated.
A simple prompt for the question was used in the following form:
Question: {question text} A: {answer A} B: {answer B} C: {answer C} D: {answer D}
E: {answer E}
For the initial prompt we used:
You are an expert radiologist. Answer the following multiple-choice question in the
form: <Single letter (answer)> <Text explaining the reason>
The outputs were restructured and combined for statistical analysis. A passing score
was considered to be 60% or above. Additionally, the questions were categorized based
on their type as either lower- or higher- order thinking questions, along with their
subject matter, as detailed in [Table 1 ]. Lower-order thinking encompasses tasks related to remembering and basic understanding,
while higher-order thinking involves the application, analysis, and evaluation of
concepts. The higher-order thinking as well as the lower-order thinking questions
were further subclassified by type (description of imaging findings, clinical management,
comprehension, knowledge). Each question underwent independent classification by two
radiologists. A flowchart of the study design is displayed in [Fig. 1 ]
[14 ].
Table 1 Performance of LLMs and medical students stratified by question type and topic.
Total
Students
GPT-3.5
GPT-3.5 + Bing
GPT-4
N
%
N
%
N
%
N
%
All questions
151
115.3
76.3
103
68.2
108
71.5
134
88.7
Bone
26
19.4
74.4
17
65.4
18
69.2
23
88.5
Breast
5
3.7
73.2
2
40
2
40
4
80
Cardiovascular
13
9.9
76.9
11
84.6
10
76.9
12
92.3
Chest
19
13.6
71.7
14
73.7
14
73.7
17
89.5
Gastrointestinal
16
12.8
80
11
68.8
12
75
15
93.8
Genitourinary
6
4.2
70
4
66.7
5
83.3
5
83.3
Head and neck
39
31.1
79.8
28
71.8
29
74.4
34
87.2
Physics
11
8.5
77.4
7
63.6
7
63.6
10
90.9
Systemic
16
12
75.1
9
56.3
11
68.8
14
87.5
Clinical management
37
29.1
78.7
26
70.3
27
72.9
33
89.2
Description of imaging findings
27
20.2
74.8
16
59.3
21
77.8
23
85.2
Diagnosis
23
17.2
74.7
16
69.6
14
60.8
22
95.7
Comprehension
28
21.3
76.3
21
75
23
82.1
25
89.3
Knowledge
36
27.5
76.3
24
66.7
23
63.9
31
86.1
Higher-order
87
66.5
76.4
58
66.7
62
71.3
78
89.7
Lower-order
64
48.8
76.3
45
70.3
46
71.9
56
87.5
Fig. 1 Flowchart of the study design. From our initial 200 exam questions, 151 remained after
excluding questions with images and questions with more than one correct answer. The
questions were then prompted either by OpenAIs API for GPT-3.5 and GPT-4 or manually
pasted into the UI of Perplexity AI (GPT 3.5+Bing). The outputs were restructured
and combined for statistical analysis. Abbreviations: MC: multiple choice; API: application
programming interface; UI: user interface.
Large language models (LLMs)
ChatGPT (ChatGPT August 3, 2023 version, OpenAI) and Perplexity AI were used in this
study. There are two versions of ChatGPT: ChatGPT, which is based on GPT-3.5, and
ChatGPT Plus, which utilizes the more advanced GPT-4. In this study we used the two
underlying LLMs directly via the OpenAI API. No specialized radiology-specific pretraining
was conducted for either of these models. It is important to highlight that GPT-3.5
and GPT-4, being server-contained LLMs, lack the capability to access the internet
or external databases for information retrieval. In contrast, Perplexity AI (ChatGPT
3.5 +Bing) has the capacity to search the internet.
Medical students
The study included a cohort of 621 medical students who were in their first clinical
semester, typically corresponding to their third year in medical school.
Prior to entering the clinical phase, the students completed two years of preclinical
education, which included foundational courses in anatomy, physiology, biochemistry,
pathology, and basic medical sciences. At the time of the study, the students had
completed an introductory course in radiology. However, their exposure to advanced
radiological topics was limited compared to more senior students and residents.
Statistical analysis
Statistical analysis was performed using Python (version 3.11). The McNemar test was
used to determine the statistical significance of difference regarding the performance
of the LLMs. This was also done for subgroups by question type and topic. For overall
model performance, we utilized the widely used accuracy score.
To quantify the comparative performance of the LLMs and the medical students, we performed
an odds ratio analysis. For each comparison, we set up 2×2 contingency tables that
summarize the number of correct and incorrect answers for the two groups being compared.
Thereafter, we calculated p-values using Fisherʼs Exact Test. A P-value of less than
0.05 was considered statistically significant. No correction-for-guessing was performed,
since the passing score of our exam already accounts for guessing.
Results
Overall performance
The overall accuracy of GPT-3.5 for all 151 questions was 67.6%. In contrast, GPT-4
achieved significantly higher accuracy compared to GPT-3.5 with an overall accuracy
of 88.1% (p< 0.001). No significant differences were observed between GPT-3.5+Bing
and GPT-3.5 (p=0.44). In comparison, the overall accuracy of the medical students
was 76%. All LLMs would have passed the radiology exam for medical students at our
university. [Table 1 ] shows the overall performance of the LLMs as well the performance stratified by
question type and topic and [Fig. 2 ] shows a question that was answered correctly by all LLMs.
Fig. 2 GPT-3.5 /4.0 and Perplexity AI response to one of the questions. All picked the correct
answer (option B). A : GPT-3.5 B : Perplexity AI; C : GPT-4.
Performance by topic
Among the subgroups, GPT4 exhibited the highest performance in the gastrointestinal
category, correctly answering 15 out of 16 questions, thus achieving an accuracy of
93.75%. Compared with GPT3.5 and Perplexity AI, GPT-4 demonstrated significantly superior
performance with regard to answering questions related to bone diseases (p=0.03).
However, subgroup analysis revealed no noteworthy variations in performance across
the remaining subspecialty groups.
Questions answered incorrectly by all models
A total of seven questions were answered incorrectly by all models (Table S1 ). Among these, two questions pertained to the use of contrast agents in patients
with renal insufficiency, while another related to MRI angiography in patients with
a pacemaker.
The remaining questions that stumped all models demanded a nuanced understanding of
specific details or specialized knowledge. For instance, one question pertained to
renal scintigraphy, where the correct response hinged on the knowledge that Tc 99m-MAG3
is primarily secreted by proximal renal tubules and, therefore, cannot be used to
estimate glomerular filtration rate. [Fig. 3 ] illustrates a question that was answered incorrectly by all LLMs.
Fig. 3 Response to a question answered incorrectly: Please be mindful that large language
models (LLMs) frequently use assertive language in their responses, even when those
responses are incorrect. Abbreviations: LLM: large language models.
Performance by question type
GPT-4 demonstrated significantly superior performance in both lower-order and higher-order
questions when compared to GPT-3.5 and Perplexity AI (p = 0.01 and p < 0.001, respectively).
GPT-4 achieved the best performance across all topics and categories compared to medical
students, GPT-3.5, and Perplexity AI ([Fig. 4 ]).
Fig. 4 Performance comparison across medical topics: medical students vs. GPT models.
Within the subgroups, GPT-4 exhibited its highest performance when responding to higher-order
questions related to diagnosis. It provided correct answers for 22 out of 23 questions
in this category, achieving an accuracy of 95.65%.
In contrast, GPT-3.5 and Perplexity AI exhibited their highest performance with respect
to the lower-order subgroup comprehension with accuracies of 75.00% and 82.41% ([Table 1 ]). Perplexity AI demonstrated the weakest performance in the higher-order category
diagnosis (60.9%) and in the lower-order category knowledge (63.9%), while GPT-3.5
had the weakest performance in the higher-order description of imaging findings (59.3%)
and the lower-order category comprehension (75%). The average medical student achieved
a similar performance for lower-order questions (76.27%) compared to higher-order
questions (76.39%). The performance of the average student was relatively stable across
all subgroups. The average student achieved the highest performance with regard to
questions related to clinical management with an accuracy of 78.7% and the lowest
performance with regard to diagnosis with an accuracy of 74.7% ([Table 1 ], [Fig. 5 ], [Fig. 6 ]).
Fig. 5 Performance comparison in higher- and lower-order tasks: medical students vs. GPT
models.
Fig. 6 Performance heatmap across medical topics and cognitive functions.
Odds ratio analysis
The odds ratio analysis confirmed that the overall performance of GPT-4 was significantly
superior to that of GPT-3, Perplexity AI, and the medical students. The improved performance
was particularly notable for higher-order questions, where GPT-4 showed the greatest
improvement over the other GPT models and the students. For example, GPT-4 is 4.3
times more likely to correctly answer higher-order thinking questions than GPT-3.5
(p<0.001). For lower-order thinking questions, while GPT-4 still performed better,
the difference was not statistically significant compared to the medical students
([Table 2 ]).
Table 2 Odds ratio analysis of the performance of LLMs and medical students.
Comparison
Odds ratio
p-value
All questions
GPT-4 vs. GPT-3.5
3.4
0.00002
GPT-4 vs. Medical Students
2.2
0.006
GPT-4 vs. Perplexity AI
2.9
0.0003
Higher-order
GPT-4 vs. GPT-3.5
4.3
0.0004
GPT-4 vs. medical students
2.7
0.03
GPT-4 vs. Perplexity AI
3.5
0.004
Lower-order
GPT-4 vs. GPT-3.5
2.9
0.03
GPT-4 vs. Perplexity AI
2.7
0.047
GPT-4 vs. medical students
2.2
0.11
Discussion
The integration of LLMs into various domains has increased remarkably in recent years,
with applications ranging from natural language processing to medical diagnostics.
In the field of medical education, LLMs have shown immense potential to assist and
enhance the learning experience for students, particularly in radiology – a discipline
that demands profound understanding of complex medical concepts and terminology.
The present study provides several important key findings to understand how advancements
in LLM technology can impact medical education. First, in this exploratory prospective
study, all LLMs would have passed the exam. Second, GPT-4 exhibited significantly
better performance than its predecessors GPT-3.5, Perplexity AI, and the medical students
with 88% of the questions answered correctly. Third, GPT-4 maintained the best performance
across all topics and categories compared to the medical students, GPT-3.5, and Perplexity
AI. Fourth, the performance improvement was particularly pronounced for higher-order
questions, where GPT-4 demonstrated the most significant improvement over the other
GPT models and the students. Fifth, GPT-4 demonstrated the highest performance in
the gastrointestinal category with an accuracy of 93.75%. The prevalence of gastrointestinal
content in training datasets may have contributed to the modelʼs enhanced performance
in this domain.
Despite the ability of Perplexity AI to search the internet, it demonstrated the weakest
performance with regard to knowledge. Internet searches can yield information from
a wide range of sources, including those that are not peer-reviewed or scientifically
accurate. Without a sophisticated mechanism to filter and prioritize high-quality,
reliable sources, the model might incorporate inaccurate or outdated information.
GPT-4ʼs superior performance may be attributed to the fact that GPT-4 benefits from
advanced model enhancements, including a deeper architecture and extensive training.
ChatGPT has demonstrated good performance in a wide range of professional examinations,
including those in the medical field, even without the need for specialized domain
pretraining [10 ]
[11 ]
[12 ]
[13 ]. For instance, it was applied to the USMLE, where ChatGPT achieved accuracy rates
exceeding 50% across all examinations and surpassing 60% in certain analyses [11 ].
Despite the absence of radiology-specific training, ChatGPT performed commendably.
When new LLMs with radiology-specific pretraining and the ability to process images
become publicly available, it will be interesting to see what results can be achieved.
As LLM technology continues to advance, radiologists will need to gain comprehensive
understanding of the performance and reliability of these models and of their evolving
role in radiology. The development of applications built on LLMs holds promise for
further enhancing radiological practice and education, ultimately benefiting both
current and future healthcare professionals. However, ChatGPT is designed to discern
patterns and associations among words within its training data. Consequently, we anticipate
limitations in cases requiring understanding of the context of specialized technical
language or specific details and specialized knowledge, such as radiological terminology
used in imaging descriptions, calculations, and classification systems.
Furthermore, ChatGPT consistently employs confident language in its responses, even
when those responses are incorrect. This tendency is a well-documented limitation
of LLMs [15 ]. Even when the most probable available option may be incorrect, ChatGPT tends to
generate responses that sound convincingly human-like. Interestingly, increased human
likeness in chatbots is associated with a higher level of trust [16 ]. Consequently, ChatGPTʼs inclination to produce plausible yet erroneous responses
presents a significant concern when it serves as the sole source of information [17 ]. This concern is particularly critical with regard to individuals who may lack the
expertise to discern inaccuracies in its assertions, notably novices. As a result,
this behavior currently restricts the practicality of employing ChatGPT in medical
education.
To prevent a future where LLMs influence the outcome of medical and radiological exams,
several measures can be taken. These include designing exam questions that necessitate
critical thinking and the application of knowledge rather than mere recall, integrating
practical components or simulations that cannot be easily answered by LLMs, ensuring
robust exam proctoring and monitoring procedures to detect any suspicious behavior,
and continually updating exam formats and content to stay ahead of potential cheating
methods involving LLMs. Additionally, emphasizing the importance of genuine learning
and skill acquisition can help maintain the integrity of medical exams amidst technological
advancements.
Furthermore, we identified inconsistencies in ChatGPTʼs responses. In a subsequent
evaluation, GPT-3.5 yielded different answers for five questions, while GPT-4 provided
six different answers, but there were no significant differences in accuracy between
the two models. These inconsistencies can be partially mitigated by adjusting parameters
such as temperature, top-k, and top-p settings. Temperature controls the randomness
of the modelʼs responses; a lower temperature makes the output more focused and deterministic,
while a higher temperature increases variability. Top-k limits the model to considering
only the top k most likely next words, thus reducing the chance of less probable words
being selected. Top-p adjusts the probability mass, allowing the model to consider
the smallest possible set of words whose cumulative probability exceeds a certain
threshold p, thereby balancing diversity and coherence.
However, this adjustment cannot be made directly through the web interface but can
be done, for instance, in the OpenAI playground. Without a nuanced understanding of
the influence of these parameters, thereʼs a risk of overestimating or underestimating
LLM capabilities, potentially leading to misleading conclusions about their effectiveness
in educational settings. Moreover, the variability introduced by different parameter
settings may result in significant fluctuations in LLM performance, thus challenging
the generalizability of findings to real-world applications. Future research should
prioritize comprehensive analyses of the impact of LLM settings on responses to radiology
exam questions to ensure accurate assessments and to optimize LLM configurations for
educational use in specialized fields.
Furthermore, it is essential to acknowledge certain limitations. First, we excluded
questions containing images, which are typically integral to a radiology examination,
due to ChatGPTʼs inability to process visual content at the time of this study. To
thoroughly assess the performance of the LLMs presented in a real-world scenario,
including all question types, further studies are necessary.
Second the pass/fail threshold we applied is an approximation, as normally a passing
score of 60% or above is standard for all written components, including those featuring
image-based questions. Furthermore, the relatively small number of questions in each
subgroup within this exploratory study has limited the statistical power available
for subgroup analyses.
In conclusion, our study underscores the potential of LLMs like ChatGPT as a new and
readily accessible knowledge source for medical students. Even without radiology-specific
pretraining, ChatGPT demonstrated remarkable performance, achieving a passing grade
on a radiology examination for medical students that did not include images. The model
excelled with respect to higher-order as well as lower-order thinking questions. It
is crucial for radiologists to be aware of ChatGPTʼs limitations, including its tendency
to confidently generate inaccurate responses. Presently, it cannot be solely relied
upon for clinical practice or educational purposes. However, ChatGPT presents an exciting
opportunity as a new and readily accessible knowledge source for medical students,
offering them a valuable tool to supplement their learning and understanding of radiology
concepts.
Declarations
We disclose that the manuscript was proofread by ChatGPT. All sections proofread by
ChatGPT were meticulously reviewed. Additionally, we adhered to data protection regulations,
ensuring that only anonymized data was uploaded.
Statistical analysis was performed using Python (version 3.11). ChatGPT was utilized
to understand and debug the Python code and adjust the graphics ([Fig. 4 ], [Fig. 5 ], [Fig. 6 ]). Specifically, the diagrams were created using the Python code.
Informed Consent: Not applicable.
Data availability statement: All data and information used are included in this manuscript.