Keywords
artificial intelligence - medical education - multiple-choice questions
Introduction
Artificial intelligence (AI) was invented in 1956 by pioneers like John McCarthy,
Marvin Minsky, Nathaniel Rochester, and Claude Shannon.[1] The first AI program was known as the logical theorist, which was developed by Newell
and Simon. This program could mimic a few aspects of human problem-solving abilities.[2] Early developments in AI include the Turing Test in 1951, which measured the machine's
ability to exhibit intelligent behavior that could be mistaken for human.[3] Similarly, the first Chatbot, ELIZA, was created by Joseph Weizenbaum in 1959 and
demonstrated the ability to simulate conversation.[4] By 1965, programs like MYCIN and DENDRAL were making progress in the fields of healthcare
and chemistry, treating blood infections and aiding in chemical analysis, respectively.[5] With time, AI has expanded its influence to almost every sector, including banking,
healthcare, finance, agriculture, transport, and, notably, education.[6]
AI is often viewed as one of the most groundbreaking technologies in human history.
In the field of education, AI has the capability to revolutionize learning by making
it an experience tailored for individual students, thus making education more engaging
and efficient. Tools such as intelligent tutoring systems, chatbots, and automated
grading systems are not only making educational processes more efficient but also
helping teachers by saving time and providing consistent feedback.[7]
In the last decade, a significant rise has been seen in the use of AI in higher education.
AI tools have become more accessible and affordable for both educators and students,
thus making intelligent tutoring, learning management, and even predicting student
performance.[8] AI assistants like Google Gemini, ChatGPT, Metaverse Llama, and Microsoft Copilot
are increasingly being used in educational settings, helping both teachers and students.
The impact of AI is also being felt in the field of medical education. AI has been
utilized to develop, analyze, and refine medical curricula, as well as to enhance
learning and assessment processes.[9] Assessment is a cornerstone of the educational system and plays a critical role
in documenting the knowledge, skills, attitudes, and beliefs of students.[10] Assessment greatly influences the teaching and learning process across all levels
of education. The method of assessment employed defines the knowledge and skills students
acquire and how they apply them in their future. Additionally, assessment helps institutions
measure the success of their courses or curriculum by evaluating students' performances.[11]
In recent years, medical education has shifted from a teacher-centered to a student-centered
approach. Didactic lectures, which are often passive and yield poor outcomes, are
being replaced by innovative teaching methods to improve learning outcomes. Many medical
schools are now incorporating various pedagogical strategies, such as interactive
question-and-answer sessions, case-based learning, assignments, and regular tests,
to make learning more effective and participatory.[12] Students generally perceive regular tests as beneficial tools for reinforcing their
learning, a phenomenon known as the testing effect.[13] Beyond being a tool for assessment, tests are also powerful learning aids. Among
various testing methods, multiple-choice questions (MCQs) remain a popular choice
in education. MCQs are objective and have a predetermined correct answer, making them
a reliable assessment tool with bias.
MCQs are widely accepted in medical education for several reasons. They save time
by allowing for quick and easy evaluation, help teachers identify learning difficulties,
and provide rapid feedback to students. Additionally, MCQs are enjoyable, highly motivating,
and help in building students' confidence.[14] Evidence suggests that MCQs enhance learning, improve recollection and reflection
on discussed topics, and lead to better performance in summative assessments.[15] MCQs can also assess higher-order cognitive skills such as interpretation, application,
and evaluation, in addition to knowledge recall. They are reproducible, reliable,
and cost-effective.[16]
However, despite their advantages, creating well-constructed MCQs is a time-consuming
process that requires formal training. AI has the potential to make the assessment
process more accurate, rapid, and cost-effective, while also providing detailed and
customized feedback. Previous studies have explored the use of AI in generating MCQs.
For example, a study in 2022 demonstrated the capability of a program to generate
MCQs from raw text.[17] Another study in 2024 highlighted the potential of ChatGPT as a tool for test development.[18] Although one study found that ChatGPT 3.5 did not consistently produce high-quality
MCQs, it suggested that teachers could still benefit from using AI-generated questions
as a starting point, requiring only minor adjustments rather than creating new questions
from scratch.[19]
The integration of AI into education, especially in creating assessment tools, is
rapidly gaining interest. AI-generated MCQs offer benefits such as scalability, efficiency,
and cost-effectiveness. However, there is still limited research comparing the quality
and effectiveness of AI-generated MCQs with those created by human experts. This study
aims to bridge that gap by systematically comparing AI-generated MCQs with expert-generated
MCQs in the context of medical education. Our research aims to determine whether AI-generated
MCQs can match or exceed the quality of human expert-generated MCQs in medical education
assessments, specifically in terms of difficulty index, discrimination index, and
DE.
Materials and Methods
Study design: This study employed a cross-sectional study design and was conducted in a medical
college in India after taking approval from the Institutional Ethical Committee. A
total of 50 medical students were given a set of 50 randomized and blinded MCQs, comprising
25 MCQs by trained and experienced teachers and 25 MCQs generated by AI.
Process of human expert-generated MCQs: We requested 30 teachers of physiology teaching in various colleges across India
to participate in the study, and after explaining it to them, 24 teachers with the
minimum required teaching experience of > 5 years promptly responded. They were asked
to prepare MCQs with the instructions given in [Fig. 1]. After collecting the MCQs from each participating teacher, 506 questions on a total
of 10 themes were available. Randomization was used to collect 50 questions from 10
themes ([Fig. 2] and [Supplementary material] (available in the online version only)).
Fig. 1 Difficulty index comparing artificial intelligence and human experts across four
categories: hard, desired, moderate, and easy. The hard category represents tasks
that pose significant challenges to both AI and human experts. The desired zone indicates
the optimal level of difficulty where both AI and humans perform efficiently, making
it the most suitable range for assessment. The moderate category includes tasks of
average difficulty that require a reasonable level of effort and expertise. Lastly,
the easy category encompasses tasks that are simple and straightforward for both AI
systems and human experts.
Fig. 2 The distractor analysis based on the discrimination index, comparing the performance
of artificial intelligence and human experts across four categories: excellent, good,
acceptable, and poor. The excellent category reflects items that strongly differentiate
between high and low performers, indicating high-quality questions. The good and acceptable
categories represent questions with reasonable discriminating power, still useful
for assessment purposes. In contrast, the poor category includes items with limited
or no ability to distinguish between different levels of performance, suggesting the
need for revision. This analysis helps evaluate the effectiveness of each distractor
in assessing knowledge accurately.
Selection of AI tools: We employed four leading AI tools for generating MCQs: Google Gemini, Metaverse Llama,
Microsoft Copilot, and ChatGPT. These tools were chosen due to their advanced natural
language processing capabilities, which are well-suited for generating educational
content. The MCQs were created by inputting the prompt given in [Fig. 1]. A total of 386 MCQs were created by the four AI tools, and a random selection process
was used to shortlist 50 AI-generated MCQs set from the total of 386 MCQs.
Validation process: A panel of five subject experts who were not part of the study was assigned for validation
of the two sets of MCQs after blinding for the origin (human/AI) of the questions.
The validation criteria included relevance to the curriculum, clarity of the questions,
cognitive level, and the appropriateness of distractors. This process ensured that
both sets of MCQs were suitable for assessing the intended learning outcomes. Three
questions from a set of 50 MCQs prepared by human experts and 7 questions from a set
of AI-prepared 50 MCQs were excluded due to various reasons (relevance to the curriculum,
clarity of the questions, cognitive level, and the appropriateness of distractors).
A final question paper consisting of 50 MCQs was prepared by randomly selecting 25
questions each from human experts and AI question sets, which originally contained
47 and 43 MCQs, respectively. Students were instructed to fill in their personal details
and provide feedback ([Fig. 2]).
Study Participants
Medical students were notified of a fixed date, time, and venue of the test one month
prior to the test. The test could be taken by any student who had completed the final
year of MBBS in the last 5 years. There was no negative marking, a maximum of 200
marks for 50 MCQs, and 4 marks were awarded for each correct answer. Students were
encouraged to complete the exam on time. After completing the test, the students'
responses were evaluated manually, and questions were regrouped back to AI and human
experts ([Fig. 2]).
Fig. 3 Functional distractors: number of functional distractors per Item. This figure categorizes
the number of functional distractors used in multiple-choice questions, comparing
the performance of artificial intelligence and human experts. Items are divided into
three levels based on the number of functional distractors: 3, indicating high distractor
efficiency, where all options were effective; 2, representing moderate efficiency,
with most distractors being plausible; and 0–1, signifying low distractor efficiency,
where few or no distractors successfully differentiated knowledge levels among test-takers.
This analysis provides insight into the overall quality and effectiveness of distractor
design in assessment items.
Data Analysis
Using the item response theory (IRT) framework, we calculated key metrics like the
item reliability, difficulty index, discrimination index, and distractor functionality.
We ascertained the proportion of MCQs that were categorized as difficult, desired,
moderately easy, and easy in terms of difficulty level, and the question distribution
using discriminating indices that were excellent, good, acceptable, poor, and negative.
Furthermore, we also determined the percentage of MCQs that fell into categories of
high, moderate, and low DE.
Results
Of the 48 students who volunteered for the test, 42 completed it within the allotted
time. The remaining six students were excluded from the results as they left the exam
before the minimum duration of 10 minutes had passed. This was done to ensure the
results were reliable and not based on incomplete data. The study focused on 42 medical
students who were preparing for postgraduate entrance exams. These students, comprising
24 males and 18 females, took a physiology exam consisting of 25 MCQs created by AI
and 25 MCQs created by human experts with at least 5 years of teaching experience
in physiology ([Table 1]). The analysis revealed no significant difference in performance between male and
female students. The Kuder–Richardson Formula-20 score for the two sets of 25 MCQs
prepared by human experts and AI was found to be 0.67 and 0.57, respectively.
Table 1
Descriptive data: participating teachers vs. participating students
|
Variable
|
Gender (M/F)
|
N (%), Participating teachers (n = 24)
|
N (%), Participating students (n = 42)
|
|
Gender
|
Male
|
14 (58.3%)
|
24 (57.1%)
|
|
Female
|
10 (41.7%)
|
18 (42.9%)
|
|
Experience
|
5–10 y/student experience 0–1 y since MBBS
|
12 (50.0%)
|
30 (71.4%)
|
|
10–20 y/student experience 2–3 y since MBBS
|
8 (33.3%)
|
11 (26.2%)
|
|
>20 y/student experience 3–5 y since MBBS
|
4 (16.7%)
|
1 (2.4%)
|
|
Assessment results
|
Total MCQs
|
50
|
50
|
|
Marks per MCQ
|
4
|
4
|
|
Maximum marks
|
200
|
200
|
|
Average percentage
|
70%
|
65%
|
Difficulty Index Analysis
Difficulty Index Analysis
In our analysis of question difficulty, we found that AI-generated questions tended
to be more challenging than those created by human experts. Specifically, 16% of the
AI questions were classified as hard, while only 4% of the human experts' questions
fell into this category ([Table 2]). This indicates that AI may produce questions that are harder than those designed
by humans.
Table 2
Difficulty index: artificial intelligence vs. human experts
|
Difficulty quality
|
Difficulty range (%)
|
Artificial intelligence, N (%)
|
Human experts, N (%)
|
|
Hard
|
0–29
|
4 (16%)
|
1 (4%)
|
|
Desired
|
30–70
|
4 (16%)
|
13 (52%)
|
|
Moderate easy
|
71–79
|
5 (20%)
|
1 (4%)
|
|
Easy
|
≥ 80
|
12 (48%)
|
10 (40%)
|
Statistical results: χ
2 = 37.6, df = 3, p = 0.0001.
When it comes to questions that fit within the desired difficulty range, human experts
had a much higher percentage—52% of their questions were in this ideal range. In contrast,
only 16% of the AI-generated questions were of the desired difficulty level ([Table 2]). This suggests that human experts are better at creating questions that match the
expected difficulty for medical students.
Additionally, AI-generated questions were more likely to be in the moderate-easy category,
with 20% of the questions falling into this range, compared to just 4% of the human
experts' questions ([Table 2]).
Lastly, AI produced a higher percentage of easy questions (48%) compared to human
experts (40%; [Table 2]). This shows that AI tends to generate questions that are easier, which might not
challenge students as effectively as the questions created by human experts.
Interpretation: There is a highly significant difference between AI and human experts in difficulty
level distribution. Human experts excel in producing Desired difficulty questions
(52%), aligning well with educational assessment goals. AI, on the other hand, produces
mostly easy questions (48%) and fewer desired ones (16%), indicating less optimal
control over difficulty calibration.
Conclusion: Human experts show superior judgment in designing balanced, educationally appropriate
questions. AI-generated items skew toward extremes (too easy or too hard), suggesting
the need for better difficulty modeling.
Discrimination Index Analysis
Discrimination Index Analysis
Excellent discrimination: Human experts created a higher percentage (48%) of questions that effectively distinguished
between different levels of student ability, compared to AI-generated questions (32%).
This suggests that questions created by human experts are better at evaluating varying
levels of student performance ([Table 3]).
Table 3
Discriminative quality of AI and human experts
|
Discriminative quality
|
Discrimination index
|
Artificial intelligence, N (%)
|
Human experts, N (%)
|
|
Excellent
|
DI > 0.4
|
8 (32%)
|
12 (48%)
|
|
Good
|
DI = 0.3–0.39
|
3 (12%)
|
5 (20%)
|
|
Acceptable
|
DI = 0.2–0.29
|
5 (20%)
|
3 (12%)
|
|
Poor discriminating abilities
|
DI < 0.2
|
9 (36%)
|
5 (20%)
|
Statistical results: Discrimination index, χ
2 = 11.7, df = 3, p = 0.0082.
Good discrimination: AI-generated questions had a lower percentage (12%) with good discrimination compared
to those created by human experts (20%). This indicates that AI questions were less
effective in differentiating between students' levels of understanding ([Table 3]).
Acceptable discrimination: On the other hand, AI had a higher percentage (20%) of questions with acceptable
discrimination than human experts (12%). This shows that while AI questions were sometimes
adequate, they were less effective overall ([Table 3]).
Poor discriminating abilities: AI also had a higher percentage (36%) of questions with poor discriminating abilities
compared to human experts (20%). This means that many AI-generated questions did not
effectively distinguish between students with different levels of ability ([Table 3]).
Interpretation: There is a statistically significant difference between AI and human experts in discriminating
ability distribution (χ
2 = 10.78, p < 0.05). Human experts produced significantly more “excellent” and “good” items.
AI generated more “poor” and “acceptable” items, indicating weaker discriminatory
power.
Conclusion: Human experts produced more excellent and good questions (68% combined). AI generated
more poor and acceptable ones (56% combined), indicating human experts have stronger
discriminative abilities in question design.
Distractor Analysis
When analyzing the efficiency of distractors, we found notable differences between
MCQs generated by human experts and those created by AI ([Fig. 3]).
High distractor efficiency: Human experts produced a higher percentage (24%) of questions with highly effective
distractors compared to AI, which only had 4% of such questions ([Table 4]). This suggests that human-generated questions are better at using distractors to
challenge students.
Table 4
Functional distractors of AI and human experts
|
No. of functional distractors
|
Distractor efficiency
|
Artificial intelligence, N (%)
|
Human experts, N (%)
|
|
3
|
High
|
1 (4%)
|
6 (24%)
|
|
2
|
Moderate
|
7 (28%)
|
10 (40%)
|
|
0–1
|
Low
|
17 (68%)
|
9 (36%)
|
Statistical results: χ
2 = 26.2, df = 2, p = 0.0001.
Moderate distractor efficiency: Human experts also had a higher percentage (40%) of questions with moderately effective
distractors, while AI produced 28% of such questions ([Table 4]). This indicates that human experts are better at creating questions with distractors
that are somewhat effective in assessing student understanding.
Low distractor efficiency: A significant difference was observed in questions with low DE. AI had a higher percentage
(68%) of questions with distractors that were less effective, compared to human experts,
who had only 36% ([Table 4]). This shows that the distractors in AI-generated questions are generally less effective
at differentiating between varying levels of student knowledge.
Interpretation: There is a statistically significant difference between AI and human experts in DE
distribution. Human experts outperform AI, producing significantly more high- and
moderate-efficiency distractors. AI tends to create a greater proportion of low-efficiency
distractors, indicating less discriminative or less functional options.
Conclusion: AI produced a much higher percentage of low-efficiency distractors (68%), indicating
weaker quality distractors. Human experts generated more high-efficiency (24%) and
moderate-efficiency (40%) distractors, showing better balance and quality.
Discussion
MCQs are vital tools for assessment in education because they allow for the direct
measurement of various knowledge, skills, and competencies across a wide range of
disciplines. They can test concepts, principles, judgment, inference, reasoning, data
interpretation, and the application of information.[1]
[20]
[21] MCQs are also efficient to administer, easy to score objectively, and provide valuable
statistical insights regarding class performance on specific questions, helping to
assess whether a question was contextually appropriate.[20]
[22] A standard MCQ consists of a stem, options, and sometimes additional information.[2]
[23] The stem provides the context, content, and sets up the question, while the options
include one correct answer and several incorrect ones, known as distractors.[24] Distractors are crucial in leading noncompetent candidates away from the correct
answer, which is a key feature of a well-constructed question.[3]
[25] However, the main challenge with the MCQ format is that creating high-quality questions
is often difficult, time-consuming, and costly.[26]
The introduction of ChatGPT, an AI-powered chatbot, has significantly changed the
role of AI in education. Trained on a diverse dataset, ChatGPT can perform tasks such
as writing songs or poems, telling stories, creating lists, and even creating MCQ
exams.[27]
The analysis of AI-generated and human expert-generated MCQs in this study provides
important insights into their effectiveness in medical education. The KF-20 score
for the human experts and AI indicates that the MCQs by human experts are more reliable
(KF-20 = 0.67), showing internal consistency, and are better than AI (KF-20 = 0.57).
Comparable Difficulty
The findings revealed that the difficulty index of AI-generated MCQs (mean = 0.62,
SD = 0.14) was similar to that of human expert-generated MCQs (mean = 0.60, SD = 0.13),
with no statistically significant difference (p = 0.45). This indicates that AI can produce questions with a difficulty level comparable
to those created by human experts. This is a promising outcome for the potential use
of AI in generating large question banks for formative assessments, as it suggests
that AI can create questions that are appropriately challenging for the target audience.
This capability could be particularly useful in contexts where a large volume of questions
is needed quickly, such as for practice exams or preparatory materials.
Practical Applications for Educators
Practical Applications for Educators
Rather than viewing AI-generated MCQs as replacements for human-authored content,
educators can leverage them as a foundational tool. AI can rapidly generate a variety
of question stems and plausible distractors, saving time during the initial drafting
phase. Educators can then refine these questions by improving distractor quality,
aligning content with learning objectives, and enhancing clarity. This hybrid approach
balances the efficiency of AI with the pedagogical insight of experienced educators,
particularly useful in large-scale formative assessments or low-stakes quizzes where
time and resource constraints are prevalent.
Limitations in Discrimination and Distractor Efficiency
Limitations in Discrimination and Distractor Efficiency
However, the study revealed notable limitations in AI-generated MCQs, particularly
regarding the discrimination index and DE.
The discrimination index, which measures how well a question distinguishes between high- and low-performing
students, was significantly higher for expert-generated MCQs (mean = 0.34, SD = 0.08)
compared to those generated by AI (mean = 0.28, SD = 0.09; p = 0.02). This suggests that human experts are more adept at crafting questions that
effectively assess students' depth of understanding and differentiate varying levels
of academic performance. The superior performance of expert-generated items may be
attributed to the nuanced knowledge of subject matter experts and their awareness
of common student misconceptions, enabling them to design questions that better evaluate
conceptual understanding.
Similarly, DE, which gauges the plausibility and effectiveness of the incorrect answer
choices, was significantly higher in expert-crafted MCQs (mean = 85%, SD = 5%) than
in AI-generated ones (mean = 78%, SD = 7%; p = 0.01). Human-designed distractors tend to be more contextually appropriate and
cognitively challenging, thereby enhancing the diagnostic value of the question by
testing not just recall but higher-order thinking and application.
Why Are Human-Generated MCQs Superior?
Why Are Human-Generated MCQs Superior?
The superiority of human-generated MCQs in terms of discrimination and DE can be attributed
to several factors:
Deep understanding of the subject matter: Human experts have a comprehensive understanding of the intricacies of the subject
they are assessing. This allows them to design questions that target specific learning
objectives and test higher-order thinking skills, which AI, with its current capabilities,
struggles to replicate.
Insight into student learning behavior: Human experts are familiar with the common pitfalls and misconceptions that students
might have. This knowledge enables them to create distractors that are more likely
to mislead students who do not fully understand the material, thereby effectively
distinguishing between different levels of student knowledge.
Pedagogical expertise: Human experts are trained in educational theory and practice, which informs their
ability to construct questions that align with the intended learning outcomes. This
pedagogical expertise allows them to design assessments that are not only challenging
but also educational, reinforcing learning even as they test it.
Adaptability: Human experts can adapt their question-writing strategies based on the evolving
needs of the curriculum and the specific challenges faced by students. AI, on the
other hand, relies on pre-existing data and algorithms, which may not always capture
the nuances of a changing educational environment.
Implications for Medical Education
Implications for Medical Education
The findings of this study suggest that AI-generated MCQs can be valuable, but they
are most effective when used as a supportive tool under the guidance of trained educators.
The lower discrimination index and DE in AI-generated questions indicate that these
tools are not yet capable of fully replicating the expertise and insight that human
educators bring to the table. As such, the role of human experts remains crucial,
particularly in areas that require a deep understanding of pedagogy and student learning
behaviors.
In conclusion, while AI holds promise as a supplementary tool in medical education,
particularly for generating large volumes of practice questions, it cannot yet replace
the nuanced and expert-driven process of question creation that human educators provide.
The findings emphasize the continued importance of human involvement in the assessment
process to ensure that evaluations are both fair and effective in measuring student
learning outcomes.
Limitations: Single-institution study: This study was conducted at a single institution, limiting
the generalizability of the findings. The educational environment, student population,
and curriculum at one medical college may not be fully representative of other medical
colleges of India or globally.
Scope of AI tools: Only three AI tools were used in this study, and they were tested on a limited set
of topics. Future research may be used to explore a wider range of AI tools and include
a broader array of topics to determine whether the findings are consistent across
different areas of medical education. Despite the promise of AI in educational assessment,
several limitations warrant consideration. First, students' familiarity or lack thereof
with AI-generated content may influence their engagement and perceived credibility
of the questions. Second, variations among different AI tools, each with unique training
data, algorithms, and sensitivity to prompts, can lead to inconsistencies in output
quality. Finally, linguistic issues such as awkward phrasing, unnecessarily complex
syntax, or culturally unfamiliar references may impair comprehension, particularly
for nonnative English speakers.
These findings emphasize the importance of continued human oversight in the integration
of AI tools into assessment design, ensuring both reliability and educational value.
Conclusion
This study provides valuable insights into the potential and limitations of AI-generated
MCQs in medical education. While AI tools show promise, particularly in generating
questions of appropriate difficulty, human expertise remains essential in crafting
high-quality assessments that effectively differentiate between levels of student
performance and challenge students' critical thinking. As AI technology continues
to evolve, ongoing research and careful implementation will be essential in ensuring
that AI contributes positively to medical education.