Subscribe to RSS

DOI: 10.1055/s-0045-1809979
Comparing ChatGPT and Dental Students' Performance in an Introduction to Dental Anatomy Examination: Comment
We would like to comment on “Comparing ChatGPT and Dental Students' Performance in an Introduction to Dental Anatomy Examination: A Cross-Sectional Study.”[1] While the study provided valuable insights, relying solely on frequency and percentage distributions might not fully capture the nuances of the data. The purpose of this article is to compare ChatGPT's knowledge and interpretability to that of undergraduate dentistry students using a multiple-choice dental anatomy test. This study used a cross-sectional analytical study approach, which is ideal for long-term comparisons. However, this research method cannot capture long-term changes or developments in both student and ChatGPT learning. Furthermore, a single average score may not capture other important skills, such as clinical reasoning or knowledge application in real-world scenarios. The researcher analyzed the data using SPSS and Microsoft Excel to determine the percentage and frequency of accurate answers.
However, relying on frequency and percentage distributions may not provide a thorough assessment of ChatGPT's performance. Furthermore, the Shapiro–Wilk test for determining data distribution is only useful when the sample size is quite small. When the test results show a p-value of 0.001, it indicates that the data are not normal. As a result, if the data are similar, using the Kolmogorov–Smirnov test may be unnecessary.[2] Reporting more complex statistics, such as assessing significant differences between groups or utilizing Cohen's kappa to assess expert consensus in scoring explanations, would boost the research results' trustworthiness. In artificial intelligence (AI) studies, methods like kappa have been applied to measure the consistency of AI outputs, such as content generation and model evaluation, indicating their potential for increasing reliability in comparative research.[3]
In terms of outcomes, the students outperformed ChatGPT, scoring 74.28% on average compared with ChatGPT's 60%. Although ChatGPT was able to answer correctly at a level that meets the minimum criteria, its accuracy and reliability remained low, indicating that AI language models may not be able to analyze data in depth or interpret specific health science contexts as well as humans. Questions to consider in future research include: What are the constraints of ChatGPT in interpreting different types of questions?
To add originality and academic value to future research, researchers should compare ChatGPT with other AI models trained particularly for medicine or dentistry, such as Med-PaLM or BioGPT.[4] A recent study by Wu et al demonstrated that when ChatGPT was integrated with the Knowledge and Few-shot Enhancement In-context Learning framework, its performance improved significantly.[5] ChatGPT-4 achieved the highest score, outperforming the average human score.[5] This highlights the potential of AI models, particularly when tailored for specific domain, to significantly enhance performance in exams and demonstrates the effectiveness of integrating additional frameworks into the evaluation.
Furthermore, developing longitudinal research to track the evolution of ChatGPT's abilities via continuous feedback may reveal its potential as an even more dynamic learning tool. In the future, the integration of ChatGPT into e-learning systems under expert supervision should be investigated to enhance the quality and safety of its application in health education.[6] The integration of AI models into educational settings has already been explored. For example, a recent bibliometric analysis of AI in dental education highlighted a growing interest in applying AI, particularly large language models and chatbots, to transform the field.[7] The study by Iniesta and Pérez-Higueras identified a significant rise in publications on the topic, with key themes like clinical decision support systems and the use of AI in enhancing dental education.[7] The findings emphasize the increasing recognition of AI's potential to improve educational outcomes in health-related fields, further supporting the value of integrating AI models like ChatGPT under expert supervision.
Conflict of Interest
None declared.
Declaration of GenAI Use
During the writing process of this article, the authors used Quilbot for language editing and checking. The authors have reviewed and edited the final text and take full responsibility for the content of the article.
Authors' Contributions
H.D.: 50% ideas, writing, analyzing, and approval.
V.W.: 50% ideas, supervision, and approval.
-
References
- 1 Ullah R, Shaikh MS, Shahani N, Lone MA, Fareed MA, Zafar MS. Comparing ChatGPT and dental students' performance in an introduction to dental anatomy examination: a cross-sectional study. Eur J Dent 2025; (e-pub ahead of print).
- 2 Ghasemi A, Zahediasl S. Normality tests for statistical analysis: a guide for non-statisticians. Int J Endocrinol Metab 2012; 10 (02) 486-489
- 3 Anisuzzaman DM, Malins JG, Friedman PA, Attia ZI. Fine-tuning large language models for specialized use cases. Mayo Clin Proc Digit Health 2024; 3 (01) 100184
- 4 Singhal K, Azizi S, Tu T. et al. Large language models encode clinical knowledge. Nature 2023; 620 (7972) 172-180
- 5 Wu J, Wu X, Qiu Z. et al. Large language models leverage external knowledge to extend clinical insight beyond language boundaries. J Am Med Inform Assoc 2024; 31 (09) 2054-2064
- 6 Chan KS, Zary N. Applications and challenges of implementing artificial intelligence in medical education: integrative review. JMIR Med Educ 2019; 5 (01) e13930
- 7 Iniesta M, Pérez-Higueras JJ. Global trends in the use of artificial intelligence in dental education: a bibliometric analysis. Eur J Dent Educ 2025; (e-pub ahead of print).
Address for correspondence
Publication History
Article published online:
07 July 2025
© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)
Thieme Medical and Scientific Publishers Pvt. Ltd.
A-12, 2nd Floor, Sector 2, Noida-201301 UP, India
-
References
- 1 Ullah R, Shaikh MS, Shahani N, Lone MA, Fareed MA, Zafar MS. Comparing ChatGPT and dental students' performance in an introduction to dental anatomy examination: a cross-sectional study. Eur J Dent 2025; (e-pub ahead of print).
- 2 Ghasemi A, Zahediasl S. Normality tests for statistical analysis: a guide for non-statisticians. Int J Endocrinol Metab 2012; 10 (02) 486-489
- 3 Anisuzzaman DM, Malins JG, Friedman PA, Attia ZI. Fine-tuning large language models for specialized use cases. Mayo Clin Proc Digit Health 2024; 3 (01) 100184
- 4 Singhal K, Azizi S, Tu T. et al. Large language models encode clinical knowledge. Nature 2023; 620 (7972) 172-180
- 5 Wu J, Wu X, Qiu Z. et al. Large language models leverage external knowledge to extend clinical insight beyond language boundaries. J Am Med Inform Assoc 2024; 31 (09) 2054-2064
- 6 Chan KS, Zary N. Applications and challenges of implementing artificial intelligence in medical education: integrative review. JMIR Med Educ 2019; 5 (01) e13930
- 7 Iniesta M, Pérez-Higueras JJ. Global trends in the use of artificial intelligence in dental education: a bibliometric analysis. Eur J Dent Educ 2025; (e-pub ahead of print).