Subscribe to RSS

DOI: 10.1055/s-0045-1812064
Comparative Benchmark of Seven Large Language Models for Traumatic Dental Injury Knowledge
Authors
Funding T.P. was supported by Health Systems Research Institute (68–032, 68–059), Faculty of Dentistry (DRF69_005), Thailand Science Research and Innovation Fund Chulalongkorn University (HEA_FF_68_008_3200_001).

Abstract
Objectives
Traumatic dental injuries (TDIs) are complex clinical conditions that require timely and accurate decision-making. With the rise of large language models (LLMs), there is growing interest in their potential to support dental management. This study evaluated the accuracy and consistency of DeepSeek R1's responses across all categories of TDIs and benchmarked its performance against other common LLMs.
Materials and Methods
DeepSeek R1 and six other LLMs, ChatGPT-4o mini, ChatGPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Flash, and Gemini 1.5 Advanced, were assessed using a validated question set (125 items) covering five subtopics: general introduction, fractures, luxations, avulsions of permanent teeth, and TDIs in the primary dentition (25 items per group) with a specific prompt. Each model was tested with five repetitions for all items.
Statistical Analysis
Accuracy was calculated as the percentage of correct responses, while consistency was measured using Fleiss' kappa analysis. Kruskal–Wallis H and Dunn's post-hoc test were applied for comparisons of three or more independent groups.
Results
DeepSeek R1 achieved the highest overall score of 86.4% ± 2.5%, despite the most inconsistent responses (κ = 0.694), statistically higher than those of ChatGPT-4o mini (74.7% ± 0.9%), Claude 3 Opus (75.2% ± 1.0%), and Gemini 1.5 Flash (73.85% ± 2.3%) (p < 0.0001). Across all models, accuracy was notably lower for luxation injury questions (68.3% ± 3.2%).
Conclusions
LLMs achieved moderate to high accuracy, yet this was tempered by varying degrees of inconsistency, particularly in the top-performing DeepSeek model. Difficulty with complex scenarios like luxation highlights current limitations in artificial intelligence (AI)'s diagnostic reasoning. AI should be viewed as a valuable dental educational and clinical adjunctive tool for knowledge acquisition and analysis, not a replacement for clinical expertise.
Keywords
traumatic dental injuries - artificial intelligence - large language models - chatbots - DeepSeek - health careData Availability
Data available on request from the authors.
Author Contributions
K.T.: conceptualization, methodology, investigation, data curation, formal analysis, visualization, writing—original draft. S.K.: conceptualization, methodology, validation, writing—original draft (discussion), writing—review and editing, writing—review and editing. S.P.: data curation. Z.K.: formal analysis, writing—review and editing. T.P.: conceptualization, writing—review and editing.
Publication History
Article published online:
22 October 2025
© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)
Thieme Medical and Scientific Publishers Pvt. Ltd.
A-12, 2nd Floor, Sector 2, Noida-201301 UP, India
-
References
- 1 Shan T, Tay FR, Gu L. Application of artificial intelligence in dentistry. J Dent Res 2021; 100 (03) 232-244
- 2 Samaranayake L, Tuygunov N, Schwendicke F. et al. The transformative role of artificial intelligence in dentistry: a comprehensive overview. part 1: fundamentals of AI, and its contemporary applications in dentistry. Int Dent J 2025; 75 (02) 383-396
- 3 Ekert T, Krois J, Meinhold L. et al. Deep learning for the radiographic detection of apical lesions. J Endod 2019; 45 (07) 917-922.e5
- 4 Hiraiwa T, Ariji Y, Fukuda M. et al. A deep-learning artificial intelligence system for assessment of root morphology of the mandibular first molar on panoramic radiography. Dentomaxillofac Radiol 2019; 48 (03) 20180218
- 5 Latke V, Narawade V. Measuring endodontic working length using artificial intelligence. Front Health Informat 2024;13(2)
- 6 Uribe SE, Maldupa I, Kavadella A. et al. Artificial intelligence chatbots and large language models in dental education: worldwide survey of educators. Eur J Dent Educ 2024; 28 (04) 865-876
- 7 Henry TA. 2 in 3 physicians are using health AI—up 78% from 2023. American Medical Association; . Accessed May 14, 2025 at: https://www.ama-assn.org/practice-management/digital-health/2-3-physicians-are-using-health-ai-78-2023
- 8 Egger J, De Paiva LF, Luijten G. et al. Is DeepSeek-R1 a game changer in healthcare?-A seed review. Authorea Preprints. 2025
- 9 Gibney E. China's cheap, open AI model DeepSeek thrills scientists. Nature 2025; 638 (8049) 13-14
- 10 Hassanein FEA, El Barbary A, Hussein RR. et al. Diagnostic performance of ChatGPT-4o and DeepSeek-3 differential diagnosis of complex oral lesions: a multimodal imaging and case difficulty analysis. Oral Dis 2025; (e-pub ahead of print)
- 11 Akhlaghi N, Nourbakhsh N, Khademi A, Karimi L. General dental practitioners' knowledge about the emergency management of dental trauma. Iran Endod J 2014; 9 (04) 251-256
- 12 Krastl G, Filippi A, Weiger R. German general dentists' knowledge of dental trauma. Dent Traumatol 2009; 25 (01) 88-91
- 13 Ozden I, Gokyar M, Ozden ME, Sazak Ovecoglu H. Assessment of artificial intelligence applications in responding to dental trauma. Dent Traumatol 2024; 40 (06) 722-729
- 14 Kuru HE, Aşık A, Demir DM. Can artificial intelligence language models effectively address dental trauma questions?. Dent Traumatol 2025; 41 (05) 567-580
- 15 Portilla ND, Garcia-Font M, Nagendrababu V, Abbott PV, Sanchez JAG, Abella F. Accuracy and consistency of gemini responses regarding the management of traumatized permanent teeth. Dent Traumatol 2025; 41 (02) 171-177
- 16 Bayraktar Nahir C. Can ChatGPT be guide in pediatric dentistry?. BMC Oral Health 2025; 25 (01) 9
- 17 Levin L, Day PF, Hicks L. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: general introduction. Dent Traumatol 2020; 36 (04) 309-313
- 18 Fouad AF, Abbott PV, Tsilingaridis G. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 2. Avulsion of permanent teeth. Dent Traumatol 2020; 36 (04) 331-342
- 19 Day PF, Flores MT, O'Connell AC. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 3. Injuries in the primary dentition. Dent Traumatol 2020; 36 (04) 343-359
- 20 Bourguignon C, Cohenca N, Lauridsen E. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 1. Fractures and luxations. Dent Traumatol 2020; 36 (04) 314-330
- 21 Schwendicke F, Singh T, Lee J-H. et al. Artificial intelligence in dental research: checklist for authors, reviewers, readers. J Dent 2021; 107: 103610
- 22 Altman DG. Practical Statistics for Medical Research. New York, NY: Chapman and Hall/CRC; 1990
- 23 Tomczak M, Tomczak E. The need to report effect size estimates revisited. An overview of some recommended measures of effect size. Trends in Sport Sciences 2024; 21 (01) 19-25
- 24 Mustuloğlu Ş, Deniz BP. Evaluation of Chatbots in the emergency management of avulsion injuries. Dent Traumatol 2025; 41 (04) 437-444
- 25 Tokgöz Kaplan T, Cankar M. Evidence-based potential of generative artificial intelligence large language models on dental avulsion: ChatGPT versus Gemini. Dent Traumatol 2025; 41 (02) 178-186
- 26 Shirani M. Comparing the performance of ChatGPT 4o, DeepSeek R1, and Gemini 2 Pro in answering fixed prosthodontics questions over time. J Prosthet Dent 2025 (in press)
- 27 Mukhopadhyay A, Mukhopadhyay S, Biswas R. Evaluation of large language models in pediatric dentistry: a Bloom's taxonomy-based analysis. Folia Med (Plovdiv) 2025; 67 (04) e154338
- 28 Temsah A, Alhasan K, Altamimi I. et al. DeepSeek in healthcare: revealing opportunities and steering challenges of a new open-source artificial intelligence frontier. Cureus 2025; 17 (02) e79221
- 29 Liang W, Chen P, Zou X. et al. DeepSeek: the “Watson” to doctors-from assistance to collaboration. J Thorac Dis 2025; 17 (02) 1103-1105
- 30 Mohammad-Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J 2024; 57 (03) 305-314
- 31 Ramlogan S, Raman V, Ramlogan S. A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam. BMC Med Educ 2025; 25 (01) 727
- 32 Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: a pilot study. J Dent 2024; 144: 104938
- 33 Kayaalp ME, Prill R, Sezgin EA, Cong T, Królikowska A, Hirschmann MT. DeepSeek versus ChatGPT: Multimodal artificial intelligence revolutionizing scientific discovery. From language editing to autonomous content generation-Redefining innovation in research and practice. Knee Surg Sports Traumatol Arthrosc 2025; 33 (05) 1553-1556
- 34 Mohammad-Rahimi H, Setzer FC, Aminoshariae A, Dummer PMH, Duncan HF, Nosrat A. Artificial intelligence chatbots in endodontic education-concepts and potential applications. Int Endod J 2025
- 35 Diniz-Freitas M, Diz-Dios P. DeepSeek: another step forward in the diagnosis of oral lesions. J Dent Sci 2025; 20 (03) 1904-1907
- 36 Kaygisiz ÖF, Teke MT. Can deepseek and ChatGPT be used in the diagnosis of oral pathologies?. BMC Oral Health 2025; 25 (01) 638
- 37 Kong A, Zhao S, Chen H. et al. Better zero-shot reasoning with role-play prompting. . arXiv preprint arXiv:2308.07702, 2023 . Doi:10.48550/arXiv.2308.07702
- 38 Guven Y, Ozdemir OT, Kavan MY. Performance of artificial intelligence chatbots in responding to patient queries related to traumatic dental injuries: a comparative study. Dent Traumatol 2025; 41 (03) 338-347
- 39 Johnson AJ, Singh TK, Gupta A. et al. Evaluation of validity and reliability of AI Chatbots as public sources of information on dental trauma. Dent Traumatol 2025; 41 (02) 187-193
- 40 Rewthamrongsris P, Burapacheep J, Trachoo V, Porntaveetus T. Accuracy of large language models for infective endocarditis prophylaxis in dental procedures. Int Dent J 2025; 75 (01) 206-212
