RSS-Feed abonnieren

DOI: 10.1055/s-0045-1812064
Comparative Benchmark of Seven Large Language Models for Traumatic Dental Injury Knowledge
Autoren
Funding T.P. was supported by Health Systems Research Institute (68–032, 68–059), Faculty of Dentistry (DRF69_005), Thailand Science Research and Innovation Fund Chulalongkorn University (HEA_FF_68_008_3200_001).
Abstract
Objectives
Traumatic dental injuries (TDIs) are complex clinical conditions that require timely and accurate decision-making. With the rise of large language models (LLMs), there is growing interest in their potential to support dental management. This study evaluated the accuracy and consistency of DeepSeek R1's responses across all categories of TDIs and benchmarked its performance against other common LLMs.
Materials and Methods
DeepSeek R1 and six other LLMs, ChatGPT-4o mini, ChatGPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Flash, and Gemini 1.5 Advanced, were assessed using a validated question set (125 items) covering five subtopics: general introduction, fractures, luxations, avulsions of permanent teeth, and TDIs in the primary dentition (25 items per group) with a specific prompt. Each model was tested with five repetitions for all items.
Statistical Analysis
Accuracy was calculated as the percentage of correct responses, while consistency was measured using Fleiss' kappa analysis. Kruskal–Wallis H and Dunn's post-hoc test were applied for comparisons of three or more independent groups.
Results
DeepSeek R1 achieved the highest overall score of 86.4% ± 2.5%, despite the most inconsistent responses (κ = 0.694), statistically higher than those of ChatGPT-4o mini (74.7% ± 0.9%), Claude 3 Opus (75.2% ± 1.0%), and Gemini 1.5 Flash (73.85% ± 2.3%) (p < 0.0001). Across all models, accuracy was notably lower for luxation injury questions (68.3% ± 3.2%).
Conclusions
LLMs achieved moderate to high accuracy, yet this was tempered by varying degrees of inconsistency, particularly in the top-performing DeepSeek model. Difficulty with complex scenarios like luxation highlights current limitations in artificial intelligence (AI)'s diagnostic reasoning. AI should be viewed as a valuable dental educational and clinical adjunctive tool for knowledge acquisition and analysis, not a replacement for clinical expertise.
Keywords
traumatic dental injuries - artificial intelligence - large language models - chatbots - DeepSeek - health careIntroduction
Artificial intelligence (AI) has been increasingly integrated across various dental specialties.[1] In endodontics, applications of AI in diagnosis, treatment planning, management, as well as surgical procedures have been introduced and explored by researchers and clinicians.[2] These include detection of periapical radiolucencies,[3] defining configurations of root and root canal system,[4] localization of apical foramen, and working length determination.[5] More recently, large language models (LLMs), basically known as chatbots such as ChatGPT, Gemini, and Claude, have caught widespread attention. Especially, the adoption of these AIs has increased among health care personnel. Recent surveys illustrated that roughly two-thirds of practicing physicians in the United States and one-third of academic dentists worldwide reported use of AI chatbots like ChatGPT.[6] [7] The study also emphasized their potential in academic settings such as knowledge acquisition, patient education, research activities, and, interestingly, clinical decision-making.[6] Thus, demonstrating its potential to assist in solving clinical challenges in the endodontic field.
DeepSeek, a newly launched open-source LLM currently available as DeepSeek R1, might play an important role in health care setting as its performance was comparable or even superior to other proprietary LLMs.[8] Unlike other models, it is incorporated with a distinct feature, “Chain of thought” (CoT) reasoning, enabling the model to construct a step-by-step logical thinking prior to obtaining the final decision.[9] Mimicking the process of differential diagnosis typically made by the clinician, CoT may offer responses with human-like and well-supported reasons, particularly in complex clinical scenarios. As another fundamental component for clinical uses, DeepSeek is sometimes referred to as a “White Box” for its transparency, allowing users to access underlying thinking mechanisms, modifying existing algorithms to optimize its capability for specific tasks.[9]
Notably, a previous study demonstrated that DeepSeek-3 achieved higher accuracy than ChatGPT-4o in generating differential diagnoses for oral lesions. Even without access to clinical and radiographic inputs, DeepSeek-3 produced differential diagnoses comparable to those of an oral medicine expert, outperforming ChatGPT-4o.[10] These findings suggest DeepSeek's considerable potential for application in the field of dentistry.
Despite well-established guidelines, traumatic dental injury (TDI) remains one of the topics in dentistry that most practitioners perceive themselves as inadequately knowledgeable.[11] [12] According to Ozden et al, the accuracy of Google Bard and ChatGPT was 64 and 51%, respectively.[13] Another study included more models of LLM as well as their respective versions; the accuracy ranged from 46.7 to 76.7%, and ChatGPT 3.5 achieved the highest score.[14] Portilla et al subsequently identified the accuracy and consistency of Gemini, revealing its 80.8% accuracy with excellent consistency of 0.96.[15] However, a recent study evaluating ChatGPT-3.5's responses in the field of pediatric dentistry found that questions related to dental trauma yielded the lowest score,[16] highlighting the complex nature of dental trauma management that might challenge the capability of LLMs in providing qualified responses.
While recent studies have initiated the evaluation of LLMs in the context of TDIs, a comprehensive assessment of DeepSeek R1's diagnostic accuracy across all TDI classifications remains a critical gap. Establishing category-specific benchmarks is essential for advancing AI performance in dentistry. Therefore, this study aims to evaluate the diagnostic accuracy and response consistency of the DeepSeek R1 LLM across the full spectrum of TDI categories. Its performance will be benchmarked against other commonly used LLMs, and we hypothesize that all models will demonstrate significant variability in their accuracy and consistency for specific injury types.
Materials and Methods
Following a previously conducted protocol,[13] dichotomous questions regarding TDIs and their answer keys were generated based on the 2020 International Association of Dental Traumatology (IADT) guidelines for the management of TDIs.[17] [18] [19] [20] The question sets covered five subtopics of TDIs according to the guideline, general introduction, fractures, luxations, avulsions of permanent teeth, and TDIs in the primary dentition, with 25 items per group, some questions were adapted from Ozden et al.[13] The question sets were thoroughly reviewed and validated by an experienced endodontist with 11 years of experience (S.K.). The authors used the author's checklist for AI in dentistry from Schwendicke et al (available in [Supplementary Table S1]).[21]
The validation confirmed that the questions and answer keys were accurate, logical, and evidence-based, following the 2020 IADT guidelines for the management of TDIs. We also confirmed that the questions fully covered all relevant TDI core knowledge, including clinical and radiographic assessment and findings, general management and patient instructions, endodontic treatment and consideration, follow-up regimens, favorable outcomes, and unfavorable outcomes, in accordance with the IADT guidelines. All questions can be accessed in [Supplementary Table S2].
Seven LLMs were evaluated, including ChatGPT-4o mini and ChatGPT-4o (OpenAI, San Francisco, California, United States), Claude 3.5 Sonnet and Claude 3 Opus (Anthropic, San Francisco, California, United States), Gemini 1.5 Flash and Gemini 1.5 Advanced (Google LLC, Mountain View, California, United States), and DeepSeek R1 (DeepSeek, Hangzhou, China). Six LLMs were evaluated between August and December 2024, while DeepSeek R1 was assessed separately from January to March 2025 following its release. Specifically, each LLM was prompted with: “Please answer the following Yes/No question as a dentist…,” followed by the questions listed in [Supplementary Table S2]. An example of a full prompt is stated below, according to an item from the General introduction section. “Please answer the following Yes/No question as a dentist, Enamel infraction is the only TDI that requires no follow-up.” All responses were recorded in Microsoft Excel Version 16.95 (Microsoft, Redmond, Washington, United States) and subsequently scored by the research team (K.T. and S.P.). The dichotomous answers, along with the accompanying explanations, were again reviewed and verified by an experienced endodontist (S.K.). One mark was given for each correct dichotomous response, and no marks were given for incorrect dichotomous responses. Five repeated assessments were performed for each LLM using its default platform settings on Chrome for Windows Version 129.0.6628.3 (Google LLC, Mountain View, California, United States). To minimize bias, the chat history was manually cleared between questions, and separate accounts were utilized for each LLM. Additionally, the “Memory” feature was disabled in the personalization settings for ChatGPT-4o and ChatGPT-4o mini assessments.
To compare accuracy of LLMs, percentages of correct responses (%Accuracy) were calculated for overall and specific TDIs. Fleiss' kappa analysis (κ) was used to evaluate agreement of repeated binary “Yes” or “No” responses from each LLM based on an established metric,[22] ranging agreement from poor (κ < 0.2) to very good (κ = 0.80–1.00). Higher Fleiss' kappa values indicate greater consistency of responses, reflecting more reliable model performance. Descriptive statistics and Fleiss' kappa analysis were obtained from the IBM SPSS version 29.0.2.0 (IBM Corporation, Armonk, New York, United States). While other statistical analyses were performed using GraphPad Prism 10 (GraphPad Software Inc., San Diego, California, United States). Prior to analyses, the Shapiro–Wilk test was used to assess datasets' normality. One-way ANOVA (analysis of variance) was used for comparisons of normally distributed data, while Kruskal–Wallis H and Dunn's post-hoc test were applied for comparisons of three or more independent groups with skewed data. Effect size (η2 ) was calculated following a previously described method.[23] A significance level less than 0.05 was considered statistically significant.
Results
The overall scores showed that DeepSeek R1 achieved the highest accuracy among the tested LLMs (86.4% ± 2.5%, 95% confidence interval [CI]: 83.3–89.5), followed by ChatGPT-4o (78.6% ± 1.0%, 95% CI: 77.3–79.9) and Gemini 1.5 Advanced (77.9% ± 1.7%, 95% CI: 75.9–80.0). The accuracy of DeepSeek R1 was statistically higher than ChatGPT-4o mini, Claude 3 Opus, and a strongly significant difference when compared with Gemini 1.5 Flash, a basic model of the Gemini series. While all LLMs resulted in over 70% mean accuracies for total questions, Gemini 1.5 Flash presented the lowest accuracy with a relatively high standard deviation of 73.8% ± 2.3% (95% CI: 70.9–76.6; [Fig. 1]).


To specifically compare performances of each LLM, we made comparisons of accuracy of each LLM when asked about general introduction and specific types of TDIs according to 2020 IADT guidelines for the management of TDIs ([Table 1], [Fig. 2]). Notably, DeepSeek R1 showed the highest accuracy on all topics except general introduction with varying degrees of statistical significance.
Abbreviations: CI, confidence interval; Claude Opus, Claude 3 Opus; Claude Sonnet, Claude 3.5 Sonnet; Gemini 1.5 Adv, Gemini 1.5 Advanced; Gemini 1.5 F, Gemini 1.5 Flash; GPT-4o mini, ChatGPT-4o mini; GPT-4o, ChatGPT-4o; LL, lower limit; SD, standard deviation; TDI, traumatic dental injury; UL, upper limit.
Note: Statistical significance was evaluated using Kruskal–Wallis tests. The following p-values were obtained for each category: general introduction: p = 0.0003, η2 = 0.68; fractures: p = 0.0053, η2 =0.44; luxations: p < 0.0001, η2 = 0.81; avulsion: p = 0.0003, η2 = 0.68; primary dentition: p < 0.0001, η2 = 0.81; total: p < 0.0001, η2 = 0.81. p <0.05 indicates a statistically significant difference in LLM accuracy across the models within that category.


Considering accuracy of responses on general introduction, the three highest accuracies were obtained from ChatGPT-4o (92% ± 0.0%, 95% CI: 92.0–92.0), DeepSeek R1 (91.2% ± 5.2%, 95% CI: 84.7–97.7), and Gemini 1.5 Advanced (89.6% ± 2.2%, 95% CI: 86.9–92.3), which are significantly greater than Claude 3.5 Sonnet (72.0% ± 6.3%, 95% CI: 64.2–79.9).
Regarding the fracture injuries, most LLMs showed comparable accuracies on questions except ChatGPT-4o mini (84.0% ± 4.0%, 95% CI: 79.0–88.9) and Gemini 1.5 Flash (82.4% ± 2.2%, 95% CI: 79.7–85.1), which obtained significantly lower accuracies compared with DeepSeek R1 (92.0% ± 4.0%, 95% CI: 87.0–97.0).
The higher accuracy of DeepSeek R1 compared with other LLMs was more pronounced statistically when considering questions regarding luxation and avulsion injuries, as shown in [Fig. 2(C, D)]. Lastly, regarding TDIs in primary dentition, Claude 3.5 Sonnet outperformed other models, with statistical significance when compared with Gemini 1.5 Flash and Gemini 1.5 Advanced, as well as DeepSeek R1.
Considering the accuracy of responses across each category of TDIs, the fractures and general introduction categories received the highest scores, both significantly higher than those for luxations, avulsion, and injuries in the primary dentition. Among all categories, the luxations category had the lowest accuracy score, which was significantly lower than the scores for fractures, general introduction, and injuries in the primary dentition ([Supplementary Table S3]).
All models consistently showed good to very good agreement (κ = 0.694–0.924) according to the Fleiss' kappa analysis ([Table 2]). Specifically, ChatGPT-4o mini delivered outstandingly consistent responses with κ of 0.924, followed by ChatGPT-4o (0.886), Claude 3 Opus (0.871), and Gemini 1.5 Advanced (0.847), which corresponded to very good agreement. The model with the lowest consistency was DeepSeek R1 (0.694), followed by Gemini 1.5 Flash (0.698) and Claude 3.5 Sonnet (0.773). It is noteworthy that despite the best performance in overall accuracy, DeepSeek R1 exhibited the lowest consistency with κ of 0.694.
Abbreviation: LLM, large language model.
Note: Consistency of seven LLMs in evaluating traumatic dental injury responses. ChatGPT-4o mini demonstrated the highest consistency (κ = 0.924), followed by ChatGPT-4o (κ = 0.886) and Claude 3 Opus (κ = 0.871). The lowest consistency was observed in DeepSeek R1 (κ = 0.694) and Gemini 1.5 Flash (κ = 0.698). Higher Fleiss' kappa values indicate greater agreement among raters, reflecting more reliable model performance.
Discussion
The overall accuracy in this study ranged from 73.8 to 86.4%, with DeepSeek R1 achieving a significantly higher overall accuracy than the other evaluated models, and Gemini 1.5 Flash demonstrating the lowest performance in accuracy. Therefore, the null hypothesis of no difference in accuracy among LLMs was rejected. Notably, although DeepSeek R1 was recorded as the highest accuracy, its consistency was rated at a “good” level rather than “very good.” The accuracy recorded in this study is higher than that reported in a previous study on the same topic using ChatGPT-3.5 and Gemini, where the correct response rate was only 57.5%.[13] This discrepancy may be attributed to differences in prompt design. Specifically, the earlier study did not instruct the AI to respond as a dentist, whereas the present study employed a targeted prompt. Another study utilizing a similar prompt format reported a comparable accuracy level of 80.8% when using the Gemini model.[15]
The comparative performance of specific LLMs in answering dental trauma questions remains inconclusive. In our study, DeepSeek R1 achieved the highest accuracy (86.4% ± 2.5%), followed by ChatGPT-4o (78.6% ± 1.0%) and Gemini 1.5 Advanced (77.9% ± 1.7%). Similarly, Mustuloğlu and Deniz reported that ChatGPT-4.0 (95.6%) outperformed Gemini (78.3%) in responding to questions on the emergency management of avulsion injuries.[24] In contrast, Ozden et al found that Gemini achieved a higher correct-answer rate (64.0%) than ChatGPT-3.5 (51.0%), and Tokgöz Kaplan et al reported that Gemini 1.5 Pro scored significantly higher for dental avulsion knowledge (4.2% ± 0.9) compared with ChatGPT-3.5 (3.5% ± 1.4).[25] Collectively, the current evidence remains insufficient to confirm the superiority of any single LLM for this specific task; however, their potential as adjunct clinical or educational tools is evident.
The consistency of LLMs is highly variable across models, tasks, and assessment methods. For instance, in dental trauma and endodontic knowledge, ChatGPT (version 3.5) demonstrated agreement values ranging from as low as 0.266[13] to as high as 0.987.[14] Similarly, while Gemini has previously achieved excellent agreement in dental trauma tasks,[14] [15] [26] our study observed lower consistency in the Gemini 1.5 Flash model. DeepSeek R1, which showed the lowest consistency in our evaluation (κ = 0.694), was also previously reported with moderate agreement (r = 0.615) in pediatric dentistry tasks.[27] These discrepancies reflect the dynamic and evolving nature of LLM performance, as model updates are incessantly released. Importantly, the variability in consistency underscores the need for caution in relying on LLM-generated responses, particularly in high-stakes clinical scenarios, as it carries a risk of critical errors.
When comparing AI models from the same provider, the paid version of ChatGPT-4o demonstrated higher accuracy than the public version. Similarly, the paid version of Gemini 1.5 Advanced outperformed the free version, in both accuracy and consistency. However, these trends did not reach statistical significance. Besides, all paid AI models achieved a level of consistency classified as “very good.” In comparison with previous research, the reliability of the Gemini 1.5 Flash model in the present study was rated as good (κ = 0.698), which is lower than the excellent reliability (r = 0.952) reported in an earlier study.[14]
Focusing on different TDI categories, the lowest accuracy was observed in the luxation category, reflecting the greater complexity involved in diagnosing and managing such injuries. This limitation illustrates a broader concern regarding the current capabilities of LLMs in complex clinical scenarios. Despite their ability to rapidly arrange, integrate, and process vast amounts of information, thereby compensating for the limitations of human memory,[28] [29] LLMs still lack the critical thinking and deep conceptual understanding required for health care decision-making.[30] This underscores the essential role of experienced clinicians, whose diagnostic reasoning and nuanced understanding remain irreplaceable in complex cases. While one study reported that LLMs performed comparably to, or even better than, clinicians in a written periodontology examination,[31] another study suggested contrasting results,[32] reflecting the variability in LLM performance across different contexts. These findings highlight the complementary nature of LLMs and clinicians, where each may help offset the other's weaknesses.
Our study reinforces the findings of previous research,[13] [14] [15] highlighting the imperfect accuracy and consistency of LLMs in TDI knowledge. Consequently, their application in clinical decision-making should be limited to experienced dental personnel who are capable of critically interpreting AI-generated responses. Knowledgeable users combined with appropriate AI use strategies are essential to maximize AI capabilities in real-world settings. Reliable and updated data sources, such as clinical guidelines, scientific papers, or official statements, selectively provided by users are a prerequisite to obtain high-quality outputs.[33] Furthermore, prompts should instruct the AI to adhere to the provided validated information, thereby minimizing the risk of hallucinations or algorithmic bias caused by the retrieval of inaccurate or misleading content.[28] [34]
The superior accuracy of DeepSeek R1 observed in our study is consistent with previous studies where DeepSeek's intriguingly high diagnostic accuracy in oral lesions of 91.6% was recorded, higher than ChatGPT-4o and the journal's readers.[35] This was also supported by another study showing its acceptable accuracy in diagnosing various oral pathology problems.[36] However, one common limitation of previous studies was that they did not either incorporate effective prompting strategies or declare their prompts when conducting the experiments. While there are many prompting strategies in LLMs from Zero-shot, CoT, to role-playing, the role assignment strategy has been demonstrated to be an effective method to enhance the reasoning capabilities of ChatGPT.[37]
Although our study utilized a text-based approach to assess LLMs' capability, some LLMs allow incorporation of images into the prompt in addition to a textual description. Adding useful information, such as a clinical photograph or a radiographic image, into the prompt could also potentially enhance the accuracy of ChatGPT for the diagnosis of oral lesions.[10] [35] The synergistic effects of such multimodal prompts on diagnostic accuracy in endodontics and dental trauma have yet to be systematically investigated.
A major strength of this study is the inclusion of a wide range of AI models, encompassing both free and paid versions. The questionnaire covered all categories outlined in the IADT guidelines, including general introduction, fractures and luxations, avulsion of permanent teeth, and injuries in the primary dentition.[17] [18] [19] [20] Questions were based on the IADT guidelines and were presented using a standardized prompt. This approach was intended to elevate the quality of responses to a professional dental standard.
Closed-ended questions in a dichotomous format were utilized to ensure precise accuracy evaluation in this study, in accordance with previous studies.[13] [15] [24] Earlier research employed open-ended questions, assessed using the Global Quality Scale (GQS), and followed by evidence-based discussion until consensus was reached to determine accuracy.[38] [39] Despite its simplicity and comprehensiveness, GQS is a subjective tool that depends on evaluators. In the context of accuracy assessment, closed-ended question eliminates GQS' drawbacks, as well as providing reliable and consistent objective assessment across different LLMs.
A key limitation of LLMs that should be taken into account while interpreting this study is the discrepancy between the dichotomous answers and the accompanying explanations. This inconsistency may be attributed to two types of error. First, inaccuracies within the explanations themselves, despite a seemingly correct dichotomous answer, may reflect factual errors commonly referred to as “hallucinations” in AI-generated content. For example, when DeepSeek R1 was asked a question regarding root fractures: “The apical part of a fractured root usually undergoes necrosis, leading to the need for root canal treatment.” Despite the correct “No” answer, looking carefully at its explanation revealed a consistent-looking misleading rationale, stating that “…necrosis typically involves the entire pulp rather than just the apical part…,” contradicting the fact that the coronal part of a fractured root is the part that usually undergoes necrosis, not the apical part[20] ([Supplementary Fig. S1A]). The number of this type of inaccuracy was recorded at 15 responses out of 4,375 total responses (0.3%). It was not only more frequent than another type of discrepancy resulting from complex prompts, but, crucially, it could also misguide clinical judgement and lead to harmful consequences.[30] As AI models have the capacity for machine learning based on user input, hallucinations could also be reduced by providing established standard guidelines as input and prompting the AI to respond specifically based on specific information. This approach is particularly applicable in areas where well-defined clinical guidelines exist, such as the management of dental trauma and infective endocarditis.[40] Second, complex prompts containing multiple sentences may lead the AI to address only part of the question, neglecting the rest of it. This results in an inaccurate dichotomous answer despite a totally correct underlying explanation ([Supplementary Fig. S1B]). Therefore, it is important to interpret both the dichotomous answer and the explanation together. Using more specific and straightforward questions may help reduce such discrepancies as well.
Future studies should aim to compare the accuracy and consistency of AI models with varying levels of expertise, clarifying the current standard of practice compared with practitioners, as well as exploring potential synergies when clinicians use AI as an adjunct tool, compared with unaided clinical judgment. Essentially, investigations on different prompting strategies are strongly encouraged to suggest an effective instruction of LLM uses in the field of dental trauma. With the advancing potential of LLMs such as DeepSeek in medical fields, longitudinal studies assessing the real-world application of this AI's accuracy, consistency, and practicality over time would serve as robust evidence for initiating its complete integration into daily practices.
Conclusion
Our findings reveal a critical dichotomy in the current state of LLMs for dental trauma diagnostics. While models achieve moderate to high accuracy (73.8–86.4%), their utility is undermined by significant response inconsistency, a limitation particularly evident in their difficulty with complex scenarios like luxations, which points to deficits in nuanced diagnostic reasoning. Consequently, their immediate application is not as autonomous diagnostic tools, but rather as powerful educational aids for knowledge acquisition and analysis. Ultimately, this research provides foundational pilot data that frame current LLMs as a premature but promising technology, establishing a crucial benchmark for the future development of reliable, dentist-in-the-loop decision-support systems.
Conflict of Interest Statement
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.
Acknowledgment
The authors would like to express sincere gratitude to Professor Lakshman Samaranayake for his invaluable guidance and constructive feedback in finalizing the manuscript. T.P. was supported by Chulalongkorn University Office of International Affairs and Global Network Scholarship for International Research Collaboration.
Data Availability
Data available on request from the authors.
Author Contributions
K.T.: conceptualization, methodology, investigation, data curation, formal analysis, visualization, writing—original draft. S.K.: conceptualization, methodology, validation, writing—original draft (discussion), writing—review and editing, writing—review and editing. S.P.: data curation. Z.K.: formal analysis, writing—review and editing. T.P.: conceptualization, writing—review and editing.
-
References
- 1 Shan T, Tay FR, Gu L. Application of artificial intelligence in dentistry. J Dent Res 2021; 100 (03) 232-244
- 2 Samaranayake L, Tuygunov N, Schwendicke F. et al. The transformative role of artificial intelligence in dentistry: a comprehensive overview. part 1: fundamentals of AI, and its contemporary applications in dentistry. Int Dent J 2025; 75 (02) 383-396
- 3 Ekert T, Krois J, Meinhold L. et al. Deep learning for the radiographic detection of apical lesions. J Endod 2019; 45 (07) 917-922.e5
- 4 Hiraiwa T, Ariji Y, Fukuda M. et al. A deep-learning artificial intelligence system for assessment of root morphology of the mandibular first molar on panoramic radiography. Dentomaxillofac Radiol 2019; 48 (03) 20180218
- 5 Latke V, Narawade V. Measuring endodontic working length using artificial intelligence. Front Health Informat 2024;13(2)
- 6 Uribe SE, Maldupa I, Kavadella A. et al. Artificial intelligence chatbots and large language models in dental education: worldwide survey of educators. Eur J Dent Educ 2024; 28 (04) 865-876
- 7 Henry TA. 2 in 3 physicians are using health AI—up 78% from 2023. American Medical Association; . Accessed May 14, 2025 at: https://www.ama-assn.org/practice-management/digital-health/2-3-physicians-are-using-health-ai-78-2023
- 8 Egger J, De Paiva LF, Luijten G. et al. Is DeepSeek-R1 a game changer in healthcare?-A seed review. Authorea Preprints. 2025
- 9 Gibney E. China's cheap, open AI model DeepSeek thrills scientists. Nature 2025; 638 (8049) 13-14
- 10 Hassanein FEA, El Barbary A, Hussein RR. et al. Diagnostic performance of ChatGPT-4o and DeepSeek-3 differential diagnosis of complex oral lesions: a multimodal imaging and case difficulty analysis. Oral Dis 2025; (e-pub ahead of print)
- 11 Akhlaghi N, Nourbakhsh N, Khademi A, Karimi L. General dental practitioners' knowledge about the emergency management of dental trauma. Iran Endod J 2014; 9 (04) 251-256
- 12 Krastl G, Filippi A, Weiger R. German general dentists' knowledge of dental trauma. Dent Traumatol 2009; 25 (01) 88-91
- 13 Ozden I, Gokyar M, Ozden ME, Sazak Ovecoglu H. Assessment of artificial intelligence applications in responding to dental trauma. Dent Traumatol 2024; 40 (06) 722-729
- 14 Kuru HE, Aşık A, Demir DM. Can artificial intelligence language models effectively address dental trauma questions?. Dent Traumatol 2025; 41 (05) 567-580
- 15 Portilla ND, Garcia-Font M, Nagendrababu V, Abbott PV, Sanchez JAG, Abella F. Accuracy and consistency of gemini responses regarding the management of traumatized permanent teeth. Dent Traumatol 2025; 41 (02) 171-177
- 16 Bayraktar Nahir C. Can ChatGPT be guide in pediatric dentistry?. BMC Oral Health 2025; 25 (01) 9
- 17 Levin L, Day PF, Hicks L. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: general introduction. Dent Traumatol 2020; 36 (04) 309-313
- 18 Fouad AF, Abbott PV, Tsilingaridis G. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 2. Avulsion of permanent teeth. Dent Traumatol 2020; 36 (04) 331-342
- 19 Day PF, Flores MT, O'Connell AC. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 3. Injuries in the primary dentition. Dent Traumatol 2020; 36 (04) 343-359
- 20 Bourguignon C, Cohenca N, Lauridsen E. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 1. Fractures and luxations. Dent Traumatol 2020; 36 (04) 314-330
- 21 Schwendicke F, Singh T, Lee J-H. et al. Artificial intelligence in dental research: checklist for authors, reviewers, readers. J Dent 2021; 107: 103610
- 22 Altman DG. Practical Statistics for Medical Research. New York, NY: Chapman and Hall/CRC; 1990
- 23 Tomczak M, Tomczak E. The need to report effect size estimates revisited. An overview of some recommended measures of effect size. Trends in Sport Sciences 2024; 21 (01) 19-25
- 24 Mustuloğlu Ş, Deniz BP. Evaluation of Chatbots in the emergency management of avulsion injuries. Dent Traumatol 2025; 41 (04) 437-444
- 25 Tokgöz Kaplan T, Cankar M. Evidence-based potential of generative artificial intelligence large language models on dental avulsion: ChatGPT versus Gemini. Dent Traumatol 2025; 41 (02) 178-186
- 26 Shirani M. Comparing the performance of ChatGPT 4o, DeepSeek R1, and Gemini 2 Pro in answering fixed prosthodontics questions over time. J Prosthet Dent 2025 (in press)
- 27 Mukhopadhyay A, Mukhopadhyay S, Biswas R. Evaluation of large language models in pediatric dentistry: a Bloom's taxonomy-based analysis. Folia Med (Plovdiv) 2025; 67 (04) e154338
- 28 Temsah A, Alhasan K, Altamimi I. et al. DeepSeek in healthcare: revealing opportunities and steering challenges of a new open-source artificial intelligence frontier. Cureus 2025; 17 (02) e79221
- 29 Liang W, Chen P, Zou X. et al. DeepSeek: the “Watson” to doctors-from assistance to collaboration. J Thorac Dis 2025; 17 (02) 1103-1105
- 30 Mohammad-Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J 2024; 57 (03) 305-314
- 31 Ramlogan S, Raman V, Ramlogan S. A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam. BMC Med Educ 2025; 25 (01) 727
- 32 Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: a pilot study. J Dent 2024; 144: 104938
- 33 Kayaalp ME, Prill R, Sezgin EA, Cong T, Królikowska A, Hirschmann MT. DeepSeek versus ChatGPT: Multimodal artificial intelligence revolutionizing scientific discovery. From language editing to autonomous content generation-Redefining innovation in research and practice. Knee Surg Sports Traumatol Arthrosc 2025; 33 (05) 1553-1556
- 34 Mohammad-Rahimi H, Setzer FC, Aminoshariae A, Dummer PMH, Duncan HF, Nosrat A. Artificial intelligence chatbots in endodontic education-concepts and potential applications. Int Endod J 2025
- 35 Diniz-Freitas M, Diz-Dios P. DeepSeek: another step forward in the diagnosis of oral lesions. J Dent Sci 2025; 20 (03) 1904-1907
- 36 Kaygisiz ÖF, Teke MT. Can deepseek and ChatGPT be used in the diagnosis of oral pathologies?. BMC Oral Health 2025; 25 (01) 638
- 37 Kong A, Zhao S, Chen H. et al. Better zero-shot reasoning with role-play prompting. . arXiv preprint arXiv:2308.07702, 2023 . Doi:10.48550/arXiv.2308.07702
- 38 Guven Y, Ozdemir OT, Kavan MY. Performance of artificial intelligence chatbots in responding to patient queries related to traumatic dental injuries: a comparative study. Dent Traumatol 2025; 41 (03) 338-347
- 39 Johnson AJ, Singh TK, Gupta A. et al. Evaluation of validity and reliability of AI Chatbots as public sources of information on dental trauma. Dent Traumatol 2025; 41 (02) 187-193
- 40 Rewthamrongsris P, Burapacheep J, Trachoo V, Porntaveetus T. Accuracy of large language models for infective endocarditis prophylaxis in dental procedures. Int Dent J 2025; 75 (01) 206-212
Address for correspondence
Publikationsverlauf
Artikel online veröffentlicht:
22. Oktober 2025
© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)
Thieme Medical and Scientific Publishers Pvt. Ltd.
A-12, 2nd Floor, Sector 2, Noida-201301 UP, India
-
References
- 1 Shan T, Tay FR, Gu L. Application of artificial intelligence in dentistry. J Dent Res 2021; 100 (03) 232-244
- 2 Samaranayake L, Tuygunov N, Schwendicke F. et al. The transformative role of artificial intelligence in dentistry: a comprehensive overview. part 1: fundamentals of AI, and its contemporary applications in dentistry. Int Dent J 2025; 75 (02) 383-396
- 3 Ekert T, Krois J, Meinhold L. et al. Deep learning for the radiographic detection of apical lesions. J Endod 2019; 45 (07) 917-922.e5
- 4 Hiraiwa T, Ariji Y, Fukuda M. et al. A deep-learning artificial intelligence system for assessment of root morphology of the mandibular first molar on panoramic radiography. Dentomaxillofac Radiol 2019; 48 (03) 20180218
- 5 Latke V, Narawade V. Measuring endodontic working length using artificial intelligence. Front Health Informat 2024;13(2)
- 6 Uribe SE, Maldupa I, Kavadella A. et al. Artificial intelligence chatbots and large language models in dental education: worldwide survey of educators. Eur J Dent Educ 2024; 28 (04) 865-876
- 7 Henry TA. 2 in 3 physicians are using health AI—up 78% from 2023. American Medical Association; . Accessed May 14, 2025 at: https://www.ama-assn.org/practice-management/digital-health/2-3-physicians-are-using-health-ai-78-2023
- 8 Egger J, De Paiva LF, Luijten G. et al. Is DeepSeek-R1 a game changer in healthcare?-A seed review. Authorea Preprints. 2025
- 9 Gibney E. China's cheap, open AI model DeepSeek thrills scientists. Nature 2025; 638 (8049) 13-14
- 10 Hassanein FEA, El Barbary A, Hussein RR. et al. Diagnostic performance of ChatGPT-4o and DeepSeek-3 differential diagnosis of complex oral lesions: a multimodal imaging and case difficulty analysis. Oral Dis 2025; (e-pub ahead of print)
- 11 Akhlaghi N, Nourbakhsh N, Khademi A, Karimi L. General dental practitioners' knowledge about the emergency management of dental trauma. Iran Endod J 2014; 9 (04) 251-256
- 12 Krastl G, Filippi A, Weiger R. German general dentists' knowledge of dental trauma. Dent Traumatol 2009; 25 (01) 88-91
- 13 Ozden I, Gokyar M, Ozden ME, Sazak Ovecoglu H. Assessment of artificial intelligence applications in responding to dental trauma. Dent Traumatol 2024; 40 (06) 722-729
- 14 Kuru HE, Aşık A, Demir DM. Can artificial intelligence language models effectively address dental trauma questions?. Dent Traumatol 2025; 41 (05) 567-580
- 15 Portilla ND, Garcia-Font M, Nagendrababu V, Abbott PV, Sanchez JAG, Abella F. Accuracy and consistency of gemini responses regarding the management of traumatized permanent teeth. Dent Traumatol 2025; 41 (02) 171-177
- 16 Bayraktar Nahir C. Can ChatGPT be guide in pediatric dentistry?. BMC Oral Health 2025; 25 (01) 9
- 17 Levin L, Day PF, Hicks L. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: general introduction. Dent Traumatol 2020; 36 (04) 309-313
- 18 Fouad AF, Abbott PV, Tsilingaridis G. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 2. Avulsion of permanent teeth. Dent Traumatol 2020; 36 (04) 331-342
- 19 Day PF, Flores MT, O'Connell AC. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 3. Injuries in the primary dentition. Dent Traumatol 2020; 36 (04) 343-359
- 20 Bourguignon C, Cohenca N, Lauridsen E. et al. International Association of Dental Traumatology guidelines for the management of traumatic dental injuries: 1. Fractures and luxations. Dent Traumatol 2020; 36 (04) 314-330
- 21 Schwendicke F, Singh T, Lee J-H. et al. Artificial intelligence in dental research: checklist for authors, reviewers, readers. J Dent 2021; 107: 103610
- 22 Altman DG. Practical Statistics for Medical Research. New York, NY: Chapman and Hall/CRC; 1990
- 23 Tomczak M, Tomczak E. The need to report effect size estimates revisited. An overview of some recommended measures of effect size. Trends in Sport Sciences 2024; 21 (01) 19-25
- 24 Mustuloğlu Ş, Deniz BP. Evaluation of Chatbots in the emergency management of avulsion injuries. Dent Traumatol 2025; 41 (04) 437-444
- 25 Tokgöz Kaplan T, Cankar M. Evidence-based potential of generative artificial intelligence large language models on dental avulsion: ChatGPT versus Gemini. Dent Traumatol 2025; 41 (02) 178-186
- 26 Shirani M. Comparing the performance of ChatGPT 4o, DeepSeek R1, and Gemini 2 Pro in answering fixed prosthodontics questions over time. J Prosthet Dent 2025 (in press)
- 27 Mukhopadhyay A, Mukhopadhyay S, Biswas R. Evaluation of large language models in pediatric dentistry: a Bloom's taxonomy-based analysis. Folia Med (Plovdiv) 2025; 67 (04) e154338
- 28 Temsah A, Alhasan K, Altamimi I. et al. DeepSeek in healthcare: revealing opportunities and steering challenges of a new open-source artificial intelligence frontier. Cureus 2025; 17 (02) e79221
- 29 Liang W, Chen P, Zou X. et al. DeepSeek: the “Watson” to doctors-from assistance to collaboration. J Thorac Dis 2025; 17 (02) 1103-1105
- 30 Mohammad-Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J 2024; 57 (03) 305-314
- 31 Ramlogan S, Raman V, Ramlogan S. A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam. BMC Med Educ 2025; 25 (01) 727
- 32 Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: a pilot study. J Dent 2024; 144: 104938
- 33 Kayaalp ME, Prill R, Sezgin EA, Cong T, Królikowska A, Hirschmann MT. DeepSeek versus ChatGPT: Multimodal artificial intelligence revolutionizing scientific discovery. From language editing to autonomous content generation-Redefining innovation in research and practice. Knee Surg Sports Traumatol Arthrosc 2025; 33 (05) 1553-1556
- 34 Mohammad-Rahimi H, Setzer FC, Aminoshariae A, Dummer PMH, Duncan HF, Nosrat A. Artificial intelligence chatbots in endodontic education-concepts and potential applications. Int Endod J 2025
- 35 Diniz-Freitas M, Diz-Dios P. DeepSeek: another step forward in the diagnosis of oral lesions. J Dent Sci 2025; 20 (03) 1904-1907
- 36 Kaygisiz ÖF, Teke MT. Can deepseek and ChatGPT be used in the diagnosis of oral pathologies?. BMC Oral Health 2025; 25 (01) 638
- 37 Kong A, Zhao S, Chen H. et al. Better zero-shot reasoning with role-play prompting. . arXiv preprint arXiv:2308.07702, 2023 . Doi:10.48550/arXiv.2308.07702
- 38 Guven Y, Ozdemir OT, Kavan MY. Performance of artificial intelligence chatbots in responding to patient queries related to traumatic dental injuries: a comparative study. Dent Traumatol 2025; 41 (03) 338-347
- 39 Johnson AJ, Singh TK, Gupta A. et al. Evaluation of validity and reliability of AI Chatbots as public sources of information on dental trauma. Dent Traumatol 2025; 41 (02) 187-193
- 40 Rewthamrongsris P, Burapacheep J, Trachoo V, Porntaveetus T. Accuracy of large language models for infective endocarditis prophylaxis in dental procedures. Int Dent J 2025; 75 (01) 206-212




