Open Access
CC BY 4.0 · European Journal of General Dentistry
DOI: 10.1055/s-0045-1809617
Original Article

Assessing the Reliability of ChatGPT and Gemini in Identifying Relevant Orthodontic Literature

1   Department of Pediatric Dentistry, College of Dentistry, Prince Sattam Bin Abdulaziz University, Al-Kharj, Saudi Arabia
› Author Affiliations

Funding The author appreciates the support and funding provided by Prince Sattam Bin Abdulaziz University for the entire research project.
 

Abstract

Objectives

Artificial intelligence (AI)-based solutions offer potential remedies to the issues encountered in conventional reference identification methods. However, the effectiveness of these AI models in assisting orthodontic experts in discovering relevant material is unknown. The purpose of this study was to assess the validity of ChatGPT and Google Gemini in delivering references for orthodontic literature studies.

Materials and Methods

This study utilized ChatGPT models (3.5 and 4) and Gemini to search for topics in orthodontics and specific subdomains. To verify the existence and precision of the cited references, several reputable sources were employed, including PubMed, Google Scholar, and Web of Science.

Statistical Analysis

Descriptive statistics were employed to present the data numerically and as percentages, focusing on three aspects: completeness, accuracy, and fabrication. Reliability analysis was conducted using Cronbach's α and the results were visually presented in the form of the correlation heat map.

Results

Out of all references, only 15.76% were correct, whereas 71.92% were fake or fabricated references and 12.32% were inaccurate references. Gemini had the significantly highest proportion of correct references (36.36%), followed by GPT 3.5 (15.76%) and GPT 4 (0.95%) (p-value < 0.01). The reliability score of 0.418 indicate low-to-moderate consistency in the accuracy of the references.

Conclusion

While Gemini showed better performance than GPT models, significant limitation remains in all three models in reference generations. These findings advocate for balanced and cautious use of AI tools in academic research related to orthodontics, emphasizing human validation of the references and training of dental professionals and researchers in efficient use of AI tools.


Introduction

Scientific referencing is one of the most important aspects of academic discourse because citing and referring to other authors' work is a rhetorical strategy for demonstrating knowledge and familiarity with the subject matter, persuading and gaining acceptance of one's academic arguments by linking one's findings with the scientific community, constructing one's identity as scientific author, and supporting, verifying, or justifying scientific arguments.[1] [2] [3] Referencing is crucial for sustaining research integrity since it prevents plagiarism, increases the validity of evidence, and helps researchers to trace the source of referenced information for more inquiry of the work. Thus, proper referencing is needed not only for scholarly rigor but also for adhering to professional and academic ethical norms.[4] Nonetheless, referencing is frequently thought of as a laborious and time-consuming task.[5] The researcher must carefully find the study articles, arrange them, and format them according to journal requirements.[4] The necessity to guarantee the accuracy and consistency of referencing material, including author, journal, issue, and volume details, among other things, makes this task more complicated.[6]

Identifying references for scientific research literature reviews using current approaches might be challenging and inefficient across many scientific fields, particularly those which are rapidly expanding, such as orthodontics. Orthodontics encompasses a wide range of related subjects, including biomechanics and craniofacial development, as well as modern advances in the field such as clear aligners and digital orthodontics. This width, combined with the continuous inflow of novel research and innovation, needs an innovative approach to literature identification that is both comprehensive and up-to-date.[7] However, researchers typically conduct manual searches in databases such as PubMed or Scopus, utilizing keywords and filters to identify relevant articles.[5] [8] This approach often necessitates multiple searches and the evaluation of numerous articles, which can be overwhelming and may result in overlooking crucial sources.[7] Furthermore, the availability and accessibility of literature differ across various databases and journals. Researchers may encounter restricted access, including paywalls that require subscriptions for full-text article access.[9] While institutional access, open access platforms, and article purchases mitigate such access, their constraints may still impact the capacity to rapidly assess artificial intelligence (AI)-generated references. This constraint can result in incomplete literature reviews and potential bias in reference selection.[10]

AI-based solutions, such as the Chat Generative Pre-trained Transformer (ChatGPT), offer potential remedies to the issues encountered in conventional reference identification methods, potentially improving both efficiency and accuracy.[11] [12] [13] Developed by OpenAI, ChatGPT is an advanced natural language processing model that utilizes AI to comprehend and generate human-like responses through text or voice interactions after a prompt request.[14] Its applications extend beyond basic language tasks, demonstrating value in initial idea generation, data analysis, literature reviews, coding assistance, organizing scientific content, and manuscript drafting, thus providing researchers with a time-efficient tool to streamline their work.[13] [14] [15] Similarly, Google developed Gemini, a multimodal large language model (LLM) and AI-powered assistant, to provide AI-based writing assistance.[16] This includes its use in scientific writing for a variety of functions such as idea generation or brainstorming research questions, developing potential research questions, research drafting, paraphrasing, and summarizing, generating conclusions, abstracts, keywords, and any other automated feedback to user queries.[17] However, the use of these generative AI models in research context is not without limitation. One major limitation is overreliance that may result in reduced critical thinking and writing skills particularly among early career researchers. Additionally, fabrication of information particularly while summarizing complex content or referencing, resulting in compromised scholarly work. Besides, ethical concerns regarding plagiarism, authorship, and transparency in conduct of research also arises. ChatGPT 3.5, ChatGPT-4, and Gemini have been evaluated individually or in comparison to each other in a few studies to determine their accuracy in providing valid references.[17] [18] [19] [20] [21] [22] However, some studies have found that the models are untrustworthy due to accuracy issues in referencing.[17] [18] [19] [20] While one comparative study found that Gemini generated significantly more accurate responses than ChatGPT in the medical and dental research fields,[21] [22] while another found the opposite.[23] These findings point to a potential accuracy hierarchy among AI chatbots. Moreover, the effectiveness of these AI models in assisting orthodontic experts in discovering relevant materials is limited. A recent study additionally examined the accuracy of ChatGPT's answers to clinical questions and cases on interceptive orthodontics. While the results showed that the AI's replies were very precise and comprehensive, as well as capable of resolving difficult clinical cases, they were not entirely correct.[24] This point toward the critical need to assess AI tools before integrating them into critical aspects of academic or clinical work. Therefore, the purpose of this study was to assess the validity of ChatGPT and Google Gemini in delivering references for orthodontic literature studies. Researchers and dental clinicians can acquire insights into the possible advantages and limitations of AI models by evaluating their performance, which can help AI researchers to improve AI-assisted tools for literature review processes.


Materials and Methods

Search Strategy and Criteria

This study utilized ChatGPT and Gemini models to carry out the search. Specifically, the Generative Pre-trained Transformer model (GPT-3.5 and 4) within ChatGPT to search for topics in orthodontics and specific subdomains was employed. The selected orthodontic areas which were searched in the models included (1) general orthodontics, (2) malocclusion classification, (3) treatment modalities, (4) orthodontic biomechanics, (5) clear aligner therapy, (6) fixed appliances, (7) surgical orthodontics, (8) retention protocols, (9) interdisciplinary treatments, (10) orthodontic outcomes, and (11) patient-centered care.

The models were instructed to “Please provide the references in Vancouver style and their links in recent literature on...(name of the topic). For each reference identified, six essential elements were identified: (1) authors, (2) reference titles, (3) journal names, (4) publication years, (5) digital object identifiers (DOIs), and (6) reference links.


Reference Validation and Data Extraction

To verify the existence and precision of the cited references, the provided references were searched in four reputable scientific databases including Google Scholar, PubMed, Scopus, and Web of Science. Each reference was searched for using the provided DOI in PubMed, a widely respected biomedical literature database. If found in PubMed, the reference was considered existing and genuine. When PubMed searches were unsuccessful or when DOIs were incomplete or missing, Scopus and Web of Science databases were utilized, which are a large abstract and citation database covering articles, books, and conference proceedings across various disciplines, as supplementary resources to locate references. If not found across three of the databases, the Google Scholar search engine was used for additional cross-referencing. Google Scholar indexes scholarly articles from various fields, encompassing both medical and nonmedical literature. Searches were conducted using reference titles, authors' names, and other pertinent information to locate and confirm the validity of the references. Because this study focused on evaluating bibliographic accuracies, access to full text to check the references was not needed. However, when more confirmation was required, institutional access was also utilized. A reference was deemed authentic if all six components—authors' names, reference titles, journal names, publication years, DOIs, and reference links—were accurate. All the information for these six components were extracted from the references in a Microsoft spreadsheet for validation process. Two authors independently verified the references to ensure correctness and accuracy of the validation process. This comprehensive approach aimed to thoroughly evaluate the authenticity and accuracy of ChatGPT and Gemini-generated references within an academic context.


Research Outcomes

The primary objective of this investigation was to assess the validity of ChatGPT and Gemini-generated references within an academic context. Validity encompassed the authenticity of references, necessitating accurate authors' names, title of the article, journal names, publication years, and DOIs/ links. References were classified as “fabricated or nonexistent” if all citation elements were incorrect or nonexistent, respectively. Existing references with at least one or more than one inaccurate component was categorized as “incorrect/inaccurate.” On the other hand, references with a complete set of six elements were designated as “correct/accurate.” In addition to evaluating accuracy, the study examined the frequency of incorrect components within each reference and across various orthodontic subdomains.


Data Analysis

Descriptive statistics were employed to present the data numerically and as percentages. All statistical analyses were conducted using Jamovi Version 2.6.19 (The Jamovi Project, Australia), facilitating effective data summarization and analysis. Descriptive statistics were utilized to evaluate the references generated by ChatGPT Version 3.5 and 4 and Gemini, focusing on three key aspects: completeness, accuracy, and fabrication. Additionally, these three aspects of all models were also assessed separately for 11 subdomains of orthodontics to determine if certain topics influenced accuracy. Bar graphs are used to demonstrate the proportion of accurate references in each subdomain. Reliability score was calculated using Cronbach's α to measure the internal consistency of models in generating accurate references. A Cronbach α value of < 0.5 was considered unacceptable, whereas a value above 0.7 is considered acceptable. The results are visually presented in the form of the correlation heat map to illustrate the relationship in accuracy between the three models.



Results

A total of 203 references were generated by the AI models with ChatGPT 4 contributing to 52% (n = 105) of the references, while Gemini generated 27% (n = 55) and ChatGPT 3.5 generated 21% (n = 43) references ([Fig. 1]).

Zoom
Fig. 1 References generated by the artificial intelligence (AI) models.

Out of all references generated by AI models, only 15.76% were correct, whereas 71.92% were fake or fabricated references and 12.32% were inaccurate references. Gemini had the significantly highest proportion of correct references (36.36%), followed by GPT 3.5 (15.76%) and GPT 4 (0.95%) with a p-value of < 0.01. On the other hand, ChatGPT 4 produced the highest proportion of fabricated references (99.05%), followed by GPT 3.5 (53.49%) and Gemini (34.55%), with a p-value of < 0.01. In terms of inaccurate references, Gemini accounted for the highest proportion (29.09%), followed by ChatGPT 3.5 (20.93%) (p-value = 0.01).

The reliability across the models yields a reliability score of 0.418, indicating low-to-moderate consistency in the accuracy of the references ([Table 1]). The correlation heat map shows a weak correlation between GPT 3.5 and Gemini while GPT 4 has negligible correlation with both models indicating distinct output across the models ([Fig. 2]).

Table 1

Proportion of correct/fake/inaccurate responses generated in each model

Response categories

ChatGPT 3.5

ChatGPT 4

Gemini

Total references

Reliability score

Correct reference

11 (23.58)

1 (0.95)

20 (36.36)

32 (15.76)

< 0.01

0.418

Fake/fabricated references

23 (53.49)

104 (99.05)

19 (34.55)

146 (71.92)

< 0.01

Inaccurate/incomplete references

9 (20.93)

0 (0)

16 (29.09)

25 (12.32)

0.01

Zoom
Fig. 2 Correlation heat map.

In terms of content of references generated by the LLMs, DOI or links were fabricated in ChatGPT 3.5 and 4, whereas Gemini provided URL of other papers or mentioned “invalid URL” or did not provide any link in some instances. Besides, in some cases, the models fabricated the name of author, journal, and year. An example of such instances is presented in [Table 2].

Table 2

Showing examples of fake/accurate/incomplete references in the AI models

Query/Prompt

GPT-3

GPT-4

Gemini

Please provide the references in Vancouver style and their links in recent literature on malocclusion classification

Macari S, et al Orthopedic devices for skeletal Class III malocclusion treatment in growing patients: A systematic review. J Clin Med. 2024;13(23):7141. https://www.mdpi.com/2077-0383/13/23/7141 (incorrect author name)

Baccetti T, Franchi L, McNamara JA Jr. The effect of Class II malocclusion on craniofacial growth in the early mixed dentition. Am J Orthod Dentofacial Orthop. 2023;163(3):413–419. 10.1016/j.ajodo.2022.07.015

Kharbanda S, Singh R, Singh P, et al. Malocclusion and occlusal traits among dental and nursing students of Seven North-East states of India. J Indian Soc Pedodontic Prev Dent. 2023;41(4):306–311. https://pubmed.ncbi.nlm.nih.gov/37715698/

Lo Giudice A, et al. Three-stage non-surgical treatment of skeletal Class III malocclusion in adolescents: A long-term follow-up report. J Clin Med. 2024;13(15):5637.

Ghoneima A, El-Bialy T. A novel classification system for malocclusion based on three-dimensional cone-beam computed tomography. J Orthod Sci. 2022;11(4):130–136. 10.4103/jos.JOS_88_22.

Adel A, Allareddy V, Al-Hezaimi K, et al. Future of Orthodontics—A Systematic Review and Meta-Analysis on the Emerging Trends in This Field. J Dent Res. 2021;100(12):1449–1461. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9861462

Guo Y, et al. Classification and characterization of facial asymmetry in skeletal malocclusions using CBCT. Am J Orthod Dentofacial Orthop. 2024;145(5):589–598. https://www.mdpi.com/2076-3417/14/17/8095

Al-Moghrabi D, Muntaser I, Al-Dosary M. Comparison of Angle's classification and the IOTN (Index of Orthodontic Treatment Need) for malocclusion diagnosis. J Dent Sci. 2022;17(5):287–293. 10.1016/j.jdsci.2022.07.001.

Graber LW, Vanarsdall RL, Vig KGO. Orthodontics: Current Principles and Techniques. 5th ed. St. Louis, MO: Mosby; 2015. (year incorrect)

Pavoni C, et al. Genetic predisposition in skeletal Class III malocclusions: A systematic review. Eur J Orthod. 2023;45(3):239–250. https://www.mdpi.com/2077-0383/13/23/7141

Kokich VG, Kokich VO. Malocclusion classification: Revisited and clarified. Orthodontics (Chic.). 2023;54(1):28–35. 10.1016/j.jdsci.2022.07.001

Burstone CJ, Burstone MT. Essentials of Orthodontics. 5th ed. St. Louis, MO: Mosby; 2018.

Al-Bitar Z, et al. Impact of malocclusion classification on oral health-related quality of life. BMC Oral Health. 2024;24:156. https://www.mdpi.com/2077-0383/13/23/7141

Zaib M, Huda N, Irfan M. Evaluation of malocclusion using the new AO classification system and its correlation with cephalometric parameters. J Clin Orthod. 2023;57(10):610–617. 10.2319/021922-105.1

American Association of Orthodontists. Clinical Practice Guidelines. [invalid URL removed]

Abbreviations: AI, artificial intelligence; DOI, digital object identifier.


Note: Sky blue color indicates nonexistent DOI/links; Violet color indicates fabricated author name/title/journal.


ChatGPT 3.5 showed moderate inaccuracies ranging from 33.33 to 75% with highest incomplete references in the field of retention protocols (75%), followed by treatment modalities (50%) and malocclusion classification (40%). ChatGPT 3.5 showed a varied pattern with fabricated references ranging from 0 to 100% with some fields such as orthodontic biomechanics and clear aligner therapy showing 100% fabrication, whereas treatment modalities showing only 20% fabrication rate and interdisciplinary treatment and patient-centered care showing 0%. The model provided accurate responses only in two fields—treatment modalities and general orthodontics with 28 and 80% being accurate, respectively ([Supplementary Fig. S1], available in the online only).

ChatGPT 4 produced fabricated references across all fields of orthodontic literatures, ranging from 87.5% (malocclusion classification) to 100% (remaining fields) ([Supplementary Fig. S2], available in the online only).

Gemini demonstrated better performance as compared with other AI models. The fabricated references were below 50% in majority field except patient-centered care (100%). The incomplete references according to different orthodontic fields ranged from 50 to 60% with highest incomplete references in general orthodontics (60%), followed by malocclusion classification (57.14%). It showed 100% correct references in the field of orthodontic outcomes, retention protocols, clear aligner therapy, and orthodontic biomechanics ([Supplementary Fig. S3], available in the online only).


Discussion

There has been an explosion of research into the accuracy and reliability of AI responses in the scientific arena. This indicates that the usage of AI is rapidly increasing in a variety of fields, including dental education and research. Given its frequent use, it is vital to investigate the accuracy of the information and sources generated by these models. For that reason, this study sought to assess the reliability and correctness of references generated by various AI models (Chatgpt-3.5, GPT-4, and Gemini) on specific orthodontic topics. Our findings revealed that the vast majority of the references generated by the models are inaccurate with weak reliability across the three models. Such findings are concerning considering that trustworthiness is a key component of dissemination of scientific information.[25] [26]

The data show that all three AI models vary in performance. Gemini has a higher performance (36.36%) in producing accurate references than GPT 3.5 (15.76%) and 4 (0.95%). These findings are consistent with Omar et al.[21] The study compared references generated by Gemini with ChatGPT-4-generated references in the medical fraternity and discovered that Gemini surpassed GPT-4 in reference precision, with Gemini references having an accuracy of 68% versus GPT-4's of 49.2%.[20] Pirkle et al found no significant differences in the performance of ChatGPT and Gemini in terms of generated citations, with both containing references with errors in publication year, journal, and fake authors.[27] These variances could be attributed to differences in the architectures of ChatGPT and the Gemini model. The fundamental element of Gemini is retrieval-augmented generation, which combines information retrieval from the Google platform with text generation to produce more accurate and contextually relevant results. ChatGPT, on the other hand, is based on reinforcement learning with human feedback, which improves answers depending on the users' instruction. This may limit its capacity to provide accurate references due to a lack of integrated retrieval mechanisms, making it better suited for general-purpose adaptation.[28]

Another significant finding of the study is that ChatGPT 4 consistently created fabricated references (99.05%) throughout all orthodontic topics, whereas ChatGPT 3.5 produced 53.49% fabricated and 20.93% incomplete references. Bhattacharyya et al[29] reported that ChatGPT 3.5 generated 47% fake references and 46% inaccurate. This maximum number of fake responses in ChatGPT models suggested an intrinsic problem with its reference generation algorithm or how the model handles reference data from the large data set on which it is trained.[30] [31] [32] Interestingly, these findings contradict earlier studies undertaken in many domains. A study compared ChatGPT 3.5 and 4 in produced otolaryngology-related references and found that ChatGPT-4.0 outperformed GPT 3.5.[33] Similarly, three studies[32] [34] [35] found that ChatGPT 4 performed better than its previous version (GPT 3.5). This disparity in findings suggests that variations in the accuracy and reliability of AI-generated references in the AI models may be attributable to differences in the subject matter of study, emphasizing the importance of domain-specific evaluation of AI tools. However, the difference in reliability and accuracy between both versions was minimal in this study, with fabricated citations persisting in both versions. Moreover, the consistency in performance across the two versions suggested that the improvements made in ChatGPT 4 might be more beneficial in general use cases rather than in highly specialized fields where the knowledge base is critical.

While Gemini overall performed better when generating accurate references (36.36%), it still created some incomplete (29.09%) and fake references (34.55%) notably for topics on malocclusion categorization, general orthodontics, treatment modalities, patient-centered care, and so on. In addition, the model's references consisted primarily of books or clinical practice guidelines that were repeated throughout multiple queries. Similarly, Chelli et al[36] observed that Bard, the previous version of Gemini, appeared to take a try-and-repeat method, producing many versions of publications. This suggested that Gemini depends mainly on a smaller number of reputable sources rather than providing diverse or topic-specific references. While this strategy may improve accuracy in particular contexts, it also showed a lack of adaptability and depth in generating nuanced references for specific research needs. Another positive aspect of Gemini that sets it apart from ChatGPT models was that it mentioned “Gemini can make mistakes, so double-check it” as a footnote in the interface. Such cautionary statement can enhance users' trust by promoting transparency and encouraging critical evaluation of AI-generated responses, thus fostering a more informed and responsible use of AI in scientific writing and referencing.

Overall, the findings showed that, while AI models can be an effective tool for assisting dental academicians and researchers in summarizing, paraphrasing, and exploring orthodontic-related content, however, it is critical to emphasize that the citations generated by these AI models are not fundamentally reliable and thus require human validation, particularly for author details, year of publication, title, and DOI. LLMs have been found to produce fictitious information; a phenomenon known as AI hallucination.[37] It includes the instances when the models produced citations with incorrect DOI, journal names, or author details.[38] [39] Users must double-check the authenticity of the “DOI” even when the models provide it, as they frequently diverge from the correct DOI, which could result in referenced sources that are unavailable or nonexistent. Therefore, it is concluded that user verification is essential when using AI models to maintain the scientific integrity and dependability of scientific communication, particularly when using AI tools in academic and research settings.[36] [40]

These findings have certain implications for research and practices. From a research perspective, there is a need for more comparative studies across various disciplines to compare and evaluate the performance of these models in other fields. The AI researchers must investigate the underlying mechanisms behind the fabrication and inaccuracies issues in the models, develop algorithms to minimize hallucination, and methods to improve references such as an integrated research library.[5] [36] Moreover, AI researchers must collaborate with domain experts to refine AI models for orthodontics-related literatures. On the other hand, from a practice perspective, research scholars must use AI-generated references with caution. Clear guidelines and policies should be developed for responsible use of LLMs in scientific or academic workflows, emphasizing human oversight to mitigate the risk of inaccuracy and misuse while leveraging their potential for enhancing efficiency and quality of research work.[25] It is imperative to validate these citations by verifying and cross-checking the details. Dental education curricula must incorporate training on the efficient use of AI tools, making students aware of its strengths and limitations.[41] Editors of scientific journals must also cross-check the references to ensure that fabricated references are not cited in scholarly literature, particularly in preprint versions. Advances in AI models require parallel improvement in fabrication detection capabilities to address the emerging challenges effectively.[30]

This study comes with few limitations. For example, only a single prompt for each query was used; employing alternate or more particular prompts could have resulted in different outcomes, underscoring the relevance of variation in prompts when evaluating AI performance.[29] In addition, all data mining occurred at one specific point in time in only three of the AI models. Given the speed of development, refinement, or upgrades in AI models based on incorporation of users' feedback, our findings reflect a snapshot of current abilities emphasizing the necessity for constant and continuing research to follow the growth of AI performance in terms of generating references. Future studies must utilize multiple and diverse prompts to assess any change in the accuracy of AI models. Besides, future studies may assess the capabilities of newly emerging AI chatbots specifically within various subfields of orthodontics to better understand the AI performance depending on the depth of the topics.[42]


Conclusion

The findings conclude that references generated by LLMs are not trustworthy. While Gemini showed better performance than GPT models, significant limitations remain in all three models in reference generations in the field of orthodontics. AI researchers must investigate the reason behind the fabricated references and develop methods to gain accuracy of the references. Additionally, these findings advocate for balanced and cautious use of AI tools not only in the academic research related to orthodontics but also in their general applications, emphasizing human validation of the AI responses and training of dental professionals and researchers in efficient use of AI tools, thus prioritizing accuracy and scientific credibility.



Conflict of Interest

None declared.

Supplementary Material


Address for correspondence

Saeed N. Asiri, BDS, MSD, M.Ed, PhD
Department of Pediatric Dentistry, College of Dentistry
Prince Sattam Bin Abdulaziz University, Al-Kharj 11942
Saudi Arabia   

Publication History

Article published online:
08 August 2025

© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)

Thieme Medical and Scientific Publishers Pvt. Ltd.
A-12, 2nd Floor, Sector 2, Noida-201301 UP, India


Zoom
Fig. 1 References generated by the artificial intelligence (AI) models.
Zoom
Fig. 2 Correlation heat map.