Assessing the Reliability of ChatGPT and Gemini in Identifying Relevant Orthodontic Literature

Saeed N. Asiri

doi:10.1055/s-0045-1809617

European Journal of General Dentistry, Table of Contents

CC BY 4.0 · European Journal of General Dentistry
DOI: 10.1055/s-0045-1809617

Original Article

Assessing the Reliability of ChatGPT and Gemini in Identifying Relevant Orthodontic Literature

Saeed N. Asiri

¹Department of Pediatric Dentistry, College of Dentistry, Prince Sattam Bin Abdulaziz University, Al-Kharj, Saudi Arabia

› Author Affiliations

Abstract

Full Text

PDF Download

Keywords

artificial intelligence - chatbot - literature - orthodontics - references

Introduction

Scientific referencing is one of the most important aspects of academic discourse because citing and referring to other authors' work is a rhetorical strategy for demonstrating knowledge and familiarity with the subject matter, persuading and gaining acceptance of one's academic arguments by linking one's findings with the scientific community, constructing one's identity as scientific author, and supporting, verifying, or justifying scientific arguments.[1] [2] [3] Referencing is crucial for sustaining research integrity since it prevents plagiarism, increases the validity of evidence, and helps researchers to trace the source of referenced information for more inquiry of the work. Thus, proper referencing is needed not only for scholarly rigor but also for adhering to professional and academic ethical norms.[4] Nonetheless, referencing is frequently thought of as a laborious and time-consuming task.[5] The researcher must carefully find the study articles, arrange them, and format them according to journal requirements.[4] The necessity to guarantee the accuracy and consistency of referencing material, including author, journal, issue, and volume details, among other things, makes this task more complicated.[6]

Identifying references for scientific research literature reviews using current approaches might be challenging and inefficient across many scientific fields, particularly those which are rapidly expanding, such as orthodontics. Orthodontics encompasses a wide range of related subjects, including biomechanics and craniofacial development, as well as modern advances in the field such as clear aligners and digital orthodontics. This width, combined with the continuous inflow of novel research and innovation, needs an innovative approach to literature identification that is both comprehensive and up-to-date.[7] However, researchers typically conduct manual searches in databases such as PubMed or Scopus, utilizing keywords and filters to identify relevant articles.[5] [8] This approach often necessitates multiple searches and the evaluation of numerous articles, which can be overwhelming and may result in overlooking crucial sources.[7] Furthermore, the availability and accessibility of literature differ across various databases and journals. Researchers may encounter restricted access, including paywalls that require subscriptions for full-text article access.[9] While institutional access, open access platforms, and article purchases mitigate such access, their constraints may still impact the capacity to rapidly assess artificial intelligence (AI)-generated references. This constraint can result in incomplete literature reviews and potential bias in reference selection.[10]

AI-based solutions, such as the Chat Generative Pre-trained Transformer (ChatGPT), offer potential remedies to the issues encountered in conventional reference identification methods, potentially improving both efficiency and accuracy.[11] [12] [13] Developed by OpenAI, ChatGPT is an advanced natural language processing model that utilizes AI to comprehend and generate human-like responses through text or voice interactions after a prompt request.[14] Its applications extend beyond basic language tasks, demonstrating value in initial idea generation, data analysis, literature reviews, coding assistance, organizing scientific content, and manuscript drafting, thus providing researchers with a time-efficient tool to streamline their work.[13] [14] [15] Similarly, Google developed Gemini, a multimodal large language model (LLM) and AI-powered assistant, to provide AI-based writing assistance.[16] This includes its use in scientific writing for a variety of functions such as idea generation or brainstorming research questions, developing potential research questions, research drafting, paraphrasing, and summarizing, generating conclusions, abstracts, keywords, and any other automated feedback to user queries.[17] However, the use of these generative AI models in research context is not without limitation. One major limitation is overreliance that may result in reduced critical thinking and writing skills particularly among early career researchers. Additionally, fabrication of information particularly while summarizing complex content or referencing, resulting in compromised scholarly work. Besides, ethical concerns regarding plagiarism, authorship, and transparency in conduct of research also arises. ChatGPT 3.5, ChatGPT-4, and Gemini have been evaluated individually or in comparison to each other in a few studies to determine their accuracy in providing valid references.[17] [18] [19] [20] [21] [22] However, some studies have found that the models are untrustworthy due to accuracy issues in referencing.[17] [18] [19] [20] While one comparative study found that Gemini generated significantly more accurate responses than ChatGPT in the medical and dental research fields,[21] [22] while another found the opposite.[23] These findings point to a potential accuracy hierarchy among AI chatbots. Moreover, the effectiveness of these AI models in assisting orthodontic experts in discovering relevant materials is limited. A recent study additionally examined the accuracy of ChatGPT's answers to clinical questions and cases on interceptive orthodontics. While the results showed that the AI's replies were very precise and comprehensive, as well as capable of resolving difficult clinical cases, they were not entirely correct.[24] This point toward the critical need to assess AI tools before integrating them into critical aspects of academic or clinical work. Therefore, the purpose of this study was to assess the validity of ChatGPT and Google Gemini in delivering references for orthodontic literature studies. Researchers and dental clinicians can acquire insights into the possible advantages and limitations of AI models by evaluating their performance, which can help AI researchers to improve AI-assisted tools for literature review processes.

Materials and Methods

Search Strategy and Criteria

This study utilized ChatGPT and Gemini models to carry out the search. Specifically, the Generative Pre-trained Transformer model (GPT-3.5 and 4) within ChatGPT to search for topics in orthodontics and specific subdomains was employed. The selected orthodontic areas which were searched in the models included (1) general orthodontics, (2) malocclusion classification, (3) treatment modalities, (4) orthodontic biomechanics, (5) clear aligner therapy, (6) fixed appliances, (7) surgical orthodontics, (8) retention protocols, (9) interdisciplinary treatments, (10) orthodontic outcomes, and (11) patient-centered care.

The models were instructed to “Please provide the references in Vancouver style and their links in recent literature on...(name of the topic). For each reference identified, six essential elements were identified: (1) authors, (2) reference titles, (3) journal names, (4) publication years, (5) digital object identifiers (DOIs), and (6) reference links.

Reference Validation and Data Extraction

To verify the existence and precision of the cited references, the provided references were searched in four reputable scientific databases including Google Scholar, PubMed, Scopus, and Web of Science. Each reference was searched for using the provided DOI in PubMed, a widely respected biomedical literature database. If found in PubMed, the reference was considered existing and genuine. When PubMed searches were unsuccessful or when DOIs were incomplete or missing, Scopus and Web of Science databases were utilized, which are a large abstract and citation database covering articles, books, and conference proceedings across various disciplines, as supplementary resources to locate references. If not found across three of the databases, the Google Scholar search engine was used for additional cross-referencing. Google Scholar indexes scholarly articles from various fields, encompassing both medical and nonmedical literature. Searches were conducted using reference titles, authors' names, and other pertinent information to locate and confirm the validity of the references. Because this study focused on evaluating bibliographic accuracies, access to full text to check the references was not needed. However, when more confirmation was required, institutional access was also utilized. A reference was deemed authentic if all six components—authors' names, reference titles, journal names, publication years, DOIs, and reference links—were accurate. All the information for these six components were extracted from the references in a Microsoft spreadsheet for validation process. Two authors independently verified the references to ensure correctness and accuracy of the validation process. This comprehensive approach aimed to thoroughly evaluate the authenticity and accuracy of ChatGPT and Gemini-generated references within an academic context.

Research Outcomes

The primary objective of this investigation was to assess the validity of ChatGPT and Gemini-generated references within an academic context. Validity encompassed the authenticity of references, necessitating accurate authors' names, title of the article, journal names, publication years, and DOIs/ links. References were classified as “fabricated or nonexistent” if all citation elements were incorrect or nonexistent, respectively. Existing references with at least one or more than one inaccurate component was categorized as “incorrect/inaccurate.” On the other hand, references with a complete set of six elements were designated as “correct/accurate.” In addition to evaluating accuracy, the study examined the frequency of incorrect components within each reference and across various orthodontic subdomains.

Data Analysis

Descriptive statistics were employed to present the data numerically and as percentages. All statistical analyses were conducted using Jamovi Version 2.6.19 (The Jamovi Project, Australia), facilitating effective data summarization and analysis. Descriptive statistics were utilized to evaluate the references generated by ChatGPT Version 3.5 and 4 and Gemini, focusing on three key aspects: completeness, accuracy, and fabrication. Additionally, these three aspects of all models were also assessed separately for 11 subdomains of orthodontics to determine if certain topics influenced accuracy. Bar graphs are used to demonstrate the proportion of accurate references in each subdomain. Reliability score was calculated using Cronbach's α to measure the internal consistency of models in generating accurate references. A Cronbach α value of < 0.5 was considered unacceptable, whereas a value above 0.7 is considered acceptable. The results are visually presented in the form of the correlation heat map to illustrate the relationship in accuracy between the three models.

Results

A total of 203 references were generated by the AI models with ChatGPT 4 contributing to 52% (n = 105) of the references, while Gemini generated 27% (n = 55) and ChatGPT 3.5 generated 21% (n = 43) references ([Fig. 1]).

Fig. 1 References generated by the artificial intelligence (AI) models.

Out of all references generated by AI models, only 15.76% were correct, whereas 71.92% were fake or fabricated references and 12.32% were inaccurate references. Gemini had the significantly highest proportion of correct references (36.36%), followed by GPT 3.5 (15.76%) and GPT 4 (0.95%) with a p-value of < 0.01. On the other hand, ChatGPT 4 produced the highest proportion of fabricated references (99.05%), followed by GPT 3.5 (53.49%) and Gemini (34.55%), with a p-value of < 0.01. In terms of inaccurate references, Gemini accounted for the highest proportion (29.09%), followed by ChatGPT 3.5 (20.93%) (p-value = 0.01).

The reliability across the models yields a reliability score of 0.418, indicating low-to-moderate consistency in the accuracy of the references ([Table 1]). The correlation heat map shows a weak correlation between GPT 3.5 and Gemini while GPT 4 has negligible correlation with both models indicating distinct output across the models ([Fig. 2]).

Table 1
Proportion of correct/fake/inaccurate responses generated in each model
Response categories	ChatGPT 3.5	ChatGPT 4	Gemini	Total references		Reliability score
Correct reference	11 (23.58)	1 (0.95)	20 (36.36)	32 (15.76)	< 0.01	0.418
Fake/fabricated references	23 (53.49)	104 (99.05)	19 (34.55)	146 (71.92)	< 0.01
Inaccurate/incomplete references	9 (20.93)	0 (0)	16 (29.09)	25 (12.32)	0.01

Fig. 2 Correlation heat map.

In terms of content of references generated by the LLMs, DOI or links were fabricated in ChatGPT 3.5 and 4, whereas Gemini provided URL of other papers or mentioned “invalid URL” or did not provide any link in some instances. Besides, in some cases, the models fabricated the name of author, journal, and year. An example of such instances is presented in [Table 2].

Table 2
Showing examples of fake/accurate/incomplete references in the AI models
Query/Prompt	GPT-3	GPT-4	Gemini
Please provide the references in Vancouver style and their links in recent literature on malocclusion classification	Macari S, et al Orthopedic devices for skeletal Class III malocclusion treatment in growing patients: A systematic review. J Clin Med. 2024;13(23):7141. https://www.mdpi.com/2077-0383/13/23/7141 (incorrect author name)	Baccetti T, Franchi L, McNamara JA Jr. The effect of Class II malocclusion on craniofacial growth in the early mixed dentition. Am J Orthod Dentofacial Orthop. 2023;163(3):413–419. 10.1016/j.ajodo.2022.07.015	Kharbanda S, Singh R, Singh P, et al. Malocclusion and occlusal traits among dental and nursing students of Seven North-East states of India. J Indian Soc Pedodontic Prev Dent. 2023;41(4):306–311. https://pubmed.ncbi.nlm.nih.gov/37715698/
	Lo Giudice A, et al. Three-stage non-surgical treatment of skeletal Class III malocclusion in adolescents: A long-term follow-up report. J Clin Med. 2024;13(15):5637.	Ghoneima A, El-Bialy T. A novel classification system for malocclusion based on three-dimensional cone-beam computed tomography. J Orthod Sci. 2022;11(4):130–136. 10.4103/jos.JOS_88_22.	Adel A, Allareddy V, Al-Hezaimi K, et al. Future of Orthodontics—A Systematic Review and Meta-Analysis on the Emerging Trends in This Field. J Dent Res. 2021;100(12):1449–1461. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9861462
	Guo Y, et al. Classification and characterization of facial asymmetry in skeletal malocclusions using CBCT. Am J Orthod Dentofacial Orthop. 2024;145(5):589–598. https://www.mdpi.com/2076-3417/14/17/8095	Al-Moghrabi D, Muntaser I, Al-Dosary M. Comparison of Angle's classification and the IOTN (Index of Orthodontic Treatment Need) for malocclusion diagnosis. J Dent Sci. 2022;17(5):287–293. 10.1016/j.jdsci.2022.07.001.	Graber LW, Vanarsdall RL, Vig KGO. Orthodontics: Current Principles and Techniques. 5th ed. St. Louis, MO: Mosby; 2015. (year incorrect)
	Pavoni C, et al. Genetic predisposition in skeletal Class III malocclusions: A systematic review. Eur J Orthod. 2023;45(3):239–250. https://www.mdpi.com/2077-0383/13/23/7141	Kokich VG, Kokich VO. Malocclusion classification: Revisited and clarified. Orthodontics (Chic.). 2023;54(1):28–35. 10.1016/j.jdsci.2022.07.001	Burstone CJ, Burstone MT. Essentials of Orthodontics. 5th ed. St. Louis, MO: Mosby; 2018.
	Al-Bitar Z, et al. Impact of malocclusion classification on oral health-related quality of life. BMC Oral Health. 2024;24:156. https://www.mdpi.com/2077-0383/13/23/7141	Zaib M, Huda N, Irfan M. Evaluation of malocclusion using the new AO classification system and its correlation with cephalometric parameters. J Clin Orthod. 2023;57(10):610–617. 10.2319/021922-105.1	American Association of Orthodontists. Clinical Practice Guidelines. [invalid URL removed]

Abbreviations: AI, artificial intelligence; DOI, digital object identifier.

Note: Sky blue color indicates nonexistent DOI/links; Violet color indicates fabricated author name/title/journal.

ChatGPT 3.5 showed moderate inaccuracies ranging from 33.33 to 75% with highest incomplete references in the field of retention protocols (75%), followed by treatment modalities (50%) and malocclusion classification (40%). ChatGPT 3.5 showed a varied pattern with fabricated references ranging from 0 to 100% with some fields such as orthodontic biomechanics and clear aligner therapy showing 100% fabrication, whereas treatment modalities showing only 20% fabrication rate and interdisciplinary treatment and patient-centered care showing 0%. The model provided accurate responses only in two fields—treatment modalities and general orthodontics with 28 and 80% being accurate, respectively ([Supplementary Fig. S1], available in the online only).

ChatGPT 4 produced fabricated references across all fields of orthodontic literatures, ranging from 87.5% (malocclusion classification) to 100% (remaining fields) ([Supplementary Fig. S2], available in the online only).

Gemini demonstrated better performance as compared with other AI models. The fabricated references were below 50% in majority field except patient-centered care (100%). The incomplete references according to different orthodontic fields ranged from 50 to 60% with highest incomplete references in general orthodontics (60%), followed by malocclusion classification (57.14%). It showed 100% correct references in the field of orthodontic outcomes, retention protocols, clear aligner therapy, and orthodontic biomechanics ([Supplementary Fig. S3], available in the online only).

Discussion

There has been an explosion of research into the accuracy and reliability of AI responses in the scientific arena. This indicates that the usage of AI is rapidly increasing in a variety of fields, including dental education and research. Given its frequent use, it is vital to investigate the accuracy of the information and sources generated by these models. For that reason, this study sought to assess the reliability and correctness of references generated by various AI models (Chatgpt-3.5, GPT-4, and Gemini) on specific orthodontic topics. Our findings revealed that the vast majority of the references generated by the models are inaccurate with weak reliability across the three models. Such findings are concerning considering that trustworthiness is a key component of dissemination of scientific information.[25] [26]

The data show that all three AI models vary in performance. Gemini has a higher performance (36.36%) in producing accurate references than GPT 3.5 (15.76%) and 4 (0.95%). These findings are consistent with Omar et al.[21] The study compared references generated by Gemini with ChatGPT-4-generated references in the medical fraternity and discovered that Gemini surpassed GPT-4 in reference precision, with Gemini references having an accuracy of 68% versus GPT-4's of 49.2%.[20] Pirkle et al found no significant differences in the performance of ChatGPT and Gemini in terms of generated citations, with both containing references with errors in publication year, journal, and fake authors.[27] These variances could be attributed to differences in the architectures of ChatGPT and the Gemini model. The fundamental element of Gemini is retrieval-augmented generation, which combines information retrieval from the Google platform with text generation to produce more accurate and contextually relevant results. ChatGPT, on the other hand, is based on reinforcement learning with human feedback, which improves answers depending on the users' instruction. This may limit its capacity to provide accurate references due to a lack of integrated retrieval mechanisms, making it better suited for general-purpose adaptation.[28]

Another significant finding of the study is that ChatGPT 4 consistently created fabricated references (99.05%) throughout all orthodontic topics, whereas ChatGPT 3.5 produced 53.49% fabricated and 20.93% incomplete references. Bhattacharyya et al[29] reported that ChatGPT 3.5 generated 47% fake references and 46% inaccurate. This maximum number of fake responses in ChatGPT models suggested an intrinsic problem with its reference generation algorithm or how the model handles reference data from the large data set on which it is trained.[30] [31] [32] Interestingly, these findings contradict earlier studies undertaken in many domains. A study compared ChatGPT 3.5 and 4 in produced otolaryngology-related references and found that ChatGPT-4.0 outperformed GPT 3.5.[33] Similarly, three studies[32] [34] [35] found that ChatGPT 4 performed better than its previous version (GPT 3.5). This disparity in findings suggests that variations in the accuracy and reliability of AI-generated references in the AI models may be attributable to differences in the subject matter of study, emphasizing the importance of domain-specific evaluation of AI tools. However, the difference in reliability and accuracy between both versions was minimal in this study, with fabricated citations persisting in both versions. Moreover, the consistency in performance across the two versions suggested that the improvements made in ChatGPT 4 might be more beneficial in general use cases rather than in highly specialized fields where the knowledge base is critical.

While Gemini overall performed better when generating accurate references (36.36%), it still created some incomplete (29.09%) and fake references (34.55%) notably for topics on malocclusion categorization, general orthodontics, treatment modalities, patient-centered care, and so on. In addition, the model's references consisted primarily of books or clinical practice guidelines that were repeated throughout multiple queries. Similarly, Chelli et al[36] observed that Bard, the previous version of Gemini, appeared to take a try-and-repeat method, producing many versions of publications. This suggested that Gemini depends mainly on a smaller number of reputable sources rather than providing diverse or topic-specific references. While this strategy may improve accuracy in particular contexts, it also showed a lack of adaptability and depth in generating nuanced references for specific research needs. Another positive aspect of Gemini that sets it apart from ChatGPT models was that it mentioned “Gemini can make mistakes, so double-check it” as a footnote in the interface. Such cautionary statement can enhance users' trust by promoting transparency and encouraging critical evaluation of AI-generated responses, thus fostering a more informed and responsible use of AI in scientific writing and referencing.

Overall, the findings showed that, while AI models can be an effective tool for assisting dental academicians and researchers in summarizing, paraphrasing, and exploring orthodontic-related content, however, it is critical to emphasize that the citations generated by these AI models are not fundamentally reliable and thus require human validation, particularly for author details, year of publication, title, and DOI. LLMs have been found to produce fictitious information; a phenomenon known as AI hallucination.[37] It includes the instances when the models produced citations with incorrect DOI, journal names, or author details.[38] [39] Users must double-check the authenticity of the “DOI” even when the models provide it, as they frequently diverge from the correct DOI, which could result in referenced sources that are unavailable or nonexistent. Therefore, it is concluded that user verification is essential when using AI models to maintain the scientific integrity and dependability of scientific communication, particularly when using AI tools in academic and research settings.[36] [40]

These findings have certain implications for research and practices. From a research perspective, there is a need for more comparative studies across various disciplines to compare and evaluate the performance of these models in other fields. The AI researchers must investigate the underlying mechanisms behind the fabrication and inaccuracies issues in the models, develop algorithms to minimize hallucination, and methods to improve references such as an integrated research library.[5] [36] Moreover, AI researchers must collaborate with domain experts to refine AI models for orthodontics-related literatures. On the other hand, from a practice perspective, research scholars must use AI-generated references with caution. Clear guidelines and policies should be developed for responsible use of LLMs in scientific or academic workflows, emphasizing human oversight to mitigate the risk of inaccuracy and misuse while leveraging their potential for enhancing efficiency and quality of research work.[25] It is imperative to validate these citations by verifying and cross-checking the details. Dental education curricula must incorporate training on the efficient use of AI tools, making students aware of its strengths and limitations.[41] Editors of scientific journals must also cross-check the references to ensure that fabricated references are not cited in scholarly literature, particularly in preprint versions. Advances in AI models require parallel improvement in fabrication detection capabilities to address the emerging challenges effectively.[30]

This study comes with few limitations. For example, only a single prompt for each query was used; employing alternate or more particular prompts could have resulted in different outcomes, underscoring the relevance of variation in prompts when evaluating AI performance.[29] In addition, all data mining occurred at one specific point in time in only three of the AI models. Given the speed of development, refinement, or upgrades in AI models based on incorporation of users' feedback, our findings reflect a snapshot of current abilities emphasizing the necessity for constant and continuing research to follow the growth of AI performance in terms of generating references. Future studies must utilize multiple and diverse prompts to assess any change in the accuracy of AI models. Besides, future studies may assess the capabilities of newly emerging AI chatbots specifically within various subfields of orthodontics to better understand the AI performance depending on the depth of the topics.[42]

Conclusion

The findings conclude that references generated by LLMs are not trustworthy. While Gemini showed better performance than GPT models, significant limitations remain in all three models in reference generations in the field of orthodontics. AI researchers must investigate the reason behind the fabricated references and develop methods to gain accuracy of the references. Additionally, these findings advocate for balanced and cautious use of AI tools not only in the academic research related to orthodontics but also in their general applications, emphasizing human validation of the AI responses and training of dental professionals and researchers in efficient use of AI tools, thus prioritizing accuracy and scientific credibility.

References

References
1 Khamkhien A. The art of referencing: patterns of citation and authorial stance in academic texts written by Thai students and professional writers. J Engl Acad Purposes 2025; 74: 101470
2 Jomaa NJ, Bidin SJ. Perspectives of EFL doctoral students on challenges of citations in academic writing. Malaysian J Learning Instruct 2017; 14 (02) 177-209
3 Mehta V, Thomas V, Mathur A. AI-dependency in scientific writing. Oral Oncol Rep 2024; 10 (03) 100269
4 Divecha CA, Tullu MS, Karande S. The art of referencing: well begun is half done!. J Postgrad Med 2023; 69 (01) 1-6
5 Suppadungsuk S, Thongprayoon C, Krisanapan P. et al. Examining the validity of ChatGPT in identifying relevant nephrology literature: findings and implications. J Clin Med 2023; 12 (17) 5550
6 Azizah NN, Maryanti R, Nandiyanto ABD. How to search and manage references with a specific referencing style using Google Scholar: from step-by-step processing for users to the practical examples in the referencing education. Indonesian J Multidiciplinary Res 2021; 1 (02) 267-294
7 Tomášik J, Zsoldos M, Oravcová Ľ. et al. AI and face-driven orthodontics: a scoping review of digital advances in diagnosis and treatment planning. AI 2024; 5 (01) 158-176
8 Martin S, Hussain Z, Boyle JG. A beginner's guide to the literature search in medical education. Scott Med J 2017; 62 (02) 58-62
9 Boudry C, Alvarez-Muñoz P, Arencibia-Jorge R. et al. Worldwide inequality in access to full text scientific articles: the example of ophthalmology. PeerJ 2019; 7: e7850
10 King S, Davidson K, Chitiyo A, Apple D. Evaluating article search and selection procedures in special education literature reviews. Remedial Spec Educ 2020; 41 (01) 3-17
11 Menon D, Shilpa K. “Chatting with ChatGPT”: analyzing the factors influencing users' intention to use the Open AI's ChatGPT using the UTAUT model. Heliyon 2023; 9 (11) e20962
12 Alyasiri OM, Salman AM, Akhtom D, Salisu S. ChatGPT revisited: using ChatGPT-4 for finding references and editing language in medical scientific articles. J Stomatol Oral Maxillofac Surg 2024; 125 (5S2, Supplement 2): 101842
13 Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 2023; 15 (02) e35179
14 Biswas SS. ChatGPT for research and publication: a step-by-step guide. J Pediatr Pharmacol Ther 2023; 28 (06) 576-584
15 Flaherty HB, Yurch J. Beyond plagiarism: ChatGPT as the vanguard of technological revolution in research and citation. Res Soc Work Pract 2024; 34 (05) 483-486
16 Imran M, Almusharraf N. Google Gemini as a next generation AI educational tool: a review of emerging educational technology. Smart Learning Environments. 2024; 11 (01) 22
17 Barrot J. Leveraging Google Gemini as a research writing tool in higher education. Technol Knowled Learning 2024; 30 (01) 1-8
18 Giray L. ChatGPT references unveiled: distinguishing the reliable from the fake. Internet Ref Serv Q 2024; 28 (01) 9-18
19 Wagner MW, Ertl-Wagner BB. Accuracy of information and references using ChatGPT-3 for retrieval of clinical radiological information. Can Assoc Radiol J 2024; 75 (01) 69-73
20 Ramos-Gomez F, Marcus M, Maida CA. et al. Using a machine learning algorithm to predict the likelihood of presence of dental caries among children aged 2 to 7. Dent J 2021; 9 (12) 141
21 Omar M, Nassar S, Hijazi K, Glicksberg BS, Nadkarni GN, Klang E. Generating credible referenced medical research: a comparative study of openAI's GPT-4 and Google's Gemini. Comput Biol Med 2025; 185: 109545
22 Aziz AAA, Abdelrahman HH, Hassan MG. The use of ChatGPT and Google Gemini in responding to orthognathic surgery-related questions: a comparative study. J World Fed Orthod 2025; 14 (01) 20-26
23 Labrague LJ. Utilizing artificial intelligence-based tools for addressing clinical queries: ChatGPT versus Google Gemini. J Nurs Educ 2024; 63 (08) 556-559
24 Hatia A, Doldo T, Parrini S. et al. Accuracy and completeness of ChatGPT-generated information on interceptive orthodontics: a multicenter collaborative study. J Clin Med 2024; 13 (03) 735
25 Thurzo A, Strunga M, Urban R, Surovková J, Afrashtehfar KI. Impact of artificial intelligence on dental education: a review and guide for curriculum update. Educ Sci (Basel) 2023; 13 (02) 150
26 Gravel J, D'Amours-Gravel M, Osmanlliu E. Learning to fake it: limited responses and fabricated references provided by ChatGPT for medical questions. Mayo Clin Proc Digit Health 2023; 1 (03) 226-234
27 Pirkle S, Yang J, Blumberg TJ. Do ChatGPT and Gemini provide appropriate recommendations for pediatric orthopaedic conditions?. J Pediatr Orthop 2025; 45 (01) e66-e71
28 Rane N, Choudhary S, Rane J. Gemini versus ChatGPT: applications, performance, architecture, capabilities, and implementation. J Appl Artif Intell. 2024; 5 (01) 69-93
29 Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High rates of fabricated and inaccurate references in ChatGPT-generated medical content. Cureus 2023; 15 (05) e39238
30 Walters WH, Wilder EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep 2023; 13 (01) 14045
31 Day T. A preliminary investigation of fake peer-reviewed citations and references generated by ChatGPT. Prof Geogr 2023; 75 (06) 1024-1027
32 Chen A, Chen DO. Accuracy of chatbots in citing journal articles. JAMA Netw Open 2023; 6 (08) e2327647
33 Lechien JR, Briganti G, Vaira LA. Accuracy of ChatGPT-3.5 and -4 in providing scientific references in otolaryngology-head and neck surgery. Eur Arch Otorhinolaryngol 2024; 281 (04) 2159-2165
34 Buchanan J, Hill S, Shapoval O. ChatGPT hallucinates non-existent citations: evidence from economics. Am Econ 2024; 69 (01) 80-87
35 Frosolini A, Franz L, Benedetti S. et al. Assessing the accuracy of ChatGPT references in head and neck and ENT disciplines. Eur Arch Otorhinolaryngol 2023; 280 (11) 5129-5133
36 Chelli M, Descamps J, Lavoué V. et al. Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews: comparative analysis. J Med Internet Res 2024; 26 (01) e53164
37 Jamaluddin J, Gaffar NA, Din NSS. Hallucination: a key challenge to artificial intelligence-generated writing. Malays Fam Physician 2023; 18: 68
38 Mugaanyi J, Cai L, Cheng S, Lu C, Huang J. Evaluation of large language model performance and reliability for citations and references in scholarly writing: cross-disciplinary study. J Med Internet Res 2024; 26 (01) e52935
39 Haman M, Školník M. Using ChatGPT to conduct a literature review. Account Res 2024; 31 (08) 1244-1246
40 Rahman M, Terano HJR, Rahman N, Salamzadeh A, Rahaman S. ChatGPT and academic research: a review and recommendations based on practical examples. J Educ, Mngt, and Dev Studies. 2023; 3 (01) 1-12
41 Snigdha NT, Batul R, Karobari MI. et al. Assessing the performance of ChatGPT 3.5 and ChatGPT 4 in operative dentistry and endodontics: an exploratory study. In: Tatu AL, ed. Human Behavior and Emerging Technologies. Toronto, Canada: JMIR Publications; 2024;2024(1):1119816
42 Aljamaan F, Temsah MH, Altamimi I. et al. Reference hallucination score for medical artificial intelligence chatbots: development and usability study. JMIR Med Inform 2024; 12: e54345

Figures

Fig. 1 References generated by the artificial intelligence (AI) models.

Fig. 2 Correlation heat map.

Supplementary Material

Supplementary Material