Keywords
artificial intelligence - chatbot - literature - orthodontics - references
Introduction
Scientific referencing is one of the most important aspects of academic discourse
because citing and referring to other authors' work is a rhetorical strategy for demonstrating
knowledge and familiarity with the subject matter, persuading and gaining acceptance
of one's academic arguments by linking one's findings with the scientific community,
constructing one's identity as scientific author, and supporting, verifying, or justifying
scientific arguments.[1]
[2]
[3] Referencing is crucial for sustaining research integrity since it prevents plagiarism,
increases the validity of evidence, and helps researchers to trace the source of referenced
information for more inquiry of the work. Thus, proper referencing is needed not only
for scholarly rigor but also for adhering to professional and academic ethical norms.[4] Nonetheless, referencing is frequently thought of as a laborious and time-consuming
task.[5] The researcher must carefully find the study articles, arrange them, and format
them according to journal requirements.[4] The necessity to guarantee the accuracy and consistency of referencing material,
including author, journal, issue, and volume details, among other things, makes this
task more complicated.[6]
Identifying references for scientific research literature reviews using current approaches
might be challenging and inefficient across many scientific fields, particularly those
which are rapidly expanding, such as orthodontics. Orthodontics encompasses a wide
range of related subjects, including biomechanics and craniofacial development, as
well as modern advances in the field such as clear aligners and digital orthodontics.
This width, combined with the continuous inflow of novel research and innovation,
needs an innovative approach to literature identification that is both comprehensive
and up-to-date.[7] However, researchers typically conduct manual searches in databases such as PubMed
or Scopus, utilizing keywords and filters to identify relevant articles.[5]
[8] This approach often necessitates multiple searches and the evaluation of numerous
articles, which can be overwhelming and may result in overlooking crucial sources.[7] Furthermore, the availability and accessibility of literature differ across various
databases and journals. Researchers may encounter restricted access, including paywalls
that require subscriptions for full-text article access.[9] While institutional access, open access platforms, and article purchases mitigate
such access, their constraints may still impact the capacity to rapidly assess artificial
intelligence (AI)-generated references. This constraint can result in incomplete literature
reviews and potential bias in reference selection.[10]
AI-based solutions, such as the Chat Generative Pre-trained Transformer (ChatGPT),
offer potential remedies to the issues encountered in conventional reference identification
methods, potentially improving both efficiency and accuracy.[11]
[12]
[13] Developed by OpenAI, ChatGPT is an advanced natural language processing model that
utilizes AI to comprehend and generate human-like responses through text or voice
interactions after a prompt request.[14] Its applications extend beyond basic language tasks, demonstrating value in initial
idea generation, data analysis, literature reviews, coding assistance, organizing
scientific content, and manuscript drafting, thus providing researchers with a time-efficient
tool to streamline their work.[13]
[14]
[15] Similarly, Google developed Gemini, a multimodal large language model (LLM) and
AI-powered assistant, to provide AI-based writing assistance.[16] This includes its use in scientific writing for a variety of functions such as idea
generation or brainstorming research questions, developing potential research questions,
research drafting, paraphrasing, and summarizing, generating conclusions, abstracts,
keywords, and any other automated feedback to user queries.[17] However, the use of these generative AI models in research context is not without
limitation. One major limitation is overreliance that may result in reduced critical
thinking and writing skills particularly among early career researchers. Additionally,
fabrication of information particularly while summarizing complex content or referencing,
resulting in compromised scholarly work. Besides, ethical concerns regarding plagiarism,
authorship, and transparency in conduct of research also arises. ChatGPT 3.5, ChatGPT-4,
and Gemini have been evaluated individually or in comparison to each other in a few
studies to determine their accuracy in providing valid references.[17]
[18]
[19]
[20]
[21]
[22] However, some studies have found that the models are untrustworthy due to accuracy
issues in referencing.[17]
[18]
[19]
[20] While one comparative study found that Gemini generated significantly more accurate
responses than ChatGPT in the medical and dental research fields,[21]
[22] while another found the opposite.[23] These findings point to a potential accuracy hierarchy among AI chatbots. Moreover,
the effectiveness of these AI models in assisting orthodontic experts in discovering
relevant materials is limited. A recent study additionally examined the accuracy of
ChatGPT's answers to clinical questions and cases on interceptive orthodontics. While
the results showed that the AI's replies were very precise and comprehensive, as well
as capable of resolving difficult clinical cases, they were not entirely correct.[24] This point toward the critical need to assess AI tools before integrating them into
critical aspects of academic or clinical work. Therefore, the purpose of this study
was to assess the validity of ChatGPT and Google Gemini in delivering references for
orthodontic literature studies. Researchers and dental clinicians can acquire insights
into the possible advantages and limitations of AI models by evaluating their performance,
which can help AI researchers to improve AI-assisted tools for literature review processes.
Materials and Methods
Search Strategy and Criteria
This study utilized ChatGPT and Gemini models to carry out the search. Specifically,
the Generative Pre-trained Transformer model (GPT-3.5 and 4) within ChatGPT to search
for topics in orthodontics and specific subdomains was employed. The selected orthodontic
areas which were searched in the models included (1) general orthodontics, (2) malocclusion
classification, (3) treatment modalities, (4) orthodontic biomechanics, (5) clear
aligner therapy, (6) fixed appliances, (7) surgical orthodontics, (8) retention protocols,
(9) interdisciplinary treatments, (10) orthodontic outcomes, and (11) patient-centered
care.
The models were instructed to “Please provide the references in Vancouver style and
their links in recent literature on...(name of the topic). For each reference identified,
six essential elements were identified: (1) authors, (2) reference titles, (3) journal
names, (4) publication years, (5) digital object identifiers (DOIs), and (6) reference
links.
Reference Validation and Data Extraction
To verify the existence and precision of the cited references, the provided references
were searched in four reputable scientific databases including Google Scholar, PubMed,
Scopus, and Web of Science. Each reference was searched for using the provided DOI
in PubMed, a widely respected biomedical literature database. If found in PubMed,
the reference was considered existing and genuine. When PubMed searches were unsuccessful
or when DOIs were incomplete or missing, Scopus and Web of Science databases were
utilized, which are a large abstract and citation database covering articles, books,
and conference proceedings across various disciplines, as supplementary resources
to locate references. If not found across three of the databases, the Google Scholar
search engine was used for additional cross-referencing. Google Scholar indexes scholarly
articles from various fields, encompassing both medical and nonmedical literature.
Searches were conducted using reference titles, authors' names, and other pertinent
information to locate and confirm the validity of the references. Because this study
focused on evaluating bibliographic accuracies, access to full text to check the references
was not needed. However, when more confirmation was required, institutional access
was also utilized. A reference was deemed authentic if all six components—authors'
names, reference titles, journal names, publication years, DOIs, and reference links—were
accurate. All the information for these six components were extracted from the references
in a Microsoft spreadsheet for validation process. Two authors independently verified
the references to ensure correctness and accuracy of the validation process. This
comprehensive approach aimed to thoroughly evaluate the authenticity and accuracy
of ChatGPT and Gemini-generated references within an academic context.
Research Outcomes
The primary objective of this investigation was to assess the validity of ChatGPT
and Gemini-generated references within an academic context. Validity encompassed the
authenticity of references, necessitating accurate authors' names, title of the article,
journal names, publication years, and DOIs/ links. References were classified as “fabricated
or nonexistent” if all citation elements were incorrect or nonexistent, respectively.
Existing references with at least one or more than one inaccurate component was categorized
as “incorrect/inaccurate.” On the other hand, references with a complete set of six
elements were designated as “correct/accurate.” In addition to evaluating accuracy,
the study examined the frequency of incorrect components within each reference and
across various orthodontic subdomains.
Data Analysis
Descriptive statistics were employed to present the data numerically and as percentages.
All statistical analyses were conducted using Jamovi Version 2.6.19 (The Jamovi Project,
Australia), facilitating effective data summarization and analysis. Descriptive statistics
were utilized to evaluate the references generated by ChatGPT Version 3.5 and 4 and
Gemini, focusing on three key aspects: completeness, accuracy, and fabrication. Additionally,
these three aspects of all models were also assessed separately for 11 subdomains
of orthodontics to determine if certain topics influenced accuracy. Bar graphs are
used to demonstrate the proportion of accurate references in each subdomain. Reliability
score was calculated using Cronbach's α to measure the internal consistency of models
in generating accurate references. A Cronbach α value of < 0.5 was considered unacceptable,
whereas a value above 0.7 is considered acceptable. The results are visually presented
in the form of the correlation heat map to illustrate the relationship in accuracy
between the three models.
Results
A total of 203 references were generated by the AI models with ChatGPT 4 contributing
to 52% (n = 105) of the references, while Gemini generated 27% (n = 55) and ChatGPT 3.5 generated 21% (n = 43) references ([Fig. 1]).
Fig. 1 References generated by the artificial intelligence (AI) models.
Out of all references generated by AI models, only 15.76% were correct, whereas 71.92%
were fake or fabricated references and 12.32% were inaccurate references. Gemini had
the significantly highest proportion of correct references (36.36%), followed by GPT
3.5 (15.76%) and GPT 4 (0.95%) with a p-value of < 0.01. On the other hand, ChatGPT 4 produced the highest proportion of
fabricated references (99.05%), followed by GPT 3.5 (53.49%) and Gemini (34.55%),
with a p-value of < 0.01. In terms of inaccurate references, Gemini accounted for the highest
proportion (29.09%), followed by ChatGPT 3.5 (20.93%) (p-value = 0.01).
The reliability across the models yields a reliability score of 0.418, indicating
low-to-moderate consistency in the accuracy of the references ([Table 1]). The correlation heat map shows a weak correlation between GPT 3.5 and Gemini while
GPT 4 has negligible correlation with both models indicating distinct output across
the models ([Fig. 2]).
Table 1
Proportion of correct/fake/inaccurate responses generated in each model
Response categories
|
ChatGPT 3.5
|
ChatGPT 4
|
Gemini
|
Total references
|
|
Reliability score
|
Correct reference
|
11 (23.58)
|
1 (0.95)
|
20 (36.36)
|
32 (15.76)
|
< 0.01
|
0.418
|
Fake/fabricated references
|
23 (53.49)
|
104 (99.05)
|
19 (34.55)
|
146 (71.92)
|
< 0.01
|
Inaccurate/incomplete references
|
9 (20.93)
|
0 (0)
|
16 (29.09)
|
25 (12.32)
|
0.01
|
Fig. 2 Correlation heat map.
In terms of content of references generated by the LLMs, DOI or links were fabricated
in ChatGPT 3.5 and 4, whereas Gemini provided URL of other papers or mentioned “invalid
URL” or did not provide any link in some instances. Besides, in some cases, the models
fabricated the name of author, journal, and year. An example of such instances is
presented in [Table 2].
Table 2
Showing examples of fake/accurate/incomplete references in the AI models
Query/Prompt
|
GPT-3
|
GPT-4
|
Gemini
|
Please provide the references in Vancouver style and their links in recent literature
on malocclusion classification
|
Macari S, et al Orthopedic devices for skeletal Class III malocclusion treatment in
growing patients: A systematic review. J Clin Med. 2024;13(23):7141. https://www.mdpi.com/2077-0383/13/23/7141 (incorrect author name)
|
Baccetti T, Franchi L, McNamara JA Jr. The effect of Class II malocclusion on craniofacial
growth in the early mixed dentition. Am J Orthod Dentofacial Orthop. 2023;163(3):413–419. 10.1016/j.ajodo.2022.07.015
|
Kharbanda S, Singh R, Singh P, et al. Malocclusion and occlusal traits among dental
and nursing students of Seven North-East states of India. J Indian Soc Pedodontic
Prev Dent. 2023;41(4):306–311. https://pubmed.ncbi.nlm.nih.gov/37715698/
|
Lo Giudice A, et al. Three-stage non-surgical treatment of skeletal Class III malocclusion
in adolescents: A long-term follow-up report. J Clin Med. 2024;13(15):5637.
|
Ghoneima A, El-Bialy T. A novel classification system for malocclusion based on three-dimensional
cone-beam computed tomography. J Orthod Sci. 2022;11(4):130–136. 10.4103/jos.JOS_88_22.
|
Adel A, Allareddy V, Al-Hezaimi K, et al. Future of Orthodontics—A Systematic Review
and Meta-Analysis on the Emerging Trends in This Field. J Dent Res. 2021;100(12):1449–1461.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9861462
|
Guo Y, et al. Classification and characterization of facial asymmetry in skeletal
malocclusions using CBCT. Am J Orthod Dentofacial Orthop. 2024;145(5):589–598. https://www.mdpi.com/2076-3417/14/17/8095
|
Al-Moghrabi D, Muntaser I, Al-Dosary M. Comparison of Angle's classification and the
IOTN (Index of Orthodontic Treatment Need) for malocclusion diagnosis. J Dent Sci. 2022;17(5):287–293. 10.1016/j.jdsci.2022.07.001.
|
Graber LW, Vanarsdall RL, Vig KGO. Orthodontics: Current Principles and Techniques.
5th ed. St. Louis, MO: Mosby; 2015. (year incorrect)
|
Pavoni C, et al. Genetic predisposition in skeletal Class III malocclusions: A systematic
review. Eur J Orthod. 2023;45(3):239–250. https://www.mdpi.com/2077-0383/13/23/7141
|
Kokich VG, Kokich VO. Malocclusion classification: Revisited and clarified. Orthodontics (Chic.). 2023;54(1):28–35. 10.1016/j.jdsci.2022.07.001
|
Burstone CJ, Burstone MT. Essentials of Orthodontics. 5th ed. St. Louis, MO: Mosby;
2018.
|
Al-Bitar Z, et al. Impact of malocclusion classification on oral health-related quality
of life. BMC Oral Health. 2024;24:156. https://www.mdpi.com/2077-0383/13/23/7141
|
Zaib M, Huda N, Irfan M. Evaluation of malocclusion using the new AO classification
system and its correlation with cephalometric parameters. J Clin Orthod. 2023;57(10):610–617. 10.2319/021922-105.1
|
American Association of Orthodontists. Clinical Practice Guidelines. [invalid URL
removed]
|
Abbreviations: AI, artificial intelligence; DOI, digital object identifier.
Note: Sky blue color indicates nonexistent DOI/links; Violet color indicates fabricated
author name/title/journal.
ChatGPT 3.5 showed moderate inaccuracies ranging from 33.33 to 75% with highest incomplete
references in the field of retention protocols (75%), followed by treatment modalities
(50%) and malocclusion classification (40%). ChatGPT 3.5 showed a varied pattern with
fabricated references ranging from 0 to 100% with some fields such as orthodontic
biomechanics and clear aligner therapy showing 100% fabrication, whereas treatment
modalities showing only 20% fabrication rate and interdisciplinary treatment and patient-centered
care showing 0%. The model provided accurate responses only in two fields—treatment
modalities and general orthodontics with 28 and 80% being accurate, respectively ([Supplementary Fig. S1], available in the online only).
ChatGPT 4 produced fabricated references across all fields of orthodontic literatures,
ranging from 87.5% (malocclusion classification) to 100% (remaining fields) ([Supplementary Fig. S2], available in the online only).
Gemini demonstrated better performance as compared with other AI models. The fabricated
references were below 50% in majority field except patient-centered care (100%). The
incomplete references according to different orthodontic fields ranged from 50 to
60% with highest incomplete references in general orthodontics (60%), followed by
malocclusion classification (57.14%). It showed 100% correct references in the field
of orthodontic outcomes, retention protocols, clear aligner therapy, and orthodontic
biomechanics ([Supplementary Fig. S3], available in the online only).
Discussion
There has been an explosion of research into the accuracy and reliability of AI responses
in the scientific arena. This indicates that the usage of AI is rapidly increasing
in a variety of fields, including dental education and research. Given its frequent
use, it is vital to investigate the accuracy of the information and sources generated
by these models. For that reason, this study sought to assess the reliability and
correctness of references generated by various AI models (Chatgpt-3.5, GPT-4, and
Gemini) on specific orthodontic topics. Our findings revealed that the vast majority
of the references generated by the models are inaccurate with weak reliability across
the three models. Such findings are concerning considering that trustworthiness is
a key component of dissemination of scientific information.[25]
[26]
The data show that all three AI models vary in performance. Gemini has a higher performance
(36.36%) in producing accurate references than GPT 3.5 (15.76%) and 4 (0.95%). These
findings are consistent with Omar et al.[21] The study compared references generated by Gemini with ChatGPT-4-generated references
in the medical fraternity and discovered that Gemini surpassed GPT-4 in reference
precision, with Gemini references having an accuracy of 68% versus GPT-4's of 49.2%.[20] Pirkle et al found no significant differences in the performance of ChatGPT and
Gemini in terms of generated citations, with both containing references with errors
in publication year, journal, and fake authors.[27] These variances could be attributed to differences in the architectures of ChatGPT
and the Gemini model. The fundamental element of Gemini is retrieval-augmented generation,
which combines information retrieval from the Google platform with text generation
to produce more accurate and contextually relevant results. ChatGPT, on the other
hand, is based on reinforcement learning with human feedback, which improves answers
depending on the users' instruction. This may limit its capacity to provide accurate
references due to a lack of integrated retrieval mechanisms, making it better suited
for general-purpose adaptation.[28]
Another significant finding of the study is that ChatGPT 4 consistently created fabricated
references (99.05%) throughout all orthodontic topics, whereas ChatGPT 3.5 produced
53.49% fabricated and 20.93% incomplete references. Bhattacharyya et al[29] reported that ChatGPT 3.5 generated 47% fake references and 46% inaccurate. This
maximum number of fake responses in ChatGPT models suggested an intrinsic problem
with its reference generation algorithm or how the model handles reference data from
the large data set on which it is trained.[30]
[31]
[32] Interestingly, these findings contradict earlier studies undertaken in many domains.
A study compared ChatGPT 3.5 and 4 in produced otolaryngology-related references and
found that ChatGPT-4.0 outperformed GPT 3.5.[33] Similarly, three studies[32]
[34]
[35] found that ChatGPT 4 performed better than its previous version (GPT 3.5). This
disparity in findings suggests that variations in the accuracy and reliability of
AI-generated references in the AI models may be attributable to differences in the
subject matter of study, emphasizing the importance of domain-specific evaluation
of AI tools. However, the difference in reliability and accuracy between both versions
was minimal in this study, with fabricated citations persisting in both versions.
Moreover, the consistency in performance across the two versions suggested that the
improvements made in ChatGPT 4 might be more beneficial in general use cases rather
than in highly specialized fields where the knowledge base is critical.
While Gemini overall performed better when generating accurate references (36.36%),
it still created some incomplete (29.09%) and fake references (34.55%) notably for
topics on malocclusion categorization, general orthodontics, treatment modalities,
patient-centered care, and so on. In addition, the model's references consisted primarily
of books or clinical practice guidelines that were repeated throughout multiple queries.
Similarly, Chelli et al[36] observed that Bard, the previous version of Gemini, appeared to take a try-and-repeat
method, producing many versions of publications. This suggested that Gemini depends
mainly on a smaller number of reputable sources rather than providing diverse or topic-specific
references. While this strategy may improve accuracy in particular contexts, it also
showed a lack of adaptability and depth in generating nuanced references for specific
research needs. Another positive aspect of Gemini that sets it apart from ChatGPT
models was that it mentioned “Gemini can make mistakes, so double-check it” as a footnote in the interface. Such cautionary statement can enhance users' trust
by promoting transparency and encouraging critical evaluation of AI-generated responses,
thus fostering a more informed and responsible use of AI in scientific writing and
referencing.
Overall, the findings showed that, while AI models can be an effective tool for assisting
dental academicians and researchers in summarizing, paraphrasing, and exploring orthodontic-related
content, however, it is critical to emphasize that the citations generated by these
AI models are not fundamentally reliable and thus require human validation, particularly
for author details, year of publication, title, and DOI. LLMs have been found to produce
fictitious information; a phenomenon known as AI hallucination.[37] It includes the instances when the models produced citations with incorrect DOI,
journal names, or author details.[38]
[39] Users must double-check the authenticity of the “DOI” even when the models provide
it, as they frequently diverge from the correct DOI, which could result in referenced
sources that are unavailable or nonexistent. Therefore, it is concluded that user
verification is essential when using AI models to maintain the scientific integrity
and dependability of scientific communication, particularly when using AI tools in
academic and research settings.[36]
[40]
These findings have certain implications for research and practices. From a research
perspective, there is a need for more comparative studies across various disciplines
to compare and evaluate the performance of these models in other fields. The AI researchers
must investigate the underlying mechanisms behind the fabrication and inaccuracies
issues in the models, develop algorithms to minimize hallucination, and methods to
improve references such as an integrated research library.[5]
[36] Moreover, AI researchers must collaborate with domain experts to refine AI models
for orthodontics-related literatures. On the other hand, from a practice perspective,
research scholars must use AI-generated references with caution. Clear guidelines
and policies should be developed for responsible use of LLMs in scientific or academic
workflows, emphasizing human oversight to mitigate the risk of inaccuracy and misuse
while leveraging their potential for enhancing efficiency and quality of research
work.[25] It is imperative to validate these citations by verifying and cross-checking the
details. Dental education curricula must incorporate training on the efficient use
of AI tools, making students aware of its strengths and limitations.[41] Editors of scientific journals must also cross-check the references to ensure that
fabricated references are not cited in scholarly literature, particularly in preprint
versions. Advances in AI models require parallel improvement in fabrication detection
capabilities to address the emerging challenges effectively.[30]
This study comes with few limitations. For example, only a single prompt for each
query was used; employing alternate or more particular prompts could have resulted
in different outcomes, underscoring the relevance of variation in prompts when evaluating
AI performance.[29] In addition, all data mining occurred at one specific point in time in only three
of the AI models. Given the speed of development, refinement, or upgrades in AI models
based on incorporation of users' feedback, our findings reflect a snapshot of current
abilities emphasizing the necessity for constant and continuing research to follow
the growth of AI performance in terms of generating references. Future studies must
utilize multiple and diverse prompts to assess any change in the accuracy of AI models.
Besides, future studies may assess the capabilities of newly emerging AI chatbots
specifically within various subfields of orthodontics to better understand the AI
performance depending on the depth of the topics.[42]
Conclusion
The findings conclude that references generated by LLMs are not trustworthy. While
Gemini showed better performance than GPT models, significant limitations remain in
all three models in reference generations in the field of orthodontics. AI researchers
must investigate the reason behind the fabricated references and develop methods to
gain accuracy of the references. Additionally, these findings advocate for balanced
and cautious use of AI tools not only in the academic research related to orthodontics
but also in their general applications, emphasizing human validation of the AI responses
and training of dental professionals and researchers in efficient use of AI tools,
thus prioritizing accuracy and scientific credibility.