Keywords
artificial intelligence applications - Bard - Bing - ChatGPT - glaucoma
Schlüsselwörter
Anwendungen der künstlichen Intelligenz - Bard - Bing - ChatGPT - Glaukom
Introduction
Technology is developing rapidly, and we see the benefits of these developments effectively
in all fields of medicine [1]. One branch of which is artificial intelligence applications. Artificial intelligence
application is a branch of computer science that enables the computer system to think
and make decisions similar to human intelligence, that is, aims to imitate human intelligence
[2]. It was first mentioned during a conference at Dartmouth College in 1956, but the
first studies were conducted in the early 1970 s [3]. These artificial intelligence applications include a wide variety of features,
such as perceiving and interpreting images, developing solutions to problems, and
understanding spoken words [4].
Glaucoma is a progressive optic neuropathy picture characterized by degeneration of
retinal ganglion cells and retinal nerve fiber layers [5]. It is a very important cause of irreversible vision loss in the world and is closely
related to quality of life [6]. Although the most important correctable risk factor is increased intraocular pressure,
there are additional risk factors such as age, myopia, genetic structure, and vasospasm
[7], [8], [9].
This study aimed to examine the success of ChatGPT, Bard, and Bing artificial intelligence
chatbots in correctly answering questions about glaucoma and to investigate whether
there is any superiority between them.
Method
All 32 questions included in the study questions section of the American Academy of
Ophthalmology 2022 – 2023 Basic and Clinical Science Course Glaucoma book and related
to glaucoma disease subtypes and treatment modalities were included in the study [10]. Out of the 32 questions, 28 assessed knowledge regarding the diagnosis of glaucoma,
while the remaining four assessed knowledge about its treatment. Before questions
were applied to each of the three artificial intelligence chatbots to be used, they
were asked about their working algorithms, where they gathered the data they used
to obtain information, and what their advantages and disadvantages were. As a result
of the answers given, the features of the artificial intelligence programs were summarized.
ChatGPT GPT-3.5 (OpenAI; San Francisco, CA, USA) is an artificial intelligence chatbot
belonging to the large language model (LLM) family that can understand textbases and
produce human-like
responses as a result. It is trained with a large dataset (datasets from books, articles,
websites, and many different sources) up to 2021. It does not have the ability to
scan the Internet in real-time. Bard (by Google AI California, United States) is an
LLM-based artificial intelligence chatbot that uses natural language processing (NLP)
algorithms to understand the meaning and structure of text, and machine learning algorithms
to improve its performance in generating more advanced text, translating languages
more accurately, and answering questions in an informative way. It provides its data
through Google Search. You can access free data. It is constantly updated by Google.
Bing (by Microsoft Washington, United States) is an LLM-based artificial intelligence
chatbot that is trained on a very large dataset and obtains its information from free
sources on the internet. While accessing this information, it can also provide reference
sources. Access to paid data is restricted.
Sometimes search results may contain incomplete information. These three programs
also share common features, including the potential for misinterpreting questions,
generating incorrect answers, and the capacity to tailor responses based on the analysis
of chat history.
For ChatGPT (GPT-3.5), Bard, and Bing chatbots, which are offered free of charge by
these three different manufacturers, the: “I will ask you multiple choice questions.
Notify me of the correct answer option” command was given. Each question was asked
separately for the three artificial intelligence chatbots. Chat history was cleared
after each question to avoid the memory retention effect of the programs. The answers
given to the questions for each chatbot were compared with the answer key section
at the back of the book and noted as correct or incorrect. Questions related to the
topic were grouped. Common correct and incorrect questions were also noted. The questions
were asked to the artificial intelligence programs on June 14, 2023.
Statistical analysis
Statistical Package for the Social Sciences version 23 (SPSS Inc., Chicago, IL, USA)
was used for statistical analysis of the data. Pearsonʼs chi-square test and Yatesʼ
chi-square test were used to compare percentage values and nominal values with each
other in the analysis. The level of significance was set at p < 0.05.
Results
ChatGPT gave correct answers to 18 (56.3%) of the 32 questions and incorrect answers
to 13 (40.6%). One (3.1%) question was answered, “I can access information until September
2021. I have no idea about it.” Of the questions correctly answered, 17 (94.4%) were
diagnosis related and 1 (5.6%) was treatment related. Of the questions answered incorrectly,
10 (76.4%) were diagnosis related and 3 (23.1%) were treatment related.
Bard gave a correct answer to 25 (78.1%) of the 32 questions asked and an incorrect
answer to 7 (21.9%). Of the questions correctly answered, 23 (92%) were diagnosis
related and 2 (8%) were treatment related. Five (83.3%) of the questions that were
answered incorrectly were diagnosis related and two (28.6%) were treatment related.
Bing gave correct answers to 19 (59.4%) of the 32 questions and incorrect answers
to 12 (37.5%) questions. One (3.1%) question was answered, “There are different publications
related to the answer to this question. I canʼt choose the answer option.” Of the
questions answered correctly, 17 (89.5%) were diagnosis-related questions and 2 (10.5%)
were treatment-related questions. Of the questions answered incorrectly, 10 (83.3%)
were diagnosis related and 2 (16.7%) were treatment related. ChatGPT and Bing did
not specify any answers to the same question. This question was asked to test their
knowledge of the diagnosis.
All 3 artificial intelligence chatbots gave correct answers to 13 (40.6%) questions
and incorrect answers to 3 (9.4%) questions. Thirteen (100%) questions that were answered
correctly tested information related to the diagnosis. Two (66.7%) questions that
were answered incorrectly tested the information related to the diagnosis and 1 (33.3%)
question tested the information related to the treatment ([Table 1]).
Table 1 The success of artificial intelligence chatbots on questions related to glaucoma
diagnosis and treatment modalities.
|
Answers (n)
|
ChatGPT
|
Bard
|
Bing
|
|
Correct
Diagnosis/Treatment
|
18 (56.3%)
17/1
|
25 (78.1%)
23/2
|
19 (59.4%)
17/2
|
|
Incorrect
Diagnosis/Treatment
|
13 (40.6%)
10/3
|
7 (21.9%)
5/2
|
12 (37.5%)
10/2
|
|
P value
|
0.195
|
|
Same Answers (n)
|
16 (50%)
|
|
Correct
Diagnosis/Treatment
|
13 (40.6%)
13/0
|
|
Incorrect
Diagnosis/Treatment
|
3 (9.4%)
2/1
|
Answering the questions with three artificial intelligence chatbots did not have a
significant effect on the correct and incorrect answer rates (p = 0.195, Pearsonʼs
chi-square test). No significant difference was observed between the correct and incorrect
response rates of the ChatGPT and Bing programs (p = 1.0, Yatesʼ chi-squared test).
There was no significant difference between the use of Bing and Bard artificial intelligence
programs for correct and incorrect answers to the questions (p = 0.238, Yatesʼ chi-square
test). There was no significant difference between the correct and incorrect response
rates of the ChatGPT and Bard programs (p = 0.15, Yatesʼ chi-squared test).
Discussion
Artificial intelligence is a new species that constantly develops with the advancement
of technology and creates a strong place for itself in current life conditions. With
the development of technology, there have been differences between the first artificial
intelligence programs and the current ones. Previously, artificial intelligence programs
were primarily deep learning models designed to understand and recognize patterns.
The recently developed LLM is a new artificial intelligence model that can understand
and link the words that come before it and has been developed to predict various possibilities
of words in a certain string. If they are trained with sufficient data using these
features, they can provide appropriate answers to human intelligence. ChatGPT, Bing,
and Bard artificial intelligence programs have been developed based on the LLM [11].
ChatGPT is a natural language processing method with 175 billion parameters developed
to imitate the human mindset and create similar responses. It has gained a strong
place among existing artificial intelligence programs owing to its powerful features
[12]. It has gained a wide variety of uses in medical education. Examples of these uses
include a wide variety of features, such as quick access to information, language
translations, and personalized learning [13]. ChatGPT is a very useful artificial intelligence application in terms of providing
fast and accurate access to various medical topics. It provides quick access to information
regarding a wide variety of diseases, differential diagnoses, and treatment modalities.
Thanks to these advantages, it can provide benefits to a wide range of medical students
who want to provide fast and reliable access of information to health professionals
[14]. A different use of these programs is to answer questions in research and to provide
a wide variety of text summaries. It provides great convenience to medical researchers
in terms of accessing the correct and up-to-date information on a large amount of
information on the Internet to examine the literature and analyze a wide variety of
data [15]. Bard and Bing artificial intelligence chatbots, which were introduced in 2023,
are current artificial intelligence programs that can be used for similar purposes.
The Bard artificial intelligence chatbot is very important in terms of being able
to access the developing ophthalmology information up to date, as it uses the internet
data to access information and is constantly updated by its manufacturer. As for the
Bing artificial intelligence chatbot, we think that Bardʼs ability to reference the
sources it uses, in addition to this feature, creates an advantage in checking the
accuracy of the data
and information and in providing quick access to current articles to access more detailed
information. Considering the advantages of these artificial intelligence chatbots
in providing rapid access to accurate information, we think that they will have a
positive impact on the training and development of researchers on a wide range of
glaucoma types and treatment methods. This can be explained as follows: Considering
that glaucoma is a chronic disease that requires regular follow-up and a cumulative
increase in the number of patients, it requires a substantial number of patient examinations.
Advanced theoretical knowledge is one of the prerequisites for providing effective
patient care. In addition, considering these intense working conditions, it is important
that these methods, which are used to provide easy access to information to make a
differential diagnosis well, use the current information to help make a quick decision
about the diagnosis and treatment. However, these
artificial intelligence chatbots also have some disadvantages. The fact that the data
used by ChatGPT covers 2021 and before, and the lack of real-time internet access
will cause this program to be incomplete from the constantly developing ophthalmology
information and may make it out of date. [12] In our own series, ChatGPT answered 56.3% of the questions correctly, 40.6% incorrectly,
and 3.1% with the answer “I can access the information until September 2021. I have
no idea about this topic.” This answer given by ChatGPT is a good example of being
out of date. The limited access of Bard and Bing artificial intelligence chatbots
to paid articles on the internet will make it difficult to access this current information
and may even prevent them from noticing new developments. These disadvantages may
negatively affect the performance of artificial intelligence chatbots in accessing
accurate and up-to-date information. Another consideration is the
variability in the responses generated by artificial intelligence chatbots to the
same questions. While chatbots exhibit this behavior, they afford researchers the
opportunity to access more consistent information by leveraging their chat history.
Utilizing chat history enables researchers to receive diverse responses from artificial
intelligence chatbots by drawing upon their prior interactions. However, this practice
has certain limitations. Despite the aim of artificial intelligence programs to provide
accurate information in response to queries, there remains a risk of inadvertent acceptance
and propagation of misinformation due to biases. In assessing the efficacy of chatbots
in responding to questions, we sought to mitigate this effect by posing each question
only once, although complete elimination of this influence may not be achievable by
clearing the memory history [2], [3], [4]. Thus,
our objective was to establish a standard to address these questions.
A wide variety of studies have been conducted to answer questions related to medicine.
In a study by Jin et al. [16], models that answered yes or no questions in PubMed obtained 68.1% accuracy. In
another study, a dataset of 12,723 questions was evaluated, and an accuracy rate of
36.7% was achieved. In a recent study investigating the effectiveness and reliability
of ChatGPT in answering questions correctly, it was stated that more than 50% of the
questions asked were answered correctly [11]. Another study reported that ChatGPT gave correct answers to more than 60% of the
questions asked, which was a successful response rate. At the same time, they concluded
that the success rate in correctly answering questions with similar content was equal
to or better than previous artificial intelligence programs [17]. In our study, we found that while all chatbots have a correct answer rate of over
55%,
ChatGPT has a lower correct answer rate than Bing and Bard artificial intelligence
chatbots released in 2023, but there is no statistical difference. The reason why
ChatGPT has the lowest accuracy rate may be that it cannot access data after 2021.
The higher accuracy rates of the more up-to-date Bard and Bing chatbots may be related
to the fact that they can provide their data simultaneously from the internet and
follow the current situation more closely. Although Bing and Bard artificial intelligence
programs follow the current situation more closely, they could not answer all the
questions correctly. There are also different accuracy rates. The reason why these
artificial intelligence chatbots cannot answer all questions correctly may be their
limited access to paid data. We also think that these programs may be related to the
ability to misunderstand questions and produce wrong answers. We also think that the
lower correct answer rate of the Bing chatbot compared to the Bard
chatbot may be related to the fact that Bing can sometimes indicate incomplete search
results.
Owing to the closed structure of artificial intelligence chatbots, the inability to
fine tune and the scarcity of questions constitute the limited aspect of our study.
In conclusion, with the development of technology, artificial intelligence applications
continue to find a place in every field of medicine more effectively and accurately.
To the best of our knowledge, this is the first study to compare the correct answering
levels of questions related to glaucoma types and treatment methods of the Bing, Bard,
and ChatGPT artificial intelligence models released by three different manufacturers
based on the LLM. Although not all chatbots were found to be statistically superior
to each other, they answered questions about glaucoma types and treatment modalities
correctly at a level that could be considered successful. Bard had a slightly higher
rate of correct answers to glaucoma questions. Although artificial intelligence chatbots
can be used to quickly and accurately access information related to glaucoma, it still
does not seem possible to replace human intelligence and abilities today.
Conclusion Box
Already Known:
-
Chatbots are new applications that have emerged with the development of artificial
intelligence programs.
-
Although the usability of artificial intelligence programs in ophthalmology fields
has been tested, the existence and usability of three different artificial intelligence
chatbots, which are available for free use, have not been investigated for their superiority
over each other in accessing information about glaucoma diseases and treatment methods.
Newly described:
-
Although all three artificial intelligence programs were not statistically superior
to each other in answering the questions correctly, the more up-to-date Bard and Bing
artificial intelligence chatbots answered the questions with higher accuracy rates.
-
Artificial intelligence programs, including current artificial intelligence programs
such as Bard and Bing, may encounter various obstacles (paid access, etc.) in accessing
current and accurate information. There seems to be a need for further development
of artificial intelligence programs and the completion of these deficiencies.