Klin Monbl Augenheilkd 2024; 241(10): 1140-1144
DOI: 10.1055/a-2327-8484
Klinische Studie

Evaluation of Current Artificial Intelligence Programs on the Knowledge of Glaucoma

Evaluierung aktueller Programme zur künstlichen Intelligenz zum Wissen über Glaukom
Eyupcan Sensoy
Ophthalmology, Ankara City Hospital, Ankara, Turkey
,
Mehmet Citirik
Ophthalmology, Ankara City Hospital, Ankara, Turkey
› Author Affiliations
 

Abstract

Background To measure the success of three different artificial intelligence chatbots, ChatGPT, Bard, and Bing, in correctly answering questions about glaucoma types and treatment modalities and to examine their superiority over each other.

Materials and Methods Thirty-two questions about glaucoma types and treatment modalities were asked using the ChatGPT, Bard, and Bing chatbots. The correct and incorrect answers were also provided. Accuracy rates were compared.

Outcomes Questions asked: ChatGPT answered 56.3%, Bard 78.1%, and Bing 59.4% correctly. There was no statistically significant difference between the three artificial intelligence chatbots in the rate of correct and incorrect answers to the questions asked (p = 0.195).

Conclusion Artificial intelligence chatbots can be used as a tool to access accurate information regarding glaucoma types and treatment modalities. However, the information obtained is not always accurate, and care should be taken when using this information.


#

Zusammenfassung

Hintergrund Ziel ist es, den Erfolg von drei verschiedenen Chatbots mit künstlicher Intelligenz – ChatGPT, Bard und Bing – bei der richtigen Beantwortung von Fragen zu Glaukomarten und Behandlungsmethoden zu messen und ihre Überlegenheit gegenüber den anderen zu untersuchen.

Methoden Mithilfe der Chatbots ChatGPT, Bard und Bing wurden 32 Fragen zu Glaukomarten und Behandlungsmethoden gestellt. Die richtigen und falschen Antworten wurden ebenfalls angegeben. Die Genauigkeitsraten wurden verglichen.

Ergebnisse Gestellte Fragen: ChatGPT antwortete 56,3%, Bard 78,1% und Bing 59,4% richtig. Es gab keinen statistisch signifikanten Unterschied in der Rate der richtigen und falschen Antworten auf die gestellten Fragen zwischen den drei Chatbots mit künstlicher Intelligenz (p = 0,195).

Schlussfolgerung Chatbots mit künstlicher Intelligenz können als Hilfsmittel eingesetzt werden, um genaue Informationen zu Glaukomarten und Behandlungsmethoden zu erhalten. Die erhaltenen Informationen sind jedoch nicht immer genau und bei der Verwendung dieser Informationen ist Vorsicht geboten.


#

Introduction

Technology is developing rapidly, and we see the benefits of these developments effectively in all fields of medicine [1]. One branch of which is artificial intelligence applications. Artificial intelligence application is a branch of computer science that enables the computer system to think and make decisions similar to human intelligence, that is, aims to imitate human intelligence [2]. It was first mentioned during a conference at Dartmouth College in 1956, but the first studies were conducted in the early 1970 s [3]. These artificial intelligence applications include a wide variety of features, such as perceiving and interpreting images, developing solutions to problems, and understanding spoken words [4].

Glaucoma is a progressive optic neuropathy picture characterized by degeneration of retinal ganglion cells and retinal nerve fiber layers [5]. It is a very important cause of irreversible vision loss in the world and is closely related to quality of life [6]. Although the most important correctable risk factor is increased intraocular pressure, there are additional risk factors such as age, myopia, genetic structure, and vasospasm [7], [8], [9].

This study aimed to examine the success of ChatGPT, Bard, and Bing artificial intelligence chatbots in correctly answering questions about glaucoma and to investigate whether there is any superiority between them.


#

Method

All 32 questions included in the study questions section of the American Academy of Ophthalmology 2022 – 2023 Basic and Clinical Science Course Glaucoma book and related to glaucoma disease subtypes and treatment modalities were included in the study [10]. Out of the 32 questions, 28 assessed knowledge regarding the diagnosis of glaucoma, while the remaining four assessed knowledge about its treatment. Before questions were applied to each of the three artificial intelligence chatbots to be used, they were asked about their working algorithms, where they gathered the data they used to obtain information, and what their advantages and disadvantages were. As a result of the answers given, the features of the artificial intelligence programs were summarized. ChatGPT GPT-3.5 (OpenAI; San Francisco, CA, USA) is an artificial intelligence chatbot belonging to the large language model (LLM) family that can understand textbases and produce human-like responses as a result. It is trained with a large dataset (datasets from books, articles, websites, and many different sources) up to 2021. It does not have the ability to scan the Internet in real-time. Bard (by Google AI California, United States) is an LLM-based artificial intelligence chatbot that uses natural language processing (NLP) algorithms to understand the meaning and structure of text, and machine learning algorithms to improve its performance in generating more advanced text, translating languages more accurately, and answering questions in an informative way. It provides its data through Google Search. You can access free data. It is constantly updated by Google. Bing (by Microsoft Washington, United States) is an LLM-based artificial intelligence chatbot that is trained on a very large dataset and obtains its information from free sources on the internet. While accessing this information, it can also provide reference sources. Access to paid data is restricted. Sometimes search results may contain incomplete information. These three programs also share common features, including the potential for misinterpreting questions, generating incorrect answers, and the capacity to tailor responses based on the analysis of chat history.

For ChatGPT (GPT-3.5), Bard, and Bing chatbots, which are offered free of charge by these three different manufacturers, the: “I will ask you multiple choice questions. Notify me of the correct answer option” command was given. Each question was asked separately for the three artificial intelligence chatbots. Chat history was cleared after each question to avoid the memory retention effect of the programs. The answers given to the questions for each chatbot were compared with the answer key section at the back of the book and noted as correct or incorrect. Questions related to the topic were grouped. Common correct and incorrect questions were also noted. The questions were asked to the artificial intelligence programs on June 14, 2023.

Statistical analysis

Statistical Package for the Social Sciences version 23 (SPSS Inc., Chicago, IL, USA) was used for statistical analysis of the data. Pearsonʼs chi-square test and Yatesʼ chi-square test were used to compare percentage values and nominal values with each other in the analysis. The level of significance was set at p < 0.05.


#
#

Results

ChatGPT gave correct answers to 18 (56.3%) of the 32 questions and incorrect answers to 13 (40.6%). One (3.1%) question was answered, “I can access information until September 2021. I have no idea about it.” Of the questions correctly answered, 17 (94.4%) were diagnosis related and 1 (5.6%) was treatment related. Of the questions answered incorrectly, 10 (76.4%) were diagnosis related and 3 (23.1%) were treatment related.

Bard gave a correct answer to 25 (78.1%) of the 32 questions asked and an incorrect answer to 7 (21.9%). Of the questions correctly answered, 23 (92%) were diagnosis related and 2 (8%) were treatment related. Five (83.3%) of the questions that were answered incorrectly were diagnosis related and two (28.6%) were treatment related.

Bing gave correct answers to 19 (59.4%) of the 32 questions and incorrect answers to 12 (37.5%) questions. One (3.1%) question was answered, “There are different publications related to the answer to this question. I canʼt choose the answer option.” Of the questions answered correctly, 17 (89.5%) were diagnosis-related questions and 2 (10.5%) were treatment-related questions. Of the questions answered incorrectly, 10 (83.3%) were diagnosis related and 2 (16.7%) were treatment related. ChatGPT and Bing did not specify any answers to the same question. This question was asked to test their knowledge of the diagnosis.

All 3 artificial intelligence chatbots gave correct answers to 13 (40.6%) questions and incorrect answers to 3 (9.4%) questions. Thirteen (100%) questions that were answered correctly tested information related to the diagnosis. Two (66.7%) questions that were answered incorrectly tested the information related to the diagnosis and 1 (33.3%) question tested the information related to the treatment ([Table 1]).

Table 1 The success of artificial intelligence chatbots on questions related to glaucoma diagnosis and treatment modalities.

Answers (n)

ChatGPT

Bard

Bing

Correct

Diagnosis/Treatment

18 (56.3%)

17/1

25 (78.1%)

23/2

19 (59.4%)

17/2

Incorrect

Diagnosis/Treatment

13 (40.6%)

10/3

7 (21.9%)

5/2

12 (37.5%)

10/2

P value

0.195

Same Answers (n)

16 (50%)

Correct

Diagnosis/Treatment

13 (40.6%)

13/0

Incorrect

Diagnosis/Treatment

3 (9.4%)

2/1

Answering the questions with three artificial intelligence chatbots did not have a significant effect on the correct and incorrect answer rates (p = 0.195, Pearsonʼs chi-square test). No significant difference was observed between the correct and incorrect response rates of the ChatGPT and Bing programs (p = 1.0, Yatesʼ chi-squared test). There was no significant difference between the use of Bing and Bard artificial intelligence programs for correct and incorrect answers to the questions (p = 0.238, Yatesʼ chi-square test). There was no significant difference between the correct and incorrect response rates of the ChatGPT and Bard programs (p = 0.15, Yatesʼ chi-squared test).


#

Discussion

Artificial intelligence is a new species that constantly develops with the advancement of technology and creates a strong place for itself in current life conditions. With the development of technology, there have been differences between the first artificial intelligence programs and the current ones. Previously, artificial intelligence programs were primarily deep learning models designed to understand and recognize patterns. The recently developed LLM is a new artificial intelligence model that can understand and link the words that come before it and has been developed to predict various possibilities of words in a certain string. If they are trained with sufficient data using these features, they can provide appropriate answers to human intelligence. ChatGPT, Bing, and Bard artificial intelligence programs have been developed based on the LLM [11].

ChatGPT is a natural language processing method with 175 billion parameters developed to imitate the human mindset and create similar responses. It has gained a strong place among existing artificial intelligence programs owing to its powerful features [12]. It has gained a wide variety of uses in medical education. Examples of these uses include a wide variety of features, such as quick access to information, language translations, and personalized learning [13]. ChatGPT is a very useful artificial intelligence application in terms of providing fast and accurate access to various medical topics. It provides quick access to information regarding a wide variety of diseases, differential diagnoses, and treatment modalities. Thanks to these advantages, it can provide benefits to a wide range of medical students who want to provide fast and reliable access of information to health professionals [14]. A different use of these programs is to answer questions in research and to provide a wide variety of text summaries. It provides great convenience to medical researchers in terms of accessing the correct and up-to-date information on a large amount of information on the Internet to examine the literature and analyze a wide variety of data [15]. Bard and Bing artificial intelligence chatbots, which were introduced in 2023, are current artificial intelligence programs that can be used for similar purposes. The Bard artificial intelligence chatbot is very important in terms of being able to access the developing ophthalmology information up to date, as it uses the internet data to access information and is constantly updated by its manufacturer. As for the Bing artificial intelligence chatbot, we think that Bardʼs ability to reference the sources it uses, in addition to this feature, creates an advantage in checking the accuracy of the data and information and in providing quick access to current articles to access more detailed information. Considering the advantages of these artificial intelligence chatbots in providing rapid access to accurate information, we think that they will have a positive impact on the training and development of researchers on a wide range of glaucoma types and treatment methods. This can be explained as follows: Considering that glaucoma is a chronic disease that requires regular follow-up and a cumulative increase in the number of patients, it requires a substantial number of patient examinations. Advanced theoretical knowledge is one of the prerequisites for providing effective patient care. In addition, considering these intense working conditions, it is important that these methods, which are used to provide easy access to information to make a differential diagnosis well, use the current information to help make a quick decision about the diagnosis and treatment. However, these artificial intelligence chatbots also have some disadvantages. The fact that the data used by ChatGPT covers 2021 and before, and the lack of real-time internet access will cause this program to be incomplete from the constantly developing ophthalmology information and may make it out of date. [12] In our own series, ChatGPT answered 56.3% of the questions correctly, 40.6% incorrectly, and 3.1% with the answer “I can access the information until September 2021. I have no idea about this topic.” This answer given by ChatGPT is a good example of being out of date. The limited access of Bard and Bing artificial intelligence chatbots to paid articles on the internet will make it difficult to access this current information and may even prevent them from noticing new developments. These disadvantages may negatively affect the performance of artificial intelligence chatbots in accessing accurate and up-to-date information. Another consideration is the variability in the responses generated by artificial intelligence chatbots to the same questions. While chatbots exhibit this behavior, they afford researchers the opportunity to access more consistent information by leveraging their chat history. Utilizing chat history enables researchers to receive diverse responses from artificial intelligence chatbots by drawing upon their prior interactions. However, this practice has certain limitations. Despite the aim of artificial intelligence programs to provide accurate information in response to queries, there remains a risk of inadvertent acceptance and propagation of misinformation due to biases. In assessing the efficacy of chatbots in responding to questions, we sought to mitigate this effect by posing each question only once, although complete elimination of this influence may not be achievable by clearing the memory history [2], [3], [4]. Thus, our objective was to establish a standard to address these questions.

A wide variety of studies have been conducted to answer questions related to medicine. In a study by Jin et al. [16], models that answered yes or no questions in PubMed obtained 68.1% accuracy. In another study, a dataset of 12,723 questions was evaluated, and an accuracy rate of 36.7% was achieved. In a recent study investigating the effectiveness and reliability of ChatGPT in answering questions correctly, it was stated that more than 50% of the questions asked were answered correctly [11]. Another study reported that ChatGPT gave correct answers to more than 60% of the questions asked, which was a successful response rate. At the same time, they concluded that the success rate in correctly answering questions with similar content was equal to or better than previous artificial intelligence programs [17]. In our study, we found that while all chatbots have a correct answer rate of over 55%, ChatGPT has a lower correct answer rate than Bing and Bard artificial intelligence chatbots released in 2023, but there is no statistical difference. The reason why ChatGPT has the lowest accuracy rate may be that it cannot access data after 2021. The higher accuracy rates of the more up-to-date Bard and Bing chatbots may be related to the fact that they can provide their data simultaneously from the internet and follow the current situation more closely. Although Bing and Bard artificial intelligence programs follow the current situation more closely, they could not answer all the questions correctly. There are also different accuracy rates. The reason why these artificial intelligence chatbots cannot answer all questions correctly may be their limited access to paid data. We also think that these programs may be related to the ability to misunderstand questions and produce wrong answers. We also think that the lower correct answer rate of the Bing chatbot compared to the Bard chatbot may be related to the fact that Bing can sometimes indicate incomplete search results.

Owing to the closed structure of artificial intelligence chatbots, the inability to fine tune and the scarcity of questions constitute the limited aspect of our study.

In conclusion, with the development of technology, artificial intelligence applications continue to find a place in every field of medicine more effectively and accurately. To the best of our knowledge, this is the first study to compare the correct answering levels of questions related to glaucoma types and treatment methods of the Bing, Bard, and ChatGPT artificial intelligence models released by three different manufacturers based on the LLM. Although not all chatbots were found to be statistically superior to each other, they answered questions about glaucoma types and treatment modalities correctly at a level that could be considered successful. Bard had a slightly higher rate of correct answers to glaucoma questions. Although artificial intelligence chatbots can be used to quickly and accurately access information related to glaucoma, it still does not seem possible to replace human intelligence and abilities today.

Conclusion Box

Already Known:

  • Chatbots are new applications that have emerged with the development of artificial intelligence programs.

  • Although the usability of artificial intelligence programs in ophthalmology fields has been tested, the existence and usability of three different artificial intelligence chatbots, which are available for free use, have not been investigated for their superiority over each other in accessing information about glaucoma diseases and treatment methods.

Newly described:

  • Although all three artificial intelligence programs were not statistically superior to each other in answering the questions correctly, the more up-to-date Bard and Bing artificial intelligence chatbots answered the questions with higher accuracy rates.

  • Artificial intelligence programs, including current artificial intelligence programs such as Bard and Bing, may encounter various obstacles (paid access, etc.) in accessing current and accurate information. There seems to be a need for further development of artificial intelligence programs and the completion of these deficiencies.


#
#

Conflict of Interest

The authors declare that they have no conflict of interest.

  • References

  • 1 Evans RS. Electronic Health Records: Then, Now, and in the Future. Yearb Med Inform 2016; 25: S48
  • 2 Rahimy E. Deep learning applications in ophthalmology. Curr Opin Ophthalmol 2018; 29: 254-260
  • 3 Patel VL, Shortliffe EH, Stefanelli M. et al. The coming of age of artificial intelligence in medicine. Artif Intell Med 2009; 46: 5-17
  • 4 Mikolov T, Deoras A, Povey D. et al. Strategies for training large scale neural network language models. 2011 IEEE Workshop on Automatic Speech Recognition and Understanding. Waikoloa, HI, USA: ASRU Proceedings; 2011: 196-201
  • 5 Harasymowycz P, Birt C, Gooi P. et al. Medical Management of Glaucoma in the 21st Century from a Canadian Perspective. J Ophthalmol 2016;
  • 6 Thomas S, Hodge W, Malvankar-Mehta M. The Cost-Effectiveness Analysis of Teleglaucoma Screening Device. PLoS One 2015; 10: e0137913
  • 7 Imrie C, Tatham AJ. Glaucoma: the patientʼs perspective. Br J Gen Pract 2016; 66: e371
  • 8 McMonnies CW. Glaucoma history and risk factors. J Optom 2017; 10: 71-78
  • 9 Hashemi H, Mohammadi M, Zandvakil N. et al. Prevalence and risk factors of glaucoma in an adult population from Shahroud, Iran. J Curr Ophthalmol 2018; 31: 366-372
  • 10 Tanna AP, Boland MV, Giaconi JA. et al. Glaucoma. San Francisco: American Academy of Ophthalmology; 2022
  • 11 Kung TH, Cheatham M, Medenilla A. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023; 2: e0000198
  • 12 Wen J, Wang W. The future of ChatGPT in academic research and publishing: A commentary for clinical and translational medicine. Clin Transl Med 2023; 13: e1207
  • 13 Khan RA, Jawaid M, Khan AR. et al. ChatGPT – Reshaping medical education and clinical management. Pak J Med Sci 2023; 39: 605
  • 14 Jeblick K, Schachtner B, Dexl J. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 2023; 34: 2817-2825
  • 15 Gao CA, Howard FM, Markov NS. et al. Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. bioRxiv 2022; 2022: 12.23.521610
  • 16 Jin Q, Dhingra B, Liu Z. et al. PubMedQA: A Dataset for Biomedical Research Question Answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics; 2019: 2567-2577
  • 17 Gilson A, Safranek CW, Huang T. et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ 2023; 9: e45312

Correspondence

Dr. Eyupcan Sensoy, MD
Ophthalmology
Ankara City Hospital
Varlik Mahallesi, HalilSezai Erkut Caddesi, Yenimahalle
060000 Ankara
Turkey   
Phone: + 90 (0) 54 64 17 75 38   

Publication History

Received: 12 August 2023

Accepted: 12 May 2024

Article published online:
24 July 2024

© 2024. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

  • References

  • 1 Evans RS. Electronic Health Records: Then, Now, and in the Future. Yearb Med Inform 2016; 25: S48
  • 2 Rahimy E. Deep learning applications in ophthalmology. Curr Opin Ophthalmol 2018; 29: 254-260
  • 3 Patel VL, Shortliffe EH, Stefanelli M. et al. The coming of age of artificial intelligence in medicine. Artif Intell Med 2009; 46: 5-17
  • 4 Mikolov T, Deoras A, Povey D. et al. Strategies for training large scale neural network language models. 2011 IEEE Workshop on Automatic Speech Recognition and Understanding. Waikoloa, HI, USA: ASRU Proceedings; 2011: 196-201
  • 5 Harasymowycz P, Birt C, Gooi P. et al. Medical Management of Glaucoma in the 21st Century from a Canadian Perspective. J Ophthalmol 2016;
  • 6 Thomas S, Hodge W, Malvankar-Mehta M. The Cost-Effectiveness Analysis of Teleglaucoma Screening Device. PLoS One 2015; 10: e0137913
  • 7 Imrie C, Tatham AJ. Glaucoma: the patientʼs perspective. Br J Gen Pract 2016; 66: e371
  • 8 McMonnies CW. Glaucoma history and risk factors. J Optom 2017; 10: 71-78
  • 9 Hashemi H, Mohammadi M, Zandvakil N. et al. Prevalence and risk factors of glaucoma in an adult population from Shahroud, Iran. J Curr Ophthalmol 2018; 31: 366-372
  • 10 Tanna AP, Boland MV, Giaconi JA. et al. Glaucoma. San Francisco: American Academy of Ophthalmology; 2022
  • 11 Kung TH, Cheatham M, Medenilla A. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023; 2: e0000198
  • 12 Wen J, Wang W. The future of ChatGPT in academic research and publishing: A commentary for clinical and translational medicine. Clin Transl Med 2023; 13: e1207
  • 13 Khan RA, Jawaid M, Khan AR. et al. ChatGPT – Reshaping medical education and clinical management. Pak J Med Sci 2023; 39: 605
  • 14 Jeblick K, Schachtner B, Dexl J. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 2023; 34: 2817-2825
  • 15 Gao CA, Howard FM, Markov NS. et al. Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. bioRxiv 2022; 2022: 12.23.521610
  • 16 Jin Q, Dhingra B, Liu Z. et al. PubMedQA: A Dataset for Biomedical Research Question Answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics; 2019: 2567-2577
  • 17 Gilson A, Safranek CW, Huang T. et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ 2023; 9: e45312