RSS-Feed abonnieren
DOI: 10.1055/a-2765-6930
Baseline Evaluation of Claude Opus 4 for Diabetes Management: A Preliminary Assessment and Lessons for Implementation
Autor*innen
Abstract
Background
Claude Opus 4 is a large language model (LLM) that features improved reasoning capabilities and broader contextual understanding compared to earlier versions. Despite the growing use of LLM systems for seeking medical information, structured and simulation-based evaluations of Claude Opus 4's capabilities in diabetes management remain limited, particularly across domains such as patient education, clinical reasoning, and emotional support.
Objectives
This study aimed to conduct a baseline evaluation of Claude Opus 4's performance across key domains of diabetes care (i.e., patient education, clinical reasoning, and emotional support), and to identify preliminary insights that can inform future, evidence-based integration strategies.
Methods
A three-step evaluation was conducted: (1) 30 diabetes management questions assessed using expert endocrinologist evaluation, (2) five fictional diabetes cases evaluated for clinical decision-making, and (3) emotional support responses assessed for appropriateness and empathy. Three expert endocrinologists graded responses according to American Diabetes Association guidelines.
Results
Claude Opus 4 achieved 80% accuracy in general diabetes knowledge, with high response reproducibility (96.7%), indicating baseline rather than clinically adequate performance. Clinical case evaluations showed moderate utility (mean expert rating = 4.4/7), while emotional-support assessments yielded high scores for empathy (6.2/7) and appropriateness (6.0/7). These findings suggest that although the model demonstrates promising informational and emotional-support capabilities, its current performance remains insufficient for autonomous clinical use and should be viewed as preliminary evidence to guide future, patient-inclusive validation studies.
Conclusion
Although Claude Opus 4 demonstrates preliminary findings suggesting potential applications in diabetes care, education, and emotional support, this baseline assessment using fictional cases underscores the need for real-world validation with clinical data to determine true clinical utility and patient-centered impact. This simulation-based evaluation also offers practical lessons learned for researchers designing future LLM assessments, highlighting the need for mixed expert–patient panels, contextual validation, and person-centered metrics beyond numerical accuracy.
Keywords
artificial intelligence - evaluation studies - diabetes management - health communication - patient education - clinical decision support - health information technology evaluation - medical informaticsProtection of Human and Animal Subjects
This study was approved by the Florida International University's Institutional Review Board (IRB Protocol Number: IRB-25-0342). As this study evaluated an AI system using fictional clinical cases and expert physician evaluations of AI-generated responses, no human participants were directly involved in data collection. The study complied with the Declaration of Helsinki regarding research involving human data. The expert physicians who evaluated the AI responses provided informed consent to participate in the evaluation process.
Publikationsverlauf
Eingereicht: 03. Juli 2025
Angenommen: 04. Dezember 2025
Accepted Manuscript online:
08. Dezember 2025
Artikel online veröffentlicht:
18. Dezember 2025
© 2025. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Alowais SA, Alghamdi SS, Alsuhebany N. et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ 2023; 23 (01) 689
- 2 Nazi ZA, Peng W. Large language models in healthcare and medical domain: a review. Informatics 2024; 11 (03) 57
- 3 Kumah-Crystal Y, Mankowitz S, Embi P, Lehmann CU. ChatGPT and the clinical informatics board examination: the end of unproctored maintenance of certification?. J Am Med Inform Assoc 2023; 30 (09) 1558-1560
- 4 Yameny AA. Diabetes mellitus overview 2024. Journal of Bioscience and Applied Research 2024; 10 (03) 641-645
- 5 Bautista JR, Herbert D, Farmer M, De Torres RQ, Soriano GP, Ronquillo CE. Health consumers' use and perceptions of health information from generative artificial intelligence chatbots: a scoping review. Appl Clin Inform 2025; 16 (04) 892-902
- 6 Huang C, Chen L, Huang H. et al. Evaluate the accuracy of ChatGPT's responses to diabetes questions and misconceptions. J Transl Med 2023; 21 (01) 502
- 7 Sharma S, Pajai S, Prasad R. et al. A critical review of ChatGPT as a potential substitute for diabetes educators. Cureus 2023; 15 (05) e38380
- 8 Uppalapati VK, Nag DS. A comparative analysis of AI models in complex medical decision-making scenarios: evaluating ChatGPT, Claude AI, Bard, and Perplexity. Cureus 2024; 16 (01) e52485
- 9 Verdi EB, Akbilgic O. Comparing the performances of a fifty-four-year-old computer-based consultation to ChatGPT-4o. Appl Clin Inform 2025
- 10 Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J 2017; 15: 104-116
- 11 Contreras I, Vehi J. Artificial intelligence for diabetes management and decision support: literature review. J Med Internet Res 2018; 20 (05) e10775
- 12 Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med 2022; 28 (01) 31-38
- 13 Kurokawa R, Ohizumi Y, Kanzawa J. et al. Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's “Diagnosis Please” cases. Jpn J Radiol 2024; 42 (12) 1399-1402
- 14 Powers MA, Bardsley J, Cypress M. et al. Diabetes self-management education and support in type 2 diabetes: a joint position statement of the American Diabetes Association, the American Association of Diabetes Educators, and the Academy of Nutrition and Dietetics. Clin Diabetes 2016; 34 (02) 70-80
- 15 Isaacs D, Cox C, Schwab K. et al. Technology integration: the role of the diabetes care and education specialist in practice. Diabetes Educ 2020; 46 (04) 323-334
- 16 Cavnar Helvaci B, Hepsen S, Candemir B. et al. Assessing the accuracy and reliability of ChatGPT's medical responses about thyroid cancer. Int J Med Inform 2024; 191: 105593
- 17 Dickinson JK, Guzman SJ, Maryniuk MD. et al. The use of language in diabetes care and education. Diabetes Educ 2017; 43 (06) 551-564
- 18 Tanaka K, Okazaki H, Nakashima H. et al. Comparative analysis of ChatGPT-based artificial intelligence for diabetes self-management support: potential of artificial intelligence stratified implementation and customization. Diabetes Technology and Obesity Medicine 2025; 1 (01) 131-142
- 19 Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019; 25 (01) 44-56
- 20 Li Y-H, Li Y-L, Wei M-Y, Li G-Y. Innovation and challenges of artificial intelligence technology in personalized healthcare. Sci Rep 2024; 14 (01) 18994
- 21 Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large language models in medicine: the potentials and pitfalls: a narrative review. Ann Intern Med 2024; 177 (02) 210-220
- 22 Pouwer F, Mizokami-Stout K, Reeves ND. et al. Psychosocial care for people with diabetic neuropathy: time for action. Diabetes Care 2024; 47 (01) 17-25
- 23 Abd-Alrazaq AA, Alajlani M, Alalwan AA, Bewick BM, Gardner P, Househ M. An overview of the features of chatbots in mental health: a scoping review. Int J Med Inform 2019; 132: 103978
- 24 Danehy T, Hecht J, Kentis S, Schechter CB, Jariwala SP. ChatGPT performs worse on USMLE-style ethics questions compared to medical knowledge questions. Appl Clin Inform 2024; 15 (05) 1049-1055
- 25 Singhal K, Azizi S, Tu T. et al. Large language models encode clinical knowledge. Nature 2023; 620 (7972) 172-180
- 26 Kaur A, Budko A, Liu K, Eaton E, Steitz BD, Johnson KB. Automating responses to patient portal messages using generative AI. Appl Clin Inform 2025; 16 (03) 718-731
- 27 Hanna JJ, Wakene AD, Johnson AO, Lehmann CU, Medford RJ. Assessing racial and ethnic bias in text generation by large language models for health care–related tasks: cross-sectional study. J Med Internet Res 2025; 27: e57257
