Appl Clin Inform 2025; 16(05): 1881-1891
DOI: 10.1055/a-2765-6930
Research Article

Baseline Evaluation of Claude Opus 4 for Diabetes Management: A Preliminary Assessment and Lessons for Implementation

Autor*innen

  • Pouyan Esmaeilzadeh

    1   Department of Information Systems and Business Analytics, College of Business, Florida International University (FIU), Miami, Florida, United States

Abstract

Background

Claude Opus 4 is a large language model (LLM) that features improved reasoning capabilities and broader contextual understanding compared to earlier versions. Despite the growing use of LLM systems for seeking medical information, structured and simulation-based evaluations of Claude Opus 4's capabilities in diabetes management remain limited, particularly across domains such as patient education, clinical reasoning, and emotional support.

Objectives

This study aimed to conduct a baseline evaluation of Claude Opus 4's performance across key domains of diabetes care (i.e., patient education, clinical reasoning, and emotional support), and to identify preliminary insights that can inform future, evidence-based integration strategies.

Methods

A three-step evaluation was conducted: (1) 30 diabetes management questions assessed using expert endocrinologist evaluation, (2) five fictional diabetes cases evaluated for clinical decision-making, and (3) emotional support responses assessed for appropriateness and empathy. Three expert endocrinologists graded responses according to American Diabetes Association guidelines.

Results

Claude Opus 4 achieved 80% accuracy in general diabetes knowledge, with high response reproducibility (96.7%), indicating baseline rather than clinically adequate performance. Clinical case evaluations showed moderate utility (mean expert rating = 4.4/7), while emotional-support assessments yielded high scores for empathy (6.2/7) and appropriateness (6.0/7). These findings suggest that although the model demonstrates promising informational and emotional-support capabilities, its current performance remains insufficient for autonomous clinical use and should be viewed as preliminary evidence to guide future, patient-inclusive validation studies.

Conclusion

Although Claude Opus 4 demonstrates preliminary findings suggesting potential applications in diabetes care, education, and emotional support, this baseline assessment using fictional cases underscores the need for real-world validation with clinical data to determine true clinical utility and patient-centered impact. This simulation-based evaluation also offers practical lessons learned for researchers designing future LLM assessments, highlighting the need for mixed expert–patient panels, contextual validation, and person-centered metrics beyond numerical accuracy.

Protection of Human and Animal Subjects

This study was approved by the Florida International University's Institutional Review Board (IRB Protocol Number: IRB-25-0342). As this study evaluated an AI system using fictional clinical cases and expert physician evaluations of AI-generated responses, no human participants were directly involved in data collection. The study complied with the Declaration of Helsinki regarding research involving human data. The expert physicians who evaluated the AI responses provided informed consent to participate in the evaluation process.




Publikationsverlauf

Eingereicht: 03. Juli 2025

Angenommen: 04. Dezember 2025

Accepted Manuscript online:
08. Dezember 2025

Artikel online veröffentlicht:
18. Dezember 2025

© 2025. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany