Subscribe to RSS
DOI: 10.1055/a-2802-2998
Can Artificial Intelligence Align with Evidence? Performance of ChatGPT-4o in Knee Osteoarthritis Surgical Guidelines
Authors
Abstract
Artificial intelligence large language models (LLMs) such as ChatGPT are increasingly used in clinical settings, yet their reliability in reproducing evidence-based recommendations remains uncertain. This study aimed to evaluate the performance of ChatGPT-4o in addressing clinical practice guideline (CPG) recommendations for the surgical management of knee osteoarthritis and total knee arthroplasty (TKA). An observational cross-sectional design was conducted in September 2025. Twenty recommendations from the most recent American Academy of Orthopaedic Surgeons CPG on TKA were translated into structured clinical questions and submitted to ChatGPT-4o. Each query was entered three times in independent sessions to evaluate textual consistency. Two independent reviewers with expertise in musculoskeletal physiotherapy and orthopedics appraised the chatbot's answers, classifying them according to the CPG framework (“should do,” “could do,” “do not do,” “uncertain”). Agreement between reviewers and alignment with CPG recommendations were assessed using Cohen's and Fleiss' Kappa coefficients. ChatGPT-4o achieved an overall concordance of 60% with the CPG recommendations, representing fair agreement (κ = 0.392, p = 0.005). Internal text consistency across repeated trials was low, with several responses showing unacceptable similarity levels (<50%). Inter-rater reliability ranged from moderate to perfect (κ = 0.547–0.946). Although ChatGPT-4o provided clinically acceptable answers in several domains, discrepancies persisted, particularly in recommendations regarding functional outcomes and rehabilitation strategies. ChatGPT-4o demonstrated moderate accuracy and heterogeneous reliability when reproducing CPG recommendations for TKA. While the model may serve as a supportive tool for education and patient communication, its variability and incomplete adherence to guidelines highlight the need for cautious integration and professional oversight in clinical decision-making.
Note
This paper has been reviewed by antiplagiarism Turnitin program that guarantees the originality of the manuscript.
Publication History
Received: 28 October 2025
Accepted: 30 January 2026
Article published online:
13 February 2026
© 2026. Thieme. All rights reserved.
Thieme Medical Publishers, Inc.
333 Seventh Avenue, 18th Floor, New York, NY 10001, USA
-
References
- 1 Cui A, Li H, Wang D, Zhong J, Chen Y, Lu H. Global, regional prevalence, incidence and risk factors of knee osteoarthritis in population-based studies. EClinicalMedicine 2020; 29–30: 100587
- 2 Cram P, Lu X, Kates SL, Singh JA, Li Y, Wolf BR. Total knee arthroplasty volume, utilization, and outcomes among Medicare beneficiaries, 1991-2010. JAMA 2012; 308 (12) 1227-1236
- 3 Gunaratne R, Pratt DN, Banda J, Fick DP, Khan RJK, Robertson BW. Patient dissatisfaction following total knee arthroplasty: a systematic review of the literature. J Arthroplasty 2017; 32 (12) 3854-3860
- 4 Srivastava AK. Surgical Management of Osteoarthritis of the Knee Work Group, Staff of the American Academy of Orthopaedic Surgeons. American Academy of Orthopaedic Surgeons Clinical Practice guideline summary of surgical management of osteoarthritis of the knee. J Am Acad Orthop Surg 2023; 31 (24) 1211-1220
- 5 Hadi A, Tran E, Nagarajan B, Kirpalani A. Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians. PLoS One 2024; 19 (07) e0307383
- 6 Rossettini G, Bargeri S, Cook C. et al. Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study. Front Digit Health 2025; 7: 1574287
- 7 Ng JY, Maduranayagam SG, Suthakar N. et al. Attitudes and perceptions of medical researchers towards the use of artificial intelligence chatbots in the scientific process: an international cross-sectional survey. Lancet Digit Health 2025; 7 (01) e94-e102
- 8 Choudhury A, Elkefi S, Tounsi A. Exploring factors influencing user perspective of ChatGPT as a technology that assists in healthcare decision making: a cross sectional survey study. PLoS One 2024; 19 (03) e0296151
- 9 Wei Q, Yao Z, Cui Y, Wei B, Jin Z, Xu X. Evaluation of ChatGPT-generated medical responses: a systematic review and meta-analysis. J Biomed Inform 2024; 151: 104620
- 10 Mika AP, Martin JR, Engstrom SM, Polkowski GG, Wilson JM. Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am 2023; 105 (19) 1519-1526
- 11 Gilson A, Safranek CW, Huang T. et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023; 9: e45312
- 12 Massey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. J Am Acad Orthop Surg 2023; 31 (23) 1173-1179
- 13 Lum ZC. Can artificial intelligence pass the American Board of Orthopaedic Surgery examination? Orthopaedic residents versus ChatGPT. Clin Orthop Relat Res 2023; 481 (08) 1623-1630
- 14 Magruder ML, Rodriguez AN, Wong JCJ. et al. Assessing ability for ChatGPT to answer total knee arthroplasty-related questions. J Arthroplasty 2024; 39 (08) 2022-2027
- 15 Mika AP, Mulvey HE, Engstrom SM, Polkowski GG, Martin JR, Wilson JM. Can ChatGPT answer patient questions regarding total knee arthroplasty?. J Knee Surg 2024; 37 (09) 664-673
- 16 Taylor IV WL, Cheng R, Weinblatt AI, Bergstein V, Long WJ. An artificial intelligence chatbot is an accurate and useful online patient resource prior to total knee arthroplasty. J Arthroplasty 2024; 39 (8S1): S358-S362
- 17 von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. PLoS Med 2007; 4 (10) e296
- 18 Vasey B, Nagendran M, Campbell B. et al; DECIDE-AI expert group. Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. BMJ 2022; 377: e070904
- 19 Graham R, Mancher M, Wolman DM, Greenfield S, Steinberg E. , Eds. Clinical Practice Guidelines We Can Trust. Washington, D.C.: National Academies Press; 2011
- 20 Gattrell WT, Hungin AP, Price A. et al. ACCORD guideline for reporting consensus-based methods in biomedical research and clinical practice: a study protocol. Res Integr Peer Rev 2022; 7 (01) 3
- 21 Mokkink LB, Terwee CB, Patrick DL. et al. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol 2010; 63 (07) 737-745
- 22 Menditto A, Patriarca M, Magnusson B. Understanding the meaning of accuracy, trueness and precision. Accredit Qual Assur 2007; 12 (01) 45-47
- 23 Gianola S, Bargeri S, Castellini G. et al. Performance of ChatGPT compared to clinical practice guidelines in making informed decisions for lumbosacral radicular pain: a cross-sectional study. J Orthop Sports Phys Ther 2024; 54 (03) 222-228
- 24 Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33 (01) 159-174