J Knee Surg
DOI: 10.1055/a-2802-2998
Original Article

Can Artificial Intelligence Align with Evidence? Performance of ChatGPT-4o in Knee Osteoarthritis Surgical Guidelines

Authors

  • Fernando García-Sanz

    1   Faculty of Medicine, Health and Sport, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
    2   Clínica CEMTRO, Madrid, Spain
  • María Bravo-Aguilar

    1   Faculty of Medicine, Health and Sport, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
  • Lorena Canosa-Carro

    1   Faculty of Medicine, Health and Sport, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
  • María Blanco-Morales

    1   Faculty of Medicine, Health and Sport, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
  • Carlos Romero-Morales

    1   Faculty of Medicine, Health and Sport, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
  • Ángel González-de-la-Flor

    1   Faculty of Medicine, Health and Sport, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain

Abstract

Artificial intelligence large language models (LLMs) such as ChatGPT are increasingly used in clinical settings, yet their reliability in reproducing evidence-based recommendations remains uncertain. This study aimed to evaluate the performance of ChatGPT-4o in addressing clinical practice guideline (CPG) recommendations for the surgical management of knee osteoarthritis and total knee arthroplasty (TKA). An observational cross-sectional design was conducted in September 2025. Twenty recommendations from the most recent American Academy of Orthopaedic Surgeons CPG on TKA were translated into structured clinical questions and submitted to ChatGPT-4o. Each query was entered three times in independent sessions to evaluate textual consistency. Two independent reviewers with expertise in musculoskeletal physiotherapy and orthopedics appraised the chatbot's answers, classifying them according to the CPG framework (“should do,” “could do,” “do not do,” “uncertain”). Agreement between reviewers and alignment with CPG recommendations were assessed using Cohen's and Fleiss' Kappa coefficients. ChatGPT-4o achieved an overall concordance of 60% with the CPG recommendations, representing fair agreement (κ = 0.392, p = 0.005). Internal text consistency across repeated trials was low, with several responses showing unacceptable similarity levels (<50%). Inter-rater reliability ranged from moderate to perfect (κ = 0.547–0.946). Although ChatGPT-4o provided clinically acceptable answers in several domains, discrepancies persisted, particularly in recommendations regarding functional outcomes and rehabilitation strategies. ChatGPT-4o demonstrated moderate accuracy and heterogeneous reliability when reproducing CPG recommendations for TKA. While the model may serve as a supportive tool for education and patient communication, its variability and incomplete adherence to guidelines highlight the need for cautious integration and professional oversight in clinical decision-making.

Note

This paper has been reviewed by antiplagiarism Turnitin program that guarantees the originality of the manuscript.




Publication History

Received: 28 October 2025

Accepted: 30 January 2026

Article published online:
13 February 2026

© 2026. Thieme. All rights reserved.

Thieme Medical Publishers, Inc.
333 Seventh Avenue, 18th Floor, New York, NY 10001, USA