Open Access
CC BY 4.0 · Journal of Clinical Interventional Radiology ISVIR
DOI: 10.1055/s-0045-1809953
Original Article

Can ChatGPT Aid in Musculoskeletal Intervention?

Mohamed Ashiq Shazahan
1   Department of Orthopedics, Royal Orthopedic Hospital, Birmingham, United Kingdom
,
Saavi Reddy Pellakuru
2   Department of Musculoskeletal Radiology, Royal Orthopedic Hospital, Birmingham, United Kingdom
,
Sonal Saran
3   Department of Musculoskeletal Radiology, All India Institute of Medical Sciences, Rishikesh, India
,
Shashank Chapala
4   Department of Radiology, AIG Hospitals, Hyderabad, India
,
Sindhura Mettu
5   Department of Radiology, Himagiri Hospitals, Hyderabad, India
,
2   Department of Musculoskeletal Radiology, Royal Orthopedic Hospital, Birmingham, United Kingdom
› Institutsangaben

Funding None.
 

Abstract

Objective

Radiology has continuously evolved exploring cutting-edge technologies to improve patient care. It is a prime example of how medical science is propelled forward by technological innovation.

In recent times, artificial intelligence (AI) has played a crucial role in various technological advancements. Chat Generative Pre-trained Transformer (ChatGPT)-4, an AI language model primarily focusing on natural language understanding and generation, is increasingly used to retrieve medical information. This study explores the utility of ChatGPT-4o in aiding imaging-guided musculoskeletal interventions, detailing its advantages and limitations.

Methods

Two musculoskeletal radiologists assessed the information generated by ChatGPT on common musculoskeletal interventions. They analyzed the overall utility of ChatGPT-4o in guiding musculoskeletal interventions by examining the procedure steps and pre- and post-procedure details provided. The assessment was documented in a 5-point Likert scale and subjected to statistical analysis.

Results

The statistical analysis of Likert scale scores by both readers revealed a moderate level of inter-rater agreement, as indicated by a Cohen's Kappa score of 0.54. Across the categories, the mode of Likert score ranged from 1 to 3, as rated by both readers, indicating suboptimal performance. The lowest scores were observed in image quality assessments, whereas the highest ratings were of post-procedure details.

Conclusion

ChatGPT-4o offers structured procedural guidance but falls short in complex, image-dependent tasks due to limited anatomical detail and contextual accuracy. It may aid education, but not clinical use without expert oversight. Domain-specific training, validation, and multidisciplinary collaboration are essential for safe and effective integration into practice.


Introduction

The emergence of artificial intelligence (AI) in health care has ushered in transformative changes across various medical disciplines.[1] Among these advancements, Chat Generative Pre-trained Transformer (ChatGPT), an AI language model developed by OpenAI, is a versatile tool with applications ranging from education to clinical decision support.[2] Its ability to generate human-like text responses has sparked interest in its potential role in guiding medical procedures, particularly musculoskeletal (MSK) medicine. Despite its potential, the efficacy and utility of ChatGPT in MSK procedures remain to be seen.

This research aims to assess the effectiveness of ChatGPT-4o in supporting MSK procedures by investigating its capability for enhancing clinician performance and promoting streamlined processes by critically evaluating the information provided on various common imaging-guided MSK interventions.


Methods

No ethical committee approval was required as no patient data were used. ChatGPT-4o was asked to provide information as well as generate illustrations explaining 5 MSK procedures ([Figs. 1] [2] [3] [4] [5], [Supplementary Table S1] [available in the online version).

Zoom
Fig. 1 ChatGPT4o-created image of steps of hip arthrogram using fluoroscopy.
Zoom
Fig. 2 ChatGPT4o-created image of steps of lumbar nerve root injection under CT guidance. CT, computed tomography.
Zoom
Fig. 3 ChatGPT4o-created image of steps of CT-guided bone biopsy. CT, computed tomography.
Zoom
Fig. 4 ChatGPT4o-created image of steps of ultrasound-guided hydrodissection in carpal tunnel syndrome.
Zoom
Fig. 5 ChatGPT4o-created image of steps of ultrasound-guided injection of corticosteroids for adhesive capsulitis.

Two fellowship-trained MSK radiologists with over 10 years of experience assessed the information generated by ChatGPT-4o. This assessment was documented in a 5-point Likert scale, which was coded as follows: 1—poor quality, 2—fair quality, 3—good quality, 4—very good quality, and 5—excellent quality.

The criteria for evaluation included the preprocedure details, the steps of the procedure, post-procedure particulars, the overall accuracy of the steps provided, and the quality of illustrations generated by ChatGPT-4o. Emphasis was placed on the completeness and accuracy of procedural steps, comprehensiveness of information, and detail on patient safety monitoring, effectiveness and utility as a learning tool.


Results

The statistical analysis revealed a moderate level of inter-rater agreement, as indicated by a Cohen's Kappa score of 0.54. One of the evaluators provided a more favorable evaluation, awarding higher scores for preprocedure details and overall accuracy. In contrast, the other evaluator consistently assigned lower scores ([Table 1]).

Table 1

Table showing Likert scale ratings by two readers

Hip arthrogram procedure under fluoroscopy

Lumbar nerve root injection under CT guidance

CT-guided bone biopsy

USG-guided hydrodissection in carpal tunnel syndrome

USG-guided corticosteroid injection in adhesive capsulitis

Radiologist A

Radiologist B

Radiologist A

Radiologist B

Radiologist A

Radiologist B

Radiologist A

Radiologist B

Radiologist A

Radiologist B

Pre-procedure details

2

3

2

3

2

2

2

3

2

2

Procedure steps

3

3

3

3

3

3

3

4

2

2

Post-procedure details

3

4

3

3

3

3

3

4

3

3

Overall accuracy

3

4

3

3

3

3

3

4

3

2

Quality of images generated

1

1

1

1

1

1

1

1

1

1

Abbreviations: CT, computed tomography; USG, ultrasonography.


The mode value for Likert scores provided by both readers for preprocedure details was 2, highlighting below-average performance in this category. The mode value for procedural steps and their accuracy, as well as post-procedural details, was 3, reflecting an overall average level of performance in these areas. Both raters, however, were unanimous in their assessment of the quality of illustrations, universally scoring them at 1, reflecting severe shortcomings in graphical representation.

Both radiologists provided lower ratings for US-guided corticosteroid injection in adhesive capsulitis, particularly for preprocedure details and overall accuracy. ChatGPT-4o performed better in guiding US-guided hydrodissection for carpal tunnel syndrome, with one of the readers rating procedure steps, post-procedure details, and overall accuracy higher as 4 on a Likert scale. For computed tomography (CT)-guided bone biopsy, CT-guided lumbar nerve root injection, and hip arthrogram under fluoroscopy, both radiologists consistently rated ChatGPT's performance as average or poor across most categories.


Discussion

This result highlights AI's current capabilities and limitations within specialized niches. Related studies provide perspective, allowing these results to be contextualized within the broader landscape of AI applications in radiology.

The ability of ChatGPT-4o to provide a structured framework for procedural steps aligns with prior studies suggesting that AI-based natural language models can assist in clinical decision support.[3] A 2023 study by Jeblick et al explored ChatGPT's utility in radiological reporting and found that the AI performed well in generating standardized text outputs but lacked depth in technical details.[4] However, in our study, ChatGPT's scores for procedural steps and post-procedure particulars were neutral (3 on the Likert scale), reflecting adequate but nonspecific guidance. The varied performance of ChatGPT-4o across different MSK procedures reflects the challenges identified in previous studies.[4] For example, in hip arthrograms under fluoroscopy and lumbar nerve root injections under CT guidance, ChatGPT-4o provided generic step-by-step instructions but failed to account for nuanced variations in the technique, such as fluoroscopic angulation or needle trajectory adjustments.[5] This limitation aligns with observations by Pesapane et al, who underlined that AI solutions perform well in a broad context but are weak in context-specific situations, especially when dealing with interventional radiology procedures.[6] Results could depend highly on the quality and quantity of data used in training the AI in radiology.[7] Such inconsistency in current ratings among different procedures can relate well to variable availability of quality and procedure-specific training data.[8]

The relatively higher scores for post-procedure particulars in our study for a few procedures suggest that ChatGPT-4o is better equipped to handle standardized tasks.[9] This is consistent with previous findings that AI systems can provide general recommendations or post-procedural advice when the guidance is rooted in established protocols.[10] For instance, ChatGPT's performance in ultrasound (US)-guided hydrodissection for carpal tunnel syndrome was relatively stronger, likely due to the standardization of the procedure and reduced reliance on advanced imaging.[11] However, its utility was less pronounced in procedures requiring nuanced post-procedure instructions, such as CT-guided bone biopsies, with scores remaining neutral or below. This further justifies the need to embed procedural personalization into AI training datasets, stating that domain-specific data are the key to enhancing the performance of AI in radiology.[12] [13] [14]

The inability of ChatGPT-4o to generate or provide high-quality procedural images was a significant limitation, with both radiologists consistently assigning the lowest possible score. This finding is consistent with existing literature, highlighting that text-based AI models lack the visual diagnostic capabilities required for radiology.[15] [16] For MSK procedures, particularly those guided by fluoroscopy, CT, or US, high-quality images are indispensable for accurate execution and interpretation.[17] [18] [19] For example, ChatGPT's inability to provide real-time imaging guidance or visually demonstrate needle positioning limits its clinical applicability in hip arthrograms and lumbar nerve root injections. This reinforces the conclusion that for AI tools to be practical in radiology, they must integrate imaging data into their outputs through partnerships with imaging platforms or advancements in multimodal AI models.[20] The inability to generate or manipulate images effectively is a critical shortcoming that limits ChatGPT's utility as a comprehensive guidance tool in radiology.[15] [16] [17]

ChatGPT's variable performance across procedures also warrants discussion. Procedures with higher standardization, such as US-guided hydrodissection, were rated slightly higher. In contrast, procedures requiring intricate imaging guidance, such as CT-guided bone biopsies, received lower ratings (scores of 2 for preprocedure details and overall accuracy). This variability aligns with findings by Pesapane et al, who noted that AI systems are more effective in well-defined, repetitive tasks but face challenges in complex, dynamic environments like interventional radiology.[6] Additionally, our results suggest that US-guided procedures may be better suited for AI-based guidance, as they are less dependent on advanced imaging modalities, which the current version of ChatGPT cannot support. So, ChatGPT's performance varies depending on the complexity and specificity of the radiology procedure, suggesting that its utility may be more pronounced in less complex interventions.[6] [21] [22]

ChatGPT's evaluation across medical procedures showed mixed results. For hip arthrograms under fluoroscopy, its performance was moderate but inconclusive, with average or below-average ratings for preprocedure details and steps, aligning with Currie et al on AI's struggles with complex preprocedural planning.[23] In lumbar nerve root injections under CT guidance, ChatGPT offered adequate but nonspecific guidance, reflecting Obermeyer and Emanuel's finding that AI struggles with tasks needing advanced adaptability.[24] CT-guided bone biopsies' performance was satisfactory but lacked depth, supporting Shen et al's observation that AI struggles with procedural guidance.[25] In US-guided hydrodissection for carpal tunnel syndrome, ChatGPT-4o performed better, indicating its suitability for less invasive procedures.[26] Finally, the US-guided corticosteroid injection for adhesive capsulitis showed variability, highlighting the need for domain-specific training, as emphasized by Topol.[27]

The differences in scoring between the two radiologists highlight subjective variability in the perceived utility of AI tools. Previous studies have similarly reported variability in user satisfaction with AI in radiology based on experience, specialty, and familiarity with AI systems.[26] This suggests that the ChatGPT-4o may not serve for comprehensive, detailed, high-quality guidance and indicates the subjective nature of the approach to AI tools.

Previous studies have also established that radiological diagnosis and procedure guidance require much contextual information.[28] In our study, the lower scores for preprocedure details underlined the limitation of ChatGPT-4o when the procedures were more complex. Its performance varied with the complexity of the intervention, suggesting it was better suited for less complex tasks. Its integration into radiological practices could improve diagnostic processes, saving time and enhancing workflow efficiency. However, ethical considerations and human oversight are crucial.[24]


Conclusion

This study provides a critical evaluation of ChatGPT-4o's performance in guiding MSK interventional procedures, highlighting both its emerging utility and substantial limitations. While the model provided a structured framework for procedural steps, it consistently underperformed in complex procedures. Pre- and post-procedural steps scored neutral or below average, reflecting the current AI models' dependence on generalized training data and highlights the need for domain-specific fine-tuning. Notably, both radiologists in their rating emphasized the critical limitation of a text-only model in a field where visual interpretation is essential. The most common errors included lack of anatomical specificity, omission of safety precautions, and absence of imaging parameters—all of which are clinically significant and could lead to suboptimal or unsafe outcomes if used without expert oversight.

This variability and lack of context-specific depth support the view that ChatGPT-4o, in its current form, should not be relied upon as a standalone tool for clinical guidance in interventional radiology. Rather, it may serve as a supplementary aid for educational purposes or procedural overviews, provided domain experts critically review its outputs.

The implementation of robust safeguards, including model transparency, continuous human oversight, and strong validation protocols, is important to mitigate potential risks and uphold patient safety. Close collaboration among radiologists, AI developers, and health care policymakers is now a requirement to ensure that such technologies are responsibly, ethically, and clinically effectively deployed. While ChatGPT-4o is a substantial advance for AI-based procedural instruction, its current limitations underscore the need for continued technical refinement, critical evaluation, and strong dedication to context-sensitive accuracy and patient-centered care in clinical application.



Conflict of Interest

None declared.

Ethical Approval

Local ethical committee approval was not required.


Supplementary Material


Address for correspondence

Rajesh Botchu, MBBS, MS(orth), MRCSEd, MRCSI, FRCR
Department of Musculoskeletal Radiology, The Royal Orthopedic Hospital
Bristol Road South, Northfield, Birmingham B31 2AP
United Kingdom   

Publikationsverlauf

Artikel online veröffentlicht:
03. Juli 2025

© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)

Thieme Medical and Scientific Publishers Pvt. Ltd.
A-12, 2nd Floor, Sector 2, Noida-201301 UP, India


Zoom
Fig. 1 ChatGPT4o-created image of steps of hip arthrogram using fluoroscopy.
Zoom
Fig. 2 ChatGPT4o-created image of steps of lumbar nerve root injection under CT guidance. CT, computed tomography.
Zoom
Fig. 3 ChatGPT4o-created image of steps of CT-guided bone biopsy. CT, computed tomography.
Zoom
Fig. 4 ChatGPT4o-created image of steps of ultrasound-guided hydrodissection in carpal tunnel syndrome.
Zoom
Fig. 5 ChatGPT4o-created image of steps of ultrasound-guided injection of corticosteroids for adhesive capsulitis.