RSS-Feed abonnieren

DOI: 10.1055/s-0045-1809953
Can ChatGPT Aid in Musculoskeletal Intervention?
Funding None.
Abstract
Objective
Radiology has continuously evolved exploring cutting-edge technologies to improve patient care. It is a prime example of how medical science is propelled forward by technological innovation.
In recent times, artificial intelligence (AI) has played a crucial role in various technological advancements. Chat Generative Pre-trained Transformer (ChatGPT)-4, an AI language model primarily focusing on natural language understanding and generation, is increasingly used to retrieve medical information. This study explores the utility of ChatGPT-4o in aiding imaging-guided musculoskeletal interventions, detailing its advantages and limitations.
Methods
Two musculoskeletal radiologists assessed the information generated by ChatGPT on common musculoskeletal interventions. They analyzed the overall utility of ChatGPT-4o in guiding musculoskeletal interventions by examining the procedure steps and pre- and post-procedure details provided. The assessment was documented in a 5-point Likert scale and subjected to statistical analysis.
Results
The statistical analysis of Likert scale scores by both readers revealed a moderate level of inter-rater agreement, as indicated by a Cohen's Kappa score of 0.54. Across the categories, the mode of Likert score ranged from 1 to 3, as rated by both readers, indicating suboptimal performance. The lowest scores were observed in image quality assessments, whereas the highest ratings were of post-procedure details.
Conclusion
ChatGPT-4o offers structured procedural guidance but falls short in complex, image-dependent tasks due to limited anatomical detail and contextual accuracy. It may aid education, but not clinical use without expert oversight. Domain-specific training, validation, and multidisciplinary collaboration are essential for safe and effective integration into practice.
Keywords
musculoskeletal interventions - artificial intelligence - radiology - radiologists - language modelIntroduction
The emergence of artificial intelligence (AI) in health care has ushered in transformative changes across various medical disciplines.[1] Among these advancements, Chat Generative Pre-trained Transformer (ChatGPT), an AI language model developed by OpenAI, is a versatile tool with applications ranging from education to clinical decision support.[2] Its ability to generate human-like text responses has sparked interest in its potential role in guiding medical procedures, particularly musculoskeletal (MSK) medicine. Despite its potential, the efficacy and utility of ChatGPT in MSK procedures remain to be seen.
This research aims to assess the effectiveness of ChatGPT-4o in supporting MSK procedures by investigating its capability for enhancing clinician performance and promoting streamlined processes by critically evaluating the information provided on various common imaging-guided MSK interventions.
Methods
No ethical committee approval was required as no patient data were used. ChatGPT-4o was asked to provide information as well as generate illustrations explaining 5 MSK procedures ([Figs. 1] [2] [3] [4] [5], [Supplementary Table S1] [available in the online version).










Two fellowship-trained MSK radiologists with over 10 years of experience assessed the information generated by ChatGPT-4o. This assessment was documented in a 5-point Likert scale, which was coded as follows: 1—poor quality, 2—fair quality, 3—good quality, 4—very good quality, and 5—excellent quality.
The criteria for evaluation included the preprocedure details, the steps of the procedure, post-procedure particulars, the overall accuracy of the steps provided, and the quality of illustrations generated by ChatGPT-4o. Emphasis was placed on the completeness and accuracy of procedural steps, comprehensiveness of information, and detail on patient safety monitoring, effectiveness and utility as a learning tool.
Results
The statistical analysis revealed a moderate level of inter-rater agreement, as indicated by a Cohen's Kappa score of 0.54. One of the evaluators provided a more favorable evaluation, awarding higher scores for preprocedure details and overall accuracy. In contrast, the other evaluator consistently assigned lower scores ([Table 1]).
Abbreviations: CT, computed tomography; USG, ultrasonography.
The mode value for Likert scores provided by both readers for preprocedure details was 2, highlighting below-average performance in this category. The mode value for procedural steps and their accuracy, as well as post-procedural details, was 3, reflecting an overall average level of performance in these areas. Both raters, however, were unanimous in their assessment of the quality of illustrations, universally scoring them at 1, reflecting severe shortcomings in graphical representation.
Both radiologists provided lower ratings for US-guided corticosteroid injection in adhesive capsulitis, particularly for preprocedure details and overall accuracy. ChatGPT-4o performed better in guiding US-guided hydrodissection for carpal tunnel syndrome, with one of the readers rating procedure steps, post-procedure details, and overall accuracy higher as 4 on a Likert scale. For computed tomography (CT)-guided bone biopsy, CT-guided lumbar nerve root injection, and hip arthrogram under fluoroscopy, both radiologists consistently rated ChatGPT's performance as average or poor across most categories.
Discussion
This result highlights AI's current capabilities and limitations within specialized niches. Related studies provide perspective, allowing these results to be contextualized within the broader landscape of AI applications in radiology.
The ability of ChatGPT-4o to provide a structured framework for procedural steps aligns with prior studies suggesting that AI-based natural language models can assist in clinical decision support.[3] A 2023 study by Jeblick et al explored ChatGPT's utility in radiological reporting and found that the AI performed well in generating standardized text outputs but lacked depth in technical details.[4] However, in our study, ChatGPT's scores for procedural steps and post-procedure particulars were neutral (3 on the Likert scale), reflecting adequate but nonspecific guidance. The varied performance of ChatGPT-4o across different MSK procedures reflects the challenges identified in previous studies.[4] For example, in hip arthrograms under fluoroscopy and lumbar nerve root injections under CT guidance, ChatGPT-4o provided generic step-by-step instructions but failed to account for nuanced variations in the technique, such as fluoroscopic angulation or needle trajectory adjustments.[5] This limitation aligns with observations by Pesapane et al, who underlined that AI solutions perform well in a broad context but are weak in context-specific situations, especially when dealing with interventional radiology procedures.[6] Results could depend highly on the quality and quantity of data used in training the AI in radiology.[7] Such inconsistency in current ratings among different procedures can relate well to variable availability of quality and procedure-specific training data.[8]
The relatively higher scores for post-procedure particulars in our study for a few procedures suggest that ChatGPT-4o is better equipped to handle standardized tasks.[9] This is consistent with previous findings that AI systems can provide general recommendations or post-procedural advice when the guidance is rooted in established protocols.[10] For instance, ChatGPT's performance in ultrasound (US)-guided hydrodissection for carpal tunnel syndrome was relatively stronger, likely due to the standardization of the procedure and reduced reliance on advanced imaging.[11] However, its utility was less pronounced in procedures requiring nuanced post-procedure instructions, such as CT-guided bone biopsies, with scores remaining neutral or below. This further justifies the need to embed procedural personalization into AI training datasets, stating that domain-specific data are the key to enhancing the performance of AI in radiology.[12] [13] [14]
The inability of ChatGPT-4o to generate or provide high-quality procedural images was a significant limitation, with both radiologists consistently assigning the lowest possible score. This finding is consistent with existing literature, highlighting that text-based AI models lack the visual diagnostic capabilities required for radiology.[15] [16] For MSK procedures, particularly those guided by fluoroscopy, CT, or US, high-quality images are indispensable for accurate execution and interpretation.[17] [18] [19] For example, ChatGPT's inability to provide real-time imaging guidance or visually demonstrate needle positioning limits its clinical applicability in hip arthrograms and lumbar nerve root injections. This reinforces the conclusion that for AI tools to be practical in radiology, they must integrate imaging data into their outputs through partnerships with imaging platforms or advancements in multimodal AI models.[20] The inability to generate or manipulate images effectively is a critical shortcoming that limits ChatGPT's utility as a comprehensive guidance tool in radiology.[15] [16] [17]
ChatGPT's variable performance across procedures also warrants discussion. Procedures with higher standardization, such as US-guided hydrodissection, were rated slightly higher. In contrast, procedures requiring intricate imaging guidance, such as CT-guided bone biopsies, received lower ratings (scores of 2 for preprocedure details and overall accuracy). This variability aligns with findings by Pesapane et al, who noted that AI systems are more effective in well-defined, repetitive tasks but face challenges in complex, dynamic environments like interventional radiology.[6] Additionally, our results suggest that US-guided procedures may be better suited for AI-based guidance, as they are less dependent on advanced imaging modalities, which the current version of ChatGPT cannot support. So, ChatGPT's performance varies depending on the complexity and specificity of the radiology procedure, suggesting that its utility may be more pronounced in less complex interventions.[6] [21] [22]
ChatGPT's evaluation across medical procedures showed mixed results. For hip arthrograms under fluoroscopy, its performance was moderate but inconclusive, with average or below-average ratings for preprocedure details and steps, aligning with Currie et al on AI's struggles with complex preprocedural planning.[23] In lumbar nerve root injections under CT guidance, ChatGPT offered adequate but nonspecific guidance, reflecting Obermeyer and Emanuel's finding that AI struggles with tasks needing advanced adaptability.[24] CT-guided bone biopsies' performance was satisfactory but lacked depth, supporting Shen et al's observation that AI struggles with procedural guidance.[25] In US-guided hydrodissection for carpal tunnel syndrome, ChatGPT-4o performed better, indicating its suitability for less invasive procedures.[26] Finally, the US-guided corticosteroid injection for adhesive capsulitis showed variability, highlighting the need for domain-specific training, as emphasized by Topol.[27]
The differences in scoring between the two radiologists highlight subjective variability in the perceived utility of AI tools. Previous studies have similarly reported variability in user satisfaction with AI in radiology based on experience, specialty, and familiarity with AI systems.[26] This suggests that the ChatGPT-4o may not serve for comprehensive, detailed, high-quality guidance and indicates the subjective nature of the approach to AI tools.
Previous studies have also established that radiological diagnosis and procedure guidance require much contextual information.[28] In our study, the lower scores for preprocedure details underlined the limitation of ChatGPT-4o when the procedures were more complex. Its performance varied with the complexity of the intervention, suggesting it was better suited for less complex tasks. Its integration into radiological practices could improve diagnostic processes, saving time and enhancing workflow efficiency. However, ethical considerations and human oversight are crucial.[24]
Conclusion
This study provides a critical evaluation of ChatGPT-4o's performance in guiding MSK interventional procedures, highlighting both its emerging utility and substantial limitations. While the model provided a structured framework for procedural steps, it consistently underperformed in complex procedures. Pre- and post-procedural steps scored neutral or below average, reflecting the current AI models' dependence on generalized training data and highlights the need for domain-specific fine-tuning. Notably, both radiologists in their rating emphasized the critical limitation of a text-only model in a field where visual interpretation is essential. The most common errors included lack of anatomical specificity, omission of safety precautions, and absence of imaging parameters—all of which are clinically significant and could lead to suboptimal or unsafe outcomes if used without expert oversight.
This variability and lack of context-specific depth support the view that ChatGPT-4o, in its current form, should not be relied upon as a standalone tool for clinical guidance in interventional radiology. Rather, it may serve as a supplementary aid for educational purposes or procedural overviews, provided domain experts critically review its outputs.
The implementation of robust safeguards, including model transparency, continuous human oversight, and strong validation protocols, is important to mitigate potential risks and uphold patient safety. Close collaboration among radiologists, AI developers, and health care policymakers is now a requirement to ensure that such technologies are responsibly, ethically, and clinically effectively deployed. While ChatGPT-4o is a substantial advance for AI-based procedural instruction, its current limitations underscore the need for continued technical refinement, critical evaluation, and strong dedication to context-sensitive accuracy and patient-centered care in clinical application.
Conflict of Interest
None declared.
Ethical Approval
Local ethical committee approval was not required.
-
References
- 1 Topol E. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again. New York, NY: Basic Books; 2019
- 2 Horiuchi D, Tatekawa H, Oura T. et al. ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology. Eur Radiol 2025; 35 (01) 506-516
- 3 Bizzo BC, Almeida RR, Michalski MH, Alkasab TK. Artificial intelligence and clinical decision support for radiologists and referring providers. J Am Coll Radiol 2019; 16 (9, Pt B): 1351-1356
- 4 Jeblick K, Schachtner B, Dexl J. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 2024; 34 (05) 2817-2825
- 5 Polisetty TS, Jain S, Pang M. et al. Concerns surrounding application of artificial intelligence in hip and knee arthroplasty: a review of literature and recommendations for meaningful adoption. Bone Joint J 2022; 104-B (12) 1292-1303
- 6 Pesapane F, Codari M, Sardanelli F. Artificial intelligence in medical imaging: threat or opportunity? Radiologists again at the forefront of innovation in medicine. Eur Radiol Exp 2018; 2 (01) 35
- 7 Rajpurkar P, Irvin J, Ball RL. et al. Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med 2018; 15 (11) e1002686
- 8 Keshavarz P, Bagherieh S, Nabipoorashrafi SA. et al. ChatGPT in radiology: a systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 2024; 105 (7-8): 251-265
- 9 Grewal H, Dhillon G, Monga V. et al. Radiology gets chatty: the ChatGPT saga unfolds. Cureus 2023; 15 (06) e40135
- 10 Bi WL, Hosny A, Schabath MB. et al. Artificial intelligence in cancer imaging: clinical challenges and applications. CA Cancer J Clin 2019; 69 (02) 127-157
- 11 Sarangi PK, Lumbani A, Swarup MS. et al. Assessing ChatGPT's proficiency in simplifying radiological reports for healthcare professionals and patients. Cureus 2023; 15 (12) e50881
- 12 Srivastava U, Pezeshk P, Chhabra A. Patient satisfaction experience and outcomes after CT-guided bone marrow biopsy versus in-office bone marrow biopsy. Radiation. 2024; 4 (03) 224-231
- 13 Droste MF, van Velden FHP, van Oosterom MN. et al. Augmenting CT-guided bone biopsies using 18F-FDG PET/CT guidance. Cancers (Basel) 2024; 16 (15) 2693
- 14 Wang S, Cao G, Wang Y. et al. Review and prospect: artificial intelligence in advanced medical imaging. Front Radiol 2021; 1: 781868
- 15 Bhayana R. Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology 2024; 310 (01) e232756
- 16 Hayden N, Gilbert S, Poisson LM, Griffith B, Klochko C. Performance of gpt-4 with vision on text-and image-based acr diagnostic radiology in-training examination questions. Radiology 2024; 312 (03) e240153
- 17 Srivastav S, Chandrakar R, Gupta S. et al. ChatGPT in radiology: the advantages and limitations of artificial intelligence for medical imaging diagnosis. Cureus 2023; 15 (07) e41435
- 18 Parillo M, Vaccarino F, Beomonte Zobel B, Mallio CA. ChatGPT and radiology report: potential applications and limitations. Radiol Med 2024; 129 (12) 1849-1863
- 19 Najjar R. Redefining radiology: a review of artificial intelligence integration in medical imaging. Diagnostics (Basel) 2023; 13 (17) 2760
- 20 Simon BD, Ozyoruk KB, Gelikman DG, Harmon SA, Türkbey B. The future of multimodal artificial intelligence models for integrating imaging and clinical metadata: a narrative review. Diagn Interv Radiol 2024; (e-pub ahead of print)
- 21 Arruzza ES, Evangelista CM, Chau M. The performance of ChatGPT-4.0o in medical imaging evaluation: a cross-sectional study. J Educ Eval Health Prof 2024; 21: 29
- 22 Viderman D, Dossov M, Seitenov S, Lee MH. Artificial intelligence in ultrasound-guided regional anesthesia: a scoping review. Front Med (Lausanne) 2022; 9: 994805
- 23 Currie G, Hawk KE, Rohren E, Vial A, Klein R. Machine learning and deep learning in medical imaging: intelligent imaging. J Med Imaging Radiat Sci 2019; 50 (04) 477-487
- 24 Obermeyer Z, Emanuel EJ. Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med 2016; 375 (13) 1216-1219
- 25 Shen D, Wu G, Suk HI. Deep learning in medical image analysis. Annu Rev Biomed Eng 2017; 19 (01) 221-248
- 26 Millischer AE, Grevent D, Sonigo P. et al. Feasibility and added value of fetal DTI Tractography in the evaluation of an isolated short Corpus callosum: preliminary results. AJNR Am J Neuroradiol 2022; 43 (01) 132-138
- 27 Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019; 25 (01) 44-56
- 28 Choy G, Khalilzadeh O, Michalski M. et al. Current applications and future impact of machine learning in radiology. Radiology 2018; 288 (02) 318-328
Address for correspondence
Publikationsverlauf
Artikel online veröffentlicht:
03. Juli 2025
© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)
Thieme Medical and Scientific Publishers Pvt. Ltd.
A-12, 2nd Floor, Sector 2, Noida-201301 UP, India
-
References
- 1 Topol E. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again. New York, NY: Basic Books; 2019
- 2 Horiuchi D, Tatekawa H, Oura T. et al. ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology. Eur Radiol 2025; 35 (01) 506-516
- 3 Bizzo BC, Almeida RR, Michalski MH, Alkasab TK. Artificial intelligence and clinical decision support for radiologists and referring providers. J Am Coll Radiol 2019; 16 (9, Pt B): 1351-1356
- 4 Jeblick K, Schachtner B, Dexl J. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 2024; 34 (05) 2817-2825
- 5 Polisetty TS, Jain S, Pang M. et al. Concerns surrounding application of artificial intelligence in hip and knee arthroplasty: a review of literature and recommendations for meaningful adoption. Bone Joint J 2022; 104-B (12) 1292-1303
- 6 Pesapane F, Codari M, Sardanelli F. Artificial intelligence in medical imaging: threat or opportunity? Radiologists again at the forefront of innovation in medicine. Eur Radiol Exp 2018; 2 (01) 35
- 7 Rajpurkar P, Irvin J, Ball RL. et al. Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med 2018; 15 (11) e1002686
- 8 Keshavarz P, Bagherieh S, Nabipoorashrafi SA. et al. ChatGPT in radiology: a systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 2024; 105 (7-8): 251-265
- 9 Grewal H, Dhillon G, Monga V. et al. Radiology gets chatty: the ChatGPT saga unfolds. Cureus 2023; 15 (06) e40135
- 10 Bi WL, Hosny A, Schabath MB. et al. Artificial intelligence in cancer imaging: clinical challenges and applications. CA Cancer J Clin 2019; 69 (02) 127-157
- 11 Sarangi PK, Lumbani A, Swarup MS. et al. Assessing ChatGPT's proficiency in simplifying radiological reports for healthcare professionals and patients. Cureus 2023; 15 (12) e50881
- 12 Srivastava U, Pezeshk P, Chhabra A. Patient satisfaction experience and outcomes after CT-guided bone marrow biopsy versus in-office bone marrow biopsy. Radiation. 2024; 4 (03) 224-231
- 13 Droste MF, van Velden FHP, van Oosterom MN. et al. Augmenting CT-guided bone biopsies using 18F-FDG PET/CT guidance. Cancers (Basel) 2024; 16 (15) 2693
- 14 Wang S, Cao G, Wang Y. et al. Review and prospect: artificial intelligence in advanced medical imaging. Front Radiol 2021; 1: 781868
- 15 Bhayana R. Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology 2024; 310 (01) e232756
- 16 Hayden N, Gilbert S, Poisson LM, Griffith B, Klochko C. Performance of gpt-4 with vision on text-and image-based acr diagnostic radiology in-training examination questions. Radiology 2024; 312 (03) e240153
- 17 Srivastav S, Chandrakar R, Gupta S. et al. ChatGPT in radiology: the advantages and limitations of artificial intelligence for medical imaging diagnosis. Cureus 2023; 15 (07) e41435
- 18 Parillo M, Vaccarino F, Beomonte Zobel B, Mallio CA. ChatGPT and radiology report: potential applications and limitations. Radiol Med 2024; 129 (12) 1849-1863
- 19 Najjar R. Redefining radiology: a review of artificial intelligence integration in medical imaging. Diagnostics (Basel) 2023; 13 (17) 2760
- 20 Simon BD, Ozyoruk KB, Gelikman DG, Harmon SA, Türkbey B. The future of multimodal artificial intelligence models for integrating imaging and clinical metadata: a narrative review. Diagn Interv Radiol 2024; (e-pub ahead of print)
- 21 Arruzza ES, Evangelista CM, Chau M. The performance of ChatGPT-4.0o in medical imaging evaluation: a cross-sectional study. J Educ Eval Health Prof 2024; 21: 29
- 22 Viderman D, Dossov M, Seitenov S, Lee MH. Artificial intelligence in ultrasound-guided regional anesthesia: a scoping review. Front Med (Lausanne) 2022; 9: 994805
- 23 Currie G, Hawk KE, Rohren E, Vial A, Klein R. Machine learning and deep learning in medical imaging: intelligent imaging. J Med Imaging Radiat Sci 2019; 50 (04) 477-487
- 24 Obermeyer Z, Emanuel EJ. Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med 2016; 375 (13) 1216-1219
- 25 Shen D, Wu G, Suk HI. Deep learning in medical image analysis. Annu Rev Biomed Eng 2017; 19 (01) 221-248
- 26 Millischer AE, Grevent D, Sonigo P. et al. Feasibility and added value of fetal DTI Tractography in the evaluation of an isolated short Corpus callosum: preliminary results. AJNR Am J Neuroradiol 2022; 43 (01) 132-138
- 27 Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019; 25 (01) 44-56
- 28 Choy G, Khalilzadeh O, Michalski M. et al. Current applications and future impact of machine learning in radiology. Radiology 2018; 288 (02) 318-328









