BMC Medical Informatics and Decision Making volume 25, Article number: 163 (2025) Cite this article
The integration of artificial intelligence (AI) in healthcare has rapidly expanded, particularly in clinical decision-making. Large language models (LLMs) such as GPT-4 and GPT-3.5 have shown potential in various medical applications, including diagnostics and treatment planning. However, their efficacy in specialized fields like sports surgery and physiotherapy remains underexplored. This study aims to compare the performance of GPT-4 and GPT-3.5 in clinical decision-making within these domains using a structured assessment approach.
This cross-sectional study included 56 professionals specializing in sports surgery and physiotherapy. Participants evaluated 10 standardized clinical scenarios generated by GPT-4 and GPT-3.5 using a 5-point Likert scale. The scenarios encompassed common musculoskeletal conditions, and assessments focused on diagnostic accuracy, treatment appropriateness, surgical technique detailing, and rehabilitation plan suitability. Data were collected anonymously via Google Forms. Statistical analysis included paired t-tests for direct model comparisons, one-way ANOVA to assess performance across multiple criteria, and Cronbach’s alpha to evaluate inter-rater reliability.
GPT-4 significantly outperformed GPT-3.5 across all evaluated criteria. Paired t-test results (t(55) = 10.45, p < 0.001) demonstrated that GPT-4 provided more accurate diagnoses, superior treatment plans, and more detailed surgical recommendations. ANOVA results confirmed the higher suitability of GPT-4 in treatment planning (F(1, 55) = 35.22, p < 0.001) and rehabilitation protocols (F(1, 55) = 32.10, p < 0.001). Cronbach’s alpha values indicated higher internal consistency for GPT-4 (α = 0.478) compared to GPT-3.5 (α = 0.234), reflecting more reliable performance.
GPT-4 demonstrates superior performance compared to GPT-3.5 in clinical decision-making for sports surgery and physiotherapy. These findings suggest that advanced AI models can aid in diagnostic accuracy, treatment planning, and rehabilitation strategies. However, AI should function as a decision-support tool rather than a substitute for expert clinical judgment. Future studies should explore the integration of AI into real-world clinical workflows, validate findings using larger datasets, and compare additional AI models beyond the GPT series.
The utilization of artificial intelligence (AI) technologies in healthcare has gained significant momentum in recent years, particularly in clinical decision-making processes [1,2,3,4]. Advances in natural language processing (NLP) technologies have highlighted the potential applicability of large language models (LLMs) like the GPT series in addressing complex tasks such as medical decision-making and patient management. The existing literature demonstrates the promise of AI in improving diagnostic accuracy and optimizing treatment processes across various medical fields. For instance, Lopez et al. (2020) [5] explored AI’s impact in cardiology, while Clark et al. (2024) [6] examined its accuracy in classifying dermatological conditions. However, there is a notable gap in the literature regarding the performance of AI-driven technologies in specialized fields such as sports surgery and physiotherapy, which demand highly specialized expertise.
Sports surgery and physiotherapy are domains characterized by intricate clinical decision-making processes that require expertise in both diagnosis and treatment planning. Accurate diagnosis and effective treatment strategies in these areas not only improve patient outcomes but are also critical for athletes to sustain their professional careers. Despite the growing body of research on AI applications in general medical practice, there remains a lack of comprehensive studies assessing its impact in these niche areas [7,8,9,10,11]. Although some recent studies have investigated AI’s role in musculoskeletal disorders, the extent to which AI can contribute to specific clinical decision-making processes in sports surgery and physiotherapy has not been fully explored [12,13,14,15].
In recent years, AI and large language models (LLMs) have been increasingly integrated into decision support systems in orthopedics and physiotherapy. Several studies have examined the potential role of AI in improving diagnostic and treatment decision-making. Kunze et al. (2024) evaluated the ability of GPT-4 to diagnose and triage patients with knee pain, demonstrating that AI can assist in clinical decision-making by improving diagnostic consistency and reducing variability in physician assessments [12]. Similarly, Lintz et al. (2024) investigated the capacity of AI models to accurately classify patients requiring surgical intervention for foot and ankle surgery, finding that AI-supported triage could improve patient management efficiency [13]. In another study, Nwachukwu et al. (2025) analyzed the extent to which current LLMs align with evidence-based clinical guidelines in the management of musculoskeletal diseases and highlighted discrepancies between AI-generated treatment recommendations and expert guidelines [14]. Furthermore, Truhn et al. (2023) assessed GPT-4’s capability to generate orthopedic treatment recommendations based on MRI reports, demonstrating its potential role in radiology-based decision support systems [15]. Beyond orthopedics, AI applications in physiotherapy have also been examined. Villagrán et al. (2024) explored how LLMs can be used in physiotherapy education to provide automated feedback to students, indicating that AI has the potential to support both clinical decision-making and educational frameworks [16].
Despite the emerging evidence supporting AI’s integration into musculoskeletal healthcare, there is a lack of comprehensive research evaluating its direct impact on clinical decision-making in sports surgery and physiotherapy. Current studies primarily focus on AI’s diagnostic accuracy and triage capabilities, but its role in treatment planning, surgical technique recommendations, and rehabilitation program design remains largely unexamined. Given the complexity of sports-related injuries and the necessity for individualized rehabilitation approaches, understanding AI’s ability to provide clinically relevant and evidence-based recommendations is essential.
This study represents the first attempt to systematically evaluate and compare the performance of GPT-4 and GPT-3.5 AI models in clinical decision-making within the fields of sports surgery and physiotherapy. By assessing these models in key domains—diagnostic accuracy, treatment suitability, surgical technique detailing, and rehabilitation plan validity—this research aims to address the existing knowledge gap. The findings of this study are expected to inform healthcare professionals about the strengths and limitations of AI-driven decision support systems, contributing to the broader integration of AI in specialized medical practice.
This study was designed as a cross-sectional, observational, and comparative research project to evaluate the performance of GPT-4 and GPT-3.5 AI models in clinical decision-making processes in sports surgery and physiotherapy. The study adhered to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines to ensure methodological rigor [17]. The performance of both models was analyzed based on four key clinical criteria: diagnostic accuracy, treatment suitability, surgical technique detailing, and the validity of rehabilitation plans. The study protocol was approved by the Duzce University Non-Interventional Clinical Research Ethics Committee (No: 2024/215, Date: 21/10/2024) and conducted in accordance with the principles of the Declaration of Helsinki. Informed consent was obtained from all participants before data collection.
A total of 56 professionals specializing in sports surgery and physiotherapy participated in the study. Participants were recruited through professional networks and hospital affiliations to ensure a diverse representation of expertise. Participants were required to have a minimum of 5 years of clinical experience, be actively working in their field, and have basic knowledge of AI technologies. Those from unrelated medical specialties, retirees, and individuals with less than 5 years of experience were excluded. All participants were blinded to the AI model that generated each scenario to minimize bias in their evaluations.
Data collection was conducted via Google Forms to ensure accessibility and anonymity. Participants were presented with 10 standardized clinical scenarios, which are detailed in Table 1, and asked to evaluate the AI-generated responses using a 5-point Likert scale. The clinical scenarios were developed based on real-world patient data extracted from the hospital’s information system, analyzing the last 10 years of patient records from orthopedic clinics. The most frequently encountered musculoskeletal conditions in sports surgery and physiotherapy were identified based on case prevalence and clinical significance. Two independent orthopedic surgeons and two independent physiotherapists reviewed the selected cases and finalized the scenarios to ensure clinical validity and diversity. Each scenario was independently processed through GPT-4 and GPT-3.5, which generated corresponding diagnoses, treatment plans, surgical recommendations, and rehabilitation protocols. AI-generated outputs were reviewed and validated by the independent panel before being presented to study participants. The details of the 10 clinical scenarios evaluated in this study are presented in Table 1.
The study data were analyzed using various statistical methods to evaluate the performance of the GPT-4 and GPT-3.5 models in clinical decision-making within sports surgery and physiotherapy contexts. Descriptive statistics were used to summarize the demographic characteristics of the participants, including the distribution of clinicians and academics, as well as the average years of experience for orthopedists and physiotherapists.
A power analysis was conducted using G*Power 3.1 software to determine the minimum required sample size for detecting a significant difference between the two AI models. Assuming an effect size (Cohen’s d) of 0.80, a significance level of α = 0.05, and a power of 0.80, the minimum required sample size was determined to be 32 participants. Since the study included 56 participants, the sample size was considered adequate for statistical comparisons [18]. A paired t-test was conducted to compare the overall performance scores of the GPT-4 and GPT-3.5 models, assessing whether a significant difference existed between them. Additionally, an independent t-test was used to compare the performance evaluations made by academics and clinicians for the GPT-4 model. A one-way analysis of variance (ANOVA) was employed to examine the model performance based on specific criteria such as diagnostic accuracy, treatment suitability, surgical technique detail, and rehabilitation plan appropriateness. In addition to p-values, eta squared (η²) effect sizes were calculated to assess the magnitude of differences observed in the ANOVA analysis.
To measure internal consistency, Cronbach’s alpha was calculated for both models. Furthermore, Cohen’s d was computed to quantify the effect sizes of the differences observed between GPT-4 and GPT-3.5 and between academics and clinicians’ evaluations of GPT-4. Statistical significance was set at p < 0.05 for all analyses.
The performance of the GPT-4 and GPT-3.5 models was assessed based on several key criteria: diagnosis accuracy, treatment suitability, surgical technique detail, and rehabilitation plan appropriateness. A total of 56 participants, consisting of 31 clinicians and 25 academics, were involved in the evaluation. Among them, 28 were orthopedists (18 clinicians, 10 academics) and 28 were physiotherapists (13 clinicians, 15 academics). The average years of experience for orthopedists was 15.7 years (SD = 6.9), and for physiotherapists, it was 15.7 years (SD = 7.3). These demographic characteristics are summarized in Table 2.
A paired t-test revealed a statistically significant difference between the overall performance scores of GPT-4 and GPT-3.5, with GPT-4 outperforming GPT-3.5 across all scenarios (t(55) = 10.45, p < 0.001). Additionally, an independent t-test comparing the evaluations made by academics and clinicians for the GPT-4 model indicated that clinicians rated GPT-4 higher than academics did (t(54) = -2.12, p = 0.039). These results are summarized in Table 3.
Further analysis using one-way ANOVA confirmed that GPT-4 demonstrated significantly better performance across all specific criteria evaluated. The most pronounced differences were observed in treatment suitability (F(1, 55) = 35.22, p < 0.001) and rehabilitation plan appropriateness (F(1, 55) = 32.10, p < 0.001). Significant differences were also found in diagnostic accuracy (F(1, 55) = 28.45, p < 0.001) and surgical technique detail (F(1, 55) = 25.67, p < 0.001), reinforcing the superiority of GPT-4. These results are detailed in Table 4.
Reliability analysis using Cronbach’s alpha revealed moderate internal consistency in the evaluations of GPT-4 (α = 0.478), whereas GPT-3.5 showed lower consistency (α = 0.234). This suggests that participants were more consistent in their ratings of GPT-4 across different criteria than in their ratings of GPT-3.5. To further evaluate the effect sizes of the observed differences, Cohen’s d was calculated. The comparison of GPT-4 vs. GPT-3.5 yielded a large effect size (d = 1.42), whereas the difference between academic and clinical evaluations of GPT-4 showed a medium effect size (d = 0.58). These results are presented in Table 5.
The overall findings of this study demonstrate that GPT-4 significantly outperforms GPT-3.5 across all evaluated criteria, with the most notable differences observed in treatment suitability and rehabilitation plan appropriateness. These results indicate that GPT-4 may provide more clinically relevant and reliable decision-making support in the context of sports surgery and physiotherapy compared to GPT-3.5. However, further research is required to validate these findings in real-world clinical settings. The comparative performance of GPT-4 and GPT-3.5 across the four key criteria is illustrated in Fig. 1.
Figure 1 illustrates the mean performance scores of the GPT-4 and GPT-3.5 models across four key clinical decision-making criteria: diagnosis accuracy, treatment suitability, surgical technique detail, and rehabilitation plan appropriateness. The blue bars represent the performance of GPT-4, while the orange bars represent GPT-3.5. As shown, GPT-4 consistently outperforms GPT-3.5 across all evaluated criteria, with the most notable differences observed in treatment suitability and rehabilitation plan appropriateness. The performance scores are annotated above each bar to provide a clear comparison between the two models.
This study represents an important step in evaluating and comparing the performance of GPT-4 and GPT-3.5 in clinical decision-making within sports surgery and physiotherapy. While AI technologies have been extensively studied in various medical domains, their application in specialized areas such as sports surgery and physiotherapy has been under-explored [9, 19,20,21]. By addressing this gap, the study provides valuable insights into the capabilities and limitations of AI in these high-expertise medical fields. The findings demonstrate that GPT-4 significantly outperformed GPT-3.5 across multiple clinical criteria, including diagnostic accuracy, treatment suitability, surgical technique detailing, and rehabilitation plan appropriateness. These results align with previous studies highlighting AI’s potential in enhancing clinical decision-making [22, 23], yet this is one of the first studies to directly assess AI performance in sports surgery and physiotherapy.
The results indicate that GPT-4 significantly outperformed GPT-3.5, as supported by statistical analyses. Paired t-test results (t(55) = 10.45, p < 0.001) confirmed that GPT-4 provided more accurate diagnoses, better treatment plans, and more detailed surgical techniques compared to GPT-3.5. These results are consistent with prior studies, which have found that more advanced AI models, such as GPT-4, tend to exhibit superior performance in clinical decision support systems [24, 25]. This performance improvement can likely be attributed to GPT-4’s more extensive training dataset and advanced natural language processing capabilities, which enable it to generate more accurate, reliable, and contextually appropriate medical recommendations. AI-driven decision support tools have been explored in various medical fields, with research indicating that LLMs can assist in triaging, diagnostic assessments, and treatment planning [12,13,14].
However, it is important to note that GPT-3.5 still demonstrated moderate performance, suggesting that less complex cases could still benefit from its clinical applications. Nonetheless, for more complex cases requiring precise decision-making, GPT-4 appears to be a more reliable and effective tool.
The findings of this study are in line with existing research, which has investigated the use of LLMs in medical decision-making. For instance, Kunze et al. (2024) demonstrated that GPT-4 performed well in knee pain triage, while Lintz et al. (2024) assessed its capability in surgical triage for foot and ankle conditions [12, 13]. Additionally, Nwachukwu et al. (2025) found that current LLMs do not fully align with evidence-based musculoskeletal treatment guidelines, raising concerns about the reliability of AI-generated recommendations [14]. The findings of this study further highlight that while GPT-4 can provide highly relevant clinical suggestions, it should still be used as a complementary tool rather than an independent decision-maker.
Similarly, Truhn et al. (2023) evaluated GPT-4’s ability to generate orthopedic treatment recommendations from MRI reports, suggesting that LLMs can assist in image-based diagnostics [15]. This aligns with our findings, which indicate that GPT-4 demonstrated superior accuracy in treatment suitability and rehabilitation planning. Additionally, Villagrán et al. (2024) explored the role of AI in physiotherapy education, indicating that LLMs may enhance clinical reasoning and learning in medical training [16]. The integration of AI into physiotherapy and sports surgery education could further optimize the use of AI in clinical environments.
A key finding of this study was the higher internal consistency of GPT-4 compared to GPT-3.5, as demonstrated by Cronbach’s alpha values (GPT-4: α = 0.478, GPT-3.5: α = 0.234). Although GPT-4 exhibited better consistency, the moderate reliability scores suggest some variability in AI-generated recommendations. This variation could be attributed to the nature of AI decision-making, where responses depend on training data and contextual interpretation. Additionally, participants may have had different subjective evaluations of AI-generated clinical recommendations, further contributing to inter-rater variability. The results emphasize the need for continuous validation of AI-generated recommendations in clinical settings [26].
The integration of LLMs such as GPT-4 into clinical practice offers several potential benefits. AI-driven decision support tools have been shown to reduce cognitive load on healthcare professionals, improve diagnostic accuracy, and enhance treatment planning [27]. In sports surgery and physiotherapy, where rapid and evidence-based decision-making is crucial, AI models may serve as valuable clinical assistants. These models could be integrated into electronic health record systems to provide real-time decision support and automated treatment recommendations, reducing the burden on clinicians and improving efficiency.
Recent studies have highlighted the increasing role of AI and large language models (LLMs) in clinical decision-making and healthcare applications. Naqvi et al. (2024) emphasized that while AI-driven models hold promise for physiotherapy applications, their real-world integration requires careful validation, particularly in decision-support scenarios where clinical expertise remains irreplaceable [28]. Furthermore, Rossettini et al. (2023) reviewed the implications of AI in musculoskeletal rehabilitation, stressing both its potential benefits and limitations. They suggested that AI could enhance clinical workflows by supporting diagnostic and rehabilitation strategies but should not replace human clinical reasoning [29]. These findings align with our study, which demonstrated that GPT-4 significantly outperformed GPT-3.5 in diagnostic accuracy, treatment planning, and rehabilitation protocol design. However, as both studies suggest, AI should be viewed as an assistive tool rather than a substitute for expert clinical judgment. Future research should explore how AI can be effectively integrated into sports surgery and physiotherapy practice while maintaining clinician oversight and patient safety.
However, AI-generated outputs must always be interpreted by a trained clinician, as AI models lack contextual understanding, patient-specific considerations, and the ability to adapt to unforeseen clinical complexities. Future research should focus on improving AI transparency, ensuring clinical validation, and integrating AI into interdisciplinary healthcare workflows.
While this study provides novel insights, several limitations should be acknowledged. First, the study was limited to GPT-4 and GPT-3.5, meaning that the findings may not be generalizable to other AI models such as DeepSeek, Llama, or Gemini. Future research should compare a broader range of AI models to determine their relative effectiveness in clinical decision-making.
Additionally, the study relied on standardized clinical scenarios, which, while designed to reflect real-world cases, may not fully capture the variability and complexity of actual patient cases. Future research should incorporate real patient data and assess AI recommendations in a real-time clinical setting to validate these findings.
Another limitation is the potential bias in scenario creation, as two orthopedic surgeons and two physiotherapists reviewed and finalized the cases. While efforts were made to ensure unbiased scenario development, real-world clinical cases may introduce additional complexity. Future studies should explore AI performance in a more dynamic and diverse patient population.
Finally, AI models, including GPT-4, are susceptible to “hallucination” (the generation of inaccurate or misleading information). This issue remains a critical limitation of AI in healthcare, and future research should develop more robust safeguards to prevent AI-generated misinformation from influencing clinical decision-making [30].
This study provides strong evidence that GPT-4 outperforms GPT-3.5 in clinical decision-making within sports surgery and physiotherapy. The findings demonstrate that AI models, particularly GPT-4, can enhance diagnostic accuracy, improve treatment planning, and support healthcare professionals in making more informed decisions. However, AI should be used as a complementary tool rather than a replacement for clinical expertise.
Future research should focus on expanding the variety of AI models tested, utilizing larger and more diverse datasets, and further exploring AI’s integration into real-world clinical environments. Additionally, the ethical, legal, and practical implications of AI in clinical decision-making should be further explored to ensure safe and effective AI adoption in healthcare.
All data and materials can be requested from Dr. Sönmez Saglam when needed. ([email protected]).
Not available.
The authors received no financial support for the research and/or authorship of this article.
The study protocol was approved by the Duzce University Non-Interventional Clinical Research Ethics Committee (No: 2024/215, Date: 21/10/2024). The study was conducted in accordance with the principles of the Declaration of Helsinki.
Informed consent was obtained from all individual participants included in the study.
Not available.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Saglam, S., Uludag, V., Karaduman, Z.O. et al. Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study. BMC Med Inform Decis Mak 25, 163 (2025). https://doi.org/10.1186/s12911-025-02996-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12911-025-02996-8