Navigation

© Zeal News Africa

Evaluating the Performance of Large Language Models on Multispecialty FRCS Section 1 Questions

Published 1 day ago2 minute read

Large language models (LLMs) have increasingly demonstrated utility in medical education and professional examinations. However, their reliability, accuracy, and consistency in answering complex surgical questions remain unclear. This study aims to assess the accuracy, consistency, and intermodel reliability of four widely used LLMs, ChatGPT 4o, Google Gemini, Perplexity AI, and Microsoft Copilot, in answering Fellowship of the Royal Colleges of Surgeons Section 1 single best answer questions.

A total of 50 single best answer-type questions from the official Joint Committee on Intercollegiate Examinations sample set, covering ten surgical specialties, were presented to each LLM three times in independent sessions to prevent memory effects. Accuracy (correct versus incorrect responses), response consistency across repeated trials, and intermodel reliability were evaluated.

ChatGPT had the highest accuracy (81.33%, 122/150, P < 0.0001), followed by Gemini (69.33%), Perplexity (64%), and Copilot (59.33%). ChatGPT achieved 100% accuracy in cardiothoracic Surgery and neurosurgery, whereas Gemini performed poorly in neurosurgery (40%) and urology (20%). Otolaryngology and plastic surgery had lower accuracy across all models. Gemini and Perplexity showed the highest consistency (90%). Intermodel reliability was low (Fleiss' Kappa = 0.127, P < 0.0001), with cardiothoracic surgery having the highest agreement (0.401) and oral and maxillofacial surgery the lowest (-0.0992).

ChatGPT performed best overall, whereas other models showed variable accuracies and lower agreement. Although Gemini and Perplexity demonstrated high internal consistency, intermodel reliability was limited. The study findings suggest that, although promising, these tools should be used with care in Fellowship of the Royal Colleges of Surgeons surgical assessments.

Artificial intelligence; Education; Examination; Fellowship; Royal College of Surgeons.

PubMed Disclaimer

Origin:
publisher logo
PubMed
Loading...
Loading...
Loading...

You may also like...