Turkish Neurosurgery
Beyond Human ‘Eyes’ in Neurosurgical Exams: Success of Artificial Intelligence (ChatGPT-4o, Grok, and Gemini) in the Image-Based Questions of Turkish Neurosurgical Society Proficiency Board Exams
Alperen Sozer1, Gokberk Erol2, Ozan Yavuz Tufek3, Batuhan Sozer4, Merve Buke Sahin5, Mustafa Caglar Sahin6
1Sincan Training and Research Hospital, Department of Neurosurgery, Ankara,
2Adiyaman Training and Research Hospital, Department of Neurosurgery, Adiyaman,
3Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara,
4Ankara Medipol University Faculty of Medicine, Ankara,
5Kulu District Health Directorate, Department of Public Health, Konya,
6Kulu State Hospital, Department of Neurosurgery, Konya,
DOI: 10.5137/1019-5149.JTN.49058-25.2

Aim:With the growing availability of large language models (LLMs), generative artificial intelligence has been transforming various fields. Medical training and neurosurgical education are no exception, as they are increasingly influenced by these advancements. One of the latest capabilities of major LLMs is image interpretation, a crucial aspect of neurosurgical training.Material and Methods:This study evaluated the performance of three major LLMs (ChatGPT-4o, Grok, and Gemini) on image-based neurosurgical proficiency board questions and compared their latest versions.Results:Real-life candidates answered correctly 70.75% of the time. LLMs answered correctly 47.38% of the time and were significantly outperformed by the candidates. Prompt selection was found to significantly influence the performance of GPT and Grok, but not Gemini. Matching and significantly outperforming the candidates was only possible by combining the best answers from all three LLMs across four runs.Conclusion:Although previous research has demonstrated strong capabilities of LLMs in text-only questions, this the results of the present study revealed that image analysis abilities of these models need further improvement when compared to actual candidates. Furthermore, the impact of prompt selection and repeated questioning should be emphasized, particularly when seeking correlation with the real-life exam results.

Corresponding author : Mustafa Caglar Sahin