Mehmet Şahap,Michael McCarthy,M N Elmarawany,Stuart H. James,MP Grevitt,Rohan Jayasuriya,Andrew M. Jones,Andrew Bowey,D. Chan,Edward Bayley,Ian Harding,James Tomlinson,John P. Andrews,Shreya Srinivas
Abstract Aim Can Large Language Models (LLMs) provide the answers to common controversial spinal surgery scenarios and answer the dilemmas that we can’t? Method 54 highly detailed questions were developed on 18 scenarios, for example, ‘Management of Painless Foot Drop’. 9 Consultant Spinal Surgeons answered the questions on 2 separate occasions. The questions were submitted to 4 LLMs and the answers regenerated 5 times for each. Response reproducibility and consistency was compared, and a thematic analysis of the AI answers was undertaken. Results Bing Chat was excluded from the study. ChatGPT3.5 refused to give a definitive answer in 14% of its answers, ChatGPT4 in 29% and Bard in 11%. ChatGPT3.5 suggested the user seek medical advice in 60% of its answers, ChatGPT4 99% and Bard 45%. Surgeons stated they were confident in their answers in 96%. AI answers were deemed decisive: ChatGPT3.5 71%, ChatGPT4 24% and Bard 92%. Reproducibility of Consultants answers averaged 63%, and 64% for AI overall. Agreement between the Consultants for each question averaged 66%, and 64% between AI. Thematic analysis of the AI answers revealed themes including surgical and conservative management, individualised approach, risks and benefits, consideration of severity and duration of symptoms, and decision-making processes. Conclusions ChatGPT and Bard provide detailed answers to common controversial spinal surgery scenarios, however, they are not as decisive as consultants, and agreed on fewer management plans. Many of these scenarios remained unanswered.