Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivity of Large Language Models to Surgical Patient Questions: Cross-Sectional Study

This cross-sectional study evaluates the clinical accuracy, relevance, clarity, and emotional sensitivity of responses to inquiries from patients undergoing surgery provided by large language models (LLMs), highlighting their potential as adjunct tools in patient communication and education. Our findings demonstrated high performance of LLMs across accuracy, relevance, clarity, and emotional sensitivity, with Anthropic’s Claude 2 outperforming OpenAI’s ChatGPT and Google’s Bard, suggesting LLMs’ potential to serve as complementary tools for enhanced information delivery and patient-surgeon interaction.


Introduction
Recent advances in natural language processing (NLP) have produced large language model (LLM) applications, such as OpenAI's ChatGPT, that have captivated a worldwide audience [1].They have permeated the health care sector, offering several benefits [2].While LLMs have immense potential in improving clinical practice and patient outcomes, their role has not been completely established [3].Often, patients that require surgery struggle with complex, anxiety-inducing questions [4].Thus, counseling during preoperative workup is crucial for obtaining informed consent, establishing trust, and ensuring presurgical optimization to improve patient outcomes.This process, being resource-intensive and involving numerous conversations, often delays communication, causing significant frustration for patients [5].Therefore, the importance of clear, adequate, and timely information delivery cannot be overemphasized.LLMs with chat features could improve preoperative communication; however, LLMs' ability in answering patients' surgical questions have not been extensively studied.Thus, this study aims to assess LLMs' potential and proficiency in responding to questions from patients undergoing surgery.

Overview
In formulating our questionnaire, we used the input of 3 neurosurgical attendings, focusing on common general patient inquiries regarding surgery.We presented 38 patient questions in web sessions to 3 publicly accessible LLMs: ChatGPT (GPT-4; OpenAI), Claude 2 (Anthropic), and Bard (Google) on August 16, 2023 (Multimedia Appendix 1).Questions had preoperative concerns, procedural aspects, and postoperative considerations.Each reply from the LLMs was reviewed by 2 independent blinded reviewers (MMD and FCO, research fellows with medical doctorates who had not completed postgraduate clinical training).A 5-point Likert scale was used to assess accuracy, relevance, and clarity of responses [6].Emotional sensitivity was evaluated on a 7-point Likert scale to increase discriminatory power [7].Assessment of data normality used the Shapiro-Wilk test.Homogeneity of variances (homoscedasticity) across groups was evaluated via the Levene test.For nonparametric analysis, the Kruskal-Wallis test was used to discern differences among groups.Subsequent pairwise comparisons were facilitated by the post hoc Dunn test.In instances where parametric assumptions were upheld, a 1-way ANOVA was conducted, followed by post hoc analysis with the Tukey honestly significant difference (HSD) test.P values from the post hoc analysis were adjusted for multiplicity with Bonferroni correction.Additionally, weighted percentage agreement (WPA) was used to determine agreement between raters.All statistical analyses used Python (version 3.7; Python Foundation).

Ethical Considerations
The study qualified for institutional review board exemption as it exclusively used questions sourced from surgeon input, with no direct patient involvement.

Principal Findings
Our investigation revealed potential for using LLMs in patient education.Claude 2 had significantly higher percentage average ratings of above 90% for accuracy (P=.004 and P<.001), relevance (P<.001), and clarity (P=.004 and P<.001) compared to ChatGPT and Bard.It also scored significantly better on emotional sensitivity than ChatGPT and Bard (P<.001 and P=.01), with 74.3%.In a study parallel to ours, Sezgin et al [8] assessed the clinical accuracy of LLMs in the context of postpartum depression, demonstrating their efficacy in providing clinically accurate information, a finding that complements our study's illustration of LLMs' potential in patient education and engagement.By providing accurate and timely information, LLMs can potentially alleviate patient concerns.

Limitations
The study's limitations include the absence of direct patient input when formulating the questionnaire, the lack of repeated zero-shot questioning, which may reveal variability, and no dedicated analysis of overtly inaccurate "hallucinations."The principal challenge for LLM deployment in clinical settings lies in its regulatory approval and secure integration within health care systems [9].We are actively conceptualizing a randomized clinical trial controlling for these limitations to investigate LLM and surgeon responses as rated by patients and surgeons.

Conclusions
While surgeons remain indispensable in patient education, LLMs can potentially serve as a complementary tool, enhancing information delivery and supporting patient-surgeon interactions.