Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

Background: Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored. Objective: This study aims to evaluate 3 large language model chatbots—Claude-2, GPT-3.5, and GPT-4—on assigning RADS categories to radiology reports and assess the impact of different prompting strategies. Methods: This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS


Wu et al
Please make my preprint PDF available only to logged-in users; I understand that my title and abstract will remain visible to all users.Only make the preprint title and abstract visible.No, I do not wish to publish my submitted manuscript as a preprint.
2) If accepted for publication in a JMIR journal, would you like the PDF to be visible to the public?
Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended).
Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain v Yes, but only make the title and abstract visible (see Important note, above).I understand that if I later pay to participate in <a href="http

Introduction
Since ChatGPT's public release in November 2022, large language models (LLMs) have attracted great interest in medical imaging applications [1].Research indicated that ChatGPT showed promising applications in various aspects of the medical imaging process.Even without radiology-specific pretraining, LLMs can pass board examinations [2], provide radiology decision support [3], assist in differential diagnosis [3][4][5][6], and generate impressions from findings or structured reports [7][8][9].These applications not only accelerate the imaging diagnosis process, alleviate the workload of doctors but also improve the accuracy of diagnosis [10].However, limitations exist, with one study showing ChatGPT-3 producing erroneous answers for a third of daily clinical questions and about 63% of provided references were not found [11].ChatGPT's dangerous tendency to produce inaccurate responses is less frequent in GPT-4 but still limits usability in medical education and practice at present [12].Tailoring LLMs to radiology may enhance reliability, as an appropriateness criteria context aware chatbot outperformed generic chatbots and radiologists [12].
The American College of Radiology (ACR) Reporting and Data Systems (RADS) standardizes communication of imaging findings.As of August 2023, there have been nine disease-specific systems endorsed by ACR, referring to products from the lexicons to report templates [13].RADS reduces terminology variability, facilitates communication between radiologists and referring physicians, allows consistent evaluations, and conveys clinical significance to improve care.However, complexity and unfamiliarity limit adoption.Therefore, endeavors should be pursued to broaden the implementation of RADS.Therefore, we conducted this study to evaluate LLM's capabilities on a focused RADS assignment task for simulated radiology cases.
Recently, the technique of "prompt tuning" has emerged as a valuable approach to refine the performance of LLMs, particularly for specific domains or tasks [14] .By providing structured queries or exemplary responses, the output of chatbots can be tailored for accurate and relevant answers.Such prompt tuning strategies leverage LLMs' knowledge while guiding appropriate delivery for particular challenges [14].Given the complexity and specificity of the RADS categorization, our investigation emphasizes different prompt impacts to assess chatbot capabilities and potential performance enhancement through refined prompting tuning.
In this study, our primary objective was to meticulously evaluate the performance of three LLMs (GPT-3.5,GPT-4, and Claude-2) for RADS categorization using different prompt tuning strategies.
We aimed to test their accuracy and consistency in RADS categorization and shed light on the potential benefits and limitations of relying on chatbot-derived information for the categorization of specific RADS.

Methods
This study was deemed exempt by the Institutional Review Board, owing to the absence of human subject involvement.

Study Design
The workflow of the study is shown in Figure 1.We conducted a cross-sectional analysis in September 2023 to evaluate the competency of three chatbots -GPT-3.5,GPT-4 (OpenAI, August 30(th), 2023 version) [15], and Claude-2 (Anthropic) [16] -in the task of assigning three RADS categorizations to simulated radiological findings.Given the chatbot's knowledge cessation was as of September 2021, we opted for Liver Imaging Reporting & Data System (LI-RADS ® ) CT/MRI v2018 [17], Lung CT Screening Reporting & Data System (Lung-RADS ® ) v2022 [18], and Ovarian-Adnexal Reporting & Data System (O-RADS TM ) MRI (developed in 2022) [19] as the yardsticks to compare the responses engendered by GPT-3.5, GPT-4, and Claude-2.A total of thirty simulated radiology reports were composed for this analysis, with ten cases representing each of the three RADS reporting systems.The cases were drafted by three board-certified radiologists, each with a specialization in thoracic, abdominal, and gynecological imaging, respectively.The objective was to evaluate the performance of chatbots on a highly structured radiology workflow task involving cancer risk categorization based on structured report inputs.The study design focused on a defined use case to illuminate the strengths and limitations of existing natural language processing technology in this radiology sub-domain.

Prompts
We collected and analyzed responses from GPT-3.5, GPT-4, and Claude-2 for each simulated case.
To mitigate bias, the radiological findings were presented individually via separate interactions, with corresponding responses saved for analysis.Three prompt templates were designed to elicit each RADS categorization along with explanatory rationale: Prompt-0 was a zero-shot prompt, merely introducing the RADS assignment task, such as "Your task is to follow Lung-RADS® v2022 guideline to give Lung-RADS category of the radiological findings delimited by angle brackets." Prompt-1 was a few-shot prompt, furnishing an exemplar of RADS categorization including the reasoning, summarized impression, and final category.For example: " Your task is to follow Lung-RADS® v2022 guideline to give Lung-RADS category of the radiological findings delimited by angle brackets.""" < …Radiological Findings… > Answer：Rationale: {…} Overall: {…} Summary: {…} Lung-RADS Category: X """ ".
Prompt-2 distinctly instructed chatbots to consult the PDF of corresponding RADS guidelines, compensating for these chatbots' lack of radiology-specific pretraining.For Claude-2, the PDF could be directly ingested, while GPT-4 required the use of an "Ask for PDF" plugin to extract pertinent information [20,21].
Each case was evaluated six times with each chatbot across the three prompt levels.The detailed prompt, guideline PDFs can be found in Appendix 1-3.

Evaluation of chatbots
Two study authors (Q.W. and H.L.) independently evaluated the following for each chatbot response in a blinded manner, with any discrepancies resolved by a third senior radiologist (Y.W.).The following were assessed for each response: (1) Patient-level RADS categorization: Judged as correct, incorrect, or unsure."Correct" denotes that the chatbot accurately identified the patient-level RADS category, irrespective of the rationale provided."Unsure" denotes that the chatbot's response failed to provide a decisive RADS category.For example, a response articulating that "a definitive Lung-RADS category cannot be assigned" would be categorized as "unsure".
(2) Overall rating: Assessed as either correct or incorrect.A response is judged incorrect if any errors are identified, including: E1 -Factual extraction error, denotes the chatbots' inability to paraphrase the radiological findings accurately, consequently misinterpreting the information.
E3 -Reasoning error, which includes the incapacity to logically interpret the imaging description (E3a) and the RADS category accurately (E3b).The subtype errors for reasoning imaging description include the inability to reason lesion signal (E3ai), lesion size (E3aii), and/or enhancement (E3aiii) accurately.
E4 -Explanatory error, encompassing inaccurate elucidation of RADS category meaning (E4a) and erroneous explanation of the recommended management and follow-up corresponding to the RADS category (E4b).
If a chatbot's feedback manifested any of the aforementioned infractions, it was labeled as incorrect, with the specific type of error documented.To assess the consistency of the evaluations, a k-pass voting method was also applied.Specifically, a case was deemed accurately categorized if it met the criteria in a minimum of 4 out of the 6 runs.

Statistical analyses
The accuracy of the patient-level RADS categorization and overall rating for each chatbot was compared using the chi-squared test.The agreement across the six repeated runs was assessed using Fleiss's kappa.Agreement strength was interpreted as follows: <0 signified poor, 0-0.

Consistency of Chatbots
As shown in

Subgroup analysis
Since the knowledge base for ChatGPT was frozen as of September 2021, accounting for the knowledge limitations of LLMs developed before the latest RADS guideline updates, we compared the responses of different RADS criteria.The total accurate responses across six runs were computed for all prompts.Both GPT-4 and Claude-2 demonstrated superior performance in the context of LI-RADS CT/MRI v2018 as opposed to Lung-RADS v2022 and O-RADS MRI (all p<0.05,Table 4).Figure 3 delineates the performance of various chatbots across different prompts and RADS categories.For the overall rating (Figure 3A), Claude-2 exhibited a progressive trend of enhancement from Prompt-0 to Prompt-1 to Prompt-2.Conversely, GPT-4 improved with Prompt-1/2 over Prompt-0, but Prompt-2 did not exceed Prompt-1.For the RADS categorization (Figure 3B), Prompt-1 and Prompt-2 outperformed Prompt-0 for LI-RADS, irrespective of chatbots.However, for Lung-RADS and O-RADS, Prompt-0 sometimes superseded Prompt-1.

Analysis of error types
A total of 1440 cases were analyzed for error types, with details provided in the Appendix 4. The bar plot illustrating the distribution of errors across the three chatbots is shown in Figure 4.A typical example of factual extraction error (E1) occurred in response to the 7th Lung-RADS question.The statement "The 3mm solid nodule in the lateral basal segmental bronchus is subsegmental" is inaccurate, as the lateral basal segmental bronchus represents one of the 18 defined lung segments, not a subsegment [22].Hallucination of inappropriate RADS categories (E2a) occurred more frequently with Prompt-0 across all three chatbots.However, this error rate decreased to zero for Claude-2 when using Prompt-2, a trend not seen with GPT-3.5 or GPT-4.A recurrent E2a error in LI-RADS was the obsolete category LI-5V from the 2014 version, now superseded by LI-TIV in subsequent editions [23,24].Furthermore, hallucination of invalid RADS criteria (E2b) was more prevalent than E2a.For instance, the LI-RADS second question response stating "T2 marked hyperintensity is a feature commonly associated with hepatocellular carcinoma (HCC)" is inaccurate, as T2 marked hyperintensity is characteristic of hemangioma, not HCC.Despite initial higher E2b rates, Claude-2 demonstrated a substantial reduction with Prompt-2 (105 to 38 instances), exceeding the decrement seen with GPT-4 (71 to 57 instances).
Regarding reasoning error, incorrect RADS category reasoning (E3b) was the most frequent error but decreased for all chatbots with Prompt-1 and Prompt-2 versus Prompt-0.Claude-2 reduced errors by almost half with Prompt-2, while the GPT-4 decrease was less pronounced.Lesion signal interpretation errors (E3ai) included misinterpreting hypointensity on diffusion-weighted imaging (DWI) as "restricted diffusion," rather than facilitated diffusion.Lesion size reasoning errors (E3aii) occurred in 34 out of 1440 cases, predominantly by Claude-2 (25/34, 73.5%), especially in systems like Lung-RADS and LI-RADS where size is critical for categorization.Examples were attributing a 12mm pulmonary nodule to the >=6mm but <8mm range, or assigning a hepatic lesion measuring 2.3 by 1.5cm to the 10-19mm category.Reasoning enhancement errors (E3aiii) were exclusive to Claude-2 in O-RADS, where enhancement significantly impacts categorization.Misclassifying images at 40 seconds post-contrast as early or delayed enhancement exemplifies this error.
Explanatory errors (E4) including incorrect RADS category definitions (E4a) and inappropriate management recommendations (E4b) also substantially declined with Prompt-1 and Prompt-2.For instance, in the first Lung-RADS question response, the statement "The 4X designation indicates infectious/inflammatory etiology is suspected." is incorrect.Lung-RADS 4X means Category 3 or 4 nodules with additional features or imaging findings that increase suspicion of lung cancer [18].

Discussion
In this study, we evaluated the performance of three chatbots -GPT-3.5,GPT-4, and Claude-2 -in categorizing radiological findings according to RADS criteria.Using three levels of prompts providing increasing structure, examples, and domain knowledge, the chatbots' accuracies and consistencies were quantified across 30 simulated cases.The best performance was achieved by Claude-2 when provided with few-shot prompting and the RADS criteria PDFs.Interestingly, the chatbots tended to categorize better for the relatively older LI-RADS v2018 criteria in contrast to the more recent Lung-RADS v2022 and O-RADS guidelines published after the chatbots' training cutoff.
The incorporation of RADS, which standardizes reporting in radiology, has been a significant advancement, although the multiplicity and complexity of these systems impose a steep learning curve for radiologists [13].Even for subspecialized radiologists at tertiary hospitals, mastering the numerous RADS guidelines poses challenges, requiring familiarity with the lexicons, regular application in daily practice, and ongoing learning to remain current with new versions.While previous studies have shown that LLMs could assist radiologists in various tasks [2][3][4][5]7,11], their performance at RADS categorization from imaging findings is untested.We therefore evaluated LLMs for focused RADS categorization of simulated cases.
Without prompt engineering (Prompt-0), all chatbots performed poorly.However, accuracy improved for all chatbots when provided an exemplar prompt demonstrating the desired response structure (Prompt-1).This underscores the utility of prompt tuning for aligning LLMs to specific domains like radiology.Further enriching Prompt-1 with the RADS guideline PDFs as a relevant knowledge source (Prompt-2) considerably enhanced Claude-2's accuracy, a feat not mirrored by GPT-4.This discrepancy could stem from ChatGPT's reliance on an external plugin to access documents, while Claude-2's architecture accommodates the direct assimilation of expansive texts, benefiting from its larger context window and superior long document processing capabilities.
Notably, we discerned performance disparities across RADS criteria.When queried on older established guidelines like LI-RADS v2018 [17], the chatbots demonstrated greater accuracy compared to more recent schemes such as Lung-RADS v2022 and O-RADS [18,19,25].Specifically, GPT-4 and Claude-2 had significantly higher total correct ratings for LI-RADS versus Lung-RADS and O-RADS (all p<0.05).This could be attributed to their extensive exposure to the voluminous data related to the matured LI-RADS during their pretraining phase.With Prompt-2, Claude-2 achieved 75% (45/60) accuracy for overall rating LI-RADS categorization.The poorer performance on newer RADS criteria highlights the need for strategies to continually align LLMs with the most up-to-date knowledge.
A deep dive into the error-type analysis revealed informative trends.Incorrect RADS category reasoning (E3b) constituted the most frequent error across chatbots, decreasing with prompt tuning.Targeted prompting also reduced critical errors like hallucinations of RADS criteria (E2b) and categories (E2a), likely by constraining output to valid responses.During pretraining, GPT-liked LLMs predict the next word in the unlabeled dataset, risking learning fallacious relationships between RADS features.For instance, Lung-RADS v2022 lacks categories 5 and 6 [18], though some other RADS like Breast Imaging Reporting and Data System include them [26].Using Prompt-0, chatbots erroneously hallucinated Lung-RADS category 5 and 6.Explanatory errors (E4) including inaccurate definition of the assigned RADS category (E4a) and inappropriate management recommendations (E4b) also substantially declined with prompt tuning.For instance, when queried on the novel O-RADS criteria with Prompt-0, chatbots hallucinated follow-up recommendations from other RADS criteria, and responded " O-RADS category 3 refers to an indeterminate adnexal mass and warrants short-interval follow-up".Targeted prompting appears to mitigate these critical errors like hallucination and incorrect reasoning.Careful prompt engineering is essential to properly shape LLM knowledge for radiology tasks.
There are also several limitations in this study.First, only the LI-RADS CT/MRI and O-RADS MRI were included, excluding LI-RADS ultrasound (US) and O-RADS US guidelines which are often practiced in an independent Ultrasound department [27,28].Second, the chatbot's performance was heavily dependent on prompt quality.We only test three types of prompts, further prompt strategies studies are warranted to investigate the impact of more exhaustive engineering on chatbots' accuracy.Third, GPT-4-turbo was released on November 6, 2023, representing the latest GPT-4 model with improvements in instruction following, reproducible outputs, and more [29].Furthermore, its training data extends to April 2023 compared to September 2021 for the base GPT-4 model tested here.We are uncertain about this newest GPT-4-turbo model's performance on the RADS categorization task.Evaluating GPT-4-turbo represents an important direction for future work.Fourth, our study focused on 3 of 9 RADS [13], although our choice ensured a blend of old and new guidelines.While this approach likely offers a representative snapshot of LLM abilities, it may constrain the broader applicability to cutting-edge knowledge.Extending evaluations to all the latest RADS guidelines could further discern limitations.Nonetheless, this initial study highlights critical considerations of prompt design and knowledge calibration required for safely applying LLMs in radiology.
In conclusion, when equipped with structured prompts and guideline PDFs, Claude-2 demonstrates potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS v2018.However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.
Our study highlights the potential of LLMs in streamlining radiological categorizations while also pinpointing the enhancements necessary for their dependable application in clinical practice for RADS categorization tasks.

Figure 2 .
Figure 2. Bar graphs show the comparison of chabot performance across six runs regarding (A) overall rating, (B) patient-level RADS categorization.RADS = Reporting and Data Systems.

Figure 3 .
Figure 3.The performance of chatbots and prompts within different RADS criteria.(A) overall rating, (B) patient-level RADS categorization.RADS = Reporting and Data Systems.

Figure 4 .
Figure 4.The number of error types for different chatbots.E1 -Factual extraction error, denotes the chatbots' inability to paraphrase the radiological findings accurately, consequently misinterpreting the information.E2 -Hallucination, encompassing the fabrication of nonexistent RADS categories (E2a) and RADS criteria (E2b).E3 -Reasoning error, which includes the incapacity to logically interpret the imaging description (E3a) and the RADS category accurately (E3b).The subtype errors for reasoning imaging description include the inability to reason lesion signal (E3ai),

Figure 2 .
Figure 2. Bar graphs show the comparison of chabot performance across six runs regarding (A) overall rating, (B) patient-level RADS categorization.RADS = Reporting and Data Systems.

Figure 3 .
Figure 3.The performance of chatbots and prompts within different RADS criteria.(A) overall rating, (B) patient-level RADS categorization.RADS = Reporting and Data Systems.

Figure 4 .
Figure 4.The number of error types for different chatbots.E1 -Factual extraction error, denotes the chatbots' inability to paraphrase the radiological findings accurately, consequently misinterpreting the information.E2 -Hallucination, encompassing the fabrication of nonexistent RADS categories (E2a) and RADS criteria (E2b).E3 -Reasoning error, which includes the incapacity to logically interpret the imaging description (E3a) and the RADS category accurately (E3b).The subtype errors for reasoning imaging description include the inability to reason lesion signal (E3ai), lesion size (E3aii), and/or enhancement (E3aiii) accurately.E4 -Explanatory error, encompassing inaccurate elucidation of RADS category meaning (E4a) and erroneous explanation of the recommended management and follow-up corresponding to the RADS category (E4b).RADS = Reporting and Data Systems.

Table 1 .
Correct overall ratings of different chatbots and prompts.

Table 3 .
The consistency of different chatbots and prompts between six runs.
Note. -Data are the results from Fleiss's kappa.RADS = Reporting and Data Systems.

Table 4 .
The performance of chatbots within different RADS criteria.
Note. -Data are aggregate numbers across six runs.RADS = Reporting and Data Systems.