Evaluation of ChatGPT-Generated Differential Diagnosis for Common Diseases With Atypical Presentation: Descriptive Research

Abstract Background The persistence of diagnostic errors, despite advances in medical knowledge and diagnostics, highlights the importance of understanding atypical disease presentations and their contribution to mortality and morbidity. Artificial intelligence (AI), particularly generative pre-trained transformers like GPT-4, holds promise for improving diagnostic accuracy, but requires further exploration in handling atypical presentations. Objective This study aimed to assess the diagnostic accuracy of ChatGPT in generating differential diagnoses for atypical presentations of common diseases, with a focus on the model’s reliance on patient history during the diagnostic process. Methods We used 25 clinical vignettes from the Journal of Generalist Medicine characterizing atypical manifestations of common diseases. Two general medicine physicians categorized the cases based on atypicality. ChatGPT was then used to generate differential diagnoses based on the clinical information provided. The concordance between AI-generated and final diagnoses was measured, with a focus on the top-ranked disease (top 1) and the top 5 differential diagnoses (top 5). Results ChatGPT’s diagnostic accuracy decreased with an increase in atypical presentation. For category 1 (C1) cases, the concordance rates were 17% (n=1) for the top 1 and 67% (n=4) for the top 5. Categories 3 (C3) and 4 (C4) showed a 0% concordance for top 1 and markedly lower rates for the top 5, indicating difficulties in handling highly atypical cases. The χ2 test revealed no significant difference in the top 1 differential diagnosis accuracy between less atypical (C1+C2) and more atypical (C3+C4) groups (χ²1=2.07; n=25; P=.13). However, a significant difference was found in the top 5 analyses, with less atypical cases showing higher accuracy (χ²1=4.01; n=25; P=.048). Conclusions ChatGPT-4 demonstrates potential as an auxiliary tool for diagnosing typical and mildly atypical presentations of common diseases. However, its performance declines with greater atypicality. The study findings underscore the need for AI systems to encompass a broader range of linguistic capabilities, cultural understanding, and diverse clinical scenarios to improve diagnostic utility in real-world settings.


Preprint Settings
1) Would you like to publish your submitted manuscript as preprint?Please make my preprint PDF available to anyone at any time (recommended).
Please make my preprint PDF available only to logged-in users; I understand that my title and abstract will remain visible to all users.
Only make the preprint title and abstract visible.
No, I do not wish to publish my submitted manuscript as a preprint.2) If accepted for publication in a JMIR journal, would you like the PDF to be visible to the public?
Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended).
Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain v Yes, but only make the title and abstract visible (see Important note, above).I understand that if I later pay to participate in <a href="http

Original Manuscript
Original Paper

Introduction
For the past decade, medical knowledge and diagnostic techniques have expanded globally, becoming more accessible with remarkable advancements in clinical testing and useful reference systems [1].Despite these advancements, misdiagnosis significantly contributes to mortality, making it a noteworthy public health issue [2,3].Studies have revealed discrepancies between clinical and postmortem autopsy diagnoses in at least 25% of cases, with diagnostic errors contributing to approximately 10% of deaths, and to 6-17% of hospital adverse events [4][5][6][7][8].The significance of atypical presentations as a contributor to diagnostic errors is especially notable, with recent findings suggesting that such presentations are prevalent in a substantial portion of outpatient consultations, and are associated with a higher risk of diagnostic inaccuracies [9].This underscores the persistent challenge in diagnosing patients correctly due to the variability in disease presentation and reliance on medical history, which comprises approximately 80% of the medical diagnosis [10,11].
The advent of artificial intelligence (AI) in healthcare, particularly through natural language processing (NLP) models such as the Generative Pre-trained Transformer (GPT), has opened new avenues in medical diagnosis [12].Recent studies on AI medical diagnosis across various specialties-including neurology [13], dermatology [14], radiology [15], and pediatrics [16]-have shown promising results, improving diagnostic accuracy, efficiency, and safety.Among these developments, ChatGPT-4, a state-of-the-art AI model developed by OpenAI, has demonstrated remarkable capabilities in understanding and processing medical language, significantly outperforming its predecessors in medical knowledge assessments, and potentially transforming medical education and clinical decision support systems [12,17].
Notably, one study found that ChatGPT could pass the United States Medical Licensing Examination (USMLE), highlighting its potential in medical education and medical diagnosis [18,19].Moreover, in controlled settings, ChatGPT-4 has shown over 90% accuracy in diagnosing common diseases with typical presentations, based on chief complaints and patient history [20].However, while research has examined the diagnostic accuracy of AI chatbots, including ChatGPT models, in generating differential diagnoses for complex clinical vignettes derived from general internal medicine (GIM) department case reports, their diagnostic accuracy in handling atypical presentations of common diseases remains less explored [21,22].There has been a notable study aimed at evaluating the accuracy of the differential diagnosis lists generated by both third-and fourth-generation ChatGPT models using case vignettes from case reports published by the Department of GIM of Dokkyo Medical University Hospital, Japan.ChatGPT-4 was found to achieve a correct diagnosis rate within the top 10 differential diagnosis lists, top five lists, and top diagnoses of 83%, 81%, and 60%, respectively -rates comparable to those by physicians.Although the study highlights the potential of ChatGPT-4 as a supplementary tool for physicians, particularly in the context of GIM, it also underlines the importance of further investigation into the diagnostic accuracy of ChatGPT with atypical disease presentations (Figure 1).Given the crucial role of patient history in diagnosis and the inherent variability in disease presentation, our study expands upon this foundation to assess the accuracy of ChatGPT-4 in diagnosing common diseases with atypical presentations [23].
[Figure 1 here] More specifically, this study aims to evaluate the hypothesis that the diagnostic accuracy of AI, exemplified by ChatGPT-4, declines when dealing with atypical presentations of common diseases.We hypothesize that despite the known capabilities of AI in recognizing typical disease patterns, its performance will be significantly challenged when presented with clinical cases that deviate from these patterns, leading to reduced diagnostic precision.Consequently, this study seeks to systematically assess this hypothesis and explore its implications for the integration of AI in clinical practice.By exploring the contribution of AI-assisted medical diagnoses to common diseases with atypical presentation and patient history, the study assesses the accuracy of ChatGPT in reaching a clinical diagnosis based on the medical information provided.By reevaluating the significance of medical information, our study contributes to the ongoing discourse on optimizing diagnostic processes -both conventional and AI-assisted.

Study Design, Settings, and Participants
This study utilized a series of 25 clinical vignettes from a special issue of Generalist Medicine (International Standard Serial Number 2188-8051, Japanese), published on March 5, 2024.These vignettes, which exemplify atypical presentations of common diseases, were selected for their alignment with our research aim to explore the impact of atypical disease presentations in AI-assisted diagnosis.The clinical vignettes were derived from real patient cases and curated by an editorial team specializing in general internal medicine, with final edits by KS.Each case included comprehensive details such as age, gender, chief complaints, medical history, medication history, current illness, and physical examination findings, along with the ultimate and initial misdiagnoses.
An expert panel comprising two general medicine and medical education physicians, TS and YO, initially reviewed these cases.After deliberation, they selected all 25 cases that exemplified atypical presentations of common diseases.Subsequently, TS and YO evaluated their degree of atypicality and categorized them into four distinct levels, using the following definition as a guide: 'Atypical presentations have a shortage of prototypical features.These can be defined as features that are most frequently encountered in patients with the disease, features encountered in advanced presentations of the disease, or simply features of the disease commonly listed in medical textbooks.Atypical presentations may also have features with unexpected values [24].' Category 1 was assigned to cases that were closest to the typical presentations of common diseases, whereas Category 4 was designated to those that were markedly atypical.In instances where TS and YO did not reach consensus, a third expert, KS, was consulted.Through collaborative discussions, the panel reached a consensus on the final category for each case, ensuring a systematic and comprehensive evaluation of the atypical presentation of common diseases (Figure 2).
[Figure 2 here] Our analysis was conducted on March 12, 2024, utilizing ChatGPT-4's proficiency in Japanese.The language processing was enabled by the standard capabilities of the ChatGPT-4 model, requiring no additional adaptations or programming by our team.We exclusively used text-based input for the generative AI, excluding tables or images to maintain a focus on linguistic data.This approach is consistent with the typical constraints of language-based AI diagnostic tools.Inputs to ChatGPT-4 consisted of direct transcriptions of the original case reports in Japanese, ensuring the authenticity of the medical information was preserved.We measured the concordance between AI-generated differential diagnoses and the vignettes' final diagnoses, as well as the initial misdiagnoses.Our investigation entailed inputting clinical information-including medical history, physical examination, and laboratory data-into ChatGPT, followed by posing the question 'List of differential diagnoses in order of likelihood, based on the provided vignettes' information,' labeled as 'GAI differential diagnoses.'diagnoses" and "Final diagnosis" coincided 12% (3/12) within the first list differential diagnosis, while "GAI differential diagnoses" and "Final diagnosis" had a concordance rate of 44% (11/25) within five differential diagnoses.The interrater reliability was substantial (Cohen's kappa = 0.84).The analysis of the concordance rates between the "GAI differential diagnoses" generated by ChatGPT and the "Final diagnosis" from the Journal of Generalist Medicine revealed distinct patterns across the four categories of atypical presentations (Table 2).For the top 1 differential diagnosis, Category 1 (C1) cases, which were closest to typical presentations, showed a concordance rate of 17%, whereas Category 2 (C2) cases exhibited a slightly higher rate of 22%.Remarkably, Categories 3 (C3) and 4 (C4), which represent more atypical cases, demonstrated no concordance (0%) in the top 1 differential diagnosis.
When the analysis was expanded to the top five differential diagnoses, the concordance rates varied across categories.C1 cases showed a significant increase in concordance to 67%, indicating a better performance of the "GAI differential diagnoses" when considering a broader range of possibilities.C2 cases had a concordance rate of 44%, followed by C3 cases at 25% and C4 cases at 17%.Table 2. Concordance rates of AI-generated differential diagnoses by atypicality category Category 1 (C1) being closest to typical, and Category 4 (C4) being most atypical.
To assess the diagnostic accuracy of ChatGPT across varying levels of atypical presentations, we employed chi-squared tests.Specifically, we compared the frequency of correct diagnoses in the top 1 and top 5 differential diagnoses provided by ChatGPT for cases categorized as C1+C2 (less atypical) versus C3+C4 (more atypical).For the top 1 differential diagnosis, there was no statistically significant difference in the number of correct diagnoses between the less atypical (C1+C2) and more atypical (C3+C4) groups ( ²(1, N=25) = 2.07, p χ = .131).However, when expanding the analysis to the top 5 differential diagnoses, we found a statistically significant difference, with the less atypical group (C1+C2) demonstrating a higher number of correct diagnoses compared to the more atypical group (C3+C4) ( ²(1, N=25) = χ 4.01, p = .048).

DISCUSSION
This study provides insightful data on the performance of ChatGPT-4 in diagnosing common diseases with atypical presentations.Our findings offer a nuanced view of the capacity of AIdriven differential diagnoses across varying levels of atypicality.In the analysis of the concordance rates between "GAI differential diagnoses" and "Final diagnosis," we observed a decrease in diagnostic accuracy as the degree of atypical presentation increased.
The performance of ChatGPT-4 in Category 1 (C1) cases, which are the closest to typical presentations, was moderately successful, with a concordance rate of 17% for the top 1 diagnosis and 67% within the top five.This suggests that when the disease presentation closely aligns with the typical characteristics known to the model, ChatGPT-4 is relatively reliable at identifying a differential diagnosis list that coincides with the final diagnosis.However, the utility of ChatGPT-4 appears to decrease as atypicality increases, as evidenced by the lower concordance rates in Category 2 (C2), and notably more so in Categories 3 (C3) and 4 (C4), where the concordance rates for the top 1 diagnosis fell to 0%.Similar challenges were observed in another 2024 study [26], where the diagnostic accuracy of ChatGPT varied depending on the disease etiology, particularly in differentiating between CNS and non-CNS tumors.
It is particularly revealing that in the more atypical presentations of common diseases (C3 and C4), the AI struggled to provide a correct diagnosis, even within the top five differential diagnoses, with concordance rates of 25% and 17%, respectively.These categories highlight the current limitations of AI in medical diagnosis when faced with cases that deviate significantly from the established patterns within its training data [27].
By leveraging the comprehensive understanding and diagnostic capabilities of ChatGPT-4, this study aims to re-evaluate the significance of patient history in AI-assisted medical diagnosis, and contribute to optimizing diagnostic processes [28].Our exploration of ChatGPT-4's performance in processing atypical disease presentations not only advances our understanding of AI's potential in medical diagnosis [23], but also underscores the importance of integrating advanced AI technologies with traditional diagnostic methodologies to enhance patient care and reduce diagnostic errors.
The contrast in performance between the C1 and C4 cases can be seen as indicative of the challenges AI systems currently face with complex clinical reasoning requiring pattern recognition.Atypical presentations can include uncommon symptoms, rare complications, or unexpected demographic characteristics, which may not be well-represented in the datasets used to train AI systems [29].Furthermore, these findings can inform the development of future versions of AI medical diagnosis systems, and guide training curricula to include a broader spectrum of atypical presentations.
This study underscores the importance of the continued refinement of AI medical diagnosis systems, as highlighted by the recent advances in AI technologies and their applications in medicine.Studies published in 2024 [30][31][32] provide evidence of the rapidly increasing capabilities of large language models (LLMs) like GPT-4 in various medical domains, including oncology, where AI is expected to significantly impact precision medicine [30].The convergence of text and image processing, as seen in multimodal AI models, suggests a qualitative leap in AI's ability to process complex medical information, which is particularly relevant for our findings on AI-assisted medical diagnostics [30].These developments reinforce the potential of AI tools like ChatGPT-4 in bridging the knowledge gap between machine learning developers and practitioners, and their role in simplifying complex data analyses in medical research and practice [31].However, as these systems evolve, it is crucial to remain aware of their limitations and the need for rigorous verification processes to mitigate the risk of errors, which can have significant implications in clinical settings [32].This aligns with our observation of decreased diagnostic accuracy in atypical presentations and the necessity for cautious integration of AI into clinical practice.It also points to the potential benefits of combining AI with human expertise to compensate for current AI limitations, and enhance diagnostic accuracy [33].
Our research suggests that while AI, particularly ChatGPT-4, shows promise as a supplementary tool for medical diagnosis, reliance on this technology should be balanced with expert clinical judgment, especially in complex and atypical cases [28,29].The observed concordance rate of 67% for C1 cases indicates that even when not dealing with extremely atypical presentations, cases with potential pitfalls may result in AI medical diagnosis accuracy lower than the 80-90% estimated by existing studies [10,11].This revelation highlights the need for cautious integration of AI in clinical settings, acknowledging that its diagnostic capabilities, while robust, may still fall short in certain scenarios [34,35].

Limitations
Despite the strengths of our research, the study has certain limitations that must be noted when contextualizing our findings.First, the external validity of the results may be limited as our dataset comprises only 25 clinical vignettes, sourced from a special issue of the Journal of Generalist Medicine.While these vignettes were chosen for their relevance to the study's hypothesis on atypical presentations of common diseases, the size of the dataset and its origin from mock scenarios rather than real patient data may limit the generalizability of our findings.This sample size may not adequately capture the variability and complexities typically encountered in broader clinical practice, and thus, might not be sufficient to firmly establish statistical generalizations.This limitation is compounded by the exclusion of pediatric vignettes, which narrows the demographic range of our findings and potentially reduces their applicability across diverse age groups.
Second, ChatGPT's current linguistic capabilities predominantly cater to English, presenting significant barriers to patient-provider interactions that may occur in other languages.This raises concerns about the potential for miscommunication and subsequent misdiagnosis in non-English medical consultations.This underscores the essential need for future AI models to exhibit a multilingual capacity that can grasp the subtleties inherent in various languages and dialects, as well as the cultural contexts within which they are used.
Finally, the diagnostic prioritization process of ChatGPT did not always align with clinical probabilities, potentially skewing the perceived effectiveness of the AI model.Additionally, it must be acknowledged that our research utilized ChatGPT-4, which is not a publicly available model.Consequently, the results obtained using ChatGPT-4 may not be directly generalizable to other large language models, especially open-source models like Llama3, which might have different underlying architectures and training datasets.Moreover, since our study relied on clinical vignettes which are mock scenarios, the potential for bias based on the cases is significant.The lack of real demographic diversity in these vignettes means that the findings may not accurately reflect the social or regional nuances, such as ethnicity, prevalence of disease, or cultural practices, that could influence diagnostic outcomes.This limitation suggests a need for careful consideration when applying these AI tools across different geographic and demographic contexts to ensure the findings are appropriately adapted to local populations.This emphasizes the necessity for AI systems to be evaluated in diverse real-world settings to understand their effectiveness comprehensively and mitigate any bias.This distinction is important to consider when extrapolating our study's findings to other AI systems.Future studies should not only refine AI's diagnostic reasoning, but also explore the interpretability of its decision-making process, especially when errors occur.ChatGPT should be considered as a supplementary tool in medical diagnosis, rather than a standalone solution.This reinforces the necessity for combined expertise, where AI supports-but does not replace -human clinical judgment.Further research should expand these findings to a wider range of conditions, especially prevalent diseases with significant public health impacts, to thoroughly assess the practical utility and limitations of AI in medical diagnosis.

Conclusions
Our study contributes valuable evidence for the ongoing discourse on the role of AI in medical diagnosis.This study provides a foundation for future research to explore the extent to which AI can be trained to recognize increasingly complex and atypical presentations, which is critical for its successful integration into clinical practice.
Final diagnosis: Final correct diagnosis listed in the Journal of Generalist Medicine clinical vignette as common disease presenting atypical symptoms.

Figure 2 .
Figure 2. Categories of common disease with atypical presentation.

Table 1 .
List of answers and diagnoses provided by ChatGPT Category (C1) being closest to typical, and Category 4 (C4) being most atypical.
GAI diagnosis rank:The high priority differential diagnosis rank generated by ChatGPT.