Feasibility of Multimodal Artificial Intelligence Using GPT-4 Vision for the Classification of Middle Ear Disease: Qualitative Study and Validation

Background The integration of artificial intelligence (AI), particularly deep learning models, has transformed the landscape of medical technology, especially in the field of diagnosis using imaging and physiological data. In otolaryngology, AI has shown promise in image classification for middle ear diseases. However, existing models often lack patient-specific data and clinical context, limiting their universal applicability. The emergence of GPT-4 Vision (GPT-4V) has enabled a multimodal diagnostic approach, integrating language processing with image analysis. Objective In this study, we investigated the effectiveness of GPT-4V in diagnosing middle ear diseases by integrating patient-specific data with otoscopic images of the tympanic membrane. Methods The design of this study was divided into two phases: (1) establishing a model with appropriate prompts and (2) validating the ability of the optimal prompt model to classify images. In total, 305 otoscopic images of 4 middle ear diseases (acute otitis media, middle ear cholesteatoma, chronic otitis media, and otitis media with effusion) were obtained from patients who visited Shinshu University or Jichi Medical University between April 2010 and December 2023. The optimized GPT-4V settings were established using prompts and patients’ data, and the model created with the optimal prompt was used to verify the diagnostic accuracy of GPT-4V on 190 images. To compare the diagnostic accuracy of GPT-4V with that of physicians, 30 clinicians completed a web-based questionnaire consisting of 190 images. Results The multimodal AI approach achieved an accuracy of 82.1%, which is superior to that of certified pediatricians at 70.6%, but trailing behind that of otolaryngologists at more than 95%. The model’s disease-specific accuracy rates were 89.2% for acute otitis media, 76.5% for chronic otitis media, 79.3% for middle ear cholesteatoma, and 85.7% for otitis media with effusion, which highlights the need for disease-specific optimization. Comparisons with physicians revealed promising results, suggesting the potential of GPT-4V to augment clinical decision-making. Conclusions Despite its advantages, challenges such as data privacy and ethical considerations must be addressed. Overall, this study underscores the potential of multimodal AI for enhancing diagnostic accuracy and improving patient care in otolaryngology. Further research is warranted to optimize and validate this approach in diverse clinical settings.


Table of Contents
Only make the preprint title and abstract visible.
No, I do not wish to publish my submitted manuscript as a preprint.2) If accepted for publication in a JMIR journal, would you like the PDF to be visible to the public?
Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended).
Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain v Yes, but only make the title and abstract visible (see Important note, above).I understand that if I later pay to participate in <a href="http

Introduction
The emergence of artificial intelligence (AI) has altered the landscape of medical technology, particularly in diagnosis, which leverages the identification of features based on imaging and physiological data [1][2][3].In the field of otolaryngology, AI and deep learning models are being used for imaging; ongoing efforts focus on classifying diseases based on tympanic membrane images of middle ear disease [4][5][6].Technological advancements, including deep learning and transfer learning using pre-trained models, have resulted in an accuracy range of 70-90% in models for analyzing otoscopic images [7].There have also been advancements in its application, such as implementing smartphone-based point-of-care diagnostics [8].However, these models rely on trained image data, require large image datasets, and do not consider patient information or clinical context.Consequently, the universality of these models is limited, and their optimal application in clinical practice remains unclear.
Recently, large-scale language-processing models have become available for general use.One such model, the Generative Pre-trained Transformer 4 (GPT-4), has demonstrated specialist-level medical knowledge through its language-processing abilities [9][10][11].Since October 2023, GPT-4 Vision (GPT-4V) has gained the ability to evaluate image data, enabling a multimodal diagnostic approach that incorporates both language processing and image analysis [12].GPT-4V enables the integration of patient information analysis and image-based deep learning models, providing valuable support in diagnosis and treatment, similar to decisions made in a clinical setting [13].Multimodal AI, which bases diagnosis on multiple pieces of information, has been reported to be more effective than methods that rely on a single type of information.This is demonstrated in various medical applications, including the combination of pathology images with genomic information [14] and their utilization in liver cancer [15] and cervical cancer [16], where imaging information is integrated.In otorhinolaryngology, there have been few reports; however, efforts to incorporate AI for otoscopic images could further improve the quality of care.
In this study, we aimed to investigate the effectiveness of a multimodal approach using GPT-4V to diagnose middle ear disease.This approach was designed to integrate patient-specific data (age, sex, and chief complaint) with tympanic membrane images to assess the accuracy of the versatile GPT-4V.The model's accuracy was compared with physicians' diagnoses to validate its effectiveness in image-based deep learning.The potential future development of the multimodal AI approach for classifying middle ear diseases is also discussed.

Methods
GPT-4V has been available as an image recognition model since September 25, 2023.The study design was divided into two phases: (1) establishing a model with appropriate prompts and (2) validating the ability of the optimal prompt model to classify images (Figure 1).

Correct Otoscopic Images and Patient Information
This study included 305 otoscopic images of middle ear disease obtained from patients who visited Shinshu University or Jichi Medical University between April 2010 and December 2023.The endoscope used was an Olympus ENF-VH and ENF-V3 (Olympus, Tokyo, Japan), and the video system was an Olympus VISERA ELITE OTV-S190.One image was obtained from each patient.We excluded images with poor quality and those in which multiple diseases were suspected.The remaining images were classified into four disease categories: acute otitis media (AOM), middle ear cholesteatoma (Chole), chronic otitis media (COM), and otitis media with effusion (OME).The final diagnoses were based on the judgment of the otolaryngologists who treated the patients.These images were accompanied by patient-specific information, such as age, sex, and chief complaint (e.g., fever, otalgia, otorrhea, ear fullness, deafness, facial palsy, dizziness, and tinnitus).We excluded images taken after otologic surgery.Of note, only one image was obtained from each patient.

GPT-4V Settings and Prompt Tuning
The GPT-4V settings were established using prompts reported in previous studies [17,18].Briefly, conditions and prompts for providing answers were verified using 10 images for each disease.
According to a report on prompts [19], image data and/or patient information were manually input into GPT-4V, and the generated results were evaluated by the physicians (N.M. and H.Y.).

Accuracy verification of GPT-4V using the optimal prompt model
The model with the optimal prompt created was used to verify the diagnostic accuracy of GPT-4V on 190 images (37 in AOM; 53 in Chole [Congenital, 6; Acquired, 47]; 51 in COM; and 49 in OME), which were different from those for tuning prompts.To account for the variability in responses, each administration was performed three times, and responses that were answered two or more times were considered to be the actual response.

Comparison of AI Accuracy with Physician Accuracy
To compare the diagnostic accuracy of GPT-4V with that of physicians, 30 clinicians completed a web-based questionnaire consisting of 190 images.
The web-based survey included tympanic membrane images and patient information (age, sex, and chief complaint) in a four-choice question format.The respondents included eight certificated pediatricians, eight otolaryngology residents, eight certificated otolaryngologists, and six experts in otolaryngology (more than 15 years of experience).
To show the trend in the percentage of correct responses according to the difficulty of the questions, the questions were divided into three levels (easy, normal, and hard) according to the overall percentage of correct responses by physicians, and the percentage of correct responses for each level and also each question was compared between the GPT-4V and all doctors, otolaryngologists, and pediatricians.

Ethical Statement
Patient information was anonymized to protect privacy and used only with the approval of the Ethics Committee of the Shinshu University School of Medicine (approval number: 6088).

Statistical Analysis
Groups were compared by one-way analysis of variance (ANOVA).Subsequently, multiple comparison tests (the Bonferroni method) were used to compare groups.Statistical significance was set at p < 0.05.A one-sample proportion test was used to compare the performance of physician with that of GPT-4V in terms of the correct response rate.

Establishment of optimal prompts
In the initial stage, we sought an optimal input method using 10 images for each disease (AOM, Chole, COM, or OME; 40 images total).First, we inputted only images or options; GPT mostly requires clinical information, such as patient history and symptoms, although no response regarding the disease was generated (Figure 1 and Multimedia Appendix 1).Second, the names of the four diseases were added as candidate answers, but again, no response regarding the disease was generated.When detailed patient information, such as age, sex, and main symptoms, was inputted, GPT-4V provided answers, indicating that input images with patient data were the optimal prompt for testing the accuracy of GPT-4V.

Accuracy validation of the multimodal AI approach
The performance of the multimodal AI approach in this study for classifying middle ear diseases was validated, with an overall diagnostic accuracy of 82.1% for the GPT-4V-based analysis.Diseasespecific accuracy rates were 89.These results indicate high discrimination among various disease types; however, there were also some incorrect responses.Representative images of correct and incorrect GPT-4V classifications for each disease are shown in Figure 3.

Comparison of diagnostic accuracy by physicians and GPT-4V
The same images with patients' information used by GPT-4V were evaluated by pediatricians (n = 8), otolaryngology residents (n = 8), certificated otolaryngologists (n=8), and experts in otolaryngology (n = 6), and the diagnostic accuracy of each group was compared.The mean diagnostic accuracy was 70.6% (standard error: 4.2%) for pediatricians, 95.5% (standard error: 1.0%) for otolaryngology residents, 97.3% (standard error: 0.8%) for certificated otolaryngologists, and 98.2% (standard error: 0.4%) for experts in otolaryngology.ANOVA revealed significant differences among the four groups (F = 13.43,p < 0.001).In the post hoc comparison, a significant difference was observed between pediatricians and the other three groups (p < 0.001).The GPT-4V correct response rate was 82.1%, surpassing that of pediatricians by 11.5% and trailing behind otolaryngologists by an average of just over 10% (Figure 4).In the confusion matrix of all doctors, there was a notable tendency to misclassify Chole as OME and AOM as OME.Among pediatricians, there were more errors in classifying Chole as AOM or COM (Figure 5).
Regarding difference in the trend of the percentage of correct answers between GPT-4V and physicians according to the difficulty of the questions, even the percentage of correct answers for GPT-4V tended to decrease gradually from 85.7% for easy, 84.0% for normal, and 71.1% for hard questions (Table 1).Furthermore, compared with otolaryngologists, GPT-4V had a significantly lower percentage of correct answers for all questions (99.7% for easy, 97.1% for normal, and 90.8% for hard questions; all P < 0.001).In contrast, the results of the "hard" and "normal" groups were similar.Compared with pediatricians, the GPT-4V outperformed the pediatricians in easy questions with 96.6%, although no statistically significant difference was observed (P = 0.006).However, the GPT-4V had a predominantly higher percentage of correct answers for normal (76.3%, P = 0.07) and hard questions (45.4%, P < 0.001).

Principal Results
In this study, we assessed the accuracy of the GPT-4V multimodal AI approach in classifying middle ear disorders, yielding the following three key findings: (1) GPT-4V, a general-purpose model focusing on large-scale language models, achieved approximately 80% accuracy in classifying middle ear disease.The model's performance, evaluated using images and patient data, was superior to that of non-otolaryngologists, although it was lower than the average accuracy of otolaryngologists.
(2) The GPT-4V was able to classify diseases when patient information and disease options were input.Further improvements in accuracy could be achieved with more detailed patient information.(3) Accuracy varied by disease, suggesting the potential for optimizing AI usage and improving accuracy by understanding the specificity of GPT-4V in classifying particular diseases.

Comparison with Prior Work
The GPT-4V model has undergone training and utilizes zero-shot learning, which recognizes image features based on natural language to classify diseases based on image information and previously learned disease features [20].GPT-4V can yield effective results with fewer resources than previous deep learning models, which typically require a large amount of image data, computational resources, time, and parameter adjustments for training.By inputting new information rather than simply classifying image data, it becomes possible to tailor diagnoses and diagnostic aids for each individual.Furthermore, GPT-4V and other large-scale language processing models feature prompt development that is appropriate for its usage purposes, since the accuracy of such models varies depending on the prompt adjustments.
Compared with physicians' accuracy, the model's performance in this study was higher than that of a pediatrician but lower than that of an otolaryngologist.In a previous comparison between deep learning and humans, Crowson et al. [21] classified 22 tympanic membrane images and found that the deep learning model achieved an accuracy of 95.5%, compared with an accuracy of 65% for 39 clinicians.Suresh et al. [22] also reported that a machine-learning model created from 1,000 images was more effective than pediatricians, with an accuracy rate of 90.6%, surpassing the clinicians' accuracy of 59.4%.Our results indicated that the model did not reach the proficiency level of otolaryngologists; however, it could be valuable for utilizing tympanic membrane images in medical practice outside of otolaryngology.In particular, GPT-4V judgments predominantly exceeded pediatricians' correct response rates for questions with normal to hard difficulty, suggesting that the present model may be useful for non-otolaryngologists who have difficulty in making such judgments.Moreover, previous reports on deep learning classification models have determined the presence or absence of inflammation and exudates based on photographs alone.Further studies are needed to identify the optimal stage in the examination for implementing the image classification model and the subsequent policy decisions that should follow.GPT-4V allows for the classification of diseases using patient information.While comments about medical or harmful content (with restrictions on medical advice) may result in a lower correct response rate, informative or educational responses are still possible if they are well-informed.
Efforts have been made to use large language models (LLMs) to improve the accuracy of prompts.Therefore, it is possible to develop appropriate prompts for medical imaging and middle ear disorders.The accuracy of the LLM is expected to further improve with the development of prompts that are specifically tailored for medical imaging and middle ear disease [23,24].
For the clinical application of the GPT-4V model, collecting clinical data and adjusting parameters are needed to further improve its diagnostic accuracy for each middle ear disease.Upon reviewing the incorrect responses of GPT-4V for each disease, we found that Chole might demonstrate a retraction pocket, which may be mistaken for a perforation.However, images with keratin debris accumulation in the retraction pocket were less prone to misclassification.In cases of COM with calcification, a white lesion was considered to be Chole calcification, emphasizing the importance of distinguishing between these two diseases.AOM cases without the chief complaint of acute inflammation (fever, ear pain, or ear discharge) were occasionally misclassified, even with characteristic findings such as a bulging tympanic membrane, suggesting that GPT-4V was likely to prioritize patients' information over images.In OME cases, a white lesion was sometimes considered to be a pearly tumor (Chole) or tympanic membrane perforation (COM), particularly when it involved a small amount of effusion and/or air.For physicians, Chole and AOM were often misidentified as other diseases and OME, respectively.When comparing the GPT-4V model with the entire group of physicians, the percentage of correct responses was generally higher among the physicians.However, the GPT-4V diagnostic accuracy for Chole was higher than that of pediatricians, indicating that GPT-4V could help non-otolaryngologists diagnose Chole.In a previous report, a dedicated AI model had a diagnostic accuracy of approximately 90% for Chole [25]; therefore, the combination of such a system and GPT-4V would be useful to improve the accuracy of Chole detection.
As demonstrated in this study, the application of AI, including LLM, is believed to offer advantages in terms of improving efficiency and providing assistance in clinical work, enabling the delivery of high-quality medical care, and overcoming language barriers in medicine.The use of GPT-4V has already been reported to diagnose complicated cases [26], and its application can be expanded by integrating it with imaging information.In the field of orthopedics, trials are underway to determine treatment methods based on MRI reports [27], showcasing the effectiveness of GPT-4V as an aid in image interpretation.GPT has been shown to return answers and provide details about the disease, including risk factors and treatment methods.This allows for the evaluation of images alone and assists in medical treatment.Such insights are valuable for understanding the practical use and challenges of AI in real-world applications.Unlike the simplistic deep learning models of the past, the LLM can enhance accuracy by presenting evidence for judgments and asking a series of questions.When used by physicians with a certain level of specialized knowledge, the LLM effectively aids judgment, leading to increased efficiency in medical care.GPT-4V provides answers in just a few seconds, which is significantly shorter than the time it takes a physician to provide a diagnosis, thereby confirming its efficiency.GPT-4V can be used on smartphones, potentially making medical treatment more location-independent.However, there are associated risks, including the reliance on AI for medical care, misdiagnoses due to system malfunctions, and patient information leakage.ChatGPT is trained based on information up to a certain period but may respond differently at different times or provide answers using outdated criteria.Furthermore, legal and personal literacy measures must be developed to protect personal information and address ethical concerns.Foreign countries and the United Nations are actively promoting laws and regulations governing the use of AI [28,29].

Limitations
One limitation of this study is the use of a limited number of images (n=190).Further analysis is required to assess the impact of using a larger dataset that encompasses various diseases.Additionally, as there are large variations in the quality of otoscopic images, accurate diagnosis might be challenging in some cases.
The recognition and content of the answers may change depending on the doctor, clinics, and designed prompt; the accuracy may also change due to changes in the image quality used or the method used to capture the image.While this is common to deep learning, the advantage of GPT, which does not require prior training, is that it is not affected by the data to be trained; thus, the possibility of such changes is considered to be small.For these reasons, further exploration is needed on strategies for handling challenging images and facilitating open-ended responses without giving predefined options.Furthermore, because of the rapid pace of technological evolution, it is essential to regularly fine-tune and make a standalone model that ensures reliability and consistency over time.

Conclusions
A multimodal AI approach using GPT-4V has revealed a potential new diagnostic approach for classifying middle-ear diseases.This confirms the ability of AI to assist in clinical diagnosis and identify disease-specific features.The significant improvement in accuracy compared with conventional deep learning models indicates that even general-purpose AI technology can assist in medical treatment with a certain level of accuracy.It can be applied to highly specialized diagnoses, depending on the method.Further improvements in diagnostic accuracy are expected in future studies by integrating more diverse data types.
Confusion matrix of doctors (pediatricians, otolaryngology residents, certificated otolaryngologists, and experts in otolaryngology) for classifying four middle ear diseases.A: Confusion matrix of all doctors (n=30).The average (percentage of total responses) is shown.B: Confusion matrix of doctors in each group: pediatricians (n=8), otolaryngology residents (n=8), certificated otolaryngologists (n=8), and experts in otolaryngology (n=6).The averages of each group (percentage of total responses) are shown.AOM, acute otitis media; Chole, cholesteatoma; COM, chronic otitis media; OME, otitis media with effusion.Noda et alRepresentative image and prompt of this study A. Representative image of input and output to GPT-4V.Input can be combined with text and images in input to obtain output B. Example of changing the prompt content and an output that asks for patient information.By presenting a concept as ORDER and adding conditions as Restriction, appropriate prompts were attempted to be developed.In the output, it is required to input patient information such as age, medical history, and chief complaint.C.An example of an answer with an optimized prompt.Present the diagnosis, the rationale for the diagnosis, and treatment and prevention methods.

Table 1 .
Comparison of the scores by GPT-4V and Human validation with physicians across various difficulty levels (N = 190).
a Statistically significant.