Large Language Model for Mental Health: A Systematic Review

Large language models (LLMs) have attracted significant attention for potential applications in digital health, while their application in mental health is subject to ongoing debate. This systematic review aims to evaluate the usage of LLMs in mental health, focusing on their strengths and limitations in early screening, digital interventions, and clinical applications. Adhering to PRISMA guidelines, we searched PubMed, IEEE Xplore, Scopus, JMIR, and ACM using keywords: 'mental health OR mental illness OR mental disorder OR psychiatry' AND 'large language models'. We included articles published between January 1, 2017, and April 30, 2024, excluding non-English articles. 30 articles were evaluated, which included research on mental health conditions and suicidal ideation detection through text (n=15), usage of LLMs for mental health conversational agents (CAs) (n=7), and other applications and evaluations of LLMs in mental health (n=18). LLMs exhibit substantial effectiveness in detecting mental health issues and providing accessible, de-stigmatized eHealth services. However, the current risks associated with the clinical use might surpass their benefits. The study identifies several significant issues: the lack of multilingual datasets annotated by experts, concerns about the accuracy and reliability of the content generated, challenges in interpretability due to the 'black box' nature of LLMs, and persistent ethical dilemmas. These include the lack of a clear ethical framework, concerns about data privacy, and the potential for over-reliance on LLMs by both therapists and patients, which could compromise traditional medical practice. Despite these issues, the rapid development of LLMs underscores their potential as new clinical aids, emphasizing the need for continued research and development in this area.


Mental Health
Mental health, a critical component of overall well-being, is at the forefront of global health challenges [1].In 2019, an estimated 970 million individuals worldwide suffered from mental illness, accounting for 12.5% of the global population [2].Anxiety and depression are among the most prevalent psychological conditions, affecting 301 million and 280 million individuals respectively [2].Additionally, 40 million people were afflicted with bipolar disorder, 24 million with schizophrenia, and 14 million experienced eating disorders [3].These mental disorders collectively contribute to an estimated USD 5 trillion in global economic losses annually [4].Despite the staggering prevalence, many cases remain undetected or untreated, with the resources allocated to the diagnosis and treatment of mental illness far less than the negative impact it has on society [5].Globally, untreated mental illnesses account for 5% in high-income countries and 19% in low-and middle-income countries [3].The COVID-19 pandemic has further exacerbated the challenges faced by mental health services worldwide [6], as the demand for these services increased while access was decreased [7].This escalating crisis underscores the urgent need for more innovative and accessible mental health care approaches.
Mental illness treatment encompasses a range of modalities including medication, psychotherapy, support groups, hospitalization, and complementary & alternative medicine [8].However, societal stigma attached to mental illnesses often deters people from seeking appropriate care [9].Many people with mental illness avoid or delay psychotherapy [10], influenced by fears of judgment and concerns over costly, ineffective treatments [11].The COVID-19 crisis and other global pandemics have underscored the importance of digital tools, such as telemedicine and mobile apps, in delivering care during critical times [12].In this evolving context, LLMs present new possibilities for enhancing the delivery and effectiveness of mental health care.
Recent technological advancements have revealed some unique advantages of LLMs in mental health.These models, capable of processing and generating text akin to human communication, provide accessible support directly to users [13].A study analyzing 2,917 Reddit user reviews found that CAs powered by LLMs are valued for their non-judgmental listening and effective problem-solving advice.This aspect is particularly beneficial for socially marginalized individuals, as it enables them to be heard and understood without the need for direct social interaction [14].Moreover, LLMs enhance the accessibility of mental health services, which are notably undersupplied globally [15].Recent data reveals substantial delays in traditional mental health care delivery: 23% of individuals with mental illnesses report waiting over 12 weeks for face-to-face psychotherapy sessions [16], with 12% waiting more than six months, and 6% over a year [16].In addition, 43% of adults with mental illness indicate that such long waits have exacerbated their conditions [15].
Telemedicine, enhanced by LLMs, offers a practical alternative that expedites service delivery and could flatten traditional healthcare hierarchies [17].This includes real-time counseling sessions through CAs that are not only cost-effective but also accessible anytime and from any location.By reducing the reliance on physical visits to traditional healthcare settings, telemedicine has the potential to decentralize access to medical expertise and diminish the hierarchical structures within the healthcare system [17].Mental health chatbots developed using language models, have been gaining recognition, such as Woebot [18] and Wysa [19].Both chatbots follow the principles of Cognitive Behavioural Therapy principles and are designed to equip users with self-help tools for managing their mental health issues [20].In clinical practice, LLMs hold the potential to support the automatic assessment of therapists' adherence to evidence-based practices and the development of systems that offer real-time feedback and support for patient homework between sessions [21].These models also have the potential to provide feedback on psychotherapy or peer support sessions, which is especially beneficial for clinicians with less training and experience [21].Currently, these applications are still in the proposal stage.Although promising, they are not yet widely used in routine clinical settings, and further evaluation of their feasibility and effectiveness is necessary.
The deployment of LLMs in mental health also poses several risks, particularly for vulnerable groups.Challenges such as inconsistencies in the content generated and the production of 'hallucinatory' content may mislead or harm users [22], raising serious ethical concerns.In response, authorities like the World Health Organization (WHO) have developed ethical guidelines for Artificial Intelligence (AI) research in healthcare, emphasizing the importance of data privacy, human oversight, and the principle that AI tools should augment, rather than replace, human practitioners [23].These potential problems with LLMs in healthcare have gained considerable industry attention, underscoring the need for a comprehensive and responsible evaluation of LLMs' applications in mental health.The following section will further explore the workings of LLMs, and their potential applications in mental health, and critically evaluate the opportunities and challenges they introduce.

Large Language Models
LLMs represent advancements in machine learning (ML), characterized by their ability to understand and generate human-like text with high accuracy [24].The efficacy of these models is typically evaluated using benchmarks designed to assess their linguistic fidelity and contextual relevance.Common metrics include BLEU for translation accuracy and ROUGE for summarization tasks [25].LLMs are characterized by their scale, often encompassing billions of parameters, setting them apart from traditional language models [26].This breakthrough is largely due to the Transformer architecture, a deep neural network structure that employs a 'self-attention' mechanism, developed by Vaswani et al. in 2017.This allows LLMs to process information in parallel rather than sequentially, greatly enhancing speed and contextual understanding [27].To clearly define the scope of this study concerning LLMs, we specify that an LLM must utilize the Transformer architecture and contain a high number of parameters, traditionally at least one billion, to qualify as 'large' [28].This criterion encompasses models such as Generative Pre-trained Transformers (GPT) and Bidirectional Encoder Representations from Transformers (BERT).Although the standard BERT model, with only 0.34 billion parameters [29], does not meet the traditional criteria for 'large', its sophisticated bidirectional design and pivotal role in establishing new natural learning processing (NLP) benchmarks justify its inclusion among notable LLMs [30].The introduction of ChatGPT in 2022 generated substantial public and academic interest in LLMs, underlining their transformative potential within the field of AI [31].Other state-of-the-art LLMs include LLaMA and PaLM, as illustrated in Figure 1.represents the number of parameters in billions for various language models by date of publication, with the oldest models at the top.The legend is color-coded by the development entity.Data was summarized with the latest models up to June 2024, with data for parameters and developers from GPT to LLaMA adapted from the work of Thirunavukarasu AJ et al [32].Details on release dates, parameter sizes, and developer entities for the most recent LLMs are sourced from references [33][34][35].
LLMs are primarily designed to learn fundamental statistical patterns of language [36].Initially, these models were used as the basis for fine-tuning task-specific models rather than training those models from scratch, offering a more resource-efficient approach [37].This fine-tuning process involves adjusting a pre-trained model to a specific task by further training it on a smaller, task-specific dataset [38].However, developments in larger and more sophisticated models have reduced the need for extensive fine-tuning in some cases.Notably, some advanced LLMs can now effectively understand and execute tasks specified through natural language prompts without extensive task-specific finetuning [39].Instruction fine-tuned models undergo additional training on pairs of user requests and appropriate responses.This training allows them to generalize across various complex tasks, such as sentiment analysis, which previously required explicit fine-tuning by researchers or developers [40].A key part of the input to these models, like ChatGPT and Gemini, includes a system prompt, often hidden from the user, which guides the model on how to interpret and respond to user prompts.For example, it might direct the model to act as a helpful mental health assistant.Additionally, 'prompt engineering' has emerged as a crucial technique in optimizing model performance.Prompt engineering involves crafting input texts that guide the model to produce the desired output without additional training.For example, refining a prompt from 'Tell me about current events in healthcare' to 'Summarize today's top news stories about technology in healthcare' provides the model with more specific guidance, which can enhance the relevance and accuracy of its responses [41].While prompt engineering can be highly effective and reduce the need to retrain the model, it is important to be wary of 'hallucinations', a phenomenon where models confidently generate incorrect or irrelevant outputs [42].This can be particularly challenging in high-accuracy scenarios, such as healthcare and medical applications [43][44][45][46].Thus, while prompt engineering reduces the reliance on extensive fine-tuning, it underscores the need for thorough evaluation and testing to ensure the reliability of model outputs in sensitive applications.
The existing literature includes a review of the application of ML and NLP in mental health [47], analyses of LLMs in medicine [32], and a scoping review of LLMs in mental health.These studies have demonstrated the effectiveness of NLP for tasks such as text categorization and sentiment analysis [47] and provided a broad overview of LLM applications in mental health [48].However, a gap remains in systematically reviewing state-of-the-art LLMs in mental health, particularly in the comprehensive assessment of literature published since the introduction of the Transformer architecture in 2017.This systematic review addresses these gaps by providing a more in-depth analysis, evaluating the quality and applicability of studies, and exploring ethical challenges specific to LLMs, such as data privacy, interpretability, and clinical integration.Unlike previous reviews, this study excludes preprints, follows a rigorous search strategy with clear inclusion and exclusion criteria (e.g., using Cohen's kappa to assess inter-reviewer agreement), and employs a detailed assessment of study quality and bias (e.g., using the Risk of Bias 2 tool) to ensure the reliability and reproducibility of the findings.
Guided by specific research questions, this systematic review critically assesses the use of LLMs in mental health, focusing on their applicability and efficacy in early screening, digital interventions, and clinical settings, as well as the methodologies and data sources employed.Our findings highlight the potential of LLMs in enhancing mental health diagnostics and interventions, while also identifying key challenges, such as inconsistencies in model outputs and the lack of robust ethical guidelines.These insights suggest that, while LLMs hold promise, their use should be supervised by physicians, and they are not yet ready for widespread clinical implementation.

Methods
This systematic review followed the Preferred Reporting Items for Systematic Review and Meta-analysis (PRISMA) guidelines [49].The protocol was registered on PROSPERO under the ID: CRD42024508617.

Search Strategies
The search was initiated on August 3, 2024, and completed on August 6, 2024, by one author (ZG).This author systematically searched four databases: MEDLINE, IEEE Xplore, Scopus, JMIR, and ACM using the following search keywords: (mental health OR mental illness OR mental disorder OR psychiatry) AND (large language models).These keywords were consistently applied across each database to ensure a uniform search strategy.To conduct a comprehensive and precise search for relevant literature, strategies were tailored for different databases.' All Metadata' was searched in MEDLINE and IEEE Xplore, while the search in Scopus was confined to titles, abstracts, and keywords.The JMIR database utilized the 'Criteria Exact Match' feature to refine search results and enhance precision.In the ACM database, the search focused on 'Full text'.The screening of all citations involved four steps: 1) Initial Search: All relevant citations were imported into a Zotero citation manager library.
2) Preliminary Inclusion: Citations were initially screened based on predefined inclusion criteria.
3) Duplicate Removal: Citations were consolidated into a single group, from which duplicates were eliminated.

4) Final Inclusion:
The remaining references were carefully evaluated against the inclusion criteria to determine their suitability.

Study Selection and Eligibility Criteria
All the articles that matched our search criteria were double-screened by two independent reviewers (ZG, KL) to ensure each article fell within the scope of LLMs in mental health.This process involved the removal of duplicates followed by a detailed manual evaluation of each article to confirm adherence to our predefined inclusion criteria, ensuring a comprehensive and focused review.To quantify the agreement level between the reviewers and ensure objectivity, inter-rater reliability was calculated using Cohen's kappa, with a score of 0.84 indicating a good level of agreement.In instances of disagreement, a third reviewer (AL) was consulted to achieve consensus.
To assess the risk of bias, we utilized the Risk of Bias 2 tool, as recommended for Cochrane Reviews.The results have been visualized in Multimedia Appendix 1.We thoroughly examined each study for potential biases that could impact the validity of the results.These included biases from the randomization process, deviations from intended interventions, missing outcome data, inaccuracies in outcome measurement, and selective reporting of results.This comprehensive assessment ensures the credibility of each study.
The criteria for selecting articles were as follows: We limited our search to Englishlanguage publications, focusing on articles published between January 1, 2017, and April 30, 2024.This timeframe was chosen considering the significant developments in the field of LLMs in 2017, marked notably by the introduction of the Transformer architecture, which has greatly influenced academic and public interest in this area.
In this review, the original research articles and available full-text papers have been carefully selected aiming to focus on the application of LLMs in mental health.To comply with PRISMA guidelines, articles that have not been published in a peer-reviewed venue, including those only available on a preprint server, were excluded.Due to the limited literature specifically addressing the mental health applications of LLMs, we included review articles to ensure a comprehensive perspective.Our selection criteria focused on direct applications, expert evaluations, and ethical considerations related to the use of LLMs in mental health contexts, with the goal of providing a thorough analysis of this rapidly developing field.

Information Extraction
The data extraction process was jointly conducted by two reviewers (ZG, KL), focusing on examining the application scenarios, model architecture, data sources, methodologies used, and main outcomes from selected studies on LLMs in mental health.
Initially, we categorized each study to determine its main objectives and applications.The categorization process was conducted in two steps.First, after reviewing all the included articles, we grouped them into three primary categories: detection of mental health conditions and suicidal ideation through text, LLM usage for mental health CAs, and other applications and evaluation of the LLMs in mental health.In the second step, we performed a more detailed categorization.After a thorough, in-depth reading of each article within these broad categories, we refined the classifications based on the specific goals of the studies.Following this, we summarized the main model architectures of the LLMs used and conducted a thorough examination of data sources, covering both public and private datasets.We noted that some review articles lacked detail on dataset content, and therefore, we focused on providing comprehensive information on public datasets, including their origins and sample sizes.We also investigated the various methods employed across different studies, including data collection strategies and analytical methodologies.We examined their comparative structures and statistical techniques to offer a clear understanding of how these methods are applied in practice.
Finally, we documented the main outcomes of each study, recording significant results and aligning them with relevant performance metrics and evaluation criteria.This included providing quantitative data where applicable to underscore these findings.The synthesis of information was conducted using a narrative approach, where we integrated and compared results across different studies to highlight the efficacy and impact of LLMs on mental health.This narrative synthesis allowed us to highlight the efficacy and impact of LLMs in mental health, providing quantitative data where applicable to underscore these findings.The results of our analysis are presented in three tables, each corresponding to one of the primary categories.

Strategy and Screening Process
The PRISMA diagram of the systematic screening process can be seen in Figure 2. Our initial search across four academic databases: MEDLINE, IEEE Xplore, Scopus, JMIR, and ACM yielded 14265 papers: 907 from MEDLINE, 102 from IEEE Xplore, 204 from Scopus, 211 from JMIR, and 12,841 from ACM.After duplication, 13,967 unique papers were retained.Subsequent screening is based on predefined inclusion and exclusion criteria, narrowing down the selection to 40 papers included in this review.The reasons for the full-text exclusion of 61 papers can be found in Multimedia Appendix 2. In our review of the literature, we classified the included articles into three broad categories: detection of mental health conditions and suicidal ideation through text (n=15), LLMs usage for mental health CAs (n=7), and the other applications and evaluation of the LLMs in mental health (n=18).The first category investigates the potential of LLMs for the early detection of mental illness and suicidal ideation via social media and other textual sources.Early screening is highlighted as essential for preventing the progression of mental disorders and mitigating more severe outcomes.The second category assesses LLM-supported CAs used as teletherapeutic interventions for mental health issues, such as loneliness, with a focus on evaluating ChatGPT-4 and ChatGPT-3.5 evaluated a vignette depicting suicide risk, compared to mental health professionals' assessments.
ChatGPT-4's assessments of suicide attempts aligned closely with mental health professionals with an average Z score of 0.01, while ChatGPT-3.5significantly underestimated these risks with a Z score of -0.83.ChatGPT-4 reported higher rates of suicidal ideation and psychache with Z scores of 0.47 and 1.00, respectively, but assessed resilience levels lower than professionals with Z scores of -0.89 and -0.90.[ 3] This paper created old standard labels for a subset of each dataset usin a panel of human raters.t compared state-of-the-art sentiment analysis tools on both datasets to evaluate variability and disa reement.Additionally, it e plored few-shot learnin by fine-tunin T usin a small annotated subset and zero-shot learnin usin hatG T. This paper revealed hi h variability and disa reement amon sentiment analysis tools when applied to healthrelated survey data.
T and hatG T demonstrated superior performance, outperformin all other tools.Moreover, hatG T outperformed T, achievin 6% hi her accuracy and a % to % hi her F-measure.The study assessed the emotional understandin of LLMs like hatG T usin the tress and opin rocess Questionnaire ( Q) across three penA models (davinci-3, hatG T, G T-) to compare their appraisal and copin reactions a ainst human data and appraisal theory predictions.
The study applied the Q to three penA LLMs-davinci-3, hatG T, and G T--and found that while their responses ali ned with human dynamics of appraisal and copin , they did not vary across key appraisal dimensions as predicted and differed si nificantly in response ma nitude.otably, all models reacted more ne atively than humans to ne ative scenarios, potentially influenced by their trainin processes.(Grabb, 3) [ ] valuation of prompt en ineerin by LLMs and its impact on mental health hatG T .

hatG T's answers to unique questions
The study tested hatG T .'s response variability to four uniquely framed questions about happiness, each asked five times in distinct roles and conte ts, to e plore the model's adaptability and advice consistency.
The study found hatG T .'

Mental health conditions and suicidal ideation detection through text
Early intervention and screening are crucial in mitigating the global burden of mental health issues [132].We examined the performance of LLMs in detecting mental health conditions and suicidal ideation through textual analysis.Six articles assessed the efficacy of early screening for depression using LLMs [50,57,60,61,66,68], while another simultaneously addressed both depression and anxiety [60].One comprehensive study examined various psychiatric conditions, including depression, social anxiety, loneliness, anxiety, and other prevalent mental health issues [69].Two articles assessed and compared the ability of LLMs to perform sentiment and emotion analysis [75,81].Five articles focused on the capability of LLMs to analyze textual content for detecting suicidal ideation [54,65,70,72,78].Most studies employed BERT and its variants as one of the primary models (n=10) [50,54,57,62, In studies focusing on early screening for depression, comparing results horizontally is challenging due to variations in datasets, training methods, and models across different investigations.Nonetheless, substantial evidence supports the significant potential of LLMs in detecting depression from text-based data.For example, Danner et al. conducted a comparative analysis using a Convolutional Neural Network (CNN) on the DAIC-WOZ dataset, achieving F1 scores of 0.53 and 0.59; however, their use of GPT-3.5 demonstrated superior performance, with an F1 score of 0.78 [57].Another study involving the E-DAIC dataset (an extension of DAIC-WOZ) used DepRoBERTa to predict the PHQ-8 scores from textual data.This approach identified three levels of depression and achieved the lowest MAE of 3.65 in PHQ-8 scores [66].
LLMs play an important role in sentiment analysis [75,81], which categorizes text into overall polarity classes such as positive, neutral, negative, and occasionally mixed, and emotion classification, which assigns labels like 'joy,' 'sadness,' 'anger,' and 'fear' [75].These analyses enable the detection of emotional states and potential mental health issues from textual data, facilitating early intervention [133].Stigall et al. demonstrated the efficacy of these models, with their study showing that EmoBERTTiny, a fine-tuned variant of BERT, achieved an accuracy of 93.14% in sentiment analysis and 85.46% in emotion analysis.This performance surpasses that of baseline models, including BERT-Base Cased and Prak-wal1 pre-trained BERTTiny [75], underscoring the advantages and validity of fine-tuning in enhancing model performance.LLMs have also demonstrated robust accuracy in detecting and classifying a range of mental health syndromes, including social anxiety, loneliness, and generalized anxiety.Vajre et al. introduced PsychBERT, developed using a diverse training dataset from both social media texts and academic literature, which achieved an F1 score of 0.63, outperforming traditional deep learning approaches such as CNNs and Long Short-Term Memory Networks (LSTMs), which recorded F1 scores of 0.57 and 0.51, respectively [69].In research on detecting suicidal ideation using LLMs, Diniz et al. showcased the high efficacy of the BERTimbau Large model within a non-English (Portuguese) context, achieving an accuracy of 0.955, precision of 0.961, and an F-score of 0.954 [54].Metzler et al.'s assessment of the BERT model found it correctly identified 88.5% of tweets as suicidal or off-topic, performing comparably to human analysts and other leading models [65].However, Inbar Levkovich et al. noted that while ChatGPT-4 assessments of suicide risk closely aligned with those by mental health professionals, it overestimated suicidal ideation [70].These results underscore that while LLMs have the potential to identify tweets reflecting suicidal ideation with accuracy comparable to psychological professionals, extensive follow-up studies are required to establish their practical application in clinical settings.

LLMs in mental health CAs
In the growing field of mental health digital support, the implementation of LLMs as CAs has exhibited both promising advantages [14,84,91,96] [14,96].This intervention is particularly important for those who lack ready access to a therapist due to constraints such as time, distance, and work, as well as for certain socially marginalized populations, such as older adults who experience chronic loneliness and a lack of companionship [14,97].Ma et al.'s qualitative analysis of user interactions on Reddit highlights that LLMs encourage users to speak up and boost their confidence by providing personalized and responsive interactions [14].Additionally, VHope, a DialoGPT-enabled mental health CA, was evaluated by three experts who rated its responses as 67% relevant, 78% human-like, and 79% empathetic [84].Another study found that after 717 evaluations by 100 participants on 239 autism-specific questions, 46.86% of evaluators preferred responses of the chief physicians, whereas 34.87% preferred ChatGPT-4 (OpenAI), and 18.27% favored ERNIE Bot (version 2.2.3;Baidu, Inc).Moreover, ChatGPT (mean score: 3.64, 95% CI 3.57-3.71)outperformed physicians (mean score: 3.13, 95% CI 3.04-3.21)in terms of empathy [98], indicating that LLM-powered CAs are not only effective but also acceptable by users.These findings highlight the potential for LLMs to complement mental health intervention systems and provide valuable medical guidance.
The development and implementation of a non-English CA for emotion capture and categorization was explored in a study by Zygadlo et al.Faced with a scarcity of Polish datasets, the study adapted by translating an existing database of personal conversations from English into Polish, which decreased accuracy in tasks from 90% in English to 80% in Polish [92].While the performance remained commendable, it highlighted the challenges posed by the lack of robust datasets in languages other than English, impacting the effectiveness of CAs across different linguistic environments.However, findings by He et al. suggest that the availability of language-specific datasets is not the sole determinant of CA performance.In their study, although ERNIE Bot was trained in Chinese and ChatGPT in English, ChatGPT demonstrated greater empathy for Chinese users [98].This implies that factors beyond the training language and dataset availability, such as model architecture or training methodology, can also affect the empathetic responsiveness of LLMs, underscoring the complexity of human-AI interaction.
Meanwhile, the reliability of LLM-driven CAs in high-risk scenarios remains a concern [14,96].An evaluation of 25 CAs found that in tests involving suicide scenarios, only two included suicide hotline referrals during the conversation [96].This suggests that while these CAs can detect extreme emotions, few are equipped to take effective preventive measures.Furthermore, CAs often struggle with maintaining consistent communication due to limited memory capacity, leading to disruptions in conversation flow and negatively affecting user experience [14].

The other applications and evaluation of the LLMs in mental health
ChatGPT has gained attention for its unparalleled ability to generate human-like text and analyze large amounts of textual data, attracting the interest of many researchers and practitioners [ , which means their responses can vary widely depending on the wording and context of the prompts given.The system prompts, which are predefined instructions given to the model, and the prompts used by the experimental team, such as those in Farhat's study, guide the behavior of ChatGPT and similar LLMs.These prompts are designed to accommodate a variety of user requests within legal and ethical boundaries.However, while these boundaries are intended to ensure safe and appropriate responses, they often fail to align with the nuanced sensitivities required in psychological contexts.This mismatch underscores a significant deficiency in the clinical judgment and control of LLMs within sensitive mental health settings.
Further research into other LLMs in the mental health sector has shown a range of capabilities and limitations.For example, a study by Sezgin et al. highlighted LaMDA's proficiency in managing complex inquiries about PPD that require medical insight or nuanced understanding, yet pointed out its challenges with straightforward, factual questions, such as "What are antidepressants?"[111].Assessments of LLMs like LLaMA-7B, ChatGLM-6B, and Alpaca, involving 50 interns specializing in mental illness, received favorable feedback regarding the fluency of these models in a clinical context, with scores above 9.5 out of 10.However, the results also indicated that the responses of these LLMs often failed to address mental health issues adequately, demonstrated limited professionalism, and resulted in decreased usability [116].Similarly, a study on psychiatrists' perceptions of using LLMs such as Bard and Bing AI in mental health care revealed mixed feelings.While 40% of physicians indicated that they would use such LLMs to assist in answering clinical questions, some expressed serious concerns about their reliability, confidentiality, and potential to damage the patient-physician relationship [130].

Principal findings
In the context of the wider prominence of LLMs in the literature [14,50,57,60,61,69,96,130], our research supports the assertion that interest in LLMs is growing in the field of mental health.Figure 3 indicates a rising trend in the number of mental health studies employing LLMs, with a notable surge observed in 2023 following the introduction of ChatGPT in late 2022.Although we included articles only up to the end of April 2024, it is evident that the number of articles related to LLMs in the field of mental health continues to show a steady increase in 2024.This trend marks a substantial shift in the discourse around LLMs, reflecting their broader acceptance and integration into various aspects of mental health research and practice.The progression from text analysis to a diverse range of applications highlights the academic community's recognition of the multifaceted uses of LLMs.LLMs are increasingly employed for complex psychological assessments, including early screening, diagnosis, and therapeutic interventions.
Our findings demonstrate that LLMs are highly effective in analyzing textual data to assess mental states and identify suicidal ideation [50,54,57,60,61,65,66,68,69,72,78], although their categorization often tends to be binary [50,54,65,68,69,72,78].These LLMs possess extensive knowledge in the field of mental health and are capable of generating empathic responses that closely resemble human interactions [97,98,107].They show great potential for providing mental health interventions with improved prognoses [50,96,110,127,128,131], with the majority being recognized by psychologists for their appropriateness and accuracy [98,100,129].The careful and rational application of LLMs can enhance mental health care efficiently and at a lower cost, which is crucial in areas with limited healthcare capacity.However, there are currently no studies available that provide evaluative evidence to support the clinical use of LLMs.

Strengths and Limitations of Using LLMs in Mental Health
Based on the works of literature the strengths and weaknesses of applying the LLMs in mental health are summarized in Table 4.
LLMs have a broad range of applications in the mental health field.These models excel in user interaction, provide empathy and anonymity, and help reduce the stigma associated with mental illness [14,107], potentially encouraging more patients to participate in treatment.They also offer a convenient, personalized, and cost-effective way for individuals to access mental health services at any time and from any location, which can be particularly helpful for socially isolated populations, especially the elderly [60,84,97].Additionally, LLMs can help reduce the burden of care during times of severe healthcare resource shortages and patient overload, such as during the COVID-19 pandemic [68].Although previous research has highlighted the potential of LLMs in mental health, it is evident that they are not yet ready for clinical use due to unresolved technical risks and ethical issues.
The use of LLMs in mental health, particularly those fine-tuned for specific tasks such as ChatGPT, reveals clear limitations.The effectiveness of these models heavily depends on the specificity of user-generated prompts.Inappropriate or imprecise prompts can disrupt the conversation's flow and diminish the model's effectiveness [75,96,105,107,109].Even small changes in the content or tone of prompts can sometimes lead to significant variations in responses, which can be particularly problematic in healthcare settings where interpretability and consistency are critical [14,105,107].Furthermore, LLMs lack clinical judgment and are not equipped to handle emergencies [95,108].While they can generally capture extreme emotions and recognize scenarios requiring urgent action, such as suicide ideation [54,65,70,72,78], they often fail to provide direct, practical measures, typically only advising users to seek professional help [96].In addition, the inherent bias in LLM training data [66,106] can lead to the propagation of stereotypical, discriminatory, or biased viewpoints.This bias can also give rise to hallucinations, where LLMs produce erroneous or misleading information [105,131].Hallucinations also may stem from overfitting the training data or a lack of context understanding [134].Such inaccuracies can have serious consequences, such as providing incorrect medical information, reinforcing harmful stereotypes, or failing to recognize and appropriately respond to mental health crises [131].For example, an LLM might reinforce a harmful belief held by a user, potentially exacerbating their mental health issues, or it could generate non-factual, overly optimistic, or pessimistic medical advice, delaying appropriate professional intervention.These issues could undermine the integrity and fairness of social psychology [102,105,106,110].
Another critical concern is the 'black box' nature of LLMs [105,107,131].This lack of interpretability complicates the application of LLMs in mental health, where trustworthiness and clarity are important.When we talk about neural networks as black boxes, we know what they were trained with, how they were trained, what the weights are, etc.However, with many new LLMs like GPT-3.5/4, researchers and practitioners often access the models via web interfaces or APIs without complete knowledge of the training data, methods, and model updates.This situation not only presents the traditional challenges associated with neural networks but also has all these additional problems that come from the "hidden" model.Ethical concern is another significant challenge associated with applying LLMs in mental health.Debates are emerging around issues like digital personhood, informed consent, the risk of manipulation, and the appropriateness of AI in mimicking human interactions [60,102,105,106,135].A primary ethical concern is the potential alteration of the traditional therapist-patient relationship.Individuals may struggle to fully grasp the advantages and disadvantages of LLM derivatives, often choosing these options for their lower cost or greater convenience.This trend could lead to an increased reliance on the emotional support provided by AI [14], inadvertently positioning AI as the primary diagnostician and decision-maker for mental health issues, thereby undermining trust in conventional healthcare settings.Moreover, therapists may become overly reliant on LLM-generated answers and use them in clinical decision-making, overlooking the complexities involved in clinical assessment.This reliance could compromise their professional judgment and reduce opportunities for in-depth engagement with patients [17,129,130].Furthermore, the dehumanization and technocratic nature of mental health care has the potential to depersonalize and dehumanize patients [136], where decisions are more driven by algorithms than by human insight and empathy.This can lead to decisions becoming mechanized, lacking empathy, and detached from ethics [137].AI systems may fail to recognize or adequately interpret the subtle and often non-verbal cues critical in traditional therapeutic settings [136], such as tone of voice, facial expressions, and the emotional weight behind words, which are essential for comprehensively understanding a patient's condition and providing empathetic care.
Additionally, the current roles and accuracy of LLMs in mental health are limited.For instance, while LLMs can categorize a patient's mood or symptoms, most of these categorizations are binary, such as 'depressed' or 'not depressed' [50,65].This oversimplification can lead to misdiagnoses.Data security and user privacy in clinical settings are also of utmost concern [14,54,60,96,130].Although nearly 70% of psychiatrists believe that managing medical documents will be more efficient using LLMs, many still have concerns about their reliability and privacy [97,130,131].These concerns could have a devastating impact on patient privacy and undermine the trust between physicians and patients if confidential treatment records stored in LLM databases are compromised.Beyond the technical limitations of AI, the current lack of an industrybenchmarked ethical framework and accountability system hinders the true application of LLMs in clinical practice [131].

Limitations of the Selected Articles
Several limitations were identified in the literature review.A significant issue is the age bias present in the social media data used for depression and mental health screening.Social media platforms tend to attract younger demographics, leading to an underrepresentation of older age groups [65].Furthermore, most studies have focused on social media platforms primarily used by English-speaking populations, such as Twitter, which may result in a lack of insight into mental health trends in non-Englishspeaking regions.Our review included studies in Polish, Chinese, Portuguese, and Malay, all of which highlighted significant limitations of LLMs caused by the availability and size of databases [54,61,92,98,116].For instance, due to the absence of a dedicated Polishlanguage mental health database, a Polish study had to rely on machine-translated English databases [92].While the LLMs achieve 80% accuracy in categorizing emotions and moods in Polish, this is still lower than the 90% accuracy observed in the original English dataset.This discrepancy highlights that the accuracy of LLMs can be affected by the quality of the database.
Another limitation of our review is the low diversity of LLMs studied.Although we used 'large language models' as keywords in our search phase, most identified studies primarily focused on BERT and its variants, as well as GPT models.Therefore, this review provides only a limited picture of the variability we might expect in applicability between different LLMs.Additionally, the rapid development of LLM technologies presents a limitation; our study can only reflect current trends and may not encompass future advances or the full potential of LLMs.For instance, in tests involving psychologically relevant questions and answers, ChatGPT 3.5 achieved an accuracy of 66.8%, while ChatGPT 4.0 reached an accuracy of 85%, compared to an average human score of 73.8% [118].Evaluating ChatGPT at different stages separately and comparing its performance to that of humans can lead to varied conclusions.In the assessment of prognosis and treatment planning for depression using LLMs, ChatGPT 3.5 demonstrated a distinctly pessimistic prognosis that differed significantly from those of ChatGPT-4, Claude, Bard, and mental health professionals [128].Therefore, continuous monitoring and evaluation are essential to fully understand and effectively utilize the advancements in LLM technologies.

Opportunities and Future Work
Implementing technologies involving LLMs within the healthcare provision of real patients demands thorough and multi-faceted evaluations.It is imperative for both industry and researchers to not let rollout exceed proportional requirements for evidence on safety and efficacy.At the level of the service provider, this includes providing explicit warnings to the public to discourage mistaking LLM functionality for clinical reliability.For example, ChatGPT-4 introduced the ability to process and interpret image inputs within conversational contexts, leading OpenAI to issue an official warning that ChatGPT-4 is not approved for analyzing specialized medical images, such as CT scans [138].
A key challenge to address in LLM research is the tendency to produce incoherent text or hallucinations.Future efforts could focus on training LLMs specifically for mental health applications, using datasets with expert labeling to reduce bias and create specialized mental health lexicons [84,102,116].The creation of specialized datasets could take advantage of the customizable nature of LLMs, fostering the development of models that cater to the distinct needs of varied demographic groups.For instance, unlike models designed for healthcare professionals which assist in tasks like data documentation, symptom analysis, medication management, and postoperative care, LLMs intended for patient interaction might be trained with an emphasis on empathy and comfortable dialogue.
Another critical concern is the problem of outdated training data in LLMs.Traditional LLMs, such as GPT-4 (with a cut-off in October 2023), rely on potentially outdated training data, limiting their ability to incorporate recent events or information.This can compromise the accuracy and relevance of their responses, leading to the generation of uninformative or incorrect answers, known as 'hallucinations' [139].RAG (Retrieval-Augmented Generation) technology offers a solution by retrieving facts from external knowledge bases, ensuring that LLMs use the most accurate and up-to-date information [140].By searching for relevant information from numerous documents, RAG enhances the generation process with the most recent and contextually relevant content [141].Additionally, RAG includes evidence-based information, increasing the reliability and credibility of LLM responses [139].
To further enhance the reliability of LLM content and minimize hallucinations, recent studies suggest adjusting model parameters, such as the 'temperature' setting [142][143][144].The 'temperature' parameter influences the randomness and predictability of outputs [145].Lowering the temperature typically results in more deterministic outputs, enhancing coherence and reducing irrelevant content [146].However, this adjustment can also limit the model's creativity and adaptability, potentially making it less effective in scenarios requiring diverse or nuanced responses.In mental therapy, where nuanced and sensitive responses are essential, maintaining an optimal balance is crucial.While a lower temperature can ensure accuracy, which is important for tasks like clinical documentation, it may not suit therapeutic dialogs where personalized engagement is key.Low temperatures can lead to repetitive and impersonal responses, reducing patient engagement and therapeutic effectiveness.To mitigate these risks, regular updates of the model incorporating the latest therapeutic practices and clinical feedback are essential.Such updates could refine the model's understanding and response mechanisms, ensuring it remains a safe and effective tool for mental health care.Nevertheless, determining the 'optimal' temperature setting is challenging, primarily due to the variability in tasks and interaction contexts which require different levels of creativity and precision.
Data privacy is another important area of concern.Many LLMs, such as ChatGPT and Claude, involve sending data to third-party servers, which poses the risk of data leakage.Current studies have found that LLMs can be enhanced by Privacy Enhancing Techniques, such as zero-knowledge proofs, differential privacy, and federated learning [147].Additionally, privacy can be preserved by replacing identifying information in textual data with generic tokens.For example, when recording sensitive information (e.g., names, addresses, or credit card numbers), using alternatives to mask tokens can help protect user data from unauthorized access [148].This obfuscation technique ensures that sensitive user information is not stored directly, thereby enhancing data security.
The lack of interpretability in LLM decision-making is another crucial area for future research on healthcare applications.Future research should examine the models' architecture, training, and inferential processes for clearer understanding.Detailed documentation of training datasets, sharing of model architectures, and third-party audits would ideally form part of this undertaking.Investigating techniques like attention mechanisms and modular architectures could illuminate aspects of neural network processing.The implementation of knowledge graphs might help in outlining logical relationships and facts [149].Additionally, another promising approach involves creating a dedicated embedding space during training, guided by an LLM.This space aligns with a causal graph and aids in identifying matches that approximate counterfactuals [150].
Before deploying LLMs in mental health settings, a comprehensive assessment of their reliability, safety, fairness, abuse resistance, interpretability, compliance with social norms, robustness, performance, linguistic accuracy, and cognitive ability is essential.It is also crucial to foster collaborative relationships among mental health professionals, patients, AI researchers, and policymakers.LLMs, for instance, have demonstrated initial competence in providing medication advice, yet their responses can sometimes be inconsistent or include inappropriate suggestions.As such, LLMs require professional oversight and should not be used independently.However, when utilized as decision aids, LLMs have the potential to enhance healthcare efficiency.We call on developers of LLMs to collaborate with authoritative regulators in actively developing ethical guidelines for AI research in healthcare.These guidelines should aim to adopt a balanced approach that considers the multifaceted nature of LLMs and ensures their responsible integration into medical practice.They are expected to become industry benchmarks, facilitating the future development of LLMs in mental health.

Conclusion
This review examines the use of LLMs in mental health applications, including text-based screening for mental health conditions, detection of suicidal ideation, CAs, clinical use, and other related applications.Despite their potential, challenges such as the production of hallucinatory or harmful information, output inconsistency, and ethical concerns remain.Nevertheless, as technology advances and ethical guidelines improve, LLMs are expected to become increasingly integral and valuable in mental health services, providing alternative solutions to this global healthcare issue.

Contributors
ZG and KL contributed to the conception and design of the study.ZG, KL, and AL also contributed to the development of the search strategy.Database search outputs were screened by ZG, and data were extracted by ZG and KL.An assessment of the risk of bias in the included studies was performed by ZG and KL.ZG completed the literature review, collated the data, performed the data analysis, interpreted the results, and wrote the first draft of the manuscript.KL, AL, JHT, JF, and TK reviewed the manuscript and provided multiple rounds of guidance in the writing of the manuscript.All authors read and approved the final version of the manuscript.

Acknowledgements
This work was funded by the UKRI Centre for Doctoral Training in AI-enabled healthcare systems (grant EP/S021612/1).The funders were not involved in the study design, data collection, analysis, publication decisions, or manuscript writing.The views expressed in the text are those of the authors and not those of the funder.

Conflicts of Interest
The authors declare no conflict of interest.

Data sharing statement
The authors ensure that all pertinent data have been incorporated within the article and/or its supplementary materials.For access to the research data, interested parties may contact the corresponding author, Kezhi Li (ken.li@ucl.ac.uk), subject to a reasonable request.Methods-Pg 7-8

4.9
13d Describe any methods used to synthesize results and provide a rationale for the choice(s).If metaanalysis was performed, describe the model(s), method(s) to identify the presence and extent of statistical heterogeneity, and software package(s) used.
Methods-Pg 7-8 13e Describe any methods used to explore possible causes of heterogeneity among study results (e.g.subgroup analysis, meta-regression).
Methods-Pg 8 13f Describe any sensitivity analyses conducted to assess robustness of the synthesized results.

Reporting bias assessment
14 Describe any methods used to assess risk of bias due to missing results in a synthesis (arising from reporting biases).

Methods-Pg 7
Certainty assessment 15 Describe any methods used to assess certainty (or confidence) in the body of evidence for an outcome.

Study selection
16a Describe the results of the search and selection process, from the number of records identified in the search to the number of studies included in the review, ideally using a flow diagram.

Section and Topic
Item # Checklist item Location where item is reported 16b Cite studies that might appear to meet the inclusion criteria, but which were excluded, and explain why they were excluded.

Figure 1 .
Figure 1.Comparative analysis of LLMs by parameter size and developer entity.The bar chart

Figure 3 .
Figure 3. Number of articles included in this literature review, grouped by year of publication and application field.The black line indicates the total number of articles in each year.
, Birrell L, Kershaw , evine K, Thornton L. an we use hatG T for mental health and substance use education?e aminin its quality and potential harms.M Medical ducation M ublications nc.; 3; .M :3 3 103.racks in the ce.vidence-Based nformation for the ommunity.racks in the ce.Available from: https: cracksintheice.or.au[accessed Apr , ] 104.ositive hoices: ru and Alcohol ducation -Get informed, stay smart, stay safeositive hoices.Available from: https: positivechoices.or.au[accessed Apr , ] 105.Farhat F. hatG T as a complementary mental health resource: a boon or a bane.Ann
and significant challenges [92,96].The studies by Ma et al. and Heston et al. both demonstrate the effectiveness of CAs powered by LLMs in providing timely, non-judgmental mental health support 100].Numerous evaluations of LLMs in mental health have focused on ChatGPT, exploring its utility across various scenarios such as clinical diagnosis However, the direct deployment of LLMs such as ChatGPT in clinical settings carries inherent risks.The outputs of LLMs are heavily influenced by prompt engineering, which can lead to inconsistencies that undermine clinical reliability [102,105,106,107,109].For example, Farhat et al. conducted a critical evaluation of ChatGPT's ability to generate medication guidelines through detailed cross-questioning and noted that altering prompts substantially changed the responses [105].While ChatGPT typically provided helpful advice and recommended seeking expert consultation, it occasionally produced inappropriate medication suggestions.Perlis et al. verified this, showing that GPT-4 Turbo suggested medications that were considered poor choices or contraindicated by experts in 12% of cases [129].Moreover, LLMs often lack the necessary clinical judgment capabilities.This issue was highlighted by Grabb's study, which revealed that despite built-in safeguards, ChatGPT remains susceptible to generating extreme and potentially hazardous recommendations [109].A particularly alarming example was ChatGPT advising a depressed patient to engage in high-risk activities like bungee jumping as a means of seeking pleasure [109].These LLMs depend on prompt engineering [102,105,109]

. Multimedia Appendix 2 Supplementary material 2: List of studies excluded at the full-text screening stage
Specify all databases, registers, websites, organisations, reference lists and other sources searched or consulted to identify studies.Specify the date when each source was last searched or consulted.Specify the methods used to decide whether a study met the inclusion criteria of the review, including how many reviewers screened each record and each report retrieved, whether they worked independently, and if applicable, details of automation tools used in the process.Specify the methods used to collect data from reports, including how many reviewers collected data from each report, whether they worked independently, any processes for obtaining or confirming data from study investigators, and if applicable, details of automation tools used in the process.Data items10a List and define all outcomes for which data were sought.Specify whether all results that were compatible with each outcome domain in each study were sought (e.g. for all measures, time points, analyses), and if not, the methods used to decide which results to collect.
For all outcomes, present, for each study: (a) summary statistics for each group (where appropriate) and (b) an effect estimate and its precision (e.g.confidence/credible interval), ideally using structured tables or plots.Present results of all statistical syntheses conducted.If meta-analysis was done, present for each the summary estimate and its precision (e.g.confidence/credible interval) and measures of statistical heterogeneity.If comparing groups, describe the direction of the effect.Results-Pg 18-21 20c Present results of all investigations of possible causes of heterogeneity among study results.Results -Pg 18-21 20d Present results of all sensitivity analyses conducted to assess the robustness of the synthesized results.Present assessments of certainty (or confidence) in the body of evidence for each outcome assessed.Results -Pg 11-22 DISCUSSION Discussion 23a Provide a general interpretation of the results in the context of other evidence.Discussion-Pg 24 23b Discuss any limitations of the evidence included in the review.Indicate where the review protocol can be accessed, or state that a protocol was not prepared.Methods-Pg 6-7 24c Describe and explain any amendments to information provided at registration or in the protocol.Methods-Pg 6-7 Support 25 Describe sources of financial or non-financial support for the review, and the role of the funders or sponsors in the review.Report which of the following are publicly available and where they can be found: template data collection forms; data extracted from included studies; data used for all analyses; analytic code; any other materials used in the review./www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset[accessed Aug 6, 2024] 78.Ghanadian H, Nejadgholi I, Al Osman H. Socially aware synthetic data generation for suicidal ideation detection using large language models.IEEE Access 2024 Jan 22;PP.doi: 10.1109/ACCESS.2024.335820679.FLAN-T5.Available from: https://huggingface.co/docs/transformers/model_doc/flan-t5 [accessed Aug 6, 2024] 80. H. Ghanadian, .ejad holi and H. Al sman, " hatG T for suicide risk assessment on social media: Quantitative evaluation of model performance potentials and limitations", arXiv:2306.09390, 3 81.Lossio-Ventura JA, Weger R, Lee AY, Guinee EP, Chung J, Atlas L, Linos E, Pereira F. A comparison of ChatGPT and fine-tuned open pre-trained transformers (OPT) against widely used sentiment analysis tools: sentiment analysis of COVID-19 survey data.JMIR Ment Health 2024 Jan 25;11:e50150.PMID:38271138 82.hun Y, Gibbons A, Atlas L, Ballard , rnst M, apee , et al.V -and mental health: predicted mental health status is associated with clinical symptoms and pandemic-related psycholo ical and behavioral responses.med iv.ct .[ reprint] 83.elson LM, imard F, luyomi A, ava V, osas LG, Bondy M, et al.U public concerns about the V -pandemic from results of a survey iven via social media.-A, n , esurreccion .Therapist vibe: children's e pressions of their emotions throu h storytellin with a chatbot.roceedin s of the nteraction esi n and hildren onference ew York, Y, U A: Association for omputin Machinery; .p. 3-.doi: .33 63.33 86.odri o M. Towards buildin mental health resilience throu h storytellin with a chatbot.roceedin s of the th nternational onference on omputers in ducation Montene ro , n .nvesti atin the acceptability and perceived effectiveness of a chatbot in helpin students assess their well-bein .roceedin s of the Asian H ymposium ew York, Y, U A: Association for omputin Machinery; .p. 3 -.doi: .3 36 .36 90.H. A. chwartz, M. ap, M. L. Kern, . .ichstaedt, A. Kapelner, M. A rawal, et al., " redictin individual well-bein throu h the lan ua e of social media", Biocomputin 6: roceedin s of the acific ymposium, pp.6-, 6 91.rasto , ias L, Miranda , Kayande .areBot: A mental health chatbot.A latform. uperior ustomer periences tart Here.Available from: https: rasa.com[accessed Apr , ] 94. spa y • ndustrial-stren th atural Lan ua e rocessin in ython.Available from: https: spacy.io[accessed Apr , ] 95.Li Y, u H, hen , Li W, ao Z, iu .aily ialo : A manually labelled multi-turn dialo ue dataset.ar iv; .doi: .ar iv. .3 [ reprint] 96.Heston TF. afety of lar e lan ua e models in addressin depression.ureus ( ):e .M :3 3 97.Alessa A, Al-Khalifa H. Towards designing a ChatGPT conversational companion for elderly people.Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments New York, NY, USA: Association for Computing Machinery; 2023.p. 667-674.doi: 10.1145/3594806.359657298.He W, Zhang W, Jin Y, Zhou Q, Zhang H, Xia Q. Physician versus large language model chatbot responses to web-based questions from autistic patients in Chinese: cross-sectional comparative analysis.J Med Internet Res 2024 Apr 30;26:e54706.PMID:38687566 99.Deng Z, Liu S, Evans R. Knowledge transfer between physicians from different geographical regions in China's online health communities.Inf Technol Manag.2023.:1-18100.Franco ' ouza , Amanullah , Mathew M, urapaneni KM.Appraisin the performance of hatG T in psychiatry usin clinical case vi nettes.Asian ournal of sychiatry 3 ov ; : 3 .doi: .6 j.ajp.3. 3 101.Wri ht B, ave , o ra .cases in psychiatry.nd ed.Boca aton: ress; .doi: .