Potential roles of large language models in production of systematic reviews and meta-analyses

Large language models (LLMs) like ChatGPT have become widely applied in the field of medical research. In the process of conducting systematic reviews, similar tools can be employed to expedite various steps, including defining clinical questions, literature search, document screening, information extraction


Preprint Settings
1) Would you like to publish your submitted manuscript as preprint?Please make my preprint PDF available to anyone at any time (recommended).Please make my preprint PDF available only to logged-in users; I understand that my title and abstract will remain visible to all users.Only make the preprint title and abstract visible.
No, I do not wish to publish my submitted manuscript as a preprint.2) If accepted for publication in a JMIR journal, would you like the PDF to be visible to the public?
Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended).
Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain v Yes, but only make the title and abstract visible (see Important note, above).I understand that if I later pay to participate in <a href="http

Introduction
A systematic review is the result of a systematic and rigorous evaluation of evidence, and a metaanalysis may or may not be a part of it [1].Due to its strict methodology and comprehensive summary of evidence, high-quality systematic reviews are considered the highest level of evidence in the hierarchy of evidence [2].They are positioned at the top of the evidence pyramid [2].
Additionally, high-quality systematic reviews and meta-analyses are often used to support the development of clinical practice guidelines, aid clinical decision-making, and inform healthcare policy formulation [3].Currently, the methods of systematic reviews and meta-analyses are also applied in various disciplines beyond medicine, such as law [4], management [5], economics [6], and have yielded positive results, contributing to the continuous advancement of these fields [7].
The process of conducting systematic reviews demands a substantial investment in terms of time, resources, human effort, and financial capital [8].To expedite the development of systematic reviews and meta-analyses, various (semi)automated tools, such as Covidence, have also come into play [9,10].However, the emergence of large language models (LLMs), particularly Chatbots such as GPT, presents a set of challenges and opportunities in the realm of systematic review and metaanalysis [11].This article conducts a comprehensive review of relevant literature, aiming to investigate the potential for harnessing LLMs to accelerate the production of systematic review and meta-analysis, while also scrutinizing the potential impacts and delineating the crucial steps involved in this process.

The process and challenges of conducting a systematic review and meta-analysis
The procedures and workflows for conducting systematic reviews and meta-analyses are wellestablished.Currently, researchers often refer to the Cochrane Handbooks recommended by the Cochrane Library for intervention or diagnostic reviews [12,13].In addition, some scholars and institutions have also developed detailed guidelines on the steps and methodology for performing systematic reviews and meta-analyses [14][15][16][17].Generally speaking, researchers should take the following steps to produce a high-quality systematic review and meta-analysis: determine the clinical question, register and draft a protocol, set inclusion and exclusion criteria, develop and implement a search strategy, screen literature, extract data from included studies, assess the quality and risk of bias of included studies, analyze and process data, write up the full text, and submit for publication, as illustrated in Figure 1.These different steps contain many sub-tasks, therefore Luo et al conducting a complete systematic review and meta-analysis requires fairly complex and timeconsuming work.

Figure 1
The process of conducting a systematic review and meta-analysis Although systematic reviews and meta-analyses have been widely applied and play an important role in developing guidelines and informing clinical decision-making, their production process faces many challenges.One of them is the long production time and large resource requirements.Studies suggest that the average estimated time to complete and publish a systematic review is 67.3 weeks, requiring five researchers and costing around $140,000 [18][19].For some time, (semi-)automated tools utilizing natural language processing and machine learning have accelerated systematic review and meta-analysis production to some extent [20], with studies showing such tools can produce a systematic review and meta-analysis within two weeks [21].However, these tools also have some limitations.First, no single tool can fully accelerate the entire production process of systematic reviews and meta-analyses.Second, these tools cannot process and analyze literature in different languages.Finally, the reliability of results generated by these (semi-)automated tools needs further validation as they are not yet widely adopted.

Large language models in medical research
Chatbots based on LLMs, such as ChatGPT, Google Gemini, and Claud, have become widely applied in medical research.These chatbots prove valuable in tasks ranging from knowledge retrieval, language refinement, content generation, and medical exam preparation to literature assessment.
Research indicates that ChatGPT excels in accuracy, completeness, nuance, and speed when generating responses to clinical inquiries in psychiatry [22].Moreover, LLMs like ChatGPT play a pivotal role in automating the evaluation of medical literature, facilitating the identification of accurately reported research findings [23].Despite their significant contributions, these chatbots are not without limitations.Challenges such as the potential for generating misleading content and susceptibility to academic deception necessitate further scholarly discourse on effective mitigation strategies.Standardized reporting practices may contribute to delineating the applications of ChatGPT and mitigating research biases [24].
In the process of conducting systematic reviews and meta-analyses, ChatGPT demonstrates significant application potential and promise.Existing studies [11,[25][26][27][28][29][30][31][32] indicate that ChatGPT can play a pivotal role in formulating clinical questions, determining inclusion and exclusion criteria, screening literature, assessing publications, generating meta-analysis code, and assisting in full-text composition, etc.In this context, we will provide a detailed exposition of these capabilities (Table 1).

Determine the research topic/question
Determining the clinical question represents the initial and paramount step in the process of conducting systematic reviews and meta-analyses.At this juncture, it is crucial to ascertain whether comparable systematic reviews and meta-analyses have already been published and to delineate the scope of the forthcoming review and meta-analysis.Generally, for interventional systematic reviews, the patient, intervention, comparison, outcome (PICO) framework is considered for defining the scope and objectives of the research question [60].In this context, ChatGPT serves a dual role.On one hand, it expeditiously aids in searching for published systematic reviews and meta-analyses related to the relevant topics (See Figure S1 and S2) [34].On the other hand, it assists in refining the clinical question that needs to be addressed (See Figure S3), facilitating researchers in promptly determining the feasibility of undertaking the proposed study.However, it is important to be cautious of false literature [35].

Register and write a research proposal
The registration and proposal writing process constitutes a pivotal preparatory phase for the conducting of systematic reviews and meta-analyses.Registration enhances research transparency, fosters collaboration among investigators, and mitigates the redundancy of research endeavors.
Drafting a proposal helps in elucidating the research objectives and methods, providing robust support for the smooth execution of the study.For LLMs, generating preliminary registration information and initial proposal content is remarkably convenient and facile (see Figures S4 and S5).
For example, ChatGPT can assist researchers in generating the statistical methods for a research proposal [37].However, considering that LLMs often generate fictitious literature, the content they produce may be inaccurate, thus discernment and validation of the generated content remain essential considerations.

Define inclusion an exclusion criterion
The inclusion and exclusion criteria for systematic review and meta-analyses are instrumental in determining the screening standards for studies.Therefore, strict and detailed inclusion and exclusion criteria contribute to the smooth and high-quality conduct of systematic reviews and meta-analyses.The use of a chatbot based on LLMs can help in establishing the inclusion and exclusion criteria (see Figure S6) [38], however, the inclusion criteria need to be optimized and adjusted according to the specific research objectives, and the exclusion criteria should be based on the foundation of the inclusion criteria.Therefore, manual adjustments and optimizations are also necessary.

Develop a search strategy and conduct searches
ChatGPT can assist in formulating search strategies, using PubMed as an example [40].Researchers can simply list their questions using the PICO framework, and a search strategy can be quickly generated (Figure S1 and S2).Based on the generated search strategy, one method is to copy the strategy into the PubMed search box for direct retrieval [40][41].Another approach involves utilizing the OpenAI application programming interfaces (APIs) to invoke PubMed APIs with the search strategy generated by GPT.This allows for searching the PubMed database, obtaining search results, and applying predetermined inclusion and exclusion criteria.Subsequently, GPT is used to filter the search results, exporting and recording the filtered results in JSON format.This integrated process encompasses search strategy formulation, retrieval, and filtering.However, the direct use of LLMs to generate search strategies and complete the one-stop process of searching and screening may not be mature at present, and poses a significant challenge for generating the PRISMA flowchart.
Therefore, we suggest using LLMs to generate search strategies, which are then optimized and modified by librarians and computer experts (specializing in large language models) before manually searching the databases.Additionally, to use search strategies transparently and reproducibly, detailed prompts should be reported [40,42].thesesearch strategies also need validation, refinement, and modification.

Screen the literature
Literature screening is one of the most time-consuming steps in the creation of systematic reviews and meta-analyses.Prior to the advent of ChatGPT, there were already many (semi) automated tools available for literature screening, such as Coevidence, EPPI-Reviewer, DistillerSR, and others [39].With the emergence of ChatGPT, researchers can now train the model based on pre-defined inclusion criteria.Subsequently, they can utilize ChatGPT to automatically screen records retrieved from databases, obtaining the filtered results .Previous studies suggested that utilizing ChatGPT in the literature selection process for meta-analysis substantially diminishes the workload while preserving a recall rate on par with manual curation [28,[44][45][46][47].

Extract the data
Data extraction involves obtaining information from primary studies and serves as a primary source for systematic reviews and meta-analyses.Generally, when conducting systematic reviews and meta-analyses, we need to extract basic information from the original studies, such as publication date, country of conduct, and the journal of publication.Additionally, characteristics of the population, such as patient samples, age, gender, and outcome data, including event occurrences, mean change values, and total sample size, are also extracted.Currently, tools based on natural language processing and LLMs, such as ChatGPT and Claude, demonstrate high accuracy in extracting information from Portable Document Format (PDF) documents (Figure S7) [47][48][49][50].
However, it is important to note that despite the promising capabilities of these tools, manual verification remains a necessary step in the data extraction process when utilizing AI tools [61].
Using large language models to extract data can help avoid random errors; however, caution is still required when extracting data from figures or tables [47][48][49][50].

Assess the risk of bias
Assessing the bias of risk involves evaluating the internal validity of studies included in research.For randomized controlled trials, we typically use tools like Risk of Bias (RoB) [62] or RoB 2 tools [63], with an estimated review time of 10-15 minutes per trial.However, automated tools such as RobotReviewer can streamline the extraction and evaluation process in batches [51][52][53], improving efficiency-though manual verification is still necessary.Additionally, chatbots based on LLMs can aid in risk of bias assessment (see Figure S8), and studies indicate that their accuracy is comparable to human evaluations [23].

Analyze the data/meta-analysis
Data analysis serves as the source of systematic review results, typically encompassing basic information and outcome findings.Meta-analysis may be one outcome, along with potential components like subgroup analysis, sensitivity analysis, meta-regression, and detection of publication bias.Numerous software options are available to facilitate these data analyses, including STATA, RevMan, Rstudio, and others [43].Currently, it appears that chatbots based on LLMs may not fully execute data analysis independently, but they can extract relevant information.Subsequently, one can employ corresponding software for comprehensive data analysis.Alternatively, after extracting information with chatbots, the ChatGPT Code Interpreter can assist in analysis and generating graphical results, contingent upon being a ChatGPT Plus subscriber.Moreover, LLM markedly accelerates the data analysis process, empowering researchers to handle larger datasets with greater efficacy [54].

Draft the full manuscript
The complete drafting of systematic reviews and meta-analyses should adhere to the PRISMA reporting guidelines [64].It is not advisable to use chatbots like ChatGPT for article composition.On one hand, the accuracy and integrity of content generated by GPT require human verification.On the other hand, various research types and journals have different requirements for full-text articles, making it challenging to achieve uniformity in generated content.However, utilizing tools like GPT for language refinement and adjusting content logic can be considered to enhance the quality and readability of the article [33,55].It is important to declare the use of GPT-related tools in the methods, acknowledgments, or appendices to ensure transparency [24,65].

Submit and publish
Submission and publication represent the final steps in the process of conducting systematic reviews and meta-analyses, aside from subsequent updates.At this stage, the potential role of tools related to LLMs can assist authors in recommending suitable journals (Figure S9).They might also aid in crafting components such as cover letters and highlights [59].However, it is imperative to emphasize that content generated by these tools requires manual verification to ensure the accuracy of the content, and all authors should be accountable for the content generated by LLMs.

Benefits and drawbacks of using large language models
Systematic reviews and meta-analyses are crucial evidence types that support the development of guidelines [3].The benefits of employing LLM chatbots in the production of systematic reviews and meta-analyses include increased speed, such as in the stages of evidence searching, data extraction, and assessment of bias risk; they can also enhance accuracy by reducing human errors, for instance, in extracting essential information and pooling data.However, there are also drawbacks, such as the potential for generating hallucinations, the reliability of the models requires human verification and the entire systematic review process is not replicable.Moreover, when interacting with large language model chatbots, it is important to manage data privacy; when using LLMs to analyze data, especially when it involves patient privacy, ethical approval and management must be properly addressed.

Challenges and solutions
While LLMs can assist in accelerating the production of systematic reviews and meta-analyses in some steps, enhancing accuracy and transparency, and saving resources, they also face several challenges.For instance, LLMs cannot promptly update their versions and information.Currently, ChatGPT 3.5 is trained on data from around 2021.Limitations such as the length of prompts and token constraints, as well as restrictions related to context associations, may potentially impact overall results and user experience [25].Although LLM-based autonomous agents have made strides in systematic reviews and meta-analyses, LLMs still face numerous challenges due to issues related to personalization, updating knowledge, strategic planning, and complex problem-solving.To better facilitate the utilization of tools such as ChatGPT in systematic reviews and meta-analyses, we believe that, first and foremost, authors should understand the scope and scenarios for applying ChatGPT, clearly defining which steps can benefit from these tools.Secondly, for researchers, collaboration with computer scientists and artificial intelligence engineers is crucial to optimize the prompts and develop integrated tools based on LLMs, such as web applications.These tools can assist in seamless transitions between different tasks in the systematic review process.Lastly, for journal editors, collaboration with authors and reviewers is essential to adhere to reporting and ethical principles associated with the use of GPT [24,68].This collaboration aims to promote transparency and integrity, and to prevent indiscriminate overuse in the application of LLMs in systematic reviews and meta-analyses.

The
development of LLM-driven autonomous agents adept at systematic reviews and metaanalyses warrants exploration [66].The use of LLMs as centrally controlled intelligent agents encompasses the ability to handle precise literature screening, extract and analyze complex data, and assist in manuscript composition, as demonstrated by proof-of-concept demos like MetaGPT [67].Moreover, as the use of LLMs continues to grow ， ensuring the accuracy of information provided in systematic reviews becomes a significant challenge, particularly if LLMs are indiscriminately overused.
AbbreviationsAPI: Application Programming Interfaces ChatGPT: Chat Generative Pre-trained Transformer LLM: Large language model PDF: Portable Document Format PICO: Population, Intervention, Comparison, Outcome PRISMA: Preferred Reporting Items for Systematic reviews and Meta-Analyses RoB: Risk of Bias