ChatGPT and large language models for systematic review tasks: What are the opportunities for improvement?

Session Type
Oral presentation
Evidence synthesis innovations and technology
Qureshi R1, Shaughnessy D1, Gill K2, Robinson K3, Li T1, Agai E4
1University of Colorado Anschutz Medical Campus, United States
2University of Pittsburgh, United States
3Johns Hopkins University, United States
4PICO Portal, United States

Background: The advancement of Artificial Intelligence (AI) assisted technologies leave many wondering about the current capabilities, limitations, and opportunities for integration into scientific endeavors. Large language models (LLM) – such as ChatGPT, designed by OpenAI – have recently gained widespread attention with their ability to respond to various prompts in a natural-sounding way. Systematic reviews (SRs) utilize secondary data and often require many months and tens to hundreds of thousands of dollars to complete, making them attractive grounds for developing AI-assistive technologies.
Objectives: To evaluate the responses of ChatGPT to SR tasks to assess applicability, correctness, and potential usefulness to reviewers.
Methods: We gave ChatGPT a set of SR tasks during a live webinar attended by over 400 people and evaluated the responses, considering our own expertise and attendees’ comments.
Results: When tasked with developing an SR question and eligibility criteria, ChatGPT appropriately formulated a structured question based on the requested PICO elements and produced criteria related to PICO, study design, language, and publication dates. ChatGPT failed to write a PubMed search strategy for a given SR question: while the output matched the structural components of a search, it was unusable due to fabricated pseudo-MeSH terms and inappropriate and incorrect filters. When tasked with screening 27 titles for a given question, ChatGPT correctly identified 8/10 eligible articles while also including one ineligible article. When tasked with summarizing multiple abstracts, ChatGPT accurately identified relevant results from 2/3, but misidentified background information from the third as results. In all tasks, ChatGPT’s response required interpretation and expertise to assess validity. Many attendees expressed hesitancy and distrust of ChatGPT for completing SR tasks.
Conclusions: While ChatGPT and LLMs show some promise for aiding in SR tasks, the technology is in its infancy and needs development for such applications. In addition, great caution should be taken by all, particularly non-content experts, to critically evaluate any content produced by LLMs due to much of the output appearing, at a high level, to be valid, while some was erroneous and in need of active vetting. Consumer involvement: The webinar was public; anyone could attend and comment.