Session Details: Colloquium 2023

Date & Time

Tuesday, September 5, 2023, 12:30 PM - 2:00 PM

Location Name

Pickwick

Session Type

Poster

Category

Evidence synthesis innovations and technology

Speakers

Jurjen van der Schans, HealthEcore/Pitts/UMCG/Open University

Authors

Hanegraaf P¹, Wondimu A², Mosselman JJ¹, De Jong R¹, Abogunrin S³, Queiros L³, Lane M³, Postma M⁴, Boersma C⁵, Van der Schans J⁶

¹Pitts, The Netherlands
²Health-Ecore, Ethiopia
³Roche, Switzerland
⁴Health-Ecore, UMCG, University of Groningen, The Netherlands
⁵Health-Ecore, Pitts, UMCG, Open University, The Netherlands
⁶Health-Ecore, University of Groningen, UMCG, Open University, The Netherlands

Description

Background: Automating aspects of the systematic literature reviews (SLRs) process could lead to better and up-to-date informed medical decision-making, and therefore improved health outcomes for individual patients. Machine learning automation has been proposed to reduce the workload and potentially enhance the quality of SLRs. However, the level of inter-reviewer reliability (IRR) in both human and machine learning automated SLRs is yet unclear.
Objectives: Our main objective is to assess the IRR reported in SLRs. Our secondary objective is to determine the expected IRR by authors of SLRs for both human and machine-assisted reviews.
Methods: In the first part of this study, we performed a review of SLRs of randomized controlled trials using the Pubmed and Embase databases. Data of all eligible studies were extracted on IRR by means of Cohen’s kappa score of abstract/title screening, full text screening, and data extraction in combination with review team size, items screened, and quality of the review as assessed with the AMSTAR 2 rating. As a second part of this review, we performed a survey of authors of SLRs on their expectations of machine learning automation and human performed IRR in SLRs.
Results: In total, 44 eligible articles were included. The average Cohen’s kappa score reported was 0.8 (SD= 0.13, n=10) for abstract screening, 0.76 (SD= 0.17, n=14) for full text screening, 0.86 (SD=0.08, n=16) for the whole screening process, and 0.88 (SD= 0.08, n=16) for data extraction. No association was observed between the IRR and team size, items screened, and quality of the SLR. The survey (n=37) showed similar expected Cohen’s kappa values ranging between 0.6 and 0.9 for either human or machine learning assisted SLRs. No trend was observed between reviewer experience and expected IRR. In general, authors expect a better-than-average IRR for machine learning assisted SLR compared with human-based SLR in both screening and the data extraction.
Conclusions: It is uncommon to report on IRR in human performed SLRs. This mixed-methods review gives first guidance on the human IRR benchmark, which could be used as a minimal threshold for IRR in machine learning assisted SLRs. Patient, public, and/or healthcare consumer involvement: NA.

Session Details

Determining the inter-reviewer reliability of literature screening for human and machine-assisted systematic reviews: A mixed methods review