Determining the inter-reviewer reliability of literature screening for human and machine-assisted systematic reviews: A mixed methods review

Session Type
Evidence synthesis innovations and technology
Hanegraaf P1, Wondimu A2, Mosselman JJ1, De Jong R1, Abogunrin S3, Queiros L3, Lane M3, Postma M4, Boersma C5, Van der Schans J6
1Pitts, The Netherlands
2Health-Ecore, Ethiopia
3Roche, Switzerland
4Health-Ecore, UMCG, University of Groningen, The Netherlands
5Health-Ecore, Pitts, UMCG, Open University, The Netherlands
6Health-Ecore, University of Groningen, UMCG, Open University, The Netherlands

Background: Automating aspects of the systematic literature reviews (SLRs) process could lead to better and up-to-date informed medical decision making, and therefore improved health outcomes for individual patients. Machine learning automation has been proposed to reduce the workload and potentially enhance quality of SLRs. However, the level of inter-reviewer reliability (IRR) in both human and machine learning automated SLRs is yet unclear.
Objectives: Our main objective is to assess the IRR reported in SLRs. Our secondary objective is to determine the expected IRR by authors of SLRs for both human and machine-assisted reviews.
Methods: In the first part of this study, we performed a review of SLRs of randomized controlled trials using the Pubmed and Embase databases. Data of all eligible studies was extracted on IRR by means of Cohen’s kappa score of abstract/title screening, full text screening, and data extraction in combination with review team size, items screened, and quality of the review as assessed with the AMSTAR 2 rating. As second part of this review, we performed a survey of authors of SLRs on their expectations of machine learning automation and human performed IRR in SLRs.
Results: In total, 44 eligible articles were included. The average Cohen’s kappa score reported was 0.8 (SD= 0.13, n=10) for abstract screening, 0.76 (SD= 0.17, n=14) for full text screening, 0.86 (SD=0.08, n=16) for the whole screening process, and 0.88 (SD= 0.08, n=16) for data extraction. No association was observed between the IRR and team size, items screened and quality of the SLR. The survey (n=37) showed similar expected Cohen’s kappa values ranging between 0.6-0.9 for either human or machine learning assisted SLRs. No trend was observed between reviewer experience and expected IRR. In general, authors expect a better-than-average IRR for machine learning assisted SLR compared to human based SLR in both screening and the data extraction.
Conclusions: It is uncommon to report on IRR in human performed SLRs. This mixed-methods review gives first guidance on the human IRR benchmark, which could be used as a minimal threshold for IRR in machine learning assisted SLRs.
Patient, public and/or healthcare consumer involvement: NA