Inter-rater reliability of AMSTAR-2 in a review of systematic reviews about interventions to prevent adverse events in the intensive care unit
2Clinical Epidemiology and Public Health Service. Hospital de la Santa Creu i Sant Pau. Institut de Recerca IIB Sant Pau, Spain
3Vall D'Hebron University Hospital, Spain
4Iberoamerican Cochrane Centre. Department of Social Medicine and Family Health, Universidad Del Cauca, Colombia, Spain
5Hospital Universitario de Bellvitge, Instituto Catala de Salut, Nursing Research Group, Bellvitge Institute for Biomedical Research, Spain
6Iberoamerican Cochrane Centre-IIB Sant Pau. CIBERESP, Spain
The AMSTAR-2 (A MeaSurement Tool to Assess systematic Reviews) is a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. With 16 items for evaluation (7 critical and 9 noncritical), the discordances are something to address with a kappa statistic calculation and a third reviewer, demanding researchers time and efforts to complete the overview tasks.
Objectives: To evaluate the inter-rater reliability and the weighted kappa statistics of AMSTAR-2.
Methods: We assessed the methodological quality with the AMSTAR-2 tool in an overview of systematic reviews about interventions to prevent adverse events in the intensive care unit (1). The study team was divided to evaluated 38 systematic reviews in pairs. We measured inter-rater agreement between reviewers. Kappa weighted score for agreement between pairs of ratters was calculated and compared by each study and AMSTAR-2 item.
Results: Agreement between reviewers was significantly high (77.6%) with a good strength of agreement (kw=0.65, p-value < .01), been these results consistent with critical and noncritical items (74.3, .64, p-value < .01; and 80.9, .62, p-value < .01 respectively). Critical items with the least agreement were those referring to the risk of bias and the assessment of heterogeneity in non-randomized studies (9.2 and 11.2), respectively. The non-critical items with the least agreement were the explanation of the study designs selection and description in detail of the included studies (items 3 and 8).
Conclusions: Our results are in line with the AMSTAR-2 development and validation study (2). The levels of agreement achieved by the pairs of ratters varied across items, but they were moderate to substantial for most items. Differences between ratters reflect the demanding nature of some item level judgments and should prompt group discussion of their causes and importance, and, if needed, consultation with experts in subject matter and methods. Prior training of the reviewers in the AMSTAR-2 instrument is necessary so that there is maximum consensus when applying it individually. Patient, public, and/or healthcare consumer involvement: No. References: 1. Suclupe et al. Aust Crit Care. 2022; S1036-7314(22)00237-5; doi:10.1016/j.aucc.2022.11.003 2. Shea et al. BMJ 2017;358:j4008; doi: https://doi.org/10.1136/bmj.j4008