An open competition involving thousands of competitors failed to construct useful search filters for new diagnostic test accuracy systematic reviews
2Division of Rehabilitation, Department of Clinical Practice and Support, Hiroshima University Hospital, Japan
3Department of Orthopedic Surgery, Miyamoto Orthopedic Hospital, Japan
4Department of Psychiatry, Okayama Psychiatric Medical Center, Japan
5Department of Emergency Medicine, National Hospital Organization Mito Medical Center, Japan
6Division of Respiratory Medicine, Saiseikai Kumamoto Hospital, Japan
7Department of Psychiatry, Seichiryo Hospital, Japan
8Oku medical clinic, Japan
9Department of Rehabilitation Medicine I, Fujita Health University School of Medicine, Japan
10Center for Advanced IBD Research and Treatment, Kitasato University Kitasato Institute Hospital, Japan
11Hospital Care Research Unit, Hyogo Prefectural Amagasaki General Medical Center, Japan
12Department of Radiology, National Center for Geriatrics and Gerontology, Japan
13Department of Emergency and General Internal Medicine, Fujita Health University School of Medicine, Japan
14Section of General Internal Medicine, Department of Emergency and General Internal Medicine, Fujita Health University School of Medicine, Japan
15Department of Neurology, Fukushima Medical University, Japan
16Department of Critical Care Medicine, Sakai City Medical Center, Japan
17Independent researcher, Japan
18Department of Health Promotion and Human Behavior, Kyoto University Graduate School of Medicine/School of Public Health, Japan
Background: No abstract classifier can be used for new diagnostic test accuracy (DTA) systematic reviews to select primary DTA study abstracts from database searches.
Objectives: Our goal with the FILtering of diagnostic Test accuracy studies (FILTER) Challenge was to develop machine learning (ML) filters for new diagnostic test accuracy (DTA) systematic reviews through an open competition.
Methods: We conducted an open competition. We prepared a dataset including titles, abstracts, and the judgement sought to retrieve full texts from 10 DTA reviews and a mapping review. We randomly split the datasets into a train set (n = 27145, labeled as DTA n= 632), a public test set (n = 20417, labeled as DTA n= 474), and a private test set (n = 20417, labeled as DTA n= 469). Competition participants used the training set to develop models, then they validated their models using the public test set to refine their development process. Finally, we used the private test set to rank the submitted models. We used Fbeta with beta set to seven to honor models. For the external validation, we used a DTA review for the cardiology dataset (n = 7722, labeled as DTA n= 167). We preset Fbeta adopted seven as the value of beta and recall to evaluate models for evaluating better filters that are less likely to miss.
Results: From July 28 to October 4, 2021, we held the challenge. We received a total of 13,774 submissions from 1,429 teams or persons. We honored the top three models. Fbeta scores and Recall in the external validation set were 0.4036 and 0.2352 by the first model, 0.3262 and 0.3313 by the second model, and 0.3891 and 0.3976 by the third model, respectively.
Conclusions: We were unable to develop a search filter with sufficient recall to apply for new DTA reviews immediately. Further studies are needed to update and validate filters with datasets from other clinical areas. Patient, public, and/or healthcare consumer involvement: None.