Transparent and traceable data science-based real-world evidence (DS-RWE) producing: framework and practice in traditional Chinese medicine
2Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, China
Background: Real-world data, including electronic medical records (EMRs), have the potential to provide evidence with high external validity. Clinical prediction models are one type of evidence that can take full advantage of such data. However, producing trustworthy evidence using real-world data can be challenging. To address this challenge, advanced data science methods are necessary, as they may offer opportunities to enhance the trustworthiness of evidence generated from real-world data.
Objectives: To propose a framework for constructing a transparent and traceable data science-based real-world evidence (DS-RWE) producing with an example in diagnostic model development for traditional Chinese medicine (TCM) syndrome pattern differentiation.
Methods: We propose a framework consisting of EMR repositories, AI-engaged data cleaner, pre-defined clinical problems, transparent method rules, and a regularly updated evidence interface with version control function (Figure 1). This framework allows for the efficient and accurate extraction of data from EMR repositories while ensuring data accuracy through AI-engaged data cleaning. Pre-defined clinical problems ensure the specificity of the evidence produced, whereas transparent method rules guarantee the trustworthiness of the results. The regularly updated evidence interface with version control function ensures the evidence produced is up to date and traceable.
Results: According to the framework, we have constructed a DS-RWE pipeline for TCM syndrome pattern diagnosis using 478,862 EMRs from 298,632 patients in our data repository. The AI-engaged data cleaner consists of Bidirectional Encoder Representation from Transformers (BERT), regular expressions, and a series of factor regulators. A pre-defined model development workflow for syndrome pattern prediction was established, and Autogluon, an auto-machine learning tool, was applied for model development. It took approximately one day to update all the diagnostic models.
Conclusions: The proposed framework provides a transparent and traceable pipeline for producing trustworthy evidence using real-world data through data science methods. It has potential contribution to the development of more robust evidence production methods in real-world research, providing patients with trustworthy, up-to-date, and traceable evidence for their care.