Abstract
Background Multisite research networks such as the project “Collaboration on Rare Diseases”
connect various hospitals to obtain sufficient data for clinical research. However,
data quality (DQ) remains a challenge for the secondary use of data recorded in different
health information systems. High levels of DQ as well as appropriate quality assessment
methods are needed to support the reuse of such distributed data.
Objectives The aim of this work is the development of an interoperable methodology for assessing
the quality of data recorded in heterogeneous sources to improve the quality of rare
disease (RD) documentation and support clinical research.
Methods We first developed a conceptual framework for DQ assessment. Using this theoretical
guidance, we implemented a software framework that provides appropriate tools for
calculating DQ metrics and for generating local as well as cross-institutional reports.
We further applied our methodology on synthetic data distributed across multiple hospitals
using Personal Health Train. Finally, we used precision and recall as metrics to validate
our implementation.
Results Four DQ dimensions were defined and represented as disjunct ontological categories.
Based on these top dimensions, 9 DQ concepts, 10 DQ indicators, and 25 DQ parameters
were developed and applied to different data sets. Randomly introduced DQ issues were
all identified and reported automatically. The generated reports show the resulting
DQ indicators and detected DQ issues.
Conclusion We have shown that our approach yields promising results, which can be used for local
and cross-institutional DQ assessments. The developed frameworks provide useful methods
for interoperable and privacy-preserving assessments of DQ that meet the specified
requirements. This study has demonstrated that our methodology is capable of detecting
DQ issues such as ambiguity or implausibility of coded diagnoses. It can be used for
DQ benchmarking to improve the quality of RD documentation and to support clinical
research on distributed data.
Keywords
Data quality - rare disease - distributed analysis - ontology - semantic interoperability
- healthcare standards