Subscribe to RSS
Definition of a Practical Taxonomy for Referencing Data Quality Problems in Health Care Databases
Introduction Health care information systems can generate and/or record huge volumes of data, some of which may be reused for research, clinical trials, or teaching. However, these databases can be affected by data quality problems; hence, an important step in the data reuse process consists in detecting and rectifying these issues. With a view to facilitating the assessment of data quality, we developed a taxonomy of data quality problems in operational databases.
Material We searched the literature for publications that mentioned “data quality problems,” “data quality taxonomy,” “data quality assessment,” or “dirty data.” The publications were then reviewed, compared, summarized, and structured using a bottom-up approach, to provide an operational taxonomy of data quality problems. The latter were illustrated with fictional examples (though based on reality) from clinical databases.
Results Twelve publications were selected, and 286 instances of data quality problems were identified and were classified according to six distinct levels of granularity. We used the classification defined by Oliveira et al to structure our taxonomy. The extracted items were grouped into 53 data quality problems.
Discussion This taxonomy facilitated the systematic assessment of data quality in databases by presenting the data's quality according to their granularity. The definition of this taxonomy is the first step in the data cleaning process. The subsequent steps include the definition of associated quality assessment methods and data cleaning methods.
Conclusion Our new taxonomy enabled the classification and illustration of 53 data quality problems found in hospital databases.
Received: 29 June 2022
Accepted: 02 November 2022
Accepted Manuscript online:
10 November 2022
Article published online:
09 January 2023
© 2023. Thieme. All rights reserved.
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany
- 1 Adler-Milstein J, Nong P. Early experiences with patient generated health data: health system and patient perspectives. J Am Med Inform Assoc 2019; 26 (10) 952-959
- 2 Weng C, Kahn MG. Clinical research informatics for big data and precision medicine. Yearb Med Inform 2016; (01) 211-218
- 3 Weiner MG, Embi PJ. Toward reuse of clinical data for research and quality improvement: the end of the beginning?. Ann Intern Med 2009; 151 (05) 359-360
- 4 Nunez CM. Advanced techniques for anesthesia data analysis. Seminars Anesthesia Perioperative Medicine and Pain 2004; 23 (02) 121-124
- 5 Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med 2009; 48 (01) 38-44
- 6 Ebidia A, Mulder C, Tripp B, Morgan MW. Getting data out of the electronic patient record: critical steps in building a data warehouse for decision support. Proc AMIA Symp 1999; 745-749
- 7 Safran C, Bloomrosen M, Hammond WE. et al; Expert Panel. Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. J Am Med Inform Assoc 2007; 14 (01) 1-9
- 8 Meystre SM, Lovis C, Bürkle T, Tognola G, Budrionis A, Lehmann CU. Clinical data reuse or secondary use: current status and potential future progress. Yearb Med Inform 2017; 26 (01) 38-52
- 9 McGlynn EA, Lieu TA, Durham ML. et al. Developing a data infrastructure for a learning health system: the PORTAL network. J Am Med Inform Assoc 2014; 21 (04) 596-601
- 10 Chazard E, Ficheur G, Caron A. et al. Secondary use of healthcare structured data: the challenge of domain-knowledge based extraction of features. Stud Health Technol Inform 2018; 255: 15-19
- 11 Wade TD. Refining gold from existing data. Curr Opin Allergy Clin Immunol 2014; 14 (03) 181-185
- 12 Miller JL. The EHR solution to clinical trial recruitment in physician groups. Health Manag Technol 2006; 27 (12) 22-25
- 13 Cai T, Cai F, Dahal KP. et al. Improving the efficiency of clinical trial recruitment using an ensemble machine learning to assist with eligibility screening. ACR Open Rheumatol 2021; 3 (09) 593-600
- 14 Altman M. The clinical data repository: a challenge to medical student education. J Am Med Inform Assoc 2007; 14 (06) 697-699
- 15 Dentler K, ten Teije A, de Keizer N, Cornet R. Barriers to the reuse of routinely recorded clinical data: a field report. Stud Health Technol Inform 2013; 192: 313-317
- 16 Redman TC. The impact of poor data quality on the typical enterprise. Commun ACM 1998; 41 (02) x
- 17 Rahm E, Do H. Data cleaning: problems and current approaches. IEEE Data Eng Bull
- 18 Oliveira P, Rodrigues F, Henriques P. A formal definition of data quality problems. In:ICIQ.; 2005
- 19 Kim W, Choi BJ, Hong EK, Kim SK, Lee D. A taxonomy of dirty data. Data Min Knowl Discov 2003; 7 (01) 81-99
- 20 Li L, Peng T, Kennedy J. A Rule Based Taxonomy of Dirty Data. In: 2010 DOI: 10.5176/978-981-08-6308-1_D-035
- 21 Barateiro J, Galhardas H. A survey of data quality tools. Published online 2005. Accessed June 8, 2022 At: https://www.semanticscholar.org/paper/A-Survey-of-Data-Quality-Tools-Barateiro-Galhardas/1122bf09792b2cd93ef61d9fba24e2cbfd4e8325
- 22 Dasu T, Vesonder GT, Wright J. Data quality through knowledge engineering. In:KDD '03.; 2003 DOI: 10.1145/956750.956844
- 23 Gschwandtner T, Gärtner J, Aigner W, Miksch S. A taxonomy of dirty time-oriented data. In:CD-ARES; 2012 DOI: 10.1007/978-3-642-32498-7_5
- 24 Diaz-Garelli F, Long A, Bancks MP, Bertoni AG, Narayanan A, Wells BJ. Developing a data quality standard primer for cardiovascular risk assessment from electronic health record data using the DataGauge process. AMIA Annu Symp Proc AMIA Symp 2021; 2021: 388-397
- 25 Wang Z, Talburt JR, Wu N, Dagtas S, Zozus MN. A rule-based data quality assessment system for electronic health record data. Appl Clin Inform 2020; 11 (04) 622-634
- 26 Weiskopf NG, Bakken S, Hripcsak G, Weng C. A data quality assessment guideline for electronic health record data reuse. EGEMS (Wash DC) 2017; 5 (01) 14
- 27 Henley-Smith S, Boyle D, Gray K. Improving a secondary use health data warehouse: proposing a multi-level data quality framework. EGEMS (Wash DC) 2019; 7 (01) 38
- 28 Kahn MG, Callahan TJ, Barnard J. et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS (Wash DC) 2016; 4 (01) 1244
- 29 Müller H, Freytag J. Problems, methods, and challenges in comprehensive data cleansing. Published online 2005. Accessed June 8, 2022 at: https://www.semanticscholar.org/paper/Problems-%2C-Methods-%2C-and-Challenges-in-Data-Mueller-Freytag/0168304c626a5b186bf559bf774a1dca52b04931
- 30 de Almeida WG, de Sousa RD, de Deus FD, Nze GDA, de Mendonça FLL. Taxonomy of data quality problems in multidimensional Data Warehouse models. Paper presented at: 8th Iberian Conference on Information Systems and Technologies; Lisbon, Portugal, June, 19–22, 2013
- 31 Strong D, Lee YW, Wang RY. Data quality in context. Commun ACM Published online 1997
- 32 Woodall P, Oberhofer M, Borek A. A classification of data quality assessment and improvement methods. Int J Inf Qual 2014; 3 (04) 298