Methods Inf Med 2015; 54(05): 455-460
DOI: 10.3414/ME14-02-0030
Original Articles
Schattauer GmbH

A Generic Data Harmonization Process for Cross-linked Research and Network Interaction[*]

Construction and Application for the Lung Cancer Phenotype Database of the German Center for Lung Research
D. Firnkorn
1   Institute of Medical Biometry and Informatics, Heidelberg University, Heidelberg, Germany
,
M. Ganzinger
1   Institute of Medical Biometry and Informatics, Heidelberg University, Heidelberg, Germany
,
T. Muley
2   Translational Research Unit, Thoraxklinik at University Hospital Heidelberg, Heidelberg, Germany
4   Translational Lung Research Centre Heidelberg (TLRC-H), Member of the German Centre for Lung Research (DZL)
,
M. Thomas
3   Department of Oncology, Thoraxklinik at University Hospital Heidelberg, Heidelberg, Germany
4   Translational Lung Research Centre Heidelberg (TLRC-H), Member of the German Centre for Lung Research (DZL)
,
P. Knaup
1   Institute of Medical Biometry and Informatics, Heidelberg University, Heidelberg, Germany
› Author Affiliations
Further Information

Publication History

received: 16 December 2014

accepted: 01 September 2015

Publication Date:
22 January 2018 (online)

Summary

Objective: Joint data analysis is a key requirement in medical research networks. Data are available in heterogeneous formats at each network partner and their harmonization is often rather complex. The objective of our paper is to provide a generic approach for the harmonization process in research networks. We applied the process when harmonizing data from three sites for the Lung Cancer Phenotype Database within the German Center for Lung Research.

Methods: We developed a spreadsheet-based solution as tool to support the harmonization process for lung cancer data and a data integration procedure based on Talend Open Studio.

Results: The harmonization process consists of eight steps describing a systematic approach for defining and reviewing source data elements and standardizing common data elements. The steps for defining common data elements and harmonizing them with local data definitions are repeated until consensus is reached. Application of this process for building the phenotype database led to a common basic data set on lung cancer with 285 structured parameters. The Lung Cancer Phenotype Database was realized as an i2b2 research data warehouse.

Conclusion: Data harmonization is a challenging task requiring informatics skills as well as domain knowledge. Our approach facilitates data harmonization by providing guidance through a uniform process that can be applied in a wide range of projects.

* Supplementary online material published on our website www.methods-online.com


 
  • References

  • 1 Lozano R, Naghavi M, Foreman K, Lim S, Shibuya K, Aboyans V. et al. Global and regional mortality from 235 causes of death for 20 age groups in 1990 and. 2010; a systematic analysis for the Global Burden of Disease Study 2010. Lancet 2012; 380 9859 2095-2128.
  • 2 Miravitlles M, Soriano JB, García-Río F, Muñoz L, Duran-Tauleria E, Sanchez G. et al. Prevalence of COPD in Spain: impact of undiagnosed COPD on quality of life and daily life activities.. Thorax 2009; 64 (010) 863-868.
  • 3 Blakemore A, Dickens C, Guthrie E, Bower P, Kontopantelis E, Afzal C. et al. Depression and anxiety predict health-related quality of life in chronic obstructive pulmonary disease: systematic review and meta-analysis.. International journal of chronic obstructive pulmonary disease 2014; 9: 501-512.
  • 4 German Center for Lung Research [cited 2014 Nov 25] . Available from: URL: http://www.dzl.de/index.php/en/.
  • 5 Cummings JN. Collaborative Research Across Disciplinary and Organizational Boundaries.. Social Studies of Science 2005; 35 (05) 703-722.
  • 6 Dixon BE, Vreeman DJ, Grannis SJ. The long road to semantic interoperability in support of public health: experiences from two states.. Journal of biomedical informatics 2014; 49: 3-8.
  • 7 Berges I, Bermúdez J, Illarramendi A. Toward semantic interoperability of electronic health records.. IEEE transactions on information technology in biomedicine : a publication of the IEEE Engineering in Medicine and Biology Society 2012; 16 (03) 424-431.
  • 8 Pathak J, Wang J, Kashyap S, Basford M, Li R, Masys DR. et al. Mapping clinical phenotype data elements to standardized metadata repositories and controlled terminologies: the eMERGE Network experience.. Journal of the American Medical Informatics Association 2011; 18 (04) 376-386.
  • 9 Dayal U, Castellanos M, Simitsis A, Wilkinson K.. Data integration flows for business intelligence. In: the 12th International Conference 1.
  • 10 Adametz H., Billig A. Semantische Konflikte.. White Paper: Semantic Interoperability, Fraunhofer Institute ISST 2010 (2).
  • 11 Rahm E, Do HH. Data Cleaning: Problems and Current Approaches.. IEEE DATA ENGINEERING BULLETIN 2000; 23: 2000.
  • 12 Murphy SN, Mendis M, Hackett K, Kuttan R, Pan W, Phillips LC. et al. Architecture of the open-source clinical research chart from Informatics for Integrating Biology and the Bedside. AMIA … Annual Symposium proceedings/AMIA Symposium. AMIA Symposium 2007; 548-552.
  • 13 Ganslandt T, Mate S, Helbing K, Sax U, Prokosch HU. Unlocking Data for Clinical Research - The German i2b2 Experience.. Applied clinical informatics 2011; 2 (01) 116-127.
  • 14 Majchrzak TA, Jansen T, Kuchen H.. Efficiency evaluation of open source ETL tools. In: The 2011 ACM Symposium 287.
  • 15 Merzweiler A, Ehlerding H, Creutzig U, Graf N, Hero B, Kaatsch P. et al. Terminologiestandardisierung in der Pädiatrischen Onkologie - der Basisdatensatz.. Klinische Pädiatrie 2002; 214 (04) 212-217.
  • 16 Flynn MR, Barrett C, Cosío FG, Gitt AK, Wallentin L, Kearney P. et al. The Cardiology Audit and Registration Data Standards (CARDS), European data standards for clinical cardiology practice.. European heart journal 2005; 26 (03) 308-313.
  • 17 Fidahussein M, Vreeman DJ. A corpus-based approach for automated LOINC mapping.. Journal of the American Medical Informatics Association: JAMIA 2014; 21 (01) 64-72.
  • 18 Elkin PL, Brown SH, Husser CS, Bauer BA, Wahner-Roedler D, Rosenbloom ST. et al. Evaluation of the content coverage of SNOMED CT: ability of SNOMED clinical terms to represent clinical problem lists.. Mayo Clinic proceedings 2006; 81 (06) 741-748.
  • 19 Stausberg J, Löbe M, Verplancke P, Drepper J, Herre H, Löffler M. Foundations of a metadata repository for databases of registers and trials.. Studies in health technology and informatics 2009; 150: 409-413.
  • 20 Buccheri G, Ferrigno D, Tamburini M. Karnofsky and ECOG performance status scoring in lung cancer: A prospective, longitudinal study of 536 patients from a single institution.. European Journal of Cancer 1996; 32 (07) 1135-1141.
  • 21 Ma C, Bandukwala S, Burman D, Bryson J, Seccareccia D, Banerjee S. et al. Interconversion of three measures of performance status: an empirical analysis.. European journal of cancer (Oxford, England : 1990) 2010; 46 (018) 3175-3183.
  • 22 Agarwal P, Shroff G, Malhotra P.. Approximate Incremental Big-Data Harmonization. In: 2013 IEEE International Congress on Big Data (BigData Congress) 118-125.
  • 23 Fortier I, Burton PR, Robson PJ, Ferretti V, Little J, L’Heureux F. et al. Quality, quantity and harmony: the DataSHaPER approach to integrating data across bioclinical studies.. International journal of epidemiology 2010; 39 (05) 1383-1393.
  • 24 Hedeler C, Belhajjame K, Paton NW, Fernandes AA, Embury SM, Mao L. et al. Pay-as-you-go mapping selection in dataspaces. In: the 2011 international conference 1279.
  • 25 Buck J, Garde S, Kohl CD, Knaup-Gregori P. Towards a comprehensive electronic patient record to support an innovative individual care concept for premature infants using the openEHR approach.. Int J Med Inform 2009; 78 (08) 521-531.
  • 26 Mul M, de PAlons, van der Velde Peter, Konings I, Bakker J, Hazelzet J. Development of a clinical data warehouse from an intensive care clinical information system.. Computer methods and programs in biomedicine 2012; 105 (01) 22-30.
  • 27 Shin S, Kim WS, Lee J. Characteristics desired in clinical data warehouse for biomedical research.. Healthcare informatics research 2014; 20 (02) 109-116.