Methods Inf Med
DOI: 10.1055/s-0041-1739361
Original Article

A Semi-Automated Term Harmonization Pipeline Applied to Pulmonary Arterial Hypertension Clinical Trials

Ryan J. Urbanowicz
1   Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States
,
John H. Holmes
1   Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States
,
Dina Appleby
1   Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States
,
Vanamala Narasimhan
2   Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
,
Stephen Durborow
1   Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States
,
Nadine Al-Naamani
3   Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
,
Melissa Fernando
1   Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States
,
Steven M. Kawut
3   Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, United States
› Author Affiliations
Funding This work was supported by Cardiovascular Medical Research and Education Fund (CMREF), Aldrighetti Research Award for Young Investigators, and NIH 2K24HL103844–06 and 5K23HL141584–03.

Abstract

Objective Data harmonization is essential to integrate individual participant data from multiple sites, time periods, and trials for meta-analysis. The process of mapping terms and phrases to an ontology is complicated by typographic errors, abbreviations, truncation, and plurality. We sought to harmonize medical history (MH) and adverse events (AE) term records across 21 randomized clinical trials in pulmonary arterial hypertension and chronic thromboembolic pulmonary hypertension.

Methods We developed and applied a semi-automated harmonization pipeline for use with domain-expert annotators to resolve ambiguous term mappings using exact and fuzzy matching. We summarized MH and AE term mapping success, including map quality measures, and imputation of a generalizing term hierarchy as defined by the applied Medical Dictionary for Regulatory Activities (MedDRA) ontology standard.

Results Over 99.6% of both MH (N = 37,105) and AE (N = 58,170) records were successfully mapped to MedDRA low-level terms. Automated exact matching accounted for 74.9% of MH and 85.5% of AE mappings. Term recommendations from fuzzy matching in the pipeline facilitated annotator mapping of the remaining 24.9% of MH and 13.8% of AE records. Imputation of the generalized MedDRA term hierarchy was unambiguous in 85.2% of high-level terms, 99.4% of high-level group terms, and 99.5% of system organ class in MH, and 75% of high-level terms, 98.3% of high-level group terms, and 98.4% of system organ class in AE.

Conclusion This pipeline dramatically reduced the burden of manual annotation for MH and AE term harmonization and could be adapted to other data integration efforts.

Supplementary Material



Publication History

Received: 08 July 2021

Accepted: 04 October 2021

Article published online:
24 November 2021

© 2021. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany