Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing

Antje Wulff; Marcel Mast; Marcus Hassler; Sara Montag; Michael Marschollek; Thomas Jack

doi:10.1055/s-0040-1716403

Methods of Information in Medicine, Table of Contents

CC BY-NC-ND 4.0 · Methods Inf Med 2020; 59(S 02): e64-e78
DOI: 10.1055/s-0040-1716403

Original Article

Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing

Antje Wulff

¹Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany

,

Marcel Mast

¹Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany

,

Marcus Hassler

²Econob, Informationsdienstleistungs GmbH, Klagenfurt am Wörthersee, Austria

,

Sara Montag

¹Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany

,

Michael Marschollek

¹Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany

,

Thomas Jack

³Department of Pediatric Cardiology and Intensive Care Medicine, Hannover Medical School, Hannover, Germany

› Author Affiliations

Abstract

Background Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly.

Objectives The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories.

Methods We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School.

Results We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall.

Conclusion The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.

Keywords

natural language processing - clinical decision support systems - openEHR - pediatric intensive care - medical history taking

Full Text

References

References
1 Meystre SM, Lovis C, Bürkle T, Tognola G, Budrionis A, Lehmann CU. Clinical data reuse or secondary use: current status and potential future progress. Yearb Med Inform 2017; 26 (01) 38-52
2 Martínez-Costa C, Cornet R, Karlsson D, Schulz S, Kalra D. Semantic enrichment of clinical models towards semantic interoperability. The heart failure summary use case. J Am Med Inform Assoc 2015; 22 (03) 565-576
3 Beale T. Archetypes: constraint-based domain models for future-proof information systems. In: Eleventh OOPSLA Workshop on Behavioral Semantics: Serving the Customer. Seattle, Washington, Boston: Northeastern University; 2002: 16-32
4 HL7. FHIR v1.0.2. Available at: http://hl7.org/fhir/index.html . Accessed June 12, 2020
5 HL7. HL7 RIM—das Referenzinformationsmodell. Available at: http://hl7.de/themen/hl7-v3-rim-das-referenzinformationsmodell/ . Accessed June 12, 2020
6 HL7. Clinical Document Architecture Release 2.0 (CDA R2). Available at: http://www.hl7.org/implement/standards/product_brief.cfm?product_id=7 . Accessed June 12, 2020
7 HL7. HL7 Version 3 Standard: clinical decision support; Virtual Medical Record (vMR) Logical Model, Release 2. Available at: http://www.hl7.org/implement/standards/product_brief.cfm?product_id=338 . Accessed June 12, 2020
8 Friedman C, Johnson SB. Natural language and text processing in biomedicine. In: Shortliffe EH, Cimino JJ. , eds. Biomedical Informatics. New York, NY: Springer New York; 2006: 312-343 . Health Informatics
9 Kreimeyer K, Foster M, Pandey A. et al. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform 2017; 73: 14-29
10 Hong N, Wen A, Stone DJ. et al. Developing a FHIR-based EHR phenotyping framework: a case study for identification of patients with obesity and multiple comorbidities from discharge summaries. J Biomed Inform 2019; 99: 103310
11 Hong N, Wen A, Shen F. et al. Developing a scalable FHIR-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data. JAMIA Open 2019; 2 (04) 570-579
12 Daumke P, Heitmann KU, Heckmann S, Martínez-Costa C, Schulz S. Clinical text mining on FHIR. Stud Health Technol Inform 2019; 264: 83-87
13 Lin C-H, Wu N-Y, Lai W-S, Liou D-M. Comparison of a semi-automatic annotation tool and a natural language processing application for the generation of clinical statement entries. J Am Med Inform Assoc 2015; 22 (01) 132-142
14 Meystre SM, Lee S, Jung CY, Chevrier RD. Common data model for natural language processing based on two existing standard information models: CDA+GrAF. J Biomed Inform 2012; 45 (04) 703-710
15 Kropf S, Krücken P, Mueller W, Denecke K. Structuring legacy pathology reports by openEHR archetypes to enable semantic querying. Methods Inf Med 2017; 56 (03) 230-237
16 Williams CN, Bratton SL, Hirshberg EL. Computerized decision support in adult and pediatric critical care. World J Crit Care Med 2013; 2 (04) 21-28
17 Lighthall GK, Vazquez-Guillamet C. Understanding decision making in critical care. Clin Med Res 2015; 13 (3-4): 156-168
18 Hampton JR, Harrison MJ, Mitchell JR, Prichard JS, Seymour C. Relative contributions of history-taking, physical examination, and laboratory investigation to diagnosis and management of medical outpatients. BMJ 1975; 2 (5969): 486-489
19 Summerton N. The medical history as a diagnostic technology. Br J Gen Pract 2008; 58 (549) 273-276
20 Peterson MC, Holbrook JH, Von Hales D, Smith NL, Staker LV. Contributions of the history, physical examination, and laboratory investigation in making medical diagnoses. West J Med 1992; 156 (02) 163-165
21 Keifenheim KE, Teufel M, Ip J. et al. Teaching history taking to medical students: a systematic review. BMC Med Educ 2015; 15: 159
22 Ghosh D, Karunaratne P. The importance of good history taking: a case report. J Med Case Reports 2015; 9: 97
23 Wang MY, Asanad S, Asanad K, Karanjia R, Sadun AA. Value of medical history in ophthalmology: a study of diagnostic accuracy. J Curr Ophthalmol 2018; 30 (04) 359-364
24 Masic I, Begic Z, Naser N, Begic E. Pediatric cardiac anamnesis: prevention of additional diagnostic tests. Int J Prev Med 2018; 9: 5
25 Ikiz MA, Cetin II, Ekici F, Güven A, Değerliyurt A, Köse G. Pediatric syncope: is detailed medical history the key point for differential diagnosis?. Pediatr Emerg Care 2014; 30 (05) 331-334
26 Brander P, Garin N. Utilité de l'anamnèse et de l'examen clinique dans le diagnostic de la pneumoniae. Rev Med Suisse 2011; 7 (313) 2026-2029
27 Garde S, Knaup P, Hovenga E, Heard S. Towards semantic interoperability for electronic health records. Methods Inf Med 2007; 46 (03) 332-343
28 vitasystems GmbH. EHRbase: Open Electronic Health Record Platform. Available at: https://ehrbase.org/ . Accessed March 11, 2020
29 DIPS AS. DIPS Electronic Patient Record. Available at: https://www.dips.com/uk/dips-electronic-patient-record . Accessed March 11, 2020
30 Ripple Foundation C.I.C. Ltd. EtherCIS: Enterprise Clinical Data Repository. Available at: http://ethercis.org/ . Accessed March 11, 2020
31 CaboLabs. CloudEHRServer: Clinical Data Management and Sharing Platform. Available at: https://cloudehrserver.com/ . Accessed March 11, 2020
32 Wulff A, Haarbrandt B, Marschollek M. Clinical knowledge governance framework for nationwide data infrastructure projects. Stud Health Technol Inform 2018;248:196–203
33 Velupillai S, Mowery D, South BR, Kvist M, Dalianis H. Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis. Yearb Med Inform 2015; 10 (01) 183-193
34 Dubitzky W, Wolkenhauer O, Cho K-H, Yokota H. , eds. Encyclopedia of Systems Biology. New York, NY: Springer New York; 2013
35 Friedman C, Rindflesch TC, Corn M. Natural language processing: state of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. J Biomed Inform 2013; 46 (05) 765-773
36 Montague R. Universal grammar. Theoria 1970; 36 (03) 373-398
37 Haarbrandt B, Schreiweis B, Rey S, et al. HiGHmed - An open platform approach to enhance care and research across institutional boundaries. Methods Inf Med 2018;57(S01):e66–e81
38 Haarbrandt B, Jack T, Marschollek M. Automated transformation of openEHR data instances to OWL. Stud Health Technol Inform 2016;223:63–70
39 Wulff A, Haarbrandt B, Tute E, Marschollek M, Beerbaum P, Jack T. An interoperable clinical decision-support system for early detection of SIRS in pediatric intensive care using openEHR. Artif Intell Med 2018;89:10–23
40 Haarbrandt B, Tute E, Marschollek M. Automated population of an i2b2 clinical data warehouse from an openEHR-based data repository. J Biomed Inform 2016;63:277–294
41 Damerau FJ. A technique for computer detection and correction of spelling errors. Commun ACM 1964; 7 (03) 171-176
42 Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Dokl Akad Nauk SSSR 1965; 163 (04) 845-848
43 Knuth DE. The Art of Computer Programming: Sorting and Searching. 2nd ed.. Boston: Addison-Wesley; 2017
44 Pomares-Quimbaya A, Kreuzthaler M, Schulz S. Current approaches to identify sections within clinical narratives from electronic health records: a systematic review. BMC Med Res Methodol 2019; 19 (01) 155
45 Wang Y, Wang L, Rastegar-Mojarad M. et al. Clinical information extraction applications: a literature review. J Biomed Inform 2018; 77: 34-49
46 Gonzalez-Hernandez G, Sarker A, O'Connor K, Savova G. Capturing the patient's perspective: a review of advances in natural language processing of health-related text. Yearb Med Inform 2017; 26 (01) 214-227
47 Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical natural language processing in languages other than English: opportunities and challenges. J Biomed Semantics 2018; 9 (01) 12
48 DFKI—German Research Center for Artificial Intelligence. mEx—Medical Information Extraction. Available at: http://biomedical.dfki.de/mEx . Accessed April 19, 2020
49 Savova GK, Masanz JJ, Ogren PV. et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17 (05) 507-513
50 Averbis. Health Discovery. Available at: https://averbis.com/health-discovery . Accessed March 11, 2020
51 OpenNLP. OpenNLP. Available at: https://opennlp.apache.org/ . Accessed April 19, 2020
52 LingRep. LingRep. Available at: https://www.econob.com/de/demos/ . Accessed April 19, 2020
53 Sohn S, Clark C, Halgrim SR, Murphy SP, Chute CG, Liu H. MedXN: an open source medication extraction and normalization tool for clinical text. J Am Med Inform Assoc 2014; 21 (05) 858-865
54 Lin Y-K, Chen H, Brown RA. MedTime: a temporal information extraction system for clinical narratives. J Biomed Inform 2013; 46: S20-S28
55 Schwartz AS, Hearst MA. A simple algorithm for identifying abbreviation definitions in biomedical text. Pac Symp Biocomput 2003; 8: 451-462
56 Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001; 34 (05) 301-310
57 Becker M, Böckmann B. Extraction of UMLS® Concepts Using Apache cTAKES™ for German Language. Stud Health Technol Inform 2016; 223: 71-76
58 Becker M, Kasper S, Böckmann B, Jöckel K-H, Virchow I. Natural language processing of German clinical colorectal cancer notes for guideline-based treatment evaluation. Int J Med Inform 2019; 127: 141-146
59 König M, Sander A, Demuth I, Diekmann D, Steinhagen-Thiessen E. Knowledge-based best of breed approach for automated detection of clinical events based on German free text digital hospital discharge letters. PLoS One 2019; 14 (11) e0224916
60 Löpprich M, Krauss F, Ganzinger M, Senghas K, Riezler S, Knaup P. Automated classification of selected data elements from free-text diagnostic reports for clinical research. Methods Inf Med 2016; 55 (04) 373-380
61 Hong N, Wen A, Mojarad MR, Sohn S, Liu H, Jiang G. Standardizing heterogeneous annotation corpora using HL7 FHIR for facilitating their reuse and integration in clinical NLP. AMIA Annu Symp Proc 2018; 2018: 574-583

Supplementary Material

Supplementary Material