CC BY-NC-ND 4.0 · Methods Inf Med 2020; 59(S 02): e64-e78
DOI: 10.1055/s-0040-1716403
Original Article

Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing

Antje Wulff
1   Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany
,
Marcel Mast
1   Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany
,
Marcus Hassler
2   Econob, Informationsdienstleistungs GmbH, Klagenfurt am Wörthersee, Austria
,
Sara Montag
1   Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany
,
Michael Marschollek
1   Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany
,
Thomas Jack
3   Department of Pediatric Cardiology and Intensive Care Medicine, Hannover Medical School, Hannover, Germany
› Institutsangaben
Funding None.

Abstract

Background Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly.

Objectives The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories.

Methods We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School.

Results We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall.

Conclusion The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.

Authors' Contributions

A.W. was responsible for drafting the methodological approach, managed the overall project work, led the proof-of-concept evaluation, and has authored the manuscript. M. M. developed the described NLP pipeline, designed the openEHR archetypes and template, and co-authored the manuscript. T. J. and S. M. provided clinical expertise for requirement analysis and dictionary construction. M. H. gave subject-specific advices on the design of NLP pipelines and provided the NLP software. M. M. provided further technical and medical expertise and, together with all authors, co-authored and proofread the manuscript. All authors read and approved the final manuscript.


Supplementary Material



Publikationsverlauf

Eingereicht: 12. Mai 2020

Angenommen: 18. Juli 2020

Artikel online veröffentlicht:
14. Oktober 2020

© 2020. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial-License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/).

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

 
  • References

  • 1 Meystre SM, Lovis C, Bürkle T, Tognola G, Budrionis A, Lehmann CU. Clinical data reuse or secondary use: current status and potential future progress. Yearb Med Inform 2017; 26 (01) 38-52
  • 2 Martínez-Costa C, Cornet R, Karlsson D, Schulz S, Kalra D. Semantic enrichment of clinical models towards semantic interoperability. The heart failure summary use case. J Am Med Inform Assoc 2015; 22 (03) 565-576
  • 3 Beale T. Archetypes: constraint-based domain models for future-proof information systems. In: Eleventh OOPSLA Workshop on Behavioral Semantics: Serving the Customer. Seattle, Washington, Boston: Northeastern University; 2002: 16-32
  • 4 HL7. FHIR v1.0.2. Available at: http://hl7.org/fhir/index.html . Accessed June 12, 2020
  • 5 HL7. HL7 RIM—das Referenzinformationsmodell. Available at: http://hl7.de/themen/hl7-v3-rim-das-referenzinformationsmodell/ . Accessed June 12, 2020
  • 6 HL7. Clinical Document Architecture Release 2.0 (CDA R2). Available at: http://www.hl7.org/implement/standards/product_brief.cfm?product_id=7 . Accessed June 12, 2020
  • 7 HL7. HL7 Version 3 Standard: clinical decision support; Virtual Medical Record (vMR) Logical Model, Release 2. Available at: http://www.hl7.org/implement/standards/product_brief.cfm?product_id=338 . Accessed June 12, 2020
  • 8 Friedman C, Johnson SB. Natural language and text processing in biomedicine. In: Shortliffe EH, Cimino JJ. , eds. Biomedical Informatics. New York, NY: Springer New York; 2006: 312-343 . Health Informatics
  • 9 Kreimeyer K, Foster M, Pandey A. et al. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform 2017; 73: 14-29
  • 10 Hong N, Wen A, Stone DJ. et al. Developing a FHIR-based EHR phenotyping framework: a case study for identification of patients with obesity and multiple comorbidities from discharge summaries. J Biomed Inform 2019; 99: 103310
  • 11 Hong N, Wen A, Shen F. et al. Developing a scalable FHIR-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data. JAMIA Open 2019; 2 (04) 570-579
  • 12 Daumke P, Heitmann KU, Heckmann S, Martínez-Costa C, Schulz S. Clinical text mining on FHIR. Stud Health Technol Inform 2019; 264: 83-87
  • 13 Lin C-H, Wu N-Y, Lai W-S, Liou D-M. Comparison of a semi-automatic annotation tool and a natural language processing application for the generation of clinical statement entries. J Am Med Inform Assoc 2015; 22 (01) 132-142
  • 14 Meystre SM, Lee S, Jung CY, Chevrier RD. Common data model for natural language processing based on two existing standard information models: CDA+GrAF. J Biomed Inform 2012; 45 (04) 703-710
  • 15 Kropf S, Krücken P, Mueller W, Denecke K. Structuring legacy pathology reports by openEHR archetypes to enable semantic querying. Methods Inf Med 2017; 56 (03) 230-237
  • 16 Williams CN, Bratton SL, Hirshberg EL. Computerized decision support in adult and pediatric critical care. World J Crit Care Med 2013; 2 (04) 21-28
  • 17 Lighthall GK, Vazquez-Guillamet C. Understanding decision making in critical care. Clin Med Res 2015; 13 (3-4): 156-168
  • 18 Hampton JR, Harrison MJ, Mitchell JR, Prichard JS, Seymour C. Relative contributions of history-taking, physical examination, and laboratory investigation to diagnosis and management of medical outpatients. BMJ 1975; 2 (5969): 486-489
  • 19 Summerton N. The medical history as a diagnostic technology. Br J Gen Pract 2008; 58 (549) 273-276
  • 20 Peterson MC, Holbrook JH, Von Hales D, Smith NL, Staker LV. Contributions of the history, physical examination, and laboratory investigation in making medical diagnoses. West J Med 1992; 156 (02) 163-165
  • 21 Keifenheim KE, Teufel M, Ip J. et al. Teaching history taking to medical students: a systematic review. BMC Med Educ 2015; 15: 159
  • 22 Ghosh D, Karunaratne P. The importance of good history taking: a case report. J Med Case Reports 2015; 9: 97
  • 23 Wang MY, Asanad S, Asanad K, Karanjia R, Sadun AA. Value of medical history in ophthalmology: a study of diagnostic accuracy. J Curr Ophthalmol 2018; 30 (04) 359-364
  • 24 Masic I, Begic Z, Naser N, Begic E. Pediatric cardiac anamnesis: prevention of additional diagnostic tests. Int J Prev Med 2018; 9: 5
  • 25 Ikiz MA, Cetin II, Ekici F, Güven A, Değerliyurt A, Köse G. Pediatric syncope: is detailed medical history the key point for differential diagnosis?. Pediatr Emerg Care 2014; 30 (05) 331-334
  • 26 Brander P, Garin N. Utilité de l'anamnèse et de l'examen clinique dans le diagnostic de la pneumoniae. Rev Med Suisse 2011; 7 (313) 2026-2029
  • 27 Garde S, Knaup P, Hovenga E, Heard S. Towards semantic interoperability for electronic health records. Methods Inf Med 2007; 46 (03) 332-343
  • 28 vitasystems GmbH. EHRbase: Open Electronic Health Record Platform. Available at: https://ehrbase.org/ . Accessed March 11, 2020
  • 29 DIPS AS. DIPS Electronic Patient Record. Available at: https://www.dips.com/uk/dips-electronic-patient-record . Accessed March 11, 2020
  • 30 Ripple Foundation C.I.C. Ltd. EtherCIS: Enterprise Clinical Data Repository. Available at: http://ethercis.org/ . Accessed March 11, 2020
  • 31 CaboLabs. CloudEHRServer: Clinical Data Management and Sharing Platform. Available at: https://cloudehrserver.com/ . Accessed March 11, 2020
  • 32 Wulff A, Haarbrandt B, Marschollek M. Clinical knowledge governance framework for nationwide data infrastructure projects. Stud Health Technol Inform 2018;248:196–203
  • 33 Velupillai S, Mowery D, South BR, Kvist M, Dalianis H. Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis. Yearb Med Inform 2015; 10 (01) 183-193
  • 34 Dubitzky W, Wolkenhauer O, Cho K-H, Yokota H. , eds. Encyclopedia of Systems Biology. New York, NY: Springer New York; 2013
  • 35 Friedman C, Rindflesch TC, Corn M. Natural language processing: state of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. J Biomed Inform 2013; 46 (05) 765-773
  • 36 Montague R. Universal grammar. Theoria 1970; 36 (03) 373-398
  • 37 Haarbrandt B, Schreiweis B, Rey S, et al. HiGHmed - An open platform approach to enhance care and research across institutional boundaries. Methods Inf Med 2018;57(S01):e66–e81
  • 38 Haarbrandt B, Jack T, Marschollek M. Automated transformation of openEHR data instances to OWL. Stud Health Technol Inform 2016;223:63–70
  • 39 Wulff A, Haarbrandt B, Tute E, Marschollek M, Beerbaum P, Jack T. An interoperable clinical decision-support system for early detection of SIRS in pediatric intensive care using openEHR. Artif Intell Med 2018;89:10–23
  • 40 Haarbrandt B, Tute E, Marschollek M. Automated population of an i2b2 clinical data warehouse from an openEHR-based data repository. J Biomed Inform 2016;63:277–294
  • 41 Damerau FJ. A technique for computer detection and correction of spelling errors. Commun ACM 1964; 7 (03) 171-176
  • 42 Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Dokl Akad Nauk SSSR 1965; 163 (04) 845-848
  • 43 Knuth DE. The Art of Computer Programming: Sorting and Searching. 2nd ed.. Boston: Addison-Wesley; 2017
  • 44 Pomares-Quimbaya A, Kreuzthaler M, Schulz S. Current approaches to identify sections within clinical narratives from electronic health records: a systematic review. BMC Med Res Methodol 2019; 19 (01) 155
  • 45 Wang Y, Wang L, Rastegar-Mojarad M. et al. Clinical information extraction applications: a literature review. J Biomed Inform 2018; 77: 34-49
  • 46 Gonzalez-Hernandez G, Sarker A, O'Connor K, Savova G. Capturing the patient's perspective: a review of advances in natural language processing of health-related text. Yearb Med Inform 2017; 26 (01) 214-227
  • 47 Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical natural language processing in languages other than English: opportunities and challenges. J Biomed Semantics 2018; 9 (01) 12
  • 48 DFKI—German Research Center for Artificial Intelligence. mEx—Medical Information Extraction. Available at: http://biomedical.dfki.de/mEx . Accessed April 19, 2020
  • 49 Savova GK, Masanz JJ, Ogren PV. et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17 (05) 507-513
  • 50 Averbis. Health Discovery. Available at: https://averbis.com/health-discovery . Accessed March 11, 2020
  • 51 OpenNLP. OpenNLP. Available at: https://opennlp.apache.org/ . Accessed April 19, 2020
  • 52 LingRep. LingRep. Available at: https://www.econob.com/de/demos/ . Accessed April 19, 2020
  • 53 Sohn S, Clark C, Halgrim SR, Murphy SP, Chute CG, Liu H. MedXN: an open source medication extraction and normalization tool for clinical text. J Am Med Inform Assoc 2014; 21 (05) 858-865
  • 54 Lin Y-K, Chen H, Brown RA. MedTime: a temporal information extraction system for clinical narratives. J Biomed Inform 2013; 46: S20-S28
  • 55 Schwartz AS, Hearst MA. A simple algorithm for identifying abbreviation definitions in biomedical text. Pac Symp Biocomput 2003; 8: 451-462
  • 56 Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001; 34 (05) 301-310
  • 57 Becker M, Böckmann B. Extraction of UMLS® Concepts Using Apache cTAKES™ for German Language. Stud Health Technol Inform 2016; 223: 71-76
  • 58 Becker M, Kasper S, Böckmann B, Jöckel K-H, Virchow I. Natural language processing of German clinical colorectal cancer notes for guideline-based treatment evaluation. Int J Med Inform 2019; 127: 141-146
  • 59 König M, Sander A, Demuth I, Diekmann D, Steinhagen-Thiessen E. Knowledge-based best of breed approach for automated detection of clinical events based on German free text digital hospital discharge letters. PLoS One 2019; 14 (11) e0224916
  • 60 Löpprich M, Krauss F, Ganzinger M, Senghas K, Riezler S, Knaup P. Automated classification of selected data elements from free-text diagnostic reports for clinical research. Methods Inf Med 2016; 55 (04) 373-380
  • 61 Hong N, Wen A, Mojarad MR, Sohn S, Liu H, Jiang G. Standardizing heterogeneous annotation corpora using HL7 FHIR for facilitating their reuse and integration in clinical NLP. AMIA Annu Symp Proc 2018; 2018: 574-583