Abstract
Background Merging disparate and heterogeneous datasets from clinical routine in a standardized
and semantically enriched format to enable a multiple use of data also means incorporating
unstructured data such as medical free texts. Although the extraction of structured
data from texts, known as natural language processing (NLP), has been researched at
least for the English language extensively, it is not enough to get a structured output
in any format. NLP techniques need to be used together with clinical information standards
such as openEHR to be able to reuse and exchange still unstructured data sensibly.
Objectives The aim of the study is to automatically extract crucial information from medical
free texts and to transform this unstructured clinical data into a standardized and
structured representation by designing and implementing an exemplary pipeline for
the processing of pediatric medical histories.
Methods We constructed a pipeline that allows reusing medical free texts such as pediatric
medical histories in a structured and standardized way by (1) selecting and modeling
appropriate openEHR archetypes as standard clinical information models, (2) defining
a German dictionary with crucial text markers serving as expert knowledge base for
a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes.
The approach was evaluated in a first pilot study by using 50 manually annotated medical
histories from the pediatric intensive care unit of the Hannover Medical School.
Results We successfully reused 24 existing international archetypes to represent the most
crucial elements of unstructured pediatric medical histories in a standardized form.
The self-developed NLP pipeline was constructed by defining 3.055 text marker entries,
132 text events, 66 regular expressions, and a text corpus consisting of 776 entries
for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented
to transform the extracted snippets to an openEHR-based representation to be able
to store them together with other structured data in an existing openEHR-based data
repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94%
recall.
Conclusion The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting
and representing important information from pediatric medical histories in a structured
and semantically enriched format. We designed a promising approach with potential
to be generalized, and implemented a prototype that is extensible and reusable for
other use cases concerning German medical free texts. In a long term, this will harness
unstructured clinical data for further research purposes such as the design of clinical
decision support systems. Together with structured data already integrated in openEHR-based
representations, we aim at developing an interoperable openEHR-based application that
is capable of automatically assessing a patient's risk status based on the patient's
medical history at time of admission.
Keywords
natural language processing - clinical decision support systems - openEHR - pediatric
intensive care - medical history taking