1 Introduction
The past decade has seen a truly revolutionary paradigm shift for Natural Language
Processing (NLP) as a result of which Deep Learning (DL) (for a technical introduction,
cf. [1]; for comprehensive surveys, cf. [2] and [3]) became the dominating mind-set of researchers and developers in this field (for
surveys, cf. [4]
[5]). Yet, DL is by no means a new computational paradigm. Rather it can be seen as
the most recent offspring of neural computation in the evolution of computer science
(cf. the historical background provided by Schmidhuber [6]). But unlike in previous attempts, it now turns out to be extremely robust and effective
for adequately dealing with the contents of unstructured visual [7], audio/speech [8], and textual data [9].
The success of Deep Neural Networks (DNNs) has many roots. Perhaps the most important
methodological reason is that, with DNNs, manual feature selection or (semi-)automated
feature engineering is abandoned. This time-consuming tuning step was at the same time mandatory and
highly influential on the performance of earlier generations of ML systems in NLP
based on Markov Models (MMs), Conditional Random Fields (CRFs), Support Vector Machines
(SVMs), etc. In a DL system, however, the relevant features (and their relative contribution
to a classification decision) are automatically computed as a result of thousands
of iterative training cycles.
The ultimate reason for the success behind DNNs is a pragmatic criterion though: system performance. Compared with results in biomedical Information Extraction (IE), obtained in previous
years with standard ML methods, DL approaches changed profoundly the rules of the
game. In a landslide manner, for the same task and domain, performance figures jumped
up to unprecedented levels so far and DL systems consistently outperformed by large
margins non-DL state-of-the-art (SOTA) systems for different tasks. Section 3 provides
ample evidence for this claim and features the new SOTA results with a deeper look
at IE, a major application class of medical NLP (for alternative surveys, cf. [10]
[11]
[12]).
Despite specialized hardware at disposal now, training DNNs still requires tremendous
computational resources and processing time. Luckily, for general NLP, huge collections
of language models (so-called embeddings) have already been trained on huge corpora (comprised of hundreds of millions of
Web-scraped documents, including newspaper and Wikipedia articles) so that these pre-compiled
model resources can be readily reused when dealing with general-purpose language. But medical (and biological) language mirrors special-purpose language characteristics and comprises a large variety of sublanguages of its own.
This becomes obvious in Section 3 where we deal with scholarly scientific writing
(with documents typically taken from PubMed). Here, differences to general language
are mostly due to the use of a highly specialized technical vocabulary (covered by
numerous terminologies, such as MeSH, SNOMED-CT, or ICD). Even more challenging are
clinical notes and reports (with documents often taken from the MIMIC[1] (Medical Information Mart for Intensive Care) clinical database) which typically
exhibit syntactically ill-formed, telegraphic language with lots of acronyms and abbreviations
as an additional layer of complexity (cf. the seminal descriptive work distinguishing
both these sublanguage types by Friedman et al. [13]). Newman-Griffis and Fosler-Lussier [14] investigated different sublanguage patterns for the many varieties of clinical reports
(pathology reports, discharge summaries, nurse and Intensive Care Unit notes, etc.),
while Nunez and Carenini [15] discussed the portability of embeddings across various fields of medicine reflecting
characteristic sublanguage use patterns. These constraints have motivated the medical
NLP community to adapt embeddings originally trained on general language to the medical
language. [Table 1] lists those medically informed embeddings, many of which are the basis for the IE
applications discussed in Section 3.
Table 1
An Overview of Common Embeddings—Biomedical Language Models
|
|
Our survey emphasizes the fundamental methodological paradigm shift of current NLP
research from symbolic to distributed representations as the basis of DL. It thus
complements earlier contributions to the International Medical Informatics Association
(IMIA) Yearbook of Medical Informatics which focused exclusively on the role of social
media documents [23], had a balanced view on the relevance of both Electronic Health Records (EHRs) and
social media posts [24], or dealt with the importance of shared tasks for the progress in medical NLP [25]. The last two Yearbook surveys of the NLP section most closely related to medical
IE were published in 2015 [26] and 2008 [27]. The survey by Velupillai et al. [28] dealt with opportunities and challenges of medical NLP for health outcomes research,
with particular emphasis on evaluation criteria and protocols.
We also refer readers to alternative surveys of DL as applied to medical and clinical
tasks. Wu et al. [29] reviewed literature for works using DL for a broader view of clinical NLP, whereas
Xiao et al. [30] and Shickel et al. [31] performed systematic reviews on the applications of DL to several kinds of EHR data,
not only text. Miotto et al. [32] and Esteva et al. [33] further extended that scope to include clinical imaging and genomic data beyond
the scope of classical EHRs. From an even broader perspective of the huge amounts
of biomedical data, Ching et al. [34] examined various applications of DL to a variety of biomedical problems—patient
classification, fundamental biological processes, and treatment of patients—and discussed
the unique challenges that biomedical data pose for DL methods. In the same vein,
Rajkomar et al. [35] used the entire EHR, including clinical free-text notes, for clinical predictive
modeling based on DL (targeted, e.g., at the prediction of in-hospital mortality or patient’s final discharge diagnoses).
They also demonstrated that DL methods outperformed traditional statistical prediction
models.
3 Deep Neural Networks for Medical Information Extraction
In this section, we introduce applications of DNNs to medical NLP for two different
tasks, Named Entity Recognition (NER) and Relation Extraction (REX). The focus of
our discussion relies on studies dealing with English as reference language since
the vast majority of benchmark and reference data sets are in English[2]. After a brief description of each task, we summarize the current SOTA in tables
which generalize often subtle distinctions in experimental design and workflows. Our
main goal is to show the diversity of major benchmark datasets, DL approaches, and
embeddings being used. For these tables, we extracted all symbolic (e.g., corpus or DL approach) and numerical information (e.g., about annotation metadata, performance scores) directly from the cited papers.
The assessment of different systems for the same task is centered around their performance
on gold data in evaluation experiments. We refrain from highlighting minor differences
in the reported scores because of different datasets being used for evaluation, changing
volumes of metadata, and sometimes even the genres they contain. Hence, from a strict
methodological perspective, the reported results have to be interpreted with utmost
caution for two main reasons [37]. First, the choice of pre-processing steps, such as tokenization, inclusion/exclusion
of punctuation marks, stop word removal, morphological normalization/lemmatization/stemming,
n-gram variability, entity blinding strategies, and, second, the calibration of training
methods (split bias, pooling techniques, hyperparameter selection (dropout rate, window
size, etc.)) have a strong impact on the way a chosen embedding type and DL model
finally performs, even within the same experimental setting. However, the data we
report give valuable comparative information of the SOTA, though with fuzzy edges.
This situation might be remedied by a recently proposed common evaluation framework
for biomedical NLP, the BLUE (Biomedical Language Understanding Evaluation) benchmark[3] [22], which consists of five different biomedical NLP tasks (including NER and REX) with
ten corpora (including BC5CDR, DDI, and i2b2 that also occur in the tables below),
or the one proposed by Chauhan et al. [37]
[4] enabling a more lucid comparison of various training methodologies, pre-processing,
modeling techniques, and evaluation metrics.
For the tables provided in the next subsections, we used the F1 score as the main ordering criterion for the cited studies (from highest to lowest)[5]. We usually had to select among a large variety of experimental conditions (with
different scores). The final choices we made were led by the criterion to favor comparability
among all studies. This means that higher (and lower) outcomes may have been reported
in the cited studies for varying experimental conditions. Still, the top-ranked system(s)
in each of the following tables defines the current SOTA for a particular application.
3.1 Named Entity Recognition
The task of Named Entity Recognition (NER) is to identify crucial medical named entities
(i.e., spans of concrete mentions of semantic types such as diseases or drugs and their
attributes) in running text. For a recent survey of DL-based approaches and architectures
underlying NER as a generic NLP application, see [38].
3.1.1 Diseases
A primary target of NER in the medical field is the automatic identification of diseases
in scientific articles and clinical reports. For instance, textual occurrences of
disease mentions (e.g., “Diabetes II” or “cerebral inflammation”) are mapped to a common semantic type, Disease
[6]. The crucial role of recognizing diseases in medical discourse is also emphasized
by a number of surveys dealing with the recognition of special diseases. For instance,
Sheikhalishahi et al. [40] discussed NLP methods targeted at chronic diseases and found that shallow ML and
rule-based approaches (as opposed to more sophisticated DL-based ones) prevail. Koleck
et al. [41] summarized the use of NLP to analyze symptom information documented in EHR free-text
narratives as an indication of diseases and similar to the previous survey found little
coverage of DL methods in this application area as well. Savova et al. [42] reviewed the current state of clinical NLP with respect to oncology and cancer phenotyping
from EHR. Datta et al. [43] focused on an even more specialized use case—the lexical representation required
for the extraction of cancer information from EHR notes in a frame-semantic format.
The research summarized in [Table 2] is strictly focused on Disease recognition and, for reasons of comparability, based on the use of shared data sets
and metadata (gold annotations). Two benchmarks are prominently featured, BC5CDR [44] and NCBI [45]
[7]. BC5CDR is a corpus made of 1,500 PubMed articles, with 4,409 annotated chemicals,
5,818 diseases, and 3,116 chemical-disease interactions, created for the BioCreative V Chemical and Disease Mention Recognition Task [44]. As an alternative, the NCBI Disease Dataset [45] consists of a collection of 793 PubMed abstracts annotated with 6,892 disease mentions
which are mapped to 790 unique disease concepts (thus, this corpus can also be used
for grounding experiments).
Table 2
Medical Named Entity Recognition: Diseases. Benchmark Datasets from BC5CDR [44] and NCBI [45].
|
|
The current top performance for Disease recognition comes close to 90% F1
[8]. Lee et al. [20] use a Transformer model with in-domain training (BioBERT), but also (attention-based)
BiLSTMs which perform strongly in the range of 88–89% F1 score. For the choice of embeddings being used, self-trained ones might be a better
choice than pre-trained ones, e.g., those provided by bio.nlplab.org [16]. The incorporation of (large) dictionaries does not provide a competitive advantage
in the experiments reported here. Though multi-task learning and transfer learning
seem reasonable choices ([39]
[46] and [47], respectively) to combat the sparsity of datasets, they generally do not boost systems
to the top ranks.
Interesting though are differences for the same approach on different evaluation data sets. For the second-best system by Sachan et al. [47], F1 scores differ for BC5CDR and NCBI by 2.0 (for the third-best [46] by 2.7) percentage points, whereas for the best non-DL approach by Lou et al. [48], this difference amounts to remarkable 4.1 percentage points. This hints at a strong
dependence of the results of the same system set-up on the specific corpus these results
have been worked out and, thus, limits generalizability. On the other hand, corpora
obviously cannot be blamed for intrinsic analytical hardness since cross-rankings
occurs: the system by Lee et al. [20] gets the over-all highest F1 score for NCBI but underperforms for BC5CDR, whereas for the tagger used by Sachan
et al. [47] the ranking is reversed—their system performs better on BC5CDR than on NCBI (differences
are in the range of 2 percentage points). The most stable system in this respect is
the one by Zhao et al. [39]. Finally, the distance between the best- and second-best-performing DL systems ([20] and [47], respectively) and their best non-DL counterpart [48] amounts to 7.6 percentage points (for NCBI) and 3.1 percentage points (for BC5CDR),
respectively.
3.1.2 Medication
The second major medical named entity type we here discuss is related to medication
information. NER complexity is increased for this task since it is split into several
subtasks, including the recognition of drug names (Drug), frequency (Dr-Freq) and (manner or) route of drug administration (Dr-Route), dosage (Dr-Dose), duration of administration (Dr-Dur), and adverse drug events (Dr-ADE). These subtypes are highly relevant in the context of medication information and
are backed up by an international standard, the HL7 Fast Healthcare Interoperability
Resources (FHIR)[9]. [Tables 3] and [4] provide an overview of the SOTA on this topic.
Table 3
Medical Named Entity Recognition: Drugs. Benchmark Datasets: n2c2 [56]; i2b2 2009 [57]; MADE 1.0 [59]; DDI [60].
|
|
Table 4
Medical Named Entity Recognition: Medication Attributes. Benchmark Datasets: n2c2
[56]; i2b2 2009 [57]; MADE 1.0 [59]; DDI [60].
|
|
For medication information, four gold standards had a great impact on the field in
the past years. The most recent one came out of the 2018 n2c2 Shared Task on Adverse
Drug Events and Medication Extraction in Electronic Health Records [56], a successor of the 2009 i2b2 Medication Challenge [57], now with a focus on Adverse Drug Events (ADEs). It includes 505 discharge summaries
(303 in the training set and 202 in the test set), which originate from the MIMIC-III
clinical care database [58]. The corpus contains nine types of clinical concepts (including drug name), eight
attributes (reason, ADE, frequency, strength, duration, route, form, and dosage –
from which we chose five for comparison), and 83,869 concept annotations. Relations
between drugs and the eight attributes were also annotated and summed up to 59,810
relation annotations (see Section 3.2.1). The third corpus, MADE 1.0 [59], formed the basis for the 2018 Challenge for Extracting Medication, Indication,
and Adverse Drug Events (ADEs) from Electronic Health Record (EHR) Notes and consists
of 1,092 de-identified EHR notes from 21 cancer patients. Each note was annotated
with medication information (drug name, dosage, route, frequency, duration), ADEs,
indication (symptom as reason for drug administration), other signs and symptoms,
severity (of disease/symptom), and relations among those entities, resulting in 79,000
mention annotations. Finally, the DDI corpus [60], originally developed for the Drug-Drug Interaction (DDI) Extraction 2013 Challenge
[61], is composed of 792 texts selected from the (semi-structured) DrugBank database[10] and other 233 (unstructured) MEDLINE abstracts, summing up 1,025 documents. This
fine-grained corpus has been annotated with a total of 18,502 pharmacological substances
and 5,028 drug-drug interactions[11]. Hence, the medication NER task not only comes with a higher entity type complexity
but also with text genres different from the disease recognition task—while the former
puts emphasis on clinical reports, the latter focuses on scholarly writing.
Except for route and ADE, all top scores for NER were achieved on the n2c2 corpus.
For drug names, the current SOTA exceeds 95% F1 score established by Wei et al. [62]. As to the subtypes, their system also compares favorably to alternative architectures
by a large F1 margin ranging from 8.6 percentage points (for duration) down to 1.0 (for drug name).
For route, the distance to the best system is marginal (around 1 percentage point)[12], whereas for ADE it is huge (more than 10 percentage points, a strong outlier).
Overall, frequency, route, and dosage recognition reach outstanding F1 scores in the range of 95 up to 97%, while for duration information top F1 scores drop remarkably by at least 10 to 20 percentage points. Still, the recognition
of ADEs seems to be the hardest task, with the best system by Wunnava et al. [67] peaking at around 64% F1 on MADE 1.0 data (here the top performing system by Wei et al. [62] plummets down to 53% F1). Interestingly, ADEs are verbally the least constrained type of natural language
utterance compared with all the other entity types considered here.
In terms of DL methodology, BiLSTM-CRFs are the dominating approach. Yet, the type
of embeddings used by different DL systems varies a lot ranging from pre-trained Word2vec
embeddings and those self-trained on MIMIC-III (for the top performers) to GloVe embeddings
pre-trained on CommonCrawl, Wikipedia, EHR notes, and PubMed. There seems to be no
generalizable winner for either choice of embeddings given the current state of evaluations,
but self-training on medical raw data, such as MIMIC-III, challenge data sets, or,
more advisable, using the now available BioSentVec [18] and BlueBERT [22] embeddings pre-trained on MIMIC-III, might be advantageous.
Studies in which the same system configuration was tested on different corpora are
still lacking so that corpus effects are unknown (unlike for diseases; see [Table 2]). Yet, there is one interesting though not so surprising observation: Unanue et
al. [65] explored the two slices of the DDI corpus, with a span of F1 scores of more than 16 percentage points. This obviously witnesses the influence
of a priori (lack of) structure—DrugBank data is considerably more structured than MEDLINE free
texts and, thus, the former gets much higher scores than the latter.
Comparing DL approaches vs. non-DL ones (a CRF architecture) on the same corpus (MADE
1.0), we found that for the core entity type (Drug), the recognition performance differs
by almost 3 percentage points, for frequency, route and dose marginally by less than
1, yet for duration and ADE it amounts to roughly 5 and 12 percentage points, respectively—consistently
in favor of Deep Neural Networks (DNNs).
3.2 Relation Extraction
Once named entities have been identified, a follow-up question emerges: does some
sort of semantic relation hold among these entities? We surveyed this Relation Extraction
(REX) task with reference to results that have been achieved for information related
to medication attributes and drug-drug interaction.
3.2.1 Medication-Attribute Relations
In Section 3.1.2, we already dealt with single named entity types typically associated
with medication information, namely drug names and administration frequency, duration,
dosage, route, and ADE, yet in isolated form only. In this subsection, we are concerned
with making the close associative ties between Drugs and typical conceptual attributes, such as Frequency, Duration, Dosage, Route, ADE, and Reason (for prescription), explicit. Hence, the recognition of the respective named entity
types (Drugs, Dr-Freq, Dr-Dur, Dr-Dose, Dr-Route, Dr-ADE, and Dr-Reason) turns out to be a good starting point for solving this REX task. Not surprisingly,
the benchmarks for this task are a subset of the ones in [Tables 3] and [4] depicting the results for medication-related NER. [Table 5] provides an overview of the experimental results for finding medication-attribute
relations in medical, in effect, clinical, documents.
Table 5
Medical Relation Extraction: Medication-Attribute Relations (including ADEs). Benchmark
Datasets: n2c2 [56]; MADE 1.0 [59].
|
|
The overall results from medication-focused NER are mostly confirmed for the REX task.
The n2c2 corpus is the reference dataset for top performance. The group who achieved
top F1 scores for the medication NER problem also performed best for the medication-attribute
REX task [62], with extraordinary figures for Frequency, Route, and Dosage relations (in the upper 98% F1 range), a superior one for the Duration relation (93% F1), and good ones on the (hard to deal with) Adverse and Reason relations (85% F1). Still, the distances to the second-best system for the same corpus (n2c2) are not
so pronounced in most cases, ranging by 1 percentage point (for Frequency, Route, Dosage, and Duration), yet increased up to 3 (for Adverse) and 7 (for Reason) percentage points.
For the MADE 1.0 corpus, a similar picture emerges. From a lower offset (typically
around 3 F1 percentage points compared with n2c2), differences between the best and second-best
systems were on the order of (negligible) 1 percentage point for Frequency, Route, and Dosage, yet increased by roughly 3, 5, and 7 percentage points for Reason, Duration, and Adverse events, respectively. Yet, in 4 out of 6 cases (Frequency, Dosage, Duration, and Adverse events) non-DL systems (CRFs, SVMs) outperformed their DL counterparts with small margins
(in the range of (again, negligible) 1 percentage point) for Frequency and Dosage, yet with higher ones for Duration and Adverse events (5 and 7 percentage points, respectively). In cases where the DL approach ranked
higher than a non-DL one, differences ranged between 1 and 3 percentage points (for
Route and Reason, respectively). Thus, the MADE 1.0 corpus constitutes a benchmark where well-engineered
standard ML classifiers can still play a competitive role. However, we did not find
this pattern of partial supremacy of non-DL approaches for the n2c2 benchmark.
The top performers for the medication attribute REX task [62] employed a joint learning approach based on CNN-RNN (thus diverging from the most
successful architectures for medication NER; see [Tables 3] and [4]) and rule-based post-processing that outperformed a simple CNN-RNN. Summarizing,
the CNN-RNN approach seems more favorable than an (attention-based) BiLSTM, with preferences
for self-trained in-domain embeddings.
3.2.2 Drug-Drug Interaction
The second type of medication-focused relation we consider here are drug-drug interactions
as featured in the DDI challenge (for surveys on the impact of DL on recent research
on drug-drug interactions, cf. [82]
[83], for a survey on drug-drug interaction combining progress in data and text mining
from EHRs, scientific papers, and clinical reports but lacking in-depth coverage of
DL methods, cf. [84], for the NLP-focused recognition of ADEs also lacking awareness of DL contributions
to this topic, cf. [85]). Four main types of relations between drugs are considered: pharmacokinetic Mechanism, drug Effect, recommendation or Advice regarding a drug interaction, and Interaction between drugs without providing any additional information. Overall, the DDI corpus
on which these evaluations were run is divided into 730 documents taken from DrugBank
and 175 abstracts from MEDLINE and contains 4,999 relation annotations (4,020 train,
979 test).
Recognition rates for these relations (cf. [Table 6]) are considerably lower than for the medication-related attributes when linked to
drugs (cf. [Table 5]). The best systems peak at 85% F1 score for Advice (a distance of more than 13 percentage points to the top recognition results for
medication-attributes), they slip to 78%[13] and 77% for Mechanism and Effect, respectively, and plummet to 59% for Interaction
[14]. Differences between the first and second-ranked systems are typically small, yet
become larger on subsequent ranks (roughly between 3 to 4 percentage points relative
to the top-ranked system). As with medication attributes, drug-drug interactions can
also be recognized in a competitive way by CNN-RNN architectures, but attention-based
LSTMs perform also considerably well. Again, self-trained embeddings using in-domain
corpora seem to be advantageous for this relation class. Reflecting the drop in performance,
one may conclude that drug-drug interactions constitute a markedly harder task than
the conceptually much closer medication-attribute relations.
Table 6
Medical Relation Extraction: Drug-Drug Interaction. Benchmark Dataset: DDI [60].
|
|
Finally, [Table 6] most drastically supports our claim that DL approaches outperform non-DL ones. The
difference between both approaches amounts to 5 percentage points for Mechanism, 7 for Effect and Interaction, and 8 for Advice.
4 Conclusions
We have presented various forms of empirical evidence that (with one exception only)
Deep Learning-based neural networks outperformed non-DL, feature engineered, approaches
for several information extraction tasks. However, despite their success, Deep Neural
Networks and their embedding models have their shortcomings as well.
One of the most problematic issues is their dependence on huge amounts of training
data: SOTA embedding models are currently trained on hundreds of billions of tokens
[89]. This magnitude of data volume is out of reach for any training effort in the medical/clinical
domain [90]. Also, embeddings are very vulnerable to malicious attacks or adversarial examples—small
changes at the input level may result in severe misclassification [5]. Another well-known problem relates to the instability of word embeddings. Word
embeddings depend on their random initialization and the processing order of the underlying
examples and therefore they do not necessarily converge on exactly the same embeddings
even after several thousands of training iterations [91]
[92]. Finally, although DL is celebrated for not requiring manual feature engineering,
the effects of proper hyperparameter tuning on DNNs [93] remain an issue for DL [94]. Apart from these intrinsic problems, Kalyan and Sangeetha [95] and Khattak et al. [96] refer to extrinsic drawbacks of neural networks, such as opaque encodings (resulting
in lacking interpretability) or limited transferability of large models (hindering
knowledge distillation for smaller models).
Still, the sparsity of corpora and special linguistic phenomena of the medical (clinical)
sublanguage(s) create intrinsic problems for data-greedy DL approaches that have to
be overcome by special learning strategies for neural systems, such as transfer learning
or domain adaptation. Research on adapting general language models to medical language
constraints is just in its beginning. Yet, there is no simple solution to this problem.
Wang et al. [97] evaluated Word2vec embeddings trained on private clinical notes, PMC, Wikipedia,
and the Google News corpus both qualitatively and quantitatively and showed that the
ones trained on Electronic Health Record data performed better on most of the tested
scenarios. However, they also found that word embeddings trained on biomedical domain
corpora do not necessarily have better performance than those trained on general domain
corpora for any downstream biomedical NLP task (other experimental evidence of the
effects of in- and out-of-domain corpora and further parameters, such as corpus size,
on word embedding performance is reported by Lai et al., [98]).
While this survey focused on the application domain of medical IE to demonstrate the
outstanding role of DL for medical Natural Language Processing, one might be tempted
to generalize this trend to other applications as well. There is, indeed, plenty of
evidence in the literature that other application fields, such as question answering
(and the closely related area of machine reading), summarization, machine translation,
and speech processing, reveal the same pattern. However, for text categorization (in
the sense of mapping free text to some pre-defined medical category system, such as
ICD, SNOMED, or MeSH) this preference is less obvious, since traditional Machine Learning
or rule-based models still play an important role here and, more often than for the
IE application scenario, show competitive performance against DL approaches. Whether
this exception will persist or will be swept away by future research remains an open
issue.