Keywords
incidental findings - software tools - natural language processing - electronic medical
record
Background and Significance
Background and Significance
Incidental radiological findings, such as adrenal incidentalomas, are clinically relevant
and require longitudinal management. A failure to properly track and manage incidental
radiological findings may result in missed diagnoses, including malignancies, that
predispose patients to poor health outcomes.[1]
[2]
[3]
[4]
[5] The rate of adherence to recommendations regarding these incidental findings is
low. Although the number of recommended follow-ups for radiological findings continues
to increase, only 58% of these recommended follow-ups are completed.[6]
[7]
[8]
[9] The care coordination process for incidental findings is particularly complex. The
timeline for following up such findings is often months to years away, and there is
no structured system to track follow-up recommendations, particularly those found
in unstructured free text of a radiology report. If the finding is discovered in the
inpatient setting, the hospital clinicians must communicate this information to the
outpatient clinicians, who must track these outstanding incidental findings using
variable free-text formats within existing electronic medical records (EMRs). At each
step of this communication process, information is prone to loss, especially given
the increasing clinical complexity of patients in an aging population. In some cases,
there is no clear consensus on who bears the primary responsibility for keeping track
of the finding (e.g., the patient has no primary care provider [PCP]).
Improved methods for tracking and managing these findings as well as the associated
follow-up recommendations are required. Previous work has demonstrated that electronic
notification tools can be useful in improving adherence to follow-up imaging test
completion,[10]
[11] highlighting the importance of exploring alternative methods of notifying providers
of the need for follow-up. In particular, a centralized system which maintains a database
of outstanding incidental findings and allows for multiuser management over time would
solve many of these issues, as it would not rely on variable individual workflows.
When integrated into an existing EMR, these systems can relay follow-up recommendations
to physicians, either directly or with the assistance of a quality improvement team
monitoring the database. Systems which effectively extract information from unstructured
free-text documents using natural language processing (NLP), rather than requiring
clinicians to overstructure data at the time of input, can augment existing workflows
and provide the backbone for these centralized results management tools.
In this manuscript, we develop and describe a comprehensive results management tool
as applied to a particular incidental finding—adrenal incidentalomas. An adrenal incidentaloma
is defined as a mass greater than 1 cm in diameter discovered on an image obtained
for reasons other than to evaluate the adrenal glands.[1] Adrenal incidentalomas are both common and potentially pathologic, making them a
valuable target for improved management.[2]
[12] Approximately 4.4% of all abdominal computed tomography (CT) scans contain incidental
adrenal gland lesions[12]; this percentage rises to nearly 10% for elderly patients.[2] As the use of imaging increases[9] and the population ages,[13] the number of masses that need tracking and follow-up care will increase substantially.
The majority of these lesions are found to be benign and nonfunctional, but some types
of masses require intervention, including primary adrenal cancers (e.g., adrenocortical
carcinoma), metastatic lesions, hormone secreting tumors, and life threatening pheochromocytomas.[1]
[2]
[14] The true percentage of pathologic adrenal incidentalomas is not well established,
but is estimated to be 20% when accounting for both pathologic biochemical dysfunction
and malignancy.[15]
The consensus from several specialty organizations is that all patients with adrenal
incidentalomas be followed up for the possibility of malignancy and subclinical hormonal
hyperfunction,[2]
[16]
[17]
[18]
[19] at least with the patient's PCP and potentially with an endocrinologist or endocrine
surgeon. Unfortunately, the majority of patients do not receive appropriate follow-up.
Among incidental adrenal lesions, follow-up is especially low, likely because it involves
a combination of radiographic studies and multiple biochemical evaluations to rule
out pheochromocytoma and autonomous secretion of aldosterone or cortisol. One recent
study found that incidental adrenal masses are appropriately followed up less than
10% of the time.[20] An additional study found that only 18.4% of patients with adrenal incidentalomas
received a complete initial evaluation of the mass per American Association of Clinical
Endocrinologists and American Association of Endocrine Surgeons guidelines.[21] Adrenal incidentalomas exemplify many of the problems with long-term incidental
results management identified above, and therefore represent an appropriate first
target for an improved results management system.
Objectives
In this manuscript, we employ NLP to identify radiology reports which contain mentions
of previously unseen adrenal incidentalomas. On top of this information extraction backbone, we develop
a back-end database and front-end web application, designed to improve longitudinal
coordination of incidentaloma management. The application, to be deployed at a large
metropolitan safety net hospital, maintains and displays a list of incidentalomas
requiring follow-up, along with a single-page dashboard of relevant clinical data
(e.g., laboratories, appointments), and enables communication and clinical management
of these findings.
Methods
Dataset
Our dataset consisted of 4,090 radiology reports drawn from our institution's database
of reports from 2016 to 2018. Of these, 251 reports already known to contain adrenal
incidentalomas generated from existing clinical workflows from January 1 to December
31, 2016 were included and a random sample of 3,839 additional reports were hand labeled.
The existing clinical workflow used a simple string matching protocol (matching any
report containing “adrenal lesion,” “adrenal nodule,” “adrenal mass,” or “adrenal
adenoma”) followed by manual screening of the returned results. We included the sample
of already known positive reports in addition to our random sample to ensure a sufficient
number of positive examples to effectively train the model. This addition of known
positive reports functionally acts as a form of dataset augmentation by oversampling
positive examples of new adrenal incidentalomas, improving the model's ability to
learn semantically useful features of positive reports. As the rate of new adrenal
incidentalomas among all new radiology reports is likely smaller than the rate in
our dataset, this form of data augmentation increases the likelihood of identifying
infrequent positive reports. The reports consisted of a broad set of imaging modalities—CT,
magnetic resonance imaging (MRI), and positron emission tomography-computed tomography
(PET-CT)—of the chest, abdomen, and pelvis. We did not limit our dataset to abdominal
imaging studies because adrenal findings were often observed on chest imaging. Radiology
reports for scans ordered specifically to evaluate the adrenal glands were excluded.
The report corpus includes a variety of reporting templates and patients of all ages
and genders. The study was approved by our institution's Institutional Review Board.
Annotation
Our data were annotated at the document level by a PGY-4 surgery resident and post-doctoral
research fellow. Annotations were reviewed with a fellowship-trained adrenal surgeon
and a fellowship-trained body imaging radiologist was consulted to answer questions
as needed. A report was assigned a “true” label if it (1) contained textual evidence
of at least one incidental adrenal finding which had not been observed before, (2)
was greater than 1 cm, and (3) was discovered on an institutional imaging study and
not at an outside hospital. A report with an incidental finding that had been noted
previously was not assigned a “true” label. For instance, a report with a single incidental
finding described as, e.g., “stable,” “unchanged,” or “previously observed” was given
a “false” label. Each report was labeled by one of the two authors (T.F. and J.S.).
Neural Network Design and Training Protocol
The structure of the task is such that the majority of text in a given report is not
relevant to the document output label (i.e., most of the report is not concerned with
the adrenal glands) mirroring feature detection tasks found in computer vision. For
the purposes of document classification in this context, the region of interest that
determines whether a new adrenal incidentaloma is identified is roughly represented
by certain groups of words clustered together. We therefore opted to use a simple
shallow convolutional neural network (CNN), which is classically used for feature
detection tasks and has been used for text classification tasks in the past,[22] including a 2019 study on identifying incidental findings in radiology reports.[23] Our decision to use a convolutional, rather than recurrent neural network (RNN),
for text classification also relates to the efficiency of these networks for the stated
feature detection task. Less computationally intensive than RNNs, CNNs can run effectively
with limited resources. The improvement in model training time and inference on a
machine without graphics processing units (GPUs) is not universal,[24] but we empirically determined that this principle held true for the purpose of identifying
new adrenal incidentalomas. This makes model implementation feasible in a resource
constrained hospital setting, the final goal of the work presented here.
Our network used pretrained 300-dimensional Global Vector (GloVe) word embeddings
derived from the Common Crawl web dataset[25] and packaged within the SpaCy Python package. Documents were tokenized using the
built-in SpaCy tokenizer.[26] Although more task-specific word embeddings exist,[23] our network was trained with GloVe word embeddings to improve model robustness and
ensure similar model performance across institutions. Appendix A.1 contains further discussion on the decision to use GloVe word embeddings. The network
itself consisted of a single convolutional layer for feature detection (200 units,
convolutional kernel size 7) followed by a global max-pooling layer, which selects
the highest feature activation for each convolutional filter from across the entire
document length. A two layer, fully connected neural network, with a ReLu nonlinearity
layer in between the two dense layers, was applied to the output of the max pooling
layer and a softmax was applied to produce the final binary output. A dropout rate
of 0.3 was applied after the convolutional and first dense layers. The hyperparameter
selection strategy is described in further detail in Appendix A.2.
Labeled documents were split into training, validation, and test sets (80%/10%/10%).
Training was performed using the Adam algorithm[27] with a learning rate of 1e − 4. Training was continued until validation loss ceased
to decrease with a patience of three epochs. Final performance was calculated on the
test set. Model performance is reported in conjunction with confidence intervals,
sensitivity, and specificity. The method for the calculation of these measures is
elaborated in Appendix A.3.
Model Interpretability
Clinical machine learning systems must be interpretable by human users.[28] Many approaches for improving interpretability among “black box” systems have been
developed in the machine learning literature; for this study, we opted to use the
Gradient Class Activation Maps (Grad-CAM) algorithm, which identifies parts of an
input which were important in the model's final decision about that input.[29] We used Grad-CAM to generate saliency maps for our model's predictions to visualize
the importance of each word token in the model's decision making process. By design,
the Grad-CAM layer does not affect model output or performance.
Web Application
To implement our system into the clinical workflow, we built a custom web application
that uses the trained neural network model to generate predictions and provides a
dashboard to enable risk stratification and prioritization, tracking, and clinical
management of identified findings (e.g., to track follow-up appointment scheduling,
PCP notification).
To accomplish this, we maintain a database of reports on a secure internal web server.
Our system receives periodic transfers of data consisting of all radiology reports
for recent imaging studies and uses a scheduled task to process the new batch of reports
with the neural network model. The model generates a new set of machine labels, which
are stored in the database. Simple rules are used to identify relevant laboratories
(e.g., ACTH, aldosterone, etc.), clinical appointments (e.g., with endocrinology or
surgery), and orders (e.g., radiologic studies), extract them from our institution's
EMR database, and display all of the information on a single dashboard. This minimizes
the clicks required to view and synthesize the relevant information.
Results
Dataset
Our dataset included 404 positively labeled reports (9.9%) and 3,686 (90.1%) negatively
labeled reports ([Table 1]). Among positively labeled reports, 166 (55.5%) came from reports on CT studies
of the abdomen/pelvis, while 73 (24.4%) were discovered on CT scans of the chest.
Of note, many of these reports contained either negation phrases (e.g., “no adrenal
nodules”) or phrases indicating that the incidental finding had been previously observed;
both of these scenarios were labeled as negative.
Table 1
Description of dataset
Total reports
|
4,090
|
New adrenal incidentalomas
|
Reports with new adrenal incidental findings
|
404
|
Reports without new adrenal incidental findings
|
3,686
|
Imaging study type (among reports with new incidentalomas)
|
CT abdomen/pelvis (with or without contrast)
|
225
|
CT chest/abdomen/pelvis
|
37
|
CT chest
|
104
|
CT abdomen
|
15
|
CT chest/abdomen
|
2
|
MRI abdomen
|
10
|
Other
|
11
|
Abbreviations: CT, computed tomography; MRI, magnetic resonance imaging.
Model Performance
On the unseen test set, our network achieved an accuracy of 97.3% (95% confidence
interval [CI]: 95.2–98.5%), a recall (sensitivity) of 92.9% (80.9–97.5)%, a precision
(positive predictive value) of 83.0% (69.9–91.1)%, a specificity of 97.8% (95.8–98.9)%,
and an F1 score (harmonic mean of recall and precision) of 87.6%. Hyperparameter selection
strategy is discussed in further detail in Appendix A.2. Model performance on the validation set, as well as 2 × 2 frequency tables outlining
model validation and test set performance in comparison to ground truth, expert human
labels, are available in Appendices B and C. The model was trained to convergence in less than 30 minutes on a standard institutional
desktop with no specialized computing hardware (i.e., no graphics processing units
[GPUs]), demonstrating the feasibility of such an approach in a non-GPU-resourced
environment. On this machine, the system is capable of processing thousands of documents
per minute.
Interpretability
We ran the Grad-CAM algorithm over all of the reports predicted to contain a new adrenal
incidentaloma. This procedure generated a set of text spans which were most “important”
to the model's predictions, based on the gradients of the final prediction with respect
to the feature maps of the convolutional layer. [Table 2] shows the text spans with the highest Grad-CAM scores for a randomly selected set
of positively predicted reports.
Table 2
Text spans with the highest Grad-CAM scores for a randomly selected set of reports
Report
|
Grad-CAM spans
|
1
|
“3. Left adrenal nodule is incompletely”
“Is a left adrenal nodule. Bones,”
|
2
|
“mm hypodense left adrenal nodule that is”
|
3
|
“of the left adrenal, there is”
|
4
|
“mass. Recommend adrenal MRI for further”
|
5
|
“1.3-cm left adrenal adenoma. There”
|
6
|
“INFORMATION: adrenal mass noted on”
|
7
|
“1.7-cm right adrenal nodule which measures,”
“likely represents an adrenal adenoma. The”
|
8
|
“there are incompletely characterized 1.4 cm”
|
9
|
“1.6 cm-right adrenal nodule which measures”
|
10
|
“consistent with an adrenal adenoma. Addendum”
|
Abbreviations: Grad-CAM, gradient class activation maps; MRI, magnetic resonance imaging.
Web Application
The web application displays a list of patients with newly identified findings and
includes the most relevant text from each report as identified by the Grad-CAM algorithm.
The application is intended for the hospital's quality department, but will also be
accessible to individual clinicians. It allows for a variety of user actions, including:
(1) collaborative note taking for a particular finding (e.g., follow-up appointment
scheduling, sending a letter to a patient), (2) marking a finding as closed (i.e.,
when adequate follow-up has been performed), and (3) confirmation or correction of
machine predictions. When a user verifies a machine prediction as correct, or marks
it as incorrect, this information is stored as a “human label,” which supersedes any
machine predicted labels. This not only allows for correction of machine mistakes
within the system, but also increases the size of the labeled training set. These
labels are combined with the initial label set; as these new labels accumulate, the
model will be retrained to further improve performance. [Fig. 1] provides an overview of the complete application development process. A screenshot
of the web application is shown in [Fig. 2].
Fig. 1 Overview of the framework required to develop a machine-learning enabled web application
to track incidental findings. Once all the documents are collected, the reports are
labeled and are used to train a machine learning classification algorithm. The machine
learning algorithm can prospectively evaluate new reports and the results are available
as a front-end web application to facilitate the process of alerting appropriate providers
of the incidental finding.
Fig. 2 Screenshot of results management web application running on simulated data (all data
are fictional, not real). The application displays the list of adrenal incidentalomas
detected by the machine learning algorithm. The displayed text is the full sentence
surrounding the Grad-CAM predicted tokens that identify this document as one containing
a new adrenal incidentaloma; the full document can also easily be viewed. The application
allows multiple concurrent users to view relevant clinical data (e.g., laboratories,
upcoming appointments, orders), manage the list over time, and make any necessary
corrections to the machine predictions. Grad-CAM, gradient class activation maps.
Discussion
In this study we trained a model to identify radiology reports containing notation
of newly discovered adrenal incidentalomas and developed a software tool that leverages
this model to enable improved management of these findings. Our model achieved a recall
(sensitivity) of 100%, a precision of 84.8%, and an F1 of 91.8% on the task of identifying
previously unseen radiology reports that contain textual evidence of a newly discovered
adrenal incidentaloma. Furthermore, the inclusion of a user verification active learning
component in the application enables continued model refinement.[30] This system performed comparably to expert level human performance, and importantly,
has significant resource saving potential both in terms of time and money.
Clinical Significance
Currently, the standard protocol for long-term care coordination of adrenal incidentalomas
is provider-to-provider communication, which is highly error prone. To our knowledge,
the automated system presented here is the first created for tracking this information
prospectively.
Consider a protocol that does not utilize automated tools: (1) The reading radiologist
documents the incidentaloma in the radiology report, (2) the ordering physician reads
the report and may (or may not) communicate the incidental findings to the patient
and PCP, who then takes appropriate follow-up action—shared decision making with the
patient, interval imaging, further laboratory testing, or referral to a specialist
who performs the evaluation. A second such pathway exists for more centralized management
of incidental findings: (1) The reading radiologist documents the incidentaloma in
the radiology report; (2) A central coordinator reads all radiology reports, or a
subset generated via a keyword search, identifies the incidentaloma, and then (3)
schedules appropriate follow-up with patient's PCP or specialist who then takes appropriate
follow-up action. In either case, these protocols would require a significant amount
of human effort and lead to many transitions of care, propagating errors inherent
in such handoffs. [Fig. 3] displays these pathways, as well as the more direct pathway made possible through
automated methods, such as the one presented here using NLP embedded within a custom
web application.
Fig. 3 Options for ensuring proper follow-up of incidentaloma. Incidentalomas recorded in
the radiology report can be communicated to appropriate providers by direct action
of the ordering physician, employment of an incidentaloma coordinator specifically
examining radiology reports for evidence of new incidentalomas, or with the use of
an automated identification tool such as the one described in our study.
We designed the web application around the NLP model's output to allow streamlined
management of these important clinical findings at our institution without requiring
physicians to input additional structured data. This framework is simple, readily
extendible, and a meaningful improvement over current practice. An overview of the
steps required to train and deploy a machine learning model to identify incidental
findings using this framework is shown in [Fig. 1]. The iterative process of tool development, deployment, and improvement will take
place in real time with physicians and quality department staff. This tool is meant
to augment the institution's ability to track and manage these findings without workflow
interruptions for frontline physicians, who are already time constrained in patient
interactions and chart review. Efforts to standardize the clinical judgements of a
diverse group of physicians who have had different training and habits may not be
fast enough to deliver care improvement at a sufficient pace.[31] Though we will make the tool available to interested PCPs, the target users work
in the quality department, which is particularly well suited for initial management
of these kinds of high-risk findings. As demonstrated here, the automated identification
of incidental findings, such as adrenal incidentalomas, represents an opportunity
to improve patient care and outcomes without negatively impacting the workflow of
providers.
Prior Approaches
Attempts to automatically identify imaging findings in radiology reports requiring
follow-up date back as far as 1993.[32] These prior efforts have explored the utility of an automated identification tool
across various modalities, including chest X-ray, CT, MRI, and ultrasound.[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44] These studies used rule-based systems, non-neural machine learning methods, or both,
to identify textual evidence of findings that require follow-up either across all
organ systems or focused on a particular organ. A 2019 study successfully used a CNN
to identify incidental findings in radiology reports, supporting the idea that interactive
NLP applications can be adopted as part of routine clinical care.[45]
Traditional machine learning approaches require the manual development of highly tuned
imaging features. When the dataset used to train these models is expanded to include
more institutions, note types, or EMRs, these features must undergo nontrivial recalibration
and model performance may suffer significantly. Preliminary analyses of rule-based
approaches on our classification task produced significantly worse results, largely
due to negation and temporal statements (e.g., comments such as “as previously measured”
or “stable,” which we aimed to exclude). Unlike these prior efforts, we opted for
a neural network solution, which easily incorporates new unstructured data to improve
performance, thereby increasing model generalizability and avoiding bottlenecks in
model development and deployment. Rather than attempt to produce an exhaustive set
of rules, we produced additional document annotations to increase the size of our
dataset; these annotations are not time consuming to produce and are sufficient to
train a neural network to perform accurately. Therefore, the neural network core of
our software makes extension to other institutions simple and likely to provide performance
gains at our institution or others.
Interpretability
We can use our model's Grad-CAM outputs to qualitatively assess its predictions. As
[Table 2] demonstrates, the model is attending to sentences relating to the adrenal nodules,
suggesting that the system has learned to use the “correct” parts of the document
(i.e., those relating to statements about adrenal incidentalomas) for classification.
These examples offer additional evidence that the model has been optimized to identify
higher order semantic features of the text that correspond to new adrenal incidentalomas,
rather than simply relying on string matching or other easily exploitable clues attributable
to the dataset itself. In addition to allowing for verification of machine predictions,
the Grad-CAM outputs may prove helpful to improve the clinician user experience in
the software, by allowing them to quickly focus on the parts of the document which
contain additional contextual information (e.g., size, location) about the identified
adrenal incidentaloma.
Limitations
There are several limitations to our study. First, there are a couple limitations
pertaining to dataset curation. Although a fellowship-trained radiologist was consulted
when report language was unclear, positive imaging findings noted in a report were
not further verified by a radiologist. Multiple physicians were involved in the annotation
procedure to produce expert level annotations, but because each report was documented
by a single annotator, the calculation of interrater reliability metrics (e.g., Cohen's
kappa) was not possible. Second, while our study achieves an impressive 91.8% F1 measure
on the test set, our dataset is limited to a single academic institution. We employed
data augmentation techniques to improve the model's ability to recognize positive
examples, but the rate of positive examples in real hospital data is likely smaller.
The sensitivity of the model should not be impacted in the setting of reduced disease
prevalence, but the PPV may be lower. Other institutions of varying types may introduce
additional complexities that limit our tool's usefulness in a wider setting, though
this risk is mitigated by the underlying structure of our model, which favors generalizability.
Third, the identification of new adrenal incidentalomas in radiology reports represents
only one of many such kinds of incidentally discovered findings requiring clinical
follow-up. Given the globally poor adherence to recommendations regarding incidental
findings in radiology reports, further work is needed to address other findings that
would benefit from automated identification. Lastly, as our purpose was to develop
a lightweight, accurate, and easily generalizable machine learning model to address
gaps in clinical continuity, we have not compared our choice of model, CNN, to other
models which could result in performance gains.
Future Directions
In future work we intend to expand the dataset to include reports from other institutions
to build a more readily generalizable model. We also plan to expand the scope of the
software and underlying NLP models to include additional clinically actionable radiological
findings (e.g., solitary pulmonary nodules). We will iterate on the front-end interface
in collaboration with users and improve integration with existing EMR systems—ideally
these actionable clinical items would be pushed back into the institution's EMR. Further
descriptive characteristics of the incidentalomas can also be identified and extracted
from these radiology reports. Though our model is able to identify new adrenal incidentaloma
findings, it does not extract specific descriptors of the nodule (e.g., size or appearance),
nor the follow-up measures specified in the report. Extracting these descriptors can
facilitate the development of an automated pipeline to automatically schedule appropriate
follow-up, reducing the burden on providers as well as opportunities for human error.
Lastly, we intend to prospectively address the impact of this tool on rates of clinical
follow-up following a period of time during which it is in active use.
Conclusion
In conclusion, we demonstrate the feasibility of training a neural network to automatically
identify free-text radiology reports that contain textual evidence of previously unseen
adrenal incidentalomas. We achieve accurate, clinically useful performance using a
small training corpus, and train the network in less than 1 hour on a machine with
no GPU. We also develop a front-end web application for centralized tracking and management
of incidentalomas by the hospital's quality department. The workflow enabled by our
system does not require changing the practice patterns of radiologists, which can
impede efforts at automating the detection, reporting, and communication of incidental
radiographic findings. Furthermore, the overall system is designed to be generalizable
to any similar task (e.g., solitary pulmonary nodules, mammographic findings, etc.).
Although neural network models have been employed in medicine, relatively few actionable
tools have been created to successfully harness their predictive power in real time.
With the creation of our neural network enabled web application, we demonstrate that
these innovations can be put to use today to fill clinical care gaps that would otherwise
be burdensome to implement.
Clinical Relevance Statement
Clinical Relevance Statement
Machine learning models using neural networks can be deployed in clinical practice
to automate otherwise onerous chart review tasks without specialized computing hardware.
High-risk incidental findings such as adrenal incidentalomas can be tracked and managed
with a centralized web application. Practitioners and health care systems should consider
deployment of such tools for quality improvement and risk reduction purposes.
Multiple Choice Questions
Multiple Choice Questions
1. How will automated identification of incidental findings affect the individual
workflow of the ordering provider?
-
The provider will communicate findings directly to a dedicated incidental finding
coordinator.
-
The provider will communicate findings directly to the patient's primary care provider.
-
The provider will confirm suspected incidental findings in radiology reports.
-
No adjustment of individual workflow is required.
Correct Answer: The correct answer is option d. One of the greatest benefits of automated identification
tool is that they do not require additional input from providers. Our application
is able to monitor radiology reports in the background, and new pipelines can be developed
that leverage the model's prediction to order appropriate follow-up without direct
intervention (d). Without automated identification, communication of incidental findings
requires active patient handoff measures, such as the (a) the employment of a dedicated
coordinator who searches through radiology reports for evidence of new incidental
findings or (b and c) physicians closely reading and relaying of information to a
patient's primary care provider.
2. Which of the following statements regarding the epidemiology and impact of adrenal
incidentalomas is true?
-
The majority of adrenal incidentalomas are indicators of serious underlying pathology.
-
As many as 1 in 10 abdominal CT scans contain incidental adrenal gland lesions in
certain patient populations.
-
Existing estimates of appropriate follow-up for adrenal incidentalomas generally exceed
80%.
-
The projected number of reported incidental adrenal findings is expected to remain
stable over time.
Correct Answer: The correct answer is option b. The overall prevalence of adrenal incidentalomas
is 4%, but this rate rises to nearly 10% in elderly populations (b). Although the
majority of incidental adrenal findings are benign and nonfunctional (a), they may
be pathological by means of an underlying malignancy or functional hormonal secretion.
Given that the number of reported incidental adrenal findings is expected to rise
significantly as a result of an increased reliance on imaging results for treatment
decisions and an aging population (c), it is important that we optimize methods for
ensuring appropriate follow-up. Current rates of follow-up measures of incidental
findings are suboptimal, hovering around 50% (d).
3. Which of the following is true regarding the machine learning model employed in
this study?
-
The features used by the model to make predictions were selected based on finely tuned
parameters following manual review of the dataset.
-
The model requires the use of a graphics processing unit (GPU) to make predictions
within a reasonable timeframe.
-
The model uses a shallow neural network to automatically identify important text spans
in radiology reports that support its predictions.
-
There are currently no methods available for providing qualitative insight into how
the model makes decisions.
Correct Answer: The correct answer is option c. Unlike traditional machine learning models or rule-based
approaches to automated prediction, neural network models such as the one in this
study (c) can learn to automatically identify and use the most salient features in
the data for the prescribed task. Earlier machine learning approaches require the
careful derivation of useful text features (a) that must then be adapted as the dataset
complexity increases. Our model intentionally uses a shallow CNN architecture to ensure
that calculations can be completed in real time without the need for a GPU (b). To
gain further insight into model function, we use gradient class activation maps, which
highlight the text spans the model pays most attention to when making predictions
(d).
Appendix A
1. Word Embedding Selection
In our study, we prioritized the development of immediately deployable, actionable,
clinically-useful applications in a setting with limited computational resources.
For these reasons, we decided to use a well-known, well-studied, publicly-accessible
set of vector embeddings. Global Vector (GloVe) pretrained vectors have been found
to perform well at a wide range of NLP tasks. Although other pretrained vector sets
may be more specific to the task of analyzing radiology reports (such as those developed
by a 2019 study[23]), they are not validated in multiple settings over time as GloVe vectors have been.
The use of a well-known, widely-used set of pretrained vectors, such as GloVe vectors,
allowed us to increase the robustness of our model. In addition, because GloVe vectors
are trained on significantly larger corpora, they are more robust to generalization
tasks. In this manuscript, we demonstrate one example of a machine-learning powered
application at a large institution. We chose GloVe vectors to allow for simpler emulation
by investigators and quality improvement leaders at other institutions; its use should
produce similar performance across different institutions as compared with word vectors
that were trained on less data but dedicated to more specific tasks.
2. Hyperparameter Selection Strategy
The number of layers were selected based on the notion that the features of import,
intuitively, were unlikely to span more than seven words and thus did not require
multiple convolutional layers to agglomerate information beyond this length. The dropout
rate was selected based on similar studies, such as a 2014 study using convolutional
neural networks (CNNs) for sentence classification.[22] We opted not to do an exhaustive hyperparameter search, as our goal was to find
a simple (i.e., not computationally expensive) model which performed sufficiently
well for our purposes, i.e., to provide a substantial improvement over the current
system, which consists of no standardized process for free-text clinical results tracking.
Furthermore, more complex models (including deeper CNNs as well as long short term
memory) generally require more labeled data to train effectively, suggesting that
it made sense to start with a simpler model and add complexity as necessary, rather
than starting with a more complex model. Increasing the size of the labeled dataset
proved to be the most effective intervention to improve performance, and based on
our past experience it is likely that adding more labeled examples to learn from would
be more effective than searching for optimal model hyperparameters.
3. Calculation of Confidence Intervals, Sensitivity, and Specificity
Confidence intervals were derived based on a single trained model evaluated on the
test set once (with a single 80/10/10 split), and are meant to reflect a confidence
interval for this model's performance on unseen data drawn from the same distribution
as the test set. They are not meant to be interpreted as confidence intervals for
the general training process or model architecture, the way that k-fold cross validation would be.
The decision to use a single holdout test set, as opposed to model evaluation with
cross validation is twofold. First, the use of repeated cross validation to assess
the performance of a model in a particular practical domain may produce misleading
results. The test set, which acts as the set of strictly unseen data the model never
has access to during the training/validation phases of model creation, most closely
resembles the data the model will be required to analyze in practice. Furthermore,
as we endeavored to make our model as reproducible as possible by investigators at
other institutions, we aimed to keep the model design and validation procedure simple.
In this case of binary classification, after a softmax is applied, the two output
nodes (yes/no) have output values which add up to 1.0. If the positive node's output
was greater than 0.5 (i.e., the output value of the “positive node” was greater than
the “negative node” output), the prediction is treated as positive. Otherwise it was
treated as a negative prediction. For sensitivity, 95% CI was calculated by treating
each positively labeled example as a Bernoulli trial for the model—a positive prediction
(TP) is a success, and a negative prediction (FN) is a failure; therefore the sensitivity
is distributed binomially and can be approximated with standard methods for binomial
proportion confidence intervals; we used the Wilson score interval. The inverse is
done for positive predictive value.
4. Additional Model Performance Statistics
Validation set performance was recorded to assess model performance on unseen data.
On the final epoch of the validation set, our network achieved an accuracy of 97.3%
(95% confidence interval: 95.2–98.5%), a recall (sensitivity) of 100% (90.6–100%),
a precision (positive predictive value) of 77.1% (63.5–86.7%), and an F1 score of
87.0%.
Appendix B
Validation set result distribution
|
Ground truth
|
Model
|
|
+
|
−
|
+
|
37
|
11
|
−
|
0
|
361
|
Appendix C
Test set result distribution
|
Ground truth
|
Model
|
|
+
|
−
|
+
|
39
|
8
|
−
|
3
|
359
|