Keywords
electronic health records - hemorrhage - artificial intelligence - decision support
systems - natural language processing
Background and Significance
Background and Significance
The electronic health record (EHR) contains important information on patients' medical
history.[1] It is used by medical doctors (MDs) to obtain knowledge on patient history before
and during patient contact to diagnose and guide their treatment. However, critical
information may be embedded in unstructured, narrative text[2] from which it can be time-consuming and complex to extract.[3] Diagnoses are registered using the International Classification of Diseases 10 (ICD)
codes but detailed information is lost when coding. Furthermore, ICD codes are known
to be error-prone[4]
[5]
[6]
[7]
[8] and are only used for primary diagnoses leading to admission. Complications such
as hemorrhage are not coded. As a result, important information can easily be missed
in clinical practice and research, and there is a risk of degraded patient treatment.
Artificial intelligence (AI) methods can be used for fast, automatic identification
of information in EHR text.[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16] This can be used in an acute clinical setting to find information in the EHR. Clinical
evaluation of these methods is currently lacking.[17]
The hemorrhage history is one information that is often in the form of unstructured,
narrative text. Hemorrhage history is a cornerstone in hemorrhage risk assessment
as previous hemorrhage is a prominent risk factor for hemorrhage in the future.[18]
[19] It is recommended to assess hemorrhage risk in all patients admitted to hospital,
and for some applications, a risk score is calculated. The assessment of hemorrhage
risk can be relevant in other situations, for example prior to surgical procedures
or to address the hemorrhage risk upon treatment with antithrombotic or anticoagulant
treatment as the therapy itself increases hemorrhage risk.[18] Previous studies on risk scores show lack of adherence to guidelines.[20]
[21]
[22] A main cause for nonadherence was that the MDs did not have time for the task.[21] In the acute clinical setting, they prioritized direct patient care. Automatically
directing the MD to the text passages relevant for reading during chart review would
greatly reduce the time needed for the task.
Objectives
The aim of this study is to evaluate if MDs identify more relevant information on
hemorrhage events in a clinical setting when assisted by the AI model and to measure
the MDs' perception of using the AI model.
Methods
Data and Study Population
The data were extracted from the electronic health care information system COSMIC
(Cambio Healthcare Systems, Søborg, Denmark) from a 5-year period at the Odense University
Hospital (OUH), Denmark. Information regarding comorbidities was based on ICD. ICD
codes and date of death were retrieved from administrative registers.
The AI model was developed using 900 randomly sampled EHRs with a registered ICD code
for hemorrhage to ensure representation of hemorrhages from all organs (ICD codes
in [Supplementary Appendix A], available in the online version). The data consist of 25,862 sentences labeled
as positive or negative with respect to indicating a hemorrhage event in the patient.
Part of the data was established earlier.[16] In the current study, we increased the size and quality of the dataset to improve
model performance and robustness between different patient characteristics and hemorrhage
locations.
After development, the AI model was evaluated in a test cohort consisting of 566 admissions.
We sampled admissions both with and without ICD codes for hemorrhage. The evaluation
data did not contain any text that had previously been seen by the AI model during
training.
Development of Artificial Intelligence Model for Hemorrhage Identification
We developed a model that can analyze Danish EHR note text and find sentences indicative
of hemorrhage.
For developing the AI model, we used the sentences from the 900 EHRs labeled as either
positive or negative for hemorrhage. The data were reviewed and labeled by MDs with
experience in hemorrhage disorders. All sentences were reviewed by one MD. All sentences
that the MD found to mention a current, prior, or possible hemorrhage event were reviewed
by three MDs. Moreover, hemorrhage events were categorized into one of 12 different
groups based on the anatomical location of the hemorrhage: Central nervous system
(CNS), Eye, Ear–nose–throat, Airway, Gastrointestinal , Internal, Gynecological, Urological,
Muscle and joint, or Dermatological. Some sentences could not be assigned to an anatomical
location and were categorized as Unknown. Sentences that mentioned hemorrhages at
multiple anatomical locations were assigned to the category Multiple locations.
The AI model takes a sentence as input and creates a numerical representation that
is used to classify it as either positive or negative for hemorrhage. It then highlights
the sentences in the text, which were classified as positive. The 25,862 annotated
sentences were split into a training (80%), validation (10%), and test set (10%).
The training set was used to train the AI model, the validation set was used to guide
and tune hyperparameters during training, and the test set was used to evaluate the
AI model. The training, validation, and test sets each contained 50% positive sentences
and 50% negative sentences. Positive sentences of the validation and test sets were
balanced using 130 sentences from each anatomical location except from muscle and
joint hemorrhage and hemorrhages from multiple locations, which had too few samples.
The test set was used to compare the performances of three models: a logistic regression
model, a gated recurrent unit (GRU) cell combined with a convolutional neural network
(CNN), and a transformer-based ELECTRA model. Further details can be found in [Supplementary Appendix B]> (available in the online version).
After development, we evaluated the AI model on a test cohort consisting of 566 full
admissions reviewed by the MDs. To detect potential bias and address the clinical
usefulness, we compared the performance of the AI model between selected groups: minor
versus major hemorrhage, young (<70 years) versus elderly (≥70 years), males versus
females, and between hemorrhage locations. Minor and major hemorrhage were defined
as by Decousus et al.[18]
User Studies
For the user studies, we included a total of 16 MDs with 1 to 25 years of clinical
experience. The participants worked in departments for either clinical biochemistry,
genetics, gynecology, hematological, infectious medicine, orthopaedic surgery, pathology,
radiology, or urology. Patient types and the roles of MDs varied, but hemorrhage was
relevant in all departments.
For each task, MDs were provided with an EHR text and 10 minutes to complete the task
to imitate the work in a clinical setting and give them a moderate time constraint.
Participants were given a thorough description of the task before inclusion. The evaluation
was conducted in June and July 2022.
Evaluation of Manual Information Extraction with Eye Tracking
Six MDs participated in a workshop with the purpose of investigating their reading
workflow when doing manual chart review. The participants were told to extract information
regarding hemorrhage events from a fictive EHR containing 79 clinical notes. The fictive
EHR contained 21 sentences describing hemorrhage events.
We used Pupil Center Corneal Reflection[23] eye tracking to analyze their gaze during the task. The EHR was a fictive patient
case written by an experienced MD and made available in the Cosmic EHR system, which
was the system used in clinical practice at the time, to ensure familiarity with the
user interface. We defined a mention of a hemorrhage event as having been read and
identified if the MD's gaze was fixed on it for at least 1 second.
Clinical Use Study of Artificial Intelligence-Assisted Chart Review
To compare manual chart review with AI-assisted chart review, we performed a clinical
use study with two admissions from the EHR system of the OUH. 13 MDs participated
in the study. Seven participants were given admission A and six participants were
given admission B. The two admissions (A and B) used to evaluate the MDs' chart review
performance with and without AI model assistance contained 63 and 51 hemorrhages,
respectively.
First, participants reviewed the admission without AI model assistance and then with
AI model assistance. When assisted by the AI model, sentences that mentioned hemorrhage
according to the AI model had been highlighted with yellow background color. Between
the nonassisted and assisted review, participants were given a random admission to
review with AI model assistance to practice using the AI model and for them to not
remember the location of hemorrhages from the first review.
The participants were informed about the performance of the AI model and given a definition
of what constituted a relevant hemorrhage event. They were told to review the admissions
as they would do in clinical practice and that they had a maximum of 10 minutes to
review one admission to simulate clinical practice. MDs reported hemorrhage events
by highlighting the sentence with red color.
The admissions were anonymized and presented in Microsoft Word (Microsoft, Redmond,
WA) with black text, font size 10, and formatted as the electronic health care information
system of the OUH.
After completion, participants were asked to rate the AI model's usefulness in a clinical
setting on a 5-point scale from 1 to 5 where 5 was the most useful. Furthermore, participants
were interviewed in a semistructured format. They were for example asked to describe
advantages and disadvantages of the AI model and if they preferred chart review with
or without AI model assistance.
Statistics
For developing the AI model, we used Python 3.6 and TensorFlow 2.0.
For evaluation of the AI model, we reported descriptive statistics, sensitivity, and
specificity.
For the user studies, we reported sensitivity and participants' ratings. Means were
reported with change in standard deviation (SD). Frequencies were reported as counts
and percentages.
Results
Artificial Intelligence Model for Hemorrhage Identification
We trained three AI models and found that a transformer-based architecture, ELECTRA,[24] performed best with a sensitivity and specificity of 95.8% on the balanced test
set. See [Supplementary Appendix B]> (available in the online version) for further details regarding the model development
and hyperparameter tuning.
Evaluation of Artificial Intelligence Model for Hemorrhage Identification
[Table 1] shows the patient characteristics of the evaluation population consisting of 566
admissions. Patients had a median age of 71 years and 47.2% were women. Admissions
had a median of 39 EHR notes, ranging from 2 to 542 notes. Essential hypertension
was the most frequently coded symptom or disease with the patient experiencing it
in 7.1% of admissions.
Table 1
Patient characteristics of the 566 admissions for evaluation
|
Patient characteristic
|
Value
|
|
Admissions, n
|
566
|
|
EHR notes, n
|
37,058
|
|
EHR notes per admission, median (range)
|
39 (2–542)
|
|
Sex, % women
|
47.2%
|
|
Age (y), median (range)
|
71 (0–97)
|
|
Registered diagnoses during admissions, (%)
(2% of admissions)
|
Essential hypertension (7.1%)
Atrial fibrillation or atrial flutter (4.2%)
Pneumonia (3.5%)
Chronic obstructive pulmonary disease (2.8%)
Retention of urine (2.1%)
Chronic kidney disease (2.1%)
Anemia (2.1%)
|
Abbreviations: EHR, electronic health record; ICD, International Classification of
Diseases.
Notes: Registered diagnoses were based on ICD codes during admission. ICD codes for
hemorrhage were omitted.
The most frequent hemorrhage location was urological with 1,019 sentences from 143
admissions indicating hemorrhage. The least frequent hemorrhage location was muscle
and joints with 16 sentences from nine admissions indicating hemorrhage. Patients
experienced a total of 637 hemorrhages (range: 1–7) in different anatomical locations
during 385 admissions. The patients experienced hemorrhage in a single anatomical
location during 226 admissions and in two or more anatomical locations during 159
admissions. Admissions had a median of three sentences indicating hemorrhage, ranging
from 0 to 96. The majority of admissions with sentences indicating hemorrhage in an
unknown location (43/47) had other sentences indicating hemorrhage in a specific location.
The 566 admissions included 4,413 sentences indicating hemorrhage and 421,798 sentences
not indicating hemorrhage. See [Supplementary Appendix C]> for further results (available in the online version).
The AI model had a sensitivity of 93.7% and specificity of 98.1% on sentences in the
566 admissions.
[Fig. 1] shows the sentence-level sensitivity of the AI model by anatomical location on the
566 evaluation admissions. The AI model had the lowest sensitivity for dermatological
hemorrhage at 87.6%, whereas it had the highest sensitivity for muscle and joint hemorrhage
and eye hemorrhage at 100%.
Fig. 1 Sentence-level sensitivity of the artificial intelligence model by anatomical location
on the 566 evaluation admissions. Unknown are sentences that could not be assigned
to an anatomical location.
We compared the performance of the AI model on the 566 evaluation admissions between
selected patient groups to detect potential bias. There were no major differences.
Minor differences in sensitivity were between minor hemorrhage (95.0%) and major hemorrhage
(91.1%). There were no noteworthy differences between sensitivity on males versus
females (93.6 vs. 93.8%) or young versus elderly (93.0 vs. 94.3%).
On average, the AI model processed all sentences of an admission in approximately
0.5 seconds using a Nvidia Tesla v100 GPU.
User Studies
Eye Tracking
The study showed that MDs overlooked 39% of possible hemorrhage events during manual
chart review. On average, MDs identified 8.5 of the 10 hemorrhage events described
in bullet-pointed text, whereas they only identified 5.3 of the 11 hemorrhage events
described in paragraphs. Moreover, MDs identified more hemorrhage events that were
described in the beginning of a note (7 out of 7) or paragraph (6.3 out of 7) compared
with hemorrhage described in the middle of a note (6.8 out of 14) or paragraph (7.5
out of 14). See [Supplementary Appendix D]> for further results (available in the online version).
Clinical Use Study
[Fig. 2] shows an example of EHR text when being reviewed with AI model assistance.
Fig. 2 An example of electronic health record text when being reviewed with artificial intelligence
model assistance.
[Fig. 3] shows the performance of the participants when reviewing with and without AI model
assistance for admission A and B. All participants increased the absolute number of
identified hemorrhages when reviewing with AI model assistance.
Fig. 3 Visualization of change in identified hemorrhages between reviewing without and with
artificial intelligence model assistance for A.1: absolute number of identified hemorrhages
for admission A; A.2: percent identified hemorrhages in text reviewed for admission
A; B.1: absolute number of identified hemorrhages for admission B.1; and B.2: percent
identified hemorrhages in text reviewed for admission B. Participants 1 to 7 reviewed
admission A, and participants 8 to 13 reviewed admission B. MD = medical doctor.
For admission A, the participants identified on average 45% (SD: ± 8) of hemorrhages
on the full admission when reviewing without AI model assistance. With AI model assistance,
it increased to 93% (SD: ± 13). Without AI model assistance, participants missed 33%
(SD: ± 16) of hemorrhage sentences in the text reviewed. With AI model assistance,
participants missed no hemorrhage sentences.
For admission B, the participants identified on average 26% (SD: ± 17) of hemorrhages
on the full admission when reviewing without AI model assistance. With AI model assistance,
it increased to 75% (SD: ± 10). Without AI model assistance, participants missed 46%
(SD: ± 24) of hemorrhage sentences in the text reviewed. With AI model assistance,
participants only missed 11% (SD: ± 6) of hemorrhage sentences.
On average, participants who reviewed admission A rated the usefulness as 4.0 (SD: ± 0.9)
on a scale from 1 to 5 (additional information in [Supplementary Appendix E]>, available in the online version). They described the AI model as useful for fast
detection of relevant information and for sorting information. Six out of seven preferred
AI model assistance over no AI model assistance and one did not know.
Participants who reviewed admission B rated the usefulness as 3.7 (SD: ± 0.5) on a
scale from 1 to 5. They described that the AI model was useful, but that it did not
find all relevant information, leading to a risk of overlooking important information
if basing decisions solely on the AI model's findings. Five out of six preferred AI
model assistance over no AI model assistance.
In the semistructured interviews, participants stated that the AI model should save
time, resources, and be useful from a medical perspective. Also, it was important
that it spared the MD from a task.
Participants noted that the AI model could be of assistance in an emergency department
where MDs must retrieve information about the patient's medical history from many
different specialties. They stated that it is difficult to decide what is useful and
define search terms for information from specialties that they are not necessarily
well-versed in, and that they preferred if the AI model could find the relevant information.
They expected that an AI model would perform better than a busy MD on the task of
extracting information from EHRs.
The participants generally expressed that the AI model's output in the form of yellow
markings was an advantage. They reported that if they trust the AI model, they would
be more likely to only look for the highlighted text and thus perform chart review
faster than without AI model assistance. The participants gave no exact guidance about
what performance they expected from the AI model for them to trust it.
Participants expressed a concern about blindly trusting the AI model and that one
runs the risk of losing focus in assessment of clinical information. One participant
stated that it did not aid in the understanding of the text content. Another reported
being taught (or reminded) by the AI model that the phrase “grade 1” is pertaining
to a scale for measuring blood in the urine, therefore indicating hematuria. Thus,
the AI model had helped with learning and understanding.
[Supplementary Appendix F]> (available in the online version) contains relevant quotes from the semistructured
interviews.
Discussion
This study evaluated an AI model for finding hemorrhage events. It had a sensitivity
of 93.7% and acceptable performance in relevant clinical settings. MDs miss more than
33% of relevant sentences when doing chart review without AI assistance and improved
the number of hemorrhages found when reviewing the EHR when assisted by the AI model.
The user satisfaction with the AI assistance was high.
This study found that the transformer-based model performed better than a logistic
regression and a GRU–CNN model. This is in line with recent developments in language
technology where the transformer-based models have achieved state-of-the-art performance
on many benchmarks.[25]
[26]
Automatic hemorrhage identification has been investigated in previous studies.[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16] Pedersen et al used an AI model to classify Danish clinical text as positive or
negative for hemorrhage and achieved an accuracy of 90% on a balanced test set.[16] For English EHRs, Li et al used an AI model to detect hemorrhage events in EHR sentences
with an F1 score of 94%,[11] and Taggart et al detected hemorrhage events at a note level using a rule-based
approach with an F1 score of 74%.[10] Mitra et al used an AI model to extract single words that indicated hemorrhage and
achieved an F1 score of 75%.[13] However, most studies used small test sets during model development (<1,000 samples)[10]
[11]
[12]
[13]
[14]
[16] and did not investigate model performance between different hemorrhage types or
patient characteristics.[9]
[10]
[11]
[12]
[13]
[14]
[16] In contrast to previous studies on hemorrhage identification, we made an evaluation
in a test cohort.
Properly investigating AI model performance is important since studies found that
AI models for text are subject to many sources of bias, e.g., gender bias that could
influence the performance of the AI model in clinical settings.[27] In this regard, it is of clinical importance that the AI model has high performance
in all settings and patient groups. In this study, we found that the AI model performed
similarly on males versus females, major versus minor hemorrhage, and for patients
with an age of <70 and ≥70. We also showed that the model performed similarly on different
hemorrhage locations. Evaluating performance across hemorrhage locations is important
because the phrasing used to describe hemorrhage varies greatly depending on the location,[28] e.g., epistaxis is a word used specifically for nose hemorrhage. If the model had
not been able to detect hemorrhages for all locations, it would be reflected as bias
toward specific patient groups in clinical practice. Overall, the present study shows
that the AI model for hemorrhage identification has an acceptable performance for
clinical use.
Previous studies did not investigate MDs' performance when using or perception of
using an AI model for finding hemorrhage events in a clinical setting. Our study provides
a clinical use evaluation, which showed that important hemorrhage information can
easily be overlooked when reviewing clinical text without assistance. MDs did not
register up to 46% of hemorrhage sentences when reviewing an admission under moderate
time pressure. Albeit they may have reached the right conclusion about the patients'
hemorrhage history, it suggests that critical information can be missed. Few hemorrhage
events were missed when assisted by an AI model for hemorrhage identification, showing
a potential role for such tools in clinical practice. Congruently, MDs were positive
toward using the AI model as decision support and stated that it was easier to review
with AI model assistance and found it useful for providing a fast overview of hemorrhage
events. The MDs did not request any explanation of the internal mechanisms of the
AI model. Instead, they reported to be satisfied with the transparent output of the
model in terms of highlighting of relevant text. The characteristics of the innovation
that participants highlighted are factors that are also positively associated with
adoption of innovation. These perspectives include trialability (the ability to try
out the innovation) and observability (the ability to observe the functionality),
which were achieved in the study.[29] Also, the ease of use is relevant for adoption of innovation. The ease-of-use concerns
both the specific application itself and the context of its use. Thus, the implementation
of an innovation should not introduce an extra task but make medical decisions and
workflow more efficient. Further, users expressed a preference for reviewing text
with AI assistance, most likely because they find it easier. This is consistent with
psychological theories as automated decisions made by “system 1” require less cognitive
effort than “system 2” decisions that involve processing and judgment of information.[30] A risk of solely focusing on highlighted text is overlooking important information.
However, there is also potential for learning by highlighting text that a health care
professional may not have realized was relevant to the topic, as also demonstrated
in this study.
An AI model that highlights sentences ensures that MDs can capture all the relevant
information. However, it still requires cognitive effort to condense and evaluate
the highlighted content, and the procedure does not eliminate manual work. On the
other hand, there is no loss of information, and the approach ensures that the data
can be processed in various contexts since nothing has been filtered out. In specific
clinical scenarios, such as when the MD is interested in a specific previous hemorrhage,
all the output from the AI model might not have to be reviewed, which would reduce
manual workload.
Limitations
It is a limitation to the generalizability of the study that the data were sampled
using ICD codes and therefore is not indicative of a typical distribution of patients
at the hospital, and that all data came from a single hospital.
The user studies may not fully reflect the potential value of the AI model as it was
tested in a mock-up, and thus, the setting may have influenced the exact results in
terms of number of hemorrhages identified and text reviewed. This study evaluated
the AI model by having the participants review an admission with and without AI assistance.
While this methodology provides a straightforward means of comparing manual chart
review with and without AI assistance, it has a disadvantage regarding potential recall
bias if participants were able to remember the exact position of hemorrhage events
in the admission. We mitigated this by using long admissions with many scattered sentences
indicative of bleeding and by having participants review a random admission in between
the nonassisted and assisted review. The time constraint is also expected to have
reduced their collection in memory. An alternative approach to accurately represent
a clinical scenario could be to ask MDs targeted questions regarding hemorrhage events
and compare the performances of two groups of MDs, one group utilizing AI assistance
and the other not. Nonetheless, the user ratings and the results clearly indicate
a positive clinically relevant effect.
For the eye-tracking study, we defined a hemorrhage event as being identified if the
MDs' gaze remained fixed on the event for at least 1 second. Future work should conduct
a sensitivity analysis on this time-threshold to analyze its impact.
Conclusion
We developed an AI model for hemorrhage identification that correctly identifies 93.7%
of sentences indicating hemorrhage in an evaluation on a test cohort. Moreover, we
found that MDs identified more hemorrhages during chart review when assisted by the
AI model that highlights sentences with hemorrhage compared with manual extraction
where MDs miss more than 33% of relevant sentences. MDs were positive toward using
an AI model for hemorrhage identification in clinical practice. Overall, the study
shows that the technology is clinically useful for information extraction from EHRs.
Clinical Relevance Statement
Clinical Relevance Statement
We developed an AI model for hemorrhage identification that can support MDs during
chart review. The implications are a less time-consuming chart review and that MDs
find more hemorrhages during chart review. This improves patient treatment.
Multiple-Choice Questions
Multiple-Choice Questions
-
What advantages are there to AI-assisted chart review for hemorrhage events?
-
An AI removes the need for manual chart review
-
MDs find more relevant information when assisted
-
MDs have more time to do chart review
-
People without medical knowledge can perform the chart review
Correct Answer: The correct answer is option b. When assisted by an AI, the MDs found more relevant
information in the form of hemorrhage events than when doing nonassisted manual chart
review.
-
Which opinion did some MDs express about AI-assisted chart review?
-
Assisted chart review will lead to worse patient treatment
-
Assisted chart review is confusing to the patients
-
Some MDs are not comfortable with the technology
-
Blindly trusting the AI assistance is a concern
Correct Answer: The correct answer is option d. Some MDs stated concerns about blindly trusting
the AI and that one runs the risk of losing focus in assessment of clinical information.
However, there is also potential for learning by highlighting text that a health care
professional may not have realized was relevant to the topic.