Keywords
text summarization - emergency department - clinical conversation - pretrained language
model - documentation burden
Background and Significance
Background and Significance
Health care professionals (HCPs), including clinicians, nurses, therapists, and other
practitioners, dedicate a considerable amount of their working hours to charting and
maintaining clinical documentation.[1]
[2]
[3] This labor-intensive process has been linked to burnout among these providers, manifesting
as emotional exhaustion, decreased focus, and heightened cognitive burden.[1]
[2] This issue is particularly prevalent within emergency departments (EDs),[4]
[5] where ED crowding impacts the process due to the high volume of patients waiting
to be seen, and low throughput due to limited space, resources, staff, and inefficient
flow further contributing to delays in treating patients.[6]
[7] In addition, Meaningful Use requirements, the Affordable Care Act reimbursement
models, and a highly regulated environment significantly impact clinical documentation
workflow and communication in routine care.[8]
[9] Literature reported that clinicians spend more time on electronic documentation
and administrative tasks than providing direct patient care,[3]
[10] as some clinicians may need to allocate over half of their working hours to charting.[11]
[12] In some cases, insufficient time for documentation leads to burnout.[12]
[13] Going unaddressed, these may lead to unintended choices and consequences, such as
noncompliant documentation practices, increased errors and duplicates, and reduced
documentation quality.[14]
In 2022, the Surgeon General issued an advisory on clinician burnout, which includes
several recommendations to address the burden on HCPs in the United States.[15] Some of the recommendations emphasize “designing technology to serve the needs of
health workers, care teams, and patients across the continuum of care” and “improving
our understanding of how to develop and apply health information technology that more
effectively supports health workers in the delivery of care.”[15] In line with that, the American Medical Informatics Association (AMIA) 25 × 5 Task
Force issued a call for action to implement personalized clinical decision support
to improve user-specific workflows and support care recommendations[16] as well as emphasized artificial intelligence (AI) as part of current and emerging
applications to reduce documentation burden in the long term.[17]
Indeed, clinical documentation could be an AI-assisted process, interactively assisting
HCPs and easing the burden.[18]
[19]
[20] A digital scribe is an “automated clinical documentation system” to capture the HCP conversations
with patients and/or other providers and create clinical documentation similar to
a human medical scribe.[1]
[21]
[22] There are several emerging natural language processing and deep learning models
being used as automated text summarization (ATS) and conversation summarization in
the literature.[23] Even though there are emerging commercial tools showing early evidence on adoption
of digital scribe in clinical practice,[24] the research on digital scribing in medical informatics and health services has
been limited due to technical and algorithmic challenges in development and limited
dataset availability.[1]
[25] Another challenge with implementation was the nonlinear and redundant nature of
conversations, with studies indicating 80% of captured dialogue is superfluous for
effective note-taking.[21]
[26] To address these gaps in effective summarization in real-world settings, context-aware
models (like pretrained language models) could be utilized. Therefore, in this study,
we present and evaluate a proof-of-concept digital scribe system (as an ATS pipeline)
for clinical conversations using novel pretrained language models with a real-world
dataset of ED consultation sessions. Our goal is to investigate the feasibility, accuracy,
and impact of implementing an automated digital scribe in a clinical setting, providing
preliminary evidence with a proof-of-concept system in the case of ED consultation.
Our long-term vision is that digital scribe tools to improve documentation efficiency,
reducing the burden, and enhancing patient care and outcomes.
Background
ATS is the foundation of the digital scribe, and it aims to automatically generate
a concise and clear summary of a text, highlighting the key information for the intended
audience.[27]
[28]
[29] ATS can be broadly categorized into two approaches: extractive summarization and
abstractive summarization (ABS). Extractive summarization selects and combines important
sentences and fragments from the original text to form a summary.[30]
[31] ABS generates new summaries that incorporate the essential elements of the original
text, potentially including key phrases.[30]
[32] In this study, we used the ABS approach to reflect more realistic and human-like
approach in summarization with both identifying the important aspects of the original
text and producing relevant and new natural language summaries.[33]
Deep learning has been the predominant method for state-of-the-art ABS.[27]
[34] With the recent development of transformer network models and the larger generalized
language models,[35]
[36] fine-tuning and/or modifying pretrained transformer-based models have become the
leading techniques for ABS on public datasets.[27] Specialized transformer models have been developed for ABS, such as the PEGASUS
family of pretrained models,[27] BART,[37] and its modifications,[38] and T5 family.[39]
[40] Each of these models uses a sequence-to-sequence architecture, which combines an
encoder block and a decoder block into one architecture. While each model has the
same architecture, they all have different pretraining data and tasks (see section
“Model Selection”). ABS in the biomedical field has mostly focused on online biomedical
texts over clinical applications. Overall, ATS has been understudied with medical
records as only 11 of the 58 reviewed studies (19%) used electronic medical record
(EMR) information as input.[41] However, a recent survey on dialog summarization found that pretrained language
model-based method achieved the highest scores in summarization of public datasets
on meeting conversations and chat logs.[42]
Methods
Study Setting and Data Collection
In the scope of our study, we use a dataset (phone conversations) available at the
Nationwide Childrens' Hospital (NCH) Physician Consult and Transfer Center (PCTC).[43] PCTC is a call service that receives calls from health care providers across the
United States to consult, admit, transfer, or refer patients. A nurse team responds
to the calls from physicians, registers their calls, connects them to physicians at
the NCH, and takes a summary note of the conversation into the corresponding patient
records (Epic EMR system).[44] ED patient transfer calls constitute a large amount of the daily PCTC calls (∼200
calls/day). Our proposed digital scribe system runs in a secure institutional network
and uses the conversational data (audio files) stored at the NCH servers. The study
used identifiable information to reflect a realistic use case and mitigate any issues
may be caused by anonymized data in summarization tasks. This study is approved by
the Institutional Ethical Board at the NCH (study ID: 00002897).
In this study, 100 phone call recordings from 100 unique callers (physicians) for
ED referrals at the NCH PCTC are used (∼412 total minutes). The calls are randomly
selected from the local server (between November and December 2022). Each call consists
of a multiturn conversation (ranging from 1- to 9-minute conversation each) among
PCTC nurses, an ED clinician or staff, and an external clinician or nurse. [Fig. 1] outlines the clinical flow and study design.
Fig. 1 Study design. AWS, Amazon Web Services; ED, emergency department; EMR, electronic
medical record.
Audio Transcription
To convert the audio recordings into text, we follow a two-step approach. First, we
use speech-to-text services via Amazon Web Services (AWS Transcribe)[45] and then an annotator reviews the original recordings and corrects any errors in
the transcript to generate clean transcripts. Dialog between speakers is differentiated
with a speaker label (e.g., “Speaker 1: Hello.”). The models have a maximum input
token size of 1,024 tokens. Of the 100 transcripts, 82 of the transcripts have fewer
than 1,024 tokens, and the maximum length of the transcript is 1,987 tokens ([Fig. 2]). Longer transcripts were truncated to include only the first 1,024 tokens.
Fig. 2 A histogram of the number of tokens per transcript. The tokens were generated for
this graph using the BART tokenizer.[46] The vertical line represents the maximum input length of the models, 1,024 tokens,
and 82% of transcripts clusters to the left of this line.
Dataset Creation
After audio transcription, we organize the transcription documents as text input into
the model. For reference text, we use nurse summary notes from the medical records
that accompany the transcriptions (as aforementioned in Section “Study Setting and
Data Collection”). Nurse summaries in medical records, as-is, are considered as high-quality
and ground truth summary in this study. Therefore, we urge readers not to consider
the reference texts as traditional “gold standard” data but instead as a representation
of high-quality reference text, which informs our evaluation methods (see Section
“Evaluation”).
Model Selection
We employ four pretrained large language models (LLMs; T5-small,[39] T5-base,[39] PEGASUS-PubMed,[47] and BART-Large-CNN[46]) for the task of summarizing clinical conversation transcriptions based on their
unique strengths and adaptability to the health care domain. Our two T5 models use
the original T5 seq2seq architecture,[39] trained for a small model (60 million parameters) and a base model (220 million
parameters). The T5 models were trained on a large corpus of English text and performed
well in tasks like summarization, question answering, and translation. PEGASUS-PubMed
(568 million parameters) comes from the class of PEGASUS models[47] developed for ABS. The inclusion of PEGASUS-PubMed in our selection is driven by
its specialization in the biomedical field (pretrained in biomedical literature via
PubMed repository).[47] BART-Large-CNN (406 million parameters) is a BART model, that is, fine-tuned on
the CNN Daily Mail dataset for summarization. BART-Large-CNN is chosen for its demonstrated
effectiveness in producing coherent and contextually accurate summaries.[46] Further comparison of the models is available in the [Supplementary Appendix 1] (available in the online version).
Our choice of these models is influenced by their combined efficiency, domain-specific
accuracy, and ability to produce coherent, reliable summaries, which are critical
in the fast-paced and precision-oriented context of health care. In addition, these
models offer a practical solution, enabling us to process conversation transcriptions
quickly without overextending our hardware capabilities (all models were run on a
single A100 NVIDIA GPU with 40 GB of VRAM), which may represent common computational
resources in health care.[48]
[49] Furthermore, our decision is influenced by security, privacy, and compliance. Larger
and more resource-intensive LLMs require application programming interface access
via cloud services. At the time, this study was conducted, our team did not have compliant
service access to use such models (e.g., generative pretrained transformer [GPT],
LLaMA: [Large Language Model Meta AI]) with our dataset, which includes protected
health information and patient data.
Model Training
We use zero-shot (no fine-tuning) and fine-tuning approaches. For fine-tuning, each
model is fine-tuned using 10-fold cross-validation (90 training samples, 10 holdout
testing samples for each fold). The final evaluation is run over the concatenated
holdout testing samples from the 10 trials (representing all the data). Each sequence
is trained for 30 epochs, with an early stopping patience of 3 epochs in no loss improvement,
using the AdamW optimizer.[50] Multiple initial learning rates are undertaken (5 × 10−10, 1 × 10−6, 1 × 10−5,1 × 10−4,1 × 10−3, 1 × 10−2) and the best result is reported. Other hyperparameters include weight decay (0.01)
and batch size of 2 (PEGASUS and BART) or 5 (T5). For zero-shot, each model is run
without any fine-tuning. For training and prediction, each model is configured to
use a maximum of 1,024 tokens inputs and output up to 200 summary tokens. The input
data (100 transcribed conversations) is summarized and compared with the PCTC nurse
notes on each patient's medical records (structured as details of the complaint, background
Information, and consultation recommendations). All models were trained using the
Hugging Face library with a PyTorch backend.[51]
[52] We used Python 3.9 software to run the models and analysis.[53]
Evaluation
We follow two-stage evaluation: quantitative evaluation and qualitative evaluation.
Quantitative Evaluation
We report the ROUGE-1, ROUGE-2, and ROUGE-L to compare model performance.[54] ROUGE scores are a standard set of metrics for quantitatively evaluating the similarity
of a generated text against a reference summary based on the number of common words
or word sequences. It measures the overlap of n-grams between the generated and reference
texts, effectively assessing the coherence of automated summaries compared with the
human-generated summaries. We compare the summaries generated by each model against
the nurse summary notes (ground truth).
For this task, we pulled nurse notes from the patient EMR intake form corresponding
to each ED referral conversation. We report ROUGE-1 (overlap scores for each word),
ROUGE-2 (overlap scores for each bigram), and ROUGE-L (the longest common subsequence
score). ROUGE scores range from a possible value of 0 (no overlapping terms) to 1
(all terms overlap). Even though expected ROUGE score depends on the task and metrics,
usually for similar tasks, 0.4 to 0.5 considered good and amendable score.[54]
Qualitative Evaluation
We qualitatively evaluate and compare generated summaries against nurse notes to assess
the information included in the generated summary. We only evaluate the generated
summaries from the best-performing model based on the ROUGE scores. For this qualitative
assessment, we compare the amount and type of important information in the nurse notes
that is also included in the generated summary. We manually label the nurse notes
and generated summaries with the following eight tags: (1) Condition—symptoms, diagnosis,
medications related to the patient, (2) Behaviors—the patient's actions, (3) Measurements—any
numerical value measured, (4) Supplies—list of supplies that the patient has/needs,
(5) Date/Time—any mentioned relevant date or time, (6) Test—any tests given or not
to the patient, (7) Location—any locations mentioned including where the patient should
be brought, (8) Transportation—method of transportation for the patient.
Evaluation Metrics
The common practice in evaluation of generated summaries involves evaluators to rate
the generated summary versus the ground truth summary via Likert scale responses,
usually for relevance and coherence.[55] However, that type of evaluation does not quantify the measure of how often or what
type of information is omitted or incorrectly summarized. To gather that information,
we incorporate a novel two-tier annotation system to evaluate the quality of the generated
summaries. Firstly, we use Entity Linking (LINK) annotations to identify and connect specific pieces of clinical information found in the generated
summaries with their corresponding references in the nurse notes. These LINK annotations
serve to establish a direct correspondence between the generated text and the ground
truth provided by the nurse notes. Secondly, we assess the Information Accuracy (CORRECT) of these entity links (LINK). Information Accuracy is measured by evaluating whether
the linked information in the generated summary retains the same meaning as it does
in the nurse notes. For instance, if both the nurse and generated summaries report
a positive coronavirus disease test result for a patient, the LINK is labeled as CORRECT.
Conversely, if the generated summary erroneously reports a negative result, the LINK
is marked as not-CORRECT. This dual-annotation approach allows us to quantify our
measure for not only the presence of key information in the generated summaries but
also the accuracy with which it reflects the original nurse notes. The annotation
process is facilitated by the use of the MedTator text annotation tool.[37]
Statistical Analysis
We performed a two-way analysis of variance (ANOVA) to evaluate the impact of different
summarization models (T5-small, T5-base, PEGASUS-PubMed, BART-Large-CNN) and ROUGE
metrics (precision, recall, F1-score) on performance scores. This between-groups analysis
assessed statistical significance using F-tests, with further post hoc comparisons
between models made using Tukey's Honestly Significant Difference (HSD) test. A significance
threshold of p < 0.05 was applied.
Results
Quantitative Results
Fine-Tuned Results
Across ROUGE-1 scores, the BART-Large-CNN model displays the highest precision (0.42,
confidence interval [CI]: [0.34, 0.49]), recall (0.53, CI: [0.44, 0.62]), and F1-score
(0.49, CI: [0.38, 0.51]), indicating a strong ability to capture unigrams from the
source text ([Table 1]). The T5-base model follows closely, with a ROUGE-1 precision of 0.41 (CI: [0.30,
0.51]) and recall of 0.41 (CI: [0.32, 0.50]), but a slightly lower F1-score of 0.37
(CI: [0.30, 0.45]), suggesting comparable performance in identifying key unigrams.
The T5-small and PEGASUS-PubMed models show lower performance on these metrics, with
the PEGASUS-PubMed model exhibiting the lowest F1-score of 0.28 (CI: [0.22, 0.36]).
Similar to ROUGE-1 scores, BART-Large-CNN has the highest recall (ROUGE-2 = 0.28,
ROUGE-L = 0.43) and F1-scores (ROUGE-2 = 0.23, ROUGE-L = 0.35), whereas T5-base has
the highest precisions scores (ROUGE-2 = 0.22, ROUGE-L = 0.34).
Table 1
ROUGE-1, 2, and L average precision, recall, and F1-scores for the fine-tuned models
on clean transcripts
Model
|
Precision (CI)
|
Recall (CI)
|
F1-score (CI)
|
|
ROUGE-1 scores
|
T5-small
|
0.34 (0.26, 0.43)
|
0.40 (0.31, 0.50)
|
0.35 (0.28, 0.42)
|
T5-base
|
0.41 (0.30, 0.51)
|
0.41 (0.32, 0.50)
|
0.37 (0.30, 0.45)
|
PEGASUS-PubMed
|
0.29 (0.21, 0.38)
|
0.35 (0.26, 0.44)
|
0.28 (0.22, 0.36)
|
BART-Large-CNN
|
0.42 (0.34, 0.49)
|
0.53 (0.44, 0.62)
|
0.49 (0.38, 0.51)
|
|
ROUGE-2 scores
|
T5-small
|
0.17 (0.13, 0.32)
|
0.21 (0.15, 0.29)
|
0.18 (0.13, 0.23)
|
T5-base
|
0.22 (0.15, 0.30)
|
0.22 (0.15, 0.30)
|
0.20 (0.15, 0.26)
|
PEGASUS-PubMed
|
0.11 (0.07, 0.16)
|
0.14 (0.09, 0.20)
|
0.11 (0.07, 0.16)
|
BART-Large-CNN
|
0.21 (0.16, 0.27)
|
0.28 (0.21, 0.36)
|
0.23 (0.18, 0.29)
|
|
ROUGE-L scores
|
T5-small
|
0.28 (0.22, 0.35)
|
0.34 (0.25, 0.43)
|
0.29 (0.23, 0.35)
|
T5-base
|
0.34 (0.25, 0.44)
|
0.34 (0.27, 0.44)
|
0.32 (0.25, 0.39)
|
PEGASUS-PubMed
|
0.22 (0.16, 0.30)
|
0.27 (0.20, 0.30)
|
0.22 (0.16, 0.29)
|
BART-Large-CNN
|
0.33 (0.27, 0.41)
|
0.43 (0.34, 0.52)
|
0.35 (0.29, 0.42)
|
Abbreviation: CI, 95% confidence interval.
Zero-Shot Results
[Table 2] reports the performance of the zero-shot models. For ROUGE-1 scores, BART-Large-CNN
exhibits the highest precision (0.26, CI: [0.19, 0.34]) and recall (0.23, CI: [0.17,
0.30]), with a corresponding F1-score of 0.23 (CI: [0.17, 0.29]), suggesting a modest
capability to identify key unigrams without fine-tuning. The T5-base model also shows
relatively better performance compared with T5-small, with precision, recall, and
F1-score of 0.30 (CI: [0.22, 0.38]), 0.17 (CI: [0.15, 0.23]), and 0.20 (CI: [0.15,
0.26]), respectively. T5-small has lower scores, and PEGASUS-PubMed's performance
is notably minimal, with an F1-score of 0.07 (CI: [0.05, 0.10]). When examining ROUGE-2
scores, which evaluate bigram overlap, the models perform generally poorly, with BART-Large-CNN
leading at a lower precision of 0.08 (CI [0.04, 0.12]) and a corresponding F1-score
of 0.07 (CI [0.04, 0.10]). The T5 models report low scores, with T5-base obtaining
an F1-score of 0.06 (CI [0.03, 0.09]), marginally outperforming T5-small, which has
an F1 of 0.05 (CI [0.02, 0.09]). PEGASUS-PubMed has no bigram overlap in this scenario,
reflecting significant limitations in its zero-shot performance. Regarding the ROUGE-L
scores, BART-Large-CNN achieves the highest F1-score of 0.16 (CI [0.12, 0.21]), albeit
modest, indicating its relative advantage in capturing the longest common subsequences
in the zero-shot learning context. T5-base and T5-small achieve F1-scores of 0.15
(CI [0.11, 0.21]) and 0.13 (CI [0.08, 0.17]), respectively, followed by PEGASUS-PubMed
with an F1-score of 0.06 (CI [0.04, 0.07]).
Table 2
ROUGE-1, 2, and L average precision, recall, and F1-scores for the zero-shot models
on clean transcripts
Model
|
Precision (CI)
|
Recall (CI)
|
F1-score (CI)
|
|
ROUGE-1 scores
|
T5-small
|
0.24 (0.17, 0.32)
|
0.15 (0.11, 0.22)
|
0.17 (0.11, 0.24)
|
T5-base
|
0.30 (0.22, 0.38)
|
0.17 (0.15, 0.23)
|
0.20 (0.15, 0.26)
|
PEGASUS-PubMed
|
0.06 (0.04, 0.09)
|
0.12 (0.05, 0.16)
|
0.07 (0.05, 0.10)
|
BART-Large-CNN
|
0.26 (0.19, 0.34)
|
0.23 (0.17, 0.30)
|
0.23 (0.17, 0.29)
|
|
ROUGE-2 scores
|
T5-small
|
0.06 (0.02, 0.11)
|
0.04 (0.01, 0.08)
|
0.05 (0.02, 0.09)
|
T5-base
|
0.08 (0.04, 0.12)
|
0.05 (0.02, 0.08)
|
0.06 (0.03, 0.09)
|
PEGASUS-PubMed
|
0.00 (0.00, 0.00)
|
0.00 (0.00, 0.00)
|
0.00 (0.00, 0.01)
|
BART-Large-CNN
|
0.08 (0.04, 0.12)
|
0.07 (0.03, 0.11)
|
0.07 (0.04, 0.10)
|
|
ROUGE-L scores
|
T5-small
|
0.18 (0.12, 0.23)
|
0.11 (0.07, 0.16)
|
0.13 (0.08, 0.17)
|
T5-base
|
0.21 (0.16, 0.26)
|
0.12 (0.08, 0.21)
|
0.15 (0.11, 0.21)
|
PEGASUS-PubMed
|
0.05 (0.03, 0.06)
|
0.09 (0.06, 0.12)
|
0.06 (0.04, 0.07)
|
BART-Large-CNN
|
0.18 (0.13, 0.24)
|
0.16 (0.11, 0.22)
|
0.16 (0.12, 0.21)
|
Abbreviation: CI, 95% confidence interval.
Qualitative Results
We label each of the 100 ground truth summaries and the summaries generated by the
BART-Large-CNN model (fine-tuned on 90 not holdout data for that cross-validation
fold) using eight tag categories: Conditions, Behaviors, Measurements, Supplies, Date/Time,
Tests, Locations, and Transportation.
[Table 3] presents the average recall for manually annotated information tags in summaries
of the fine-tuned BART-Large-CNN. All summaries contain at least one of the specified
tags, with an average of 8.67 tags per summary. When examining the average LINK recall,
the model performs consistently, with a mean recall of 0.71 (standard deviation [SD] = 0.23),
indicating that over 70% of the information present in the ground truth summaries
is also found in the generated summaries. The average CORRECT recall is marginally
lower at 0.67 (SD = 0.23), suggesting that while the model is proficient at identifying
relevant information, there is a slight decrease in accuracy when considering the
correctness of the information. [Fig. 3] illustrates the recall characteristics of the fine-tuned BART-Large-CNN model.
Fig. 3 Histograms showing information recalled (without consideration of correctness) [right]
and correctly recalled information [left] by a generated summary that appeared in
the ground truth summary.
Table 3
Average recall for tags and annotations in generated summaries by the fine-tuned BART-Large-CNN
model
Tags
|
% Summary with
at least 1 tag
|
Average tags
per summary (SD)
|
Average LINK recall per summary (SD)
|
Average CORRECT
recall per summary (SD)
|
All tags
|
100% (100/100)
|
8.670 (4.800)
|
0.714 (0.231)
|
0.677 (0.228)
|
Condition
|
99% (99/100)
|
4.848 (2.776)
|
0.744 (0.268)
|
0.731 (0.274)
|
Behaviors
|
29% (29/100)
|
1.483 (1.038)
|
0.772 (0.380)
|
0.772 (0.380)
|
Measurements
|
47% (47/100)
|
2.298 (1.687)
|
0.736 (0.409)
|
0.644 (0.425)
|
Supplies
|
7% (7/100)
|
1.143 (0.350)
|
0.571 (0.495)
|
0.571 (0.495)
|
Date/Time
|
46% (46/100)
|
1.304 (0.655)
|
0.741 (0.409)
|
0.730 (0.409)
|
Test
|
35% (35/100)
|
2.343 (1.453)
|
0.673 (0.409)
|
0.564 (0.423)
|
Location
|
42% (42/100)
|
1.071 (0.258)
|
0.762 (0.426)
|
0.667 (0.471)
|
Transportation
|
41% (41/100)
|
1.000 (0.000)
|
0.439 (0.496)
|
0.439 (0.496)
|
Abbreviation: SD, standard deviation.
The “Condition” tag appears in 99% (99/100) of the summaries, and it has a high CORRECT
recall at 0.73 (SD = 0.27), which indicates a high degree of precision in reporting
patient conditions, symptoms, and diagnoses. However, tags such as “Transportation”
are present in only 41% (41/100) of the summaries, with the lowest average LINK and
CORRECT recall scores of 0.44 (SD = 0.5). “Behaviors” and “Supplies” tags appear less
frequently at 29% (29/100) and 7% (7/100), respectively, yet show relatively high
CORRECT recall. [Fig. 4] shows an example note sample outlining CORRECT and LINK annotations and tags.
Fig. 4 Example generated and nurse note samples with LINK and CORRECT annotations.
For all summaries combined, the model demonstrates a LINK recall of 69.7% (604/867)
instances where tagged information in the ground truth also appears in the generated
summaries ([Table 4]). The CORRECT recall, which indicates the instances where the tagged information
from the ground truth summary appears accurately in the generated summary, is slightly
lower at 65.7% (570/867). However, of the information that is LINKed correctly, the
CORRECT accuracy is high at 94.4% (570/604), indicating that when the model does capture
relevant information, it tends to be accurate. “Conditions” shows the highest LINK
recall at 72.1% (346/480), and an almost equivalent CORRECT recall at 70.8% (340/480).
The CORRECT accuracy for “Conditions” is at 98.3% (340/346), indicating that nearly
all the condition-related information captured by the model is accurate. The “Behaviors”
and “Supplies” tags have the fewest instances but achieve a CORRECT recall of 74.4
(32/43) and 62.5% (5/8), respectively, with both categories achieving CORRECT accuracy
of 100%. Conversely, “Test” and “Transportation” tags display lower performance on
LINK and CORRECT recall.
Table 4
Information tag appearance and correctness in the summaries generated by the fine-tuned
BART-Large-CNN model
Tag
|
Total tags
|
LINK recall across all summaries
|
CORRECT recall across all summaries
|
CORRECT accuracy across LINKed tags
|
All tags
|
867
|
69.7% (604/867)
|
65.7% (570/867)
|
94.4% (570/604)
|
Condition
|
480
|
72.1% (346/480)
|
70.8% (340/480)
|
98.3% (340/346)
|
Behaviors
|
43
|
74.4% (32/43)
|
74.4% (32/43)
|
100.0% (32/32)
|
Measurements
|
108
|
68.5% (74/108)
|
57.4% (62/108)
|
83.8% (62/74)
|
Supplies
|
8
|
62.5% (5/8)
|
62.5% (5/8)
|
100.0% (5/5)
|
Date/Time
|
60
|
75.0% (45/60)
|
73.3% (44/60)
|
97.8% (44/45)
|
Test
|
82
|
62.2% (51/82)
|
48.8% (40/82)
|
78.4% (40/51)
|
Location
|
45
|
73.3% (33/45)
|
64.4% (29/45)
|
87.9% (29/33)
|
Transportation
|
41
|
43.9% (18/41)
|
43.9% (18/41)
|
100.0% (18/18)
|
Model and Metric Differences
Two-way ANOVA analysis showed that the differences in ROUGE scores between the models
are statistically significant (p < 0.001). The post hoc (Tukey's HSD) analysis results showed BART-Large-CNN generally
performs better in ROUGE metrics compared with PEGASUS-PubMed, whereas T5-small outperforms
both T5-base and PEGASUS-PubMed. Different ROUGE metrics affect the scores significantly
(p < 0.05), suggesting that the choice of model and the type of ROUGE metric both have
statistically meaningful impacts on the reported scores. (See [Supplementary Appendix 2] for the analysis results, available in online version only).
Transcription Differences
We compare the difference in performance between the original (AWS) transcripts and
the clean transcripts. BART-Large-CNN's ROUGE-1 improves by 0.06 (F1-score) when using
the clean transcripts. However, T5-base and PEGASUS-PubMed both have lower F1-scores
when using the clean transcripts. This difference is mostly not applicable for ROUGE-2
and ROUGE-L scores with a difference between F1-scores less than 0.02. Please see
[Supplementary Appendix 3] for ROUGE scores of AWS transcripts (available in online version only).
Discussion
Our fine-tuned text summarization models report promising results compared with similar
applications and tasks.[23] The BART-Large-CNN model shows a greater ability to comprehend and replicate the
structure and flow of clinical dialogue in medical conversation with a fine-tuned
and zero-shot approach. This is similar to the performance of high-performing models
on the nonmedical CNN/DailyMail dataset.[27] However, we need to note that BART-Large-CNN's performance may be influenced by
architectural advantages in dialog summarization task and potentially better alignment
with the characteristics of the test data and metrics.[37] In addition, the significant variation in ROUGE scores based on the model and the
metric used suggests differences in model utility in specific contexts of text summarization
tasks and underlines the importance of model selection based on the specific characteristics
and performance metrics per application domain.
Regarding the literature of digital scribe and clinical dialog summarization, ROUGE-L
scores ranged from 0.42 to 0.55,[21] with a recent research demonstrating 0.43 ROUGE-L score using GPT-4 with in-context
learning approach.[56] Our work contributes to the literature, achieving similar scores utilizing BART-Large-CNN,
further expanding with evaluating ED consult and referral conversations.
However, the variants of recall in our study show an inconsistency in performance,
with a subset of notes being replicated with high accuracy, yet a broader variability
indicating room for refinement, especially in achieving consistent correctness. The
differential performance across various information categories illuminates the necessity
for enhancing model recognition capabilities.[57] As the accuracy rates across most tags are promising, they also highlight the disparity
in the model's ability to uniformly identify and convey the full spectrum of clinically
relevant information present in the reference summaries.[58] In a zero-shot context, each model performed relatively worse than their fine-tuned
counterparts. BART-Large-CNN and T5 have better performance, as the models tend to
reproduce some lines of the transcript as the summary. PEGASUS-PubMed, by comparison,
outputs similar to the original training data text, which is somewhat related to the
text in the transcript. These results reinforce the idea that competent zero-shot
performance might be achievable at larger model sizes as well as incorporating different
architectures and datasets.[59] Furthermore, the variability in model performance in our study, particularly in
the context of recall, denotes a significant opportunity for advancing the model's
performance with hybrid models[60]
[61] or approaches (e.g., user interface design, human-in-the-loop),[62]
[63] thereby augmenting its utility in real-world clinical documentation.
Transcription Quality
The transcription quality notably impacts the model performance, as evidenced by the
improvement in BART-Large-CNN's ROUGE-1 scores when utilizing clean transcripts. This
improvement underscores the importance of high-quality input data for the efficacy
of AI-driven clinical documentation.[64] Interestingly, T5-base and PEGASUS-PubMed models register a lower F1-score with
clean transcripts, an anomaly that suggests a complex interaction between model architecture
and data quality. This observation requires a closer examination of the preprocessing
steps and the models' resilience to variations in data quality. In the high-stress,
fast-paced ED environment, where documentation accuracy is important, these findings
highlight the necessity for robust digital scribe systems capable of handling the
inherent variability in clinical speech and text data. The minor differences in ROUGE-2
and ROUGE-L scores with different transcript quality suggest that for capturing the
broader context and relationships within the text, the models are less sensitive to
transcription errors. This resilience is critical for the practical deployment of
digital scribes, where they must perform reliably across varying conditions of data
quality.[65]
The Nature of Conversations
In our observation of audio conversations, we note a common pattern involving additional
clinicians or health care workers, often leading to multiparticipant calls and extended
discussions. The conversation starts with caller information and patient information
exchange, followed by patient health information shared later in the conversation.
Waiting times with hold tones are frequent. A notable discrepancy between audio summaries
and intake notes is that, especially when nurses follow up for additional details,
these details are not always included in the initial transcription. Another observation
is the variation in note style and content, depending on the nurse taking the notes,
indicating differences in documentation approaches among nurses. This added an extra
layer of complexity to the task of accurate digital scribing. Additionally, external
factors like background noise and coughing during conversations pose potential challenges
for automated transcription accuracy.[1] The intake notes sometimes include details from internal consultations that are
not present in the original audio, pointing to a possible mismatch in the documentation.
These insights underscore the multifaceted nature of clinical communication and the
challenges it presents for effective digital documentation.[66]
Implications
The implications of our study extend into several key areas of health care informatics
and policy. Firstly, the use of the BART-Large-CNN model in clinical documentation
points toward a potential to reduce the documentation burden on HCPs, aligning with
the broader goal of mitigating burnout.[1]
[13] The high accuracy in key information categories like “Conditions” indicates that
AI-assisted tools can effectively complement HCPs' condition tasks. However, in their
current forms, the models are insufficient to the task of automatic summarization
in a clinical setting. While, in general, the summaries produced are often coherent,
existing summarization errors would be detrimental if used in practice as a replacement
for a human scribe or clinician. Using these models as assistants instead would be
a more useful trajectory. Yet, the successful implementation and integration of such
tools hinges on their design elements, human factors, and usability.[67]
[68] The variability in model performance underscores the need for a user-centered design
approach and a systems thinking approach to overcome technical challenges.[69]
[70] This involves tailoring these tools to fit into clinical workflows, ensuring they
are intuitive and capable of handling the dynamic nature of clinical environments.[71]
In line with recommendations by the Surgeon General and the AMIA 25 × 5 Task Force,
the findings inform developing and applying health information technology that supports
HCPs, suggesting that policies may encourage the exploration and adoption of AI tools
like digital scribes in clinical settings.[16] This could be achieved through incentives for technology adoption, support for implementation
research and technical development, and the development of evidence-based guidelines
to ensure ethical and secure use of AI in health care.[72] However, the collaboration between HCP and AI is key to success in improving the
accuracy, consistency, and completeness of medical documentation while minimizing
documentation errors.[63]
[73] It is also important to develop operationalization and implementation plans with
accountable, fair, and inclusive AI approaches to ensure the trustworthiness of the
digital scribes.[74]
[75]
Limitations
The limitations of our study are multifaceted, reflecting both methodological constraints
and broader challenges in the field. Firstly, the absence of standardized and validated
measures for assessing documentation burden presents a significant challenge.[76] Therefore we depend on our quantitative and qualitative approaches to assess quality,
and assuming higher quality of summarization will contribute to reducing documentation
burden. Our scoring does not account for differences in notes, note-takers (nurses),
and conversations. ROUGE metrics are coherence-insensitive, focusing solely on word
overlap without considering the coherence and logical flow of the summaries, which
introduces a limitation for quantitative analysis.[77] Our qualitative evaluation focused on a 2-tier assessment, which might limit the
perspectives. We did not include additional information, such as patient history or
treatment plans, in summarization to reduce complexity and token size, and to focus
on task-specific performance. The study lacks qualitative feedback from nurses and
clinicians to further assess the perceived value and utility of generated summaries.
The reliance on F1-scores with corrected transcripts limits evaluation of utility
as well as not fully capturing semantic completeness crucial for clinical relevance.
Additionally, using clean transcripts as model input might not reflect the complexity
of raw clinical conversations. These limitations are compounded by the small and less
diverse dataset size, single annotator bias, lack of real-world testing, and the limited
scope of the dataset for ED referrals (not representing all ED clinical conversations
and related documentation), all of which contribute to potential constraints on the
generalizability and applicability of our findings. To address these limitations,
in the next stages of this study, we plan to include: (1) collaboratively developing
a highly specific guideline and compiling evaluation metrics for summaries to improve
assessment of the automated summaries, (2) auditing existing summaries against guidelines
and applying scoring to help inform the model the quality of the summaries on which
it is being trained, and (3) working with nurses and clinicians to gain user feedback
to this system.
Future Work
In future works, we aim to expand the scope and applicability of our research. A primary
focus will be on testing a cloud-based transcription and digital scribe pipeline using
additional language models with larger and diverse datasets and longer conversations.
This initiative will be geared toward developing a deployable pipeline, with a specific
scenario involving a call service connection and providing immediate feedback through
a web application to nurses. Another important area of exploration will be the hybrid
models[41] combining statistical, machine learning, and computational linguistics techniques,
further experimenting with zero- and few-shot learning, and conducting a comparative
study to measure effort, task completion time, user variations and human factors,
cognitive load, burden and stress, and other relevant units of analysis toward clinical
utilization.[76]
Conclusion
Our study introduces the development and testing of a digital scribe pipeline, contributing
to the field of automated clinical documentation and efficient documentation flow.
By utilizing a real-world dataset, our research addresses a critical gap in the literature,
particularly in the areas of workflow optimization and clinical and nurse informatics
applications.[1] The practical implications of our findings are offering potential time and resource
savings for health care systems, aiming to reduce the documentation burden among nurses
and clinicians, thereby enhancing overall health care delivery efficiency and quality.
Clinical Relevance Statement
Clinical Relevance Statement
The study demonstrates the performance and potential of a digital scribe system to
effectively summarize clinical conversations in emergency departments, offering a
tool to lead to more efficient clinical documentation, improved accuracy in medical
records, and potentially alleviate clinician documentation burden.
Multiple-Choice Questions
Multiple-Choice Questions
-
Which aspect of health care professionals' workload does the digital scribe system
primarily aim to address in the emergency department setting?
-
Patient diagnosis
-
Appointment scheduling
-
Documentation burden
-
Treatment planning
Correct Answer: The correct answer is option c. The study focuses on developing a digital scribe
system to summarize clinical conversations in emergency departments. The primary goal
is to alleviate the documentation burden on health care professionals, a significant
factor contributing to clinician burnout and inefficiencies in patient care.
-
In the study, which model exhibited superior performance in the task of clinical conversation
summarization, as evidenced by the highest ROUGE scores?
-
T5-small
-
T5-base
-
PEGASUS-PubMed
-
BART-Large-CNN
Correct Answer: The correct answer is option d. Among the models tested (T5-small, T5-base, PEGASUS-
PubMed, and BART-Large-CNN), the BART-Large-CNN model demonstrated the highest performance
in summarization tasks. This was evidenced by its superior ROUGE scores, indicating
its effectiveness in capturing key information from clinical conversations.
-
What factor is important in enhancing the accuracy and utility of AI-assisted clinical
summarization tools?
-
Increasing the speed of data processing
-
Expanding the variety of medical conditions covered
-
Improving the quality of input data (transcriptions)
-
Reducing the cost of technology implementation
Correct Answer: The correct answer is option c. The study highlights the significant impact of input
data quality on the performance of AI-assisted clinical documentation tools. High-quality
transcriptions are crucial for these tools to accurately summarize and capture essential
information from clinical conversations. The improvement in the BART-Large-CNN model's
performance with clean transcripts compared with AWS transcripts underscores this
point.