RSS-Feed abonnieren
DOI: 10.1055/a-2327-4121
Evaluation of a Digital Scribe: Conversation Summarization for Emergency Department Consultation Calls
Funding The project described was supported by Award Number UM1TR004548 from the National Center for Advancing Translational Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Center for Advancing Translational Sciences or the National Institutes of Health.
- Abstract
- Background and Significance
- Background
- Methods
- Results
- Discussion
- Conclusion
- Clinical Relevance Statement
- Multiple-Choice Questions
- References
Abstract
Objectives We present a proof-of-concept digital scribe system as an emergency department (ED) consultation call-based clinical conversation summarization pipeline to support clinical documentation and report its performance.
Methods We use four pretrained large language models to establish the digital scribe system: T5-small, T5-base, PEGASUS-PubMed, and BART-Large-CNN via zero-shot and fine-tuning approaches. Our dataset includes 100 referral conversations among ED clinicians and medical records. We report the ROUGE-1, ROUGE-2, and ROUGE-L to compare model performance. In addition, we annotated transcriptions to assess the quality of generated summaries.
Results The fine-tuned BART-Large-CNN model demonstrates greater performance in summarization tasks with the highest ROUGE scores (F1ROUGE-1 = 0.49, F1ROUGE-2 = 0.23, F1ROUGE-L = 0.35) scores. In contrast, PEGASUS-PubMed lags notably (F1ROUGE-1 = 0.28, F1ROUGE-2 = 0.11, F1ROUGE-L = 0.22). BART-Large-CNN's performance decreases by more than 50% with the zero-shot approach. Annotations show that BART-Large-CNN performs 71.4% recall in identifying key information and a 67.7% accuracy rate.
Conclusion The BART-Large-CNN model demonstrates a high level of understanding of clinical dialogue structure, indicated by its performance with and without fine-tuning. Despite some instances of high recall, there is variability in the model's performance, particularly in achieving consistent correctness, suggesting room for refinement. The model's recall ability varies across different information categories. The study provides evidence toward the potential of artificial intelligence-assisted tools in assisting clinical documentation. Future work is suggested on expanding the research scope with additional language models and hybrid approaches and comparative analysis to measure documentation burden and human factors.
Keywords
text summarization - emergency department - clinical conversation - pretrained language model - documentation burdenBackground and Significance
Health care professionals (HCPs), including clinicians, nurses, therapists, and other practitioners, dedicate a considerable amount of their working hours to charting and maintaining clinical documentation.[1] [2] [3] This labor-intensive process has been linked to burnout among these providers, manifesting as emotional exhaustion, decreased focus, and heightened cognitive burden.[1] [2] This issue is particularly prevalent within emergency departments (EDs),[4] [5] where ED crowding impacts the process due to the high volume of patients waiting to be seen, and low throughput due to limited space, resources, staff, and inefficient flow further contributing to delays in treating patients.[6] [7] In addition, Meaningful Use requirements, the Affordable Care Act reimbursement models, and a highly regulated environment significantly impact clinical documentation workflow and communication in routine care.[8] [9] Literature reported that clinicians spend more time on electronic documentation and administrative tasks than providing direct patient care,[3] [10] as some clinicians may need to allocate over half of their working hours to charting.[11] [12] In some cases, insufficient time for documentation leads to burnout.[12] [13] Going unaddressed, these may lead to unintended choices and consequences, such as noncompliant documentation practices, increased errors and duplicates, and reduced documentation quality.[14]
In 2022, the Surgeon General issued an advisory on clinician burnout, which includes several recommendations to address the burden on HCPs in the United States.[15] Some of the recommendations emphasize “designing technology to serve the needs of health workers, care teams, and patients across the continuum of care” and “improving our understanding of how to develop and apply health information technology that more effectively supports health workers in the delivery of care.”[15] In line with that, the American Medical Informatics Association (AMIA) 25 × 5 Task Force issued a call for action to implement personalized clinical decision support to improve user-specific workflows and support care recommendations[16] as well as emphasized artificial intelligence (AI) as part of current and emerging applications to reduce documentation burden in the long term.[17]
Indeed, clinical documentation could be an AI-assisted process, interactively assisting HCPs and easing the burden.[18] [19] [20] A digital scribe is an “automated clinical documentation system” to capture the HCP conversations with patients and/or other providers and create clinical documentation similar to a human medical scribe.[1] [21] [22] There are several emerging natural language processing and deep learning models being used as automated text summarization (ATS) and conversation summarization in the literature.[23] Even though there are emerging commercial tools showing early evidence on adoption of digital scribe in clinical practice,[24] the research on digital scribing in medical informatics and health services has been limited due to technical and algorithmic challenges in development and limited dataset availability.[1] [25] Another challenge with implementation was the nonlinear and redundant nature of conversations, with studies indicating 80% of captured dialogue is superfluous for effective note-taking.[21] [26] To address these gaps in effective summarization in real-world settings, context-aware models (like pretrained language models) could be utilized. Therefore, in this study, we present and evaluate a proof-of-concept digital scribe system (as an ATS pipeline) for clinical conversations using novel pretrained language models with a real-world dataset of ED consultation sessions. Our goal is to investigate the feasibility, accuracy, and impact of implementing an automated digital scribe in a clinical setting, providing preliminary evidence with a proof-of-concept system in the case of ED consultation. Our long-term vision is that digital scribe tools to improve documentation efficiency, reducing the burden, and enhancing patient care and outcomes.
Background
ATS is the foundation of the digital scribe, and it aims to automatically generate a concise and clear summary of a text, highlighting the key information for the intended audience.[27] [28] [29] ATS can be broadly categorized into two approaches: extractive summarization and abstractive summarization (ABS). Extractive summarization selects and combines important sentences and fragments from the original text to form a summary.[30] [31] ABS generates new summaries that incorporate the essential elements of the original text, potentially including key phrases.[30] [32] In this study, we used the ABS approach to reflect more realistic and human-like approach in summarization with both identifying the important aspects of the original text and producing relevant and new natural language summaries.[33]
Deep learning has been the predominant method for state-of-the-art ABS.[27] [34] With the recent development of transformer network models and the larger generalized language models,[35] [36] fine-tuning and/or modifying pretrained transformer-based models have become the leading techniques for ABS on public datasets.[27] Specialized transformer models have been developed for ABS, such as the PEGASUS family of pretrained models,[27] BART,[37] and its modifications,[38] and T5 family.[39] [40] Each of these models uses a sequence-to-sequence architecture, which combines an encoder block and a decoder block into one architecture. While each model has the same architecture, they all have different pretraining data and tasks (see section “Model Selection”). ABS in the biomedical field has mostly focused on online biomedical texts over clinical applications. Overall, ATS has been understudied with medical records as only 11 of the 58 reviewed studies (19%) used electronic medical record (EMR) information as input.[41] However, a recent survey on dialog summarization found that pretrained language model-based method achieved the highest scores in summarization of public datasets on meeting conversations and chat logs.[42]
Methods
Study Setting and Data Collection
In the scope of our study, we use a dataset (phone conversations) available at the Nationwide Childrens' Hospital (NCH) Physician Consult and Transfer Center (PCTC).[43] PCTC is a call service that receives calls from health care providers across the United States to consult, admit, transfer, or refer patients. A nurse team responds to the calls from physicians, registers their calls, connects them to physicians at the NCH, and takes a summary note of the conversation into the corresponding patient records (Epic EMR system).[44] ED patient transfer calls constitute a large amount of the daily PCTC calls (∼200 calls/day). Our proposed digital scribe system runs in a secure institutional network and uses the conversational data (audio files) stored at the NCH servers. The study used identifiable information to reflect a realistic use case and mitigate any issues may be caused by anonymized data in summarization tasks. This study is approved by the Institutional Ethical Board at the NCH (study ID: 00002897).
In this study, 100 phone call recordings from 100 unique callers (physicians) for ED referrals at the NCH PCTC are used (∼412 total minutes). The calls are randomly selected from the local server (between November and December 2022). Each call consists of a multiturn conversation (ranging from 1- to 9-minute conversation each) among PCTC nurses, an ED clinician or staff, and an external clinician or nurse. [Fig. 1] outlines the clinical flow and study design.


Audio Transcription
To convert the audio recordings into text, we follow a two-step approach. First, we use speech-to-text services via Amazon Web Services (AWS Transcribe)[45] and then an annotator reviews the original recordings and corrects any errors in the transcript to generate clean transcripts. Dialog between speakers is differentiated with a speaker label (e.g., “Speaker 1: Hello.”). The models have a maximum input token size of 1,024 tokens. Of the 100 transcripts, 82 of the transcripts have fewer than 1,024 tokens, and the maximum length of the transcript is 1,987 tokens ([Fig. 2]). Longer transcripts were truncated to include only the first 1,024 tokens.


Dataset Creation
After audio transcription, we organize the transcription documents as text input into the model. For reference text, we use nurse summary notes from the medical records that accompany the transcriptions (as aforementioned in Section “Study Setting and Data Collection”). Nurse summaries in medical records, as-is, are considered as high-quality and ground truth summary in this study. Therefore, we urge readers not to consider the reference texts as traditional “gold standard” data but instead as a representation of high-quality reference text, which informs our evaluation methods (see Section “Evaluation”).
Model Selection
We employ four pretrained large language models (LLMs; T5-small,[39] T5-base,[39] PEGASUS-PubMed,[47] and BART-Large-CNN[46]) for the task of summarizing clinical conversation transcriptions based on their unique strengths and adaptability to the health care domain. Our two T5 models use the original T5 seq2seq architecture,[39] trained for a small model (60 million parameters) and a base model (220 million parameters). The T5 models were trained on a large corpus of English text and performed well in tasks like summarization, question answering, and translation. PEGASUS-PubMed (568 million parameters) comes from the class of PEGASUS models[47] developed for ABS. The inclusion of PEGASUS-PubMed in our selection is driven by its specialization in the biomedical field (pretrained in biomedical literature via PubMed repository).[47] BART-Large-CNN (406 million parameters) is a BART model, that is, fine-tuned on the CNN Daily Mail dataset for summarization. BART-Large-CNN is chosen for its demonstrated effectiveness in producing coherent and contextually accurate summaries.[46] Further comparison of the models is available in the [Supplementary Appendix 1] (available in the online version).
Our choice of these models is influenced by their combined efficiency, domain-specific accuracy, and ability to produce coherent, reliable summaries, which are critical in the fast-paced and precision-oriented context of health care. In addition, these models offer a practical solution, enabling us to process conversation transcriptions quickly without overextending our hardware capabilities (all models were run on a single A100 NVIDIA GPU with 40 GB of VRAM), which may represent common computational resources in health care.[48] [49] Furthermore, our decision is influenced by security, privacy, and compliance. Larger and more resource-intensive LLMs require application programming interface access via cloud services. At the time, this study was conducted, our team did not have compliant service access to use such models (e.g., generative pretrained transformer [GPT], LLaMA: [Large Language Model Meta AI]) with our dataset, which includes protected health information and patient data.
Model Training
We use zero-shot (no fine-tuning) and fine-tuning approaches. For fine-tuning, each model is fine-tuned using 10-fold cross-validation (90 training samples, 10 holdout testing samples for each fold). The final evaluation is run over the concatenated holdout testing samples from the 10 trials (representing all the data). Each sequence is trained for 30 epochs, with an early stopping patience of 3 epochs in no loss improvement, using the AdamW optimizer.[50] Multiple initial learning rates are undertaken (5 × 10−10, 1 × 10−6, 1 × 10−5,1 × 10−4,1 × 10−3, 1 × 10−2) and the best result is reported. Other hyperparameters include weight decay (0.01) and batch size of 2 (PEGASUS and BART) or 5 (T5). For zero-shot, each model is run without any fine-tuning. For training and prediction, each model is configured to use a maximum of 1,024 tokens inputs and output up to 200 summary tokens. The input data (100 transcribed conversations) is summarized and compared with the PCTC nurse notes on each patient's medical records (structured as details of the complaint, background Information, and consultation recommendations). All models were trained using the Hugging Face library with a PyTorch backend.[51] [52] We used Python 3.9 software to run the models and analysis.[53]
Evaluation
We follow two-stage evaluation: quantitative evaluation and qualitative evaluation.
Quantitative Evaluation
We report the ROUGE-1, ROUGE-2, and ROUGE-L to compare model performance.[54] ROUGE scores are a standard set of metrics for quantitatively evaluating the similarity of a generated text against a reference summary based on the number of common words or word sequences. It measures the overlap of n-grams between the generated and reference texts, effectively assessing the coherence of automated summaries compared with the human-generated summaries. We compare the summaries generated by each model against the nurse summary notes (ground truth).
For this task, we pulled nurse notes from the patient EMR intake form corresponding to each ED referral conversation. We report ROUGE-1 (overlap scores for each word), ROUGE-2 (overlap scores for each bigram), and ROUGE-L (the longest common subsequence score). ROUGE scores range from a possible value of 0 (no overlapping terms) to 1 (all terms overlap). Even though expected ROUGE score depends on the task and metrics, usually for similar tasks, 0.4 to 0.5 considered good and amendable score.[54]
Qualitative Evaluation
We qualitatively evaluate and compare generated summaries against nurse notes to assess the information included in the generated summary. We only evaluate the generated summaries from the best-performing model based on the ROUGE scores. For this qualitative assessment, we compare the amount and type of important information in the nurse notes that is also included in the generated summary. We manually label the nurse notes and generated summaries with the following eight tags: (1) Condition—symptoms, diagnosis, medications related to the patient, (2) Behaviors—the patient's actions, (3) Measurements—any numerical value measured, (4) Supplies—list of supplies that the patient has/needs, (5) Date/Time—any mentioned relevant date or time, (6) Test—any tests given or not to the patient, (7) Location—any locations mentioned including where the patient should be brought, (8) Transportation—method of transportation for the patient.
Evaluation Metrics
The common practice in evaluation of generated summaries involves evaluators to rate the generated summary versus the ground truth summary via Likert scale responses, usually for relevance and coherence.[55] However, that type of evaluation does not quantify the measure of how often or what type of information is omitted or incorrectly summarized. To gather that information, we incorporate a novel two-tier annotation system to evaluate the quality of the generated summaries. Firstly, we use Entity Linking (LINK) annotations to identify and connect specific pieces of clinical information found in the generated summaries with their corresponding references in the nurse notes. These LINK annotations serve to establish a direct correspondence between the generated text and the ground truth provided by the nurse notes. Secondly, we assess the Information Accuracy (CORRECT) of these entity links (LINK). Information Accuracy is measured by evaluating whether the linked information in the generated summary retains the same meaning as it does in the nurse notes. For instance, if both the nurse and generated summaries report a positive coronavirus disease test result for a patient, the LINK is labeled as CORRECT. Conversely, if the generated summary erroneously reports a negative result, the LINK is marked as not-CORRECT. This dual-annotation approach allows us to quantify our measure for not only the presence of key information in the generated summaries but also the accuracy with which it reflects the original nurse notes. The annotation process is facilitated by the use of the MedTator text annotation tool.[37]
Statistical Analysis
We performed a two-way analysis of variance (ANOVA) to evaluate the impact of different summarization models (T5-small, T5-base, PEGASUS-PubMed, BART-Large-CNN) and ROUGE metrics (precision, recall, F1-score) on performance scores. This between-groups analysis assessed statistical significance using F-tests, with further post hoc comparisons between models made using Tukey's Honestly Significant Difference (HSD) test. A significance threshold of p < 0.05 was applied.
Results
Quantitative Results
Fine-Tuned Results
Across ROUGE-1 scores, the BART-Large-CNN model displays the highest precision (0.42, confidence interval [CI]: [0.34, 0.49]), recall (0.53, CI: [0.44, 0.62]), and F1-score (0.49, CI: [0.38, 0.51]), indicating a strong ability to capture unigrams from the source text ([Table 1]). The T5-base model follows closely, with a ROUGE-1 precision of 0.41 (CI: [0.30, 0.51]) and recall of 0.41 (CI: [0.32, 0.50]), but a slightly lower F1-score of 0.37 (CI: [0.30, 0.45]), suggesting comparable performance in identifying key unigrams. The T5-small and PEGASUS-PubMed models show lower performance on these metrics, with the PEGASUS-PubMed model exhibiting the lowest F1-score of 0.28 (CI: [0.22, 0.36]). Similar to ROUGE-1 scores, BART-Large-CNN has the highest recall (ROUGE-2 = 0.28, ROUGE-L = 0.43) and F1-scores (ROUGE-2 = 0.23, ROUGE-L = 0.35), whereas T5-base has the highest precisions scores (ROUGE-2 = 0.22, ROUGE-L = 0.34).
Abbreviation: CI, 95% confidence interval.
Zero-Shot Results
[Table 2] reports the performance of the zero-shot models. For ROUGE-1 scores, BART-Large-CNN exhibits the highest precision (0.26, CI: [0.19, 0.34]) and recall (0.23, CI: [0.17, 0.30]), with a corresponding F1-score of 0.23 (CI: [0.17, 0.29]), suggesting a modest capability to identify key unigrams without fine-tuning. The T5-base model also shows relatively better performance compared with T5-small, with precision, recall, and F1-score of 0.30 (CI: [0.22, 0.38]), 0.17 (CI: [0.15, 0.23]), and 0.20 (CI: [0.15, 0.26]), respectively. T5-small has lower scores, and PEGASUS-PubMed's performance is notably minimal, with an F1-score of 0.07 (CI: [0.05, 0.10]). When examining ROUGE-2 scores, which evaluate bigram overlap, the models perform generally poorly, with BART-Large-CNN leading at a lower precision of 0.08 (CI [0.04, 0.12]) and a corresponding F1-score of 0.07 (CI [0.04, 0.10]). The T5 models report low scores, with T5-base obtaining an F1-score of 0.06 (CI [0.03, 0.09]), marginally outperforming T5-small, which has an F1 of 0.05 (CI [0.02, 0.09]). PEGASUS-PubMed has no bigram overlap in this scenario, reflecting significant limitations in its zero-shot performance. Regarding the ROUGE-L scores, BART-Large-CNN achieves the highest F1-score of 0.16 (CI [0.12, 0.21]), albeit modest, indicating its relative advantage in capturing the longest common subsequences in the zero-shot learning context. T5-base and T5-small achieve F1-scores of 0.15 (CI [0.11, 0.21]) and 0.13 (CI [0.08, 0.17]), respectively, followed by PEGASUS-PubMed with an F1-score of 0.06 (CI [0.04, 0.07]).
Abbreviation: CI, 95% confidence interval.
Qualitative Results
We label each of the 100 ground truth summaries and the summaries generated by the BART-Large-CNN model (fine-tuned on 90 not holdout data for that cross-validation fold) using eight tag categories: Conditions, Behaviors, Measurements, Supplies, Date/Time, Tests, Locations, and Transportation.
[Table 3] presents the average recall for manually annotated information tags in summaries of the fine-tuned BART-Large-CNN. All summaries contain at least one of the specified tags, with an average of 8.67 tags per summary. When examining the average LINK recall, the model performs consistently, with a mean recall of 0.71 (standard deviation [SD] = 0.23), indicating that over 70% of the information present in the ground truth summaries is also found in the generated summaries. The average CORRECT recall is marginally lower at 0.67 (SD = 0.23), suggesting that while the model is proficient at identifying relevant information, there is a slight decrease in accuracy when considering the correctness of the information. [Fig. 3] illustrates the recall characteristics of the fine-tuned BART-Large-CNN model.


Abbreviation: SD, standard deviation.
The “Condition” tag appears in 99% (99/100) of the summaries, and it has a high CORRECT recall at 0.73 (SD = 0.27), which indicates a high degree of precision in reporting patient conditions, symptoms, and diagnoses. However, tags such as “Transportation” are present in only 41% (41/100) of the summaries, with the lowest average LINK and CORRECT recall scores of 0.44 (SD = 0.5). “Behaviors” and “Supplies” tags appear less frequently at 29% (29/100) and 7% (7/100), respectively, yet show relatively high CORRECT recall. [Fig. 4] shows an example note sample outlining CORRECT and LINK annotations and tags.


For all summaries combined, the model demonstrates a LINK recall of 69.7% (604/867) instances where tagged information in the ground truth also appears in the generated summaries ([Table 4]). The CORRECT recall, which indicates the instances where the tagged information from the ground truth summary appears accurately in the generated summary, is slightly lower at 65.7% (570/867). However, of the information that is LINKed correctly, the CORRECT accuracy is high at 94.4% (570/604), indicating that when the model does capture relevant information, it tends to be accurate. “Conditions” shows the highest LINK recall at 72.1% (346/480), and an almost equivalent CORRECT recall at 70.8% (340/480). The CORRECT accuracy for “Conditions” is at 98.3% (340/346), indicating that nearly all the condition-related information captured by the model is accurate. The “Behaviors” and “Supplies” tags have the fewest instances but achieve a CORRECT recall of 74.4 (32/43) and 62.5% (5/8), respectively, with both categories achieving CORRECT accuracy of 100%. Conversely, “Test” and “Transportation” tags display lower performance on LINK and CORRECT recall.
Model and Metric Differences
Two-way ANOVA analysis showed that the differences in ROUGE scores between the models are statistically significant (p < 0.001). The post hoc (Tukey's HSD) analysis results showed BART-Large-CNN generally performs better in ROUGE metrics compared with PEGASUS-PubMed, whereas T5-small outperforms both T5-base and PEGASUS-PubMed. Different ROUGE metrics affect the scores significantly (p < 0.05), suggesting that the choice of model and the type of ROUGE metric both have statistically meaningful impacts on the reported scores. (See [Supplementary Appendix 2] for the analysis results, available in online version only).
Transcription Differences
We compare the difference in performance between the original (AWS) transcripts and the clean transcripts. BART-Large-CNN's ROUGE-1 improves by 0.06 (F1-score) when using the clean transcripts. However, T5-base and PEGASUS-PubMed both have lower F1-scores when using the clean transcripts. This difference is mostly not applicable for ROUGE-2 and ROUGE-L scores with a difference between F1-scores less than 0.02. Please see [Supplementary Appendix 3] for ROUGE scores of AWS transcripts (available in online version only).
Discussion
Our fine-tuned text summarization models report promising results compared with similar applications and tasks.[23] The BART-Large-CNN model shows a greater ability to comprehend and replicate the structure and flow of clinical dialogue in medical conversation with a fine-tuned and zero-shot approach. This is similar to the performance of high-performing models on the nonmedical CNN/DailyMail dataset.[27] However, we need to note that BART-Large-CNN's performance may be influenced by architectural advantages in dialog summarization task and potentially better alignment with the characteristics of the test data and metrics.[37] In addition, the significant variation in ROUGE scores based on the model and the metric used suggests differences in model utility in specific contexts of text summarization tasks and underlines the importance of model selection based on the specific characteristics and performance metrics per application domain.
Regarding the literature of digital scribe and clinical dialog summarization, ROUGE-L scores ranged from 0.42 to 0.55,[21] with a recent research demonstrating 0.43 ROUGE-L score using GPT-4 with in-context learning approach.[56] Our work contributes to the literature, achieving similar scores utilizing BART-Large-CNN, further expanding with evaluating ED consult and referral conversations.
However, the variants of recall in our study show an inconsistency in performance, with a subset of notes being replicated with high accuracy, yet a broader variability indicating room for refinement, especially in achieving consistent correctness. The differential performance across various information categories illuminates the necessity for enhancing model recognition capabilities.[57] As the accuracy rates across most tags are promising, they also highlight the disparity in the model's ability to uniformly identify and convey the full spectrum of clinically relevant information present in the reference summaries.[58] In a zero-shot context, each model performed relatively worse than their fine-tuned counterparts. BART-Large-CNN and T5 have better performance, as the models tend to reproduce some lines of the transcript as the summary. PEGASUS-PubMed, by comparison, outputs similar to the original training data text, which is somewhat related to the text in the transcript. These results reinforce the idea that competent zero-shot performance might be achievable at larger model sizes as well as incorporating different architectures and datasets.[59] Furthermore, the variability in model performance in our study, particularly in the context of recall, denotes a significant opportunity for advancing the model's performance with hybrid models[60] [61] or approaches (e.g., user interface design, human-in-the-loop),[62] [63] thereby augmenting its utility in real-world clinical documentation.
Transcription Quality
The transcription quality notably impacts the model performance, as evidenced by the improvement in BART-Large-CNN's ROUGE-1 scores when utilizing clean transcripts. This improvement underscores the importance of high-quality input data for the efficacy of AI-driven clinical documentation.[64] Interestingly, T5-base and PEGASUS-PubMed models register a lower F1-score with clean transcripts, an anomaly that suggests a complex interaction between model architecture and data quality. This observation requires a closer examination of the preprocessing steps and the models' resilience to variations in data quality. In the high-stress, fast-paced ED environment, where documentation accuracy is important, these findings highlight the necessity for robust digital scribe systems capable of handling the inherent variability in clinical speech and text data. The minor differences in ROUGE-2 and ROUGE-L scores with different transcript quality suggest that for capturing the broader context and relationships within the text, the models are less sensitive to transcription errors. This resilience is critical for the practical deployment of digital scribes, where they must perform reliably across varying conditions of data quality.[65]
The Nature of Conversations
In our observation of audio conversations, we note a common pattern involving additional clinicians or health care workers, often leading to multiparticipant calls and extended discussions. The conversation starts with caller information and patient information exchange, followed by patient health information shared later in the conversation. Waiting times with hold tones are frequent. A notable discrepancy between audio summaries and intake notes is that, especially when nurses follow up for additional details, these details are not always included in the initial transcription. Another observation is the variation in note style and content, depending on the nurse taking the notes, indicating differences in documentation approaches among nurses. This added an extra layer of complexity to the task of accurate digital scribing. Additionally, external factors like background noise and coughing during conversations pose potential challenges for automated transcription accuracy.[1] The intake notes sometimes include details from internal consultations that are not present in the original audio, pointing to a possible mismatch in the documentation. These insights underscore the multifaceted nature of clinical communication and the challenges it presents for effective digital documentation.[66]
Implications
The implications of our study extend into several key areas of health care informatics and policy. Firstly, the use of the BART-Large-CNN model in clinical documentation points toward a potential to reduce the documentation burden on HCPs, aligning with the broader goal of mitigating burnout.[1] [13] The high accuracy in key information categories like “Conditions” indicates that AI-assisted tools can effectively complement HCPs' condition tasks. However, in their current forms, the models are insufficient to the task of automatic summarization in a clinical setting. While, in general, the summaries produced are often coherent, existing summarization errors would be detrimental if used in practice as a replacement for a human scribe or clinician. Using these models as assistants instead would be a more useful trajectory. Yet, the successful implementation and integration of such tools hinges on their design elements, human factors, and usability.[67] [68] The variability in model performance underscores the need for a user-centered design approach and a systems thinking approach to overcome technical challenges.[69] [70] This involves tailoring these tools to fit into clinical workflows, ensuring they are intuitive and capable of handling the dynamic nature of clinical environments.[71]
In line with recommendations by the Surgeon General and the AMIA 25 × 5 Task Force, the findings inform developing and applying health information technology that supports HCPs, suggesting that policies may encourage the exploration and adoption of AI tools like digital scribes in clinical settings.[16] This could be achieved through incentives for technology adoption, support for implementation research and technical development, and the development of evidence-based guidelines to ensure ethical and secure use of AI in health care.[72] However, the collaboration between HCP and AI is key to success in improving the accuracy, consistency, and completeness of medical documentation while minimizing documentation errors.[63] [73] It is also important to develop operationalization and implementation plans with accountable, fair, and inclusive AI approaches to ensure the trustworthiness of the digital scribes.[74] [75]
Limitations
The limitations of our study are multifaceted, reflecting both methodological constraints and broader challenges in the field. Firstly, the absence of standardized and validated measures for assessing documentation burden presents a significant challenge.[76] Therefore we depend on our quantitative and qualitative approaches to assess quality, and assuming higher quality of summarization will contribute to reducing documentation burden. Our scoring does not account for differences in notes, note-takers (nurses), and conversations. ROUGE metrics are coherence-insensitive, focusing solely on word overlap without considering the coherence and logical flow of the summaries, which introduces a limitation for quantitative analysis.[77] Our qualitative evaluation focused on a 2-tier assessment, which might limit the perspectives. We did not include additional information, such as patient history or treatment plans, in summarization to reduce complexity and token size, and to focus on task-specific performance. The study lacks qualitative feedback from nurses and clinicians to further assess the perceived value and utility of generated summaries. The reliance on F1-scores with corrected transcripts limits evaluation of utility as well as not fully capturing semantic completeness crucial for clinical relevance. Additionally, using clean transcripts as model input might not reflect the complexity of raw clinical conversations. These limitations are compounded by the small and less diverse dataset size, single annotator bias, lack of real-world testing, and the limited scope of the dataset for ED referrals (not representing all ED clinical conversations and related documentation), all of which contribute to potential constraints on the generalizability and applicability of our findings. To address these limitations, in the next stages of this study, we plan to include: (1) collaboratively developing a highly specific guideline and compiling evaluation metrics for summaries to improve assessment of the automated summaries, (2) auditing existing summaries against guidelines and applying scoring to help inform the model the quality of the summaries on which it is being trained, and (3) working with nurses and clinicians to gain user feedback to this system.
Future Work
In future works, we aim to expand the scope and applicability of our research. A primary focus will be on testing a cloud-based transcription and digital scribe pipeline using additional language models with larger and diverse datasets and longer conversations. This initiative will be geared toward developing a deployable pipeline, with a specific scenario involving a call service connection and providing immediate feedback through a web application to nurses. Another important area of exploration will be the hybrid models[41] combining statistical, machine learning, and computational linguistics techniques, further experimenting with zero- and few-shot learning, and conducting a comparative study to measure effort, task completion time, user variations and human factors, cognitive load, burden and stress, and other relevant units of analysis toward clinical utilization.[76]
Conclusion
Our study introduces the development and testing of a digital scribe pipeline, contributing to the field of automated clinical documentation and efficient documentation flow. By utilizing a real-world dataset, our research addresses a critical gap in the literature, particularly in the areas of workflow optimization and clinical and nurse informatics applications.[1] The practical implications of our findings are offering potential time and resource savings for health care systems, aiming to reduce the documentation burden among nurses and clinicians, thereby enhancing overall health care delivery efficiency and quality.
Clinical Relevance Statement
The study demonstrates the performance and potential of a digital scribe system to effectively summarize clinical conversations in emergency departments, offering a tool to lead to more efficient clinical documentation, improved accuracy in medical records, and potentially alleviate clinician documentation burden.
Multiple-Choice Questions
-
Which aspect of health care professionals' workload does the digital scribe system primarily aim to address in the emergency department setting?
-
Patient diagnosis
-
Appointment scheduling
-
Documentation burden
-
Treatment planning
Correct Answer: The correct answer is option c. The study focuses on developing a digital scribe system to summarize clinical conversations in emergency departments. The primary goal is to alleviate the documentation burden on health care professionals, a significant factor contributing to clinician burnout and inefficiencies in patient care.
-
-
In the study, which model exhibited superior performance in the task of clinical conversation summarization, as evidenced by the highest ROUGE scores?
-
T5-small
-
T5-base
-
PEGASUS-PubMed
-
BART-Large-CNN
Correct Answer: The correct answer is option d. Among the models tested (T5-small, T5-base, PEGASUS- PubMed, and BART-Large-CNN), the BART-Large-CNN model demonstrated the highest performance in summarization tasks. This was evidenced by its superior ROUGE scores, indicating its effectiveness in capturing key information from clinical conversations.
-
-
What factor is important in enhancing the accuracy and utility of AI-assisted clinical summarization tools?
-
Increasing the speed of data processing
-
Expanding the variety of medical conditions covered
-
Improving the quality of input data (transcriptions)
-
Reducing the cost of technology implementation
Correct Answer: The correct answer is option c. The study highlights the significant impact of input data quality on the performance of AI-assisted clinical documentation tools. High-quality transcriptions are crucial for these tools to accurately summarize and capture essential information from clinical conversations. The improvement in the BART-Large-CNN model's performance with clean transcripts compared with AWS transcripts underscores this point.
-
Conflict of interest
None declared.
Human Subject Protection of Human and Animal Subjects
No human subjects were involved in the study.
-
References
- 1 Quiroz JC, Laranjo L, Kocaballi AB, Berkovsky S, Rezazadegan D, Coiera E. Challenges of developing a digital scribe to reduce clinical documentation burden. NPJ Digit Med 2019; 2: 114
- 2 Chandawarkar A, Chaparro JD. Burnout in clinicians. Curr Probl Pediatr Adolesc Health Care 2021; 51 (11) 101104
- 3 Joukes E, Abu-Hanna A, Cornet R, de Keizer NF. Time spent on dedicated patient care and documentation tasks before and after the introduction of a structured and standardized electronic health record. Appl Clin Inform 2018; 9 (01) 46-53
- 4 Moukarzel A, Michelet P, Durand AC. et al. Burnout syndrome among emergency department staff: prevalence and associated factors. BioMed Res Int 2019; 2019: 6462472
- 5 Moy AJ, Hobensack M, Marshall K. et al. Understanding the perceived role of electronic health records and workflow fragmentation on clinician documentation burden in emergency departments. J Am Med Inform Assoc 2023; 30 (05) 797-808
- 6 Morley C, Unwin M, Peterson GM, Stankovich J, Kinsman L. Emergency department crowding: a systematic review of causes, consequences and solutions. PLoS One 2018; 13 (08) e0203316
- 7 Kelen GD, Wolfe R, D'Onofrio G, Mills AM. Emergency department crowding: the canary in the health care system. . NEJM Catal 2021;2(05):
- 8 Colicchio TK, Cimino JJ, Del Fiol G. Unintended consequences of nationwide electronic health record adoption: challenges and opportunities in the post-meaningful use era. J Med Internet Res 2019; 21 (06) e13313
- 9 Reich J. The physician's view: healthcare digital transformation priorities and challenges. In: Hübner UH, Mustata WilsonG, Morawski TS, Ball MJ. eds. Nursing Informatics: A Health Informatics, Interprofessional and Global Perspective. Springer International Publishing;; 2022: 57-67
- 10 Holmgren AJ, Downing NL, Bates DW. et al. Assessment of electronic health record use between US and non-US health systems. JAMA Intern Med 2021; 181 (02) 251-259
- 11 Lavander P, Meriläinen M, Turkki L. Working time use and division of labour among nurses and health-care workers in hospitals - a systematic review. J Nurs Manag 2016; 24 (08) 1027-1040
- 12 Harris DA, Haskell J, Cooper E, Crouse N, Gardner R. Estimating the association between burnout and electronic health record-related stress among advanced practice registered nurses. Appl Nurs Res 2018; 43: 36-41
- 13 Wang J, Lavender M, Hoque E, Brophy P, Kautz H. A patient-centered digital scribe for automatic medical documentation. JAMIA Open 2021; 4 (01) ooab003
- 14 Shanafelt TD, Dyrbye LN, Sinsky C. et al. Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction. Mayo Clin Proc 2016; 91 (07) 836-848
- 15 Health worker burnout. Accessed November 27, 2023 at: https://www.hhs.gov/surgeongeneral/priorities/health-worker-burnout/index.html
- 16 AMIA 25 × 5. AMIA - American Medical Informatics Association. Accessed November 27, 2023 at: https://amia.org/about-amia/amia-25×5
- 17 25 by 5: Columbia leads symposium, ongoing efforts to reduce documentation burden on U.S. clinicians. Columbia DBMI. Accessed November 27, 2023 at: https://www.dbmi.columbia.edu/25×5/
- 18 Lin SY, Shanafelt TD, Asch SM. Reimagining clinical documentation with artificial intelligence. Mayo Clin Proc 2018; 93 (05) 563-565
- 19 Luh JY, Thompson RF, Lin S. Clinical documentation and patient care using artificial intelligence in radiation oncology. J Am Coll Radiol 2019; 16 (9 Pt B): 1343-1346
- 20 Bohr A, Memarzadeh K. Chapter 2 - The rise of artificial intelligence in healthcare applications. In: Bohr A, Memarzadeh K. eds. Artificial Intelligence in Healthcare. Academic Press;; 2020: 25-60
- 21 van Buchem MM, Boosman H, Bauer MP, Kant IMJ, Cammel SA, Steyerberg EW. The digital scribe in clinical practice: a scoping review and research agenda. NPJ Digit Med 2021; 4 (01) 57
- 22 Coiera E, Kocaballi B, Halamka J, Laranjo L. The digital scribe. NPJ Digit Med 2018; 1: 58
- 23 Goodwin TR, Savery ME, Demner-Fushman D. Flight of the PEGASUS? Comparing transformers on few-shot and zero-shot multi-document abstractive summarization. Proc Int Conf Comput Ling 2020; 2020: 5640-5646
- 24 Tierney AA, Gregg G, Brian H. et al. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catal 2024; 5 (03) CAT.23.0404
- 25 Rezazadegan D, Berkovsky S, Quiroz JC. et al. Automatic speech summarisation: a scoping review. arXiv [csCL] . Accessed August 27, 2020 at: http://arxiv.org/abs/2008.11897
- 26 Ghatnekar S, Faletsky A, Nambudiri VE. Digital scribe utility and barriers to implementation in clinical practice: a scoping review. Health Technol (Berl) 2021; 11 (04) 803-809
- 27 Zhang M, Zhou G, Yu W, Huang N, Liu W. A comprehensive survey of abstractive text summarization based on deep learning. Comput Intell Neurosci 2022; 2022: 7132226
- 28 Goyal T, Xu J, Li JJ, Durrett G. Training dynamics for text summarization models. arXiv [csCL] . Accessed October 15, 2021 at: http://arxiv.org/abs/2110.08370
- 29 Zhu C, Xu R, Zeng M, Huang X. A hierarchical network for abstractive meeting summarization with cross-domain pretraining. In: Cohn T, He Y, Liu Y. eds. Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics; 2020: 194-203
- 30 Nenkova A, McKeown K. Automatic Summarization. Foundations and Trends® in Information Retrieval. . 2011;5(2–3): 103-23310.1561/1500000015
- 31 Moratanch N, Chitrakala S. A survey on extractive text summarization. In: 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP). IEEE; 2017: 1-6
- 32 Sotudeh S, Goharian N, Filice RW. Attend to medical ontologies: content selection for clinical abstractive summarization. arXiv [csCL] . Accessed May 1, 2020 at: http://arxiv.org/abs/2005.00163
- 33 Liu C, Wang P, Xu J, Li Z, Ye J. Automatic dialogue summary generation for customer service. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD '19. Association for Computing Machinery; 2019: 1957-1965
- 34 Lin H, Ng V. Abstractive summarization: a survey of the state of the art. AAAI 2019; 33 (01) 9815-9822
- 35 Brown T, Mann B, Ryder N. et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020; 33: 1877-1901
- 36 Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. . Adv Neural Inf Process Syst 2017;30
- 37 Lewis M, Liu Y, Goyal N. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv [csCL] . Accessed October 29, 2019 at: http://arxiv.org/abs/1910.13461
- 38 Aghajanyan A, Gupta A, Shrivastava A, Chen X, Zettlemoyer L, Gupta S. Muppet: massive multi-task representations with pre-finetuning. arXiv [csCL] . Accessed January 26, 2021 at: http://arxiv.org/abs/2101.11038
- 39 Raffel C, Shazeer N, Roberts A. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2020; 21 (01) 5485-5551
- 40 Chung HW, Hou L, Longpre S. et al. Scaling instruction-finetuned language models. arXiv [csLG] . Accessed October 20, 2022 at: http://arxiv.org/abs/2210.11416
- 41 Wang M, Wang M, Yu F, Yang Y, Walker J, Mostafa J. A systematic review of automatic text summarization for biomedical literature and EHRs. J Am Med Inform Assoc 2021; 28 (10) 2287-2297
- 42 Feng X, Feng X, Qin B. A survey on dialogue summarization: recent advances and new frontiers. arXiv [csCL] . Accessed July 7, 2021 at: http://arxiv.org/abs/2107.03175
- 43 Physician Direct Connect (PDC). Accessed November 21, 2023 at: https://www.nationwidechildrens.org/for-medical-professionals/refer-a-patient/referrals-and-scheduling/pdc
- 44 Epic. Accessed December 5, 2023 at: https://www.epic.com/
- 45 Amazon Web Services - Transcribe. . Accessed December 1, 2021 at: https://aws.amazon.com/transcribe/
- 46 Lewis M, Liu Y, Goyal N. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020: 7871-7880
- 47 Zhang J, Zhao Y, Saleh M, Liu P. PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In: Iii HD, Singh A, eds. Proceedings of the 37th International Conference on Machine Learning. Vol. 119. Proceedings of Machine Learning Research; July 13–18, 2020: 11328-11339
- 48 Jia Z, Chen J, Xu X. et al. The importance of resource awareness in artificial intelligence for healthcare. Nat Mach Intell 2023; 5 (07) 687-698
- 49 Koch M, Arlandini C, Antonopoulos G. et al. HPC+ in the medical field: overview and current examples. Technol Health Care 2023; 31 (04) 1509-1523
- 50 Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv [csLG] . . Accessed November 14, 2017 at: http://arxiv.org/abs/1711.05101
- 51 Ecosystem. PyTorch. Accessed April 10, 2024 at: https://pytorch.org/ecosystem/
- 52 Models. Accessed April 10, 2024 at: https://huggingface.co/models
- 53 Python software. . Python.org. Accessed April 12, 2024 at: https://www.python.org/downloads/
- 54 Lin CY. ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out. Association for Computational Linguistics; 2004: 74-81
- 55 Liu S, McCoy AB, Wright AP. et al. Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J Am Med Inform Assoc 2024; 31 (06) 1388-1396
- 56 Van Veen D, Van Uden C, Blankemeier L. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024; 30: 1134-1142
- 57 Cohen A, Kantor A, Hilleli S, Kolman E. Automatic rephrasing of transcripts-based action items. In: Zong C, Xia F, Li W, Navigli R. eds. Findings of the Association for Computational Linguistics. Association for Computational Linguistics; 2021. :2862–2873
- 58 Gao Y, Miller T, Xu D, Dligach D, Churpek MM, Afshar M. Summarizing patients' problems from hospital progress notes using pre-trained sequence-to-sequence models. Proc Int Conf Comput Ling 2022; 2022: 2979-2991
- 59 Gao J, Zhao H, Zhang Y, Wang W, Yu C, Xu R. Benchmarking large language models with augmented instructions for fine-grained information extraction. arXiv [csCL] . Accessed October 8, 2023 at: http://arxiv.org/abs/2310.05092
- 60 Nguyen QA, Duong QH, Nguyen MQ. et al. A hybrid multi-answer summarization model for the biomedical question-answering system. In: 2021 13th International Conference on Knowledge and Systems Engineering (KSE). IEEE; 2021: 1-6
- 61 Park J, Kotzias D, Kuo P. et al. Detecting conversation topics in primary care office visits from transcripts of patient-provider interactions. J Am Med Inform Assoc 2019; 26 (12) 1493-1504
- 62 Shortliffe EH, Sepúlveda MJ. Clinical decision support in the era of artificial intelligence. JAMA 2018; 320 (21) 2199-2200
- 63 Sezgin E. Artificial intelligence in healthcare: complementing, not replacing, doctors and healthcare providers. Digit Health 2023; 9: 20 552076231186520
- 64 Rousseau I, Fosse L, Dkhissi Y, Damnati G, Lecorvé G. Darbarer @ AutoMin2023. Transcription simplification for concise minute generation from multi-party conversations. International Conference on Natural Language Generation. Accessed December 1, 2023 at: https://www.semanticscholar.org/paper/3d8c3cd49045e8310174146e571fae7092c7a770
- 65 Nanayakkara G, Wiratunga N, Corsar D, Martin K, Wijekoon A. Clinical Dialogue Transcription Error Correction with Self-supervision. In: Artificial Intelligence XL. Springer Nature:; Switzerland: 2023: 33-46
- 66 Ganoe CH, Wu W, Barr PJ. et al. Natural language processing for automated annotation of medication mentions in primary care visit conversations. JAMIA Open 2021; 4 (03) ooab071
- 67 Smits M, Nacar M, DSLudden G, van Goor H. Stepwise design and evaluation of a values-oriented ambient intelligence healthcare monitoring platform. Value Health 2022; 25 (06) 914-923
- 68 Rao A, Pang M, Kim J. et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res 2023; 25: e48659
- 69 Rudin RS, Perez S, Rodriguez JA. et al. User-centered design of a scalable, electronic health record-integrated remote symptom monitoring intervention for patients with asthma and providers in primary care. J Am Med Inform Assoc 2021; 28 (11) 2433-2444
- 70 McNab D, McKay J, Shorrock S, Luty S, Bowie P. Development and application of 'systems thinking' principles for quality improvement. BMJ Open Qual 2020; 9 (01) e000714
- 71 Magrabi F, Ammenwerth E, McNair JB. et al. Artificial intelligence in clinical decision support: challenges for evaluating AI and practical implications. Yearb Med Inform 2019; 28 (01) 128-134
- 72 Liao F, Adelaine S, Afshar M, Patterson BW. Governance of clinical AI applications to facilitate safe and equitable deployment in a large health system: key elements and early successes. Front Digit Health 2022; 4: 931439
- 73 Bossen C, Pine KH. Batman and Robin in healthcare knowledge work: human-AI collaboration by clinical documentation integrity specialists. ACM Trans Comput Hum Interact 2023; 30 (02) 1-29
- 74 Zhang G, Jin Q, McInerney DJ. et al. Leveraging generative AI for clinical evidence summarization needs to achieve trustworthiness. arXiv [csAI] . Accessed November 19, 2023 at: http://arxiv.org/abs/2311.11211
- 75 Sezgin E, Sirrianni J, Linwood SL. Operationalizing and implementing pretrained, large artificial intelligence linguistic models in the US health care system: outlook of generative pretrained transformer 3 (GPT-3) as a service model. JMIR Med Inform 2022; 10 (02) e32875
- 76 Moy AJ, Schwartz JM, Chen R. et al. Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. J Am Med Inform Assoc 2021; 28 (05) 998-1008
- 77 Schluter N. The limits of automatic summarisation according to rouge. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics; 2017 :41–45
Address for correspondence
Publikationsverlauf
Eingereicht: 08. Januar 2024
Angenommen: 14. Mai 2024
Accepted Manuscript online:
15. Mai 2024
Artikel online veröffentlicht:
24. Juli 2024
© 2024. Thieme. All rights reserved.
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany
-
References
- 1 Quiroz JC, Laranjo L, Kocaballi AB, Berkovsky S, Rezazadegan D, Coiera E. Challenges of developing a digital scribe to reduce clinical documentation burden. NPJ Digit Med 2019; 2: 114
- 2 Chandawarkar A, Chaparro JD. Burnout in clinicians. Curr Probl Pediatr Adolesc Health Care 2021; 51 (11) 101104
- 3 Joukes E, Abu-Hanna A, Cornet R, de Keizer NF. Time spent on dedicated patient care and documentation tasks before and after the introduction of a structured and standardized electronic health record. Appl Clin Inform 2018; 9 (01) 46-53
- 4 Moukarzel A, Michelet P, Durand AC. et al. Burnout syndrome among emergency department staff: prevalence and associated factors. BioMed Res Int 2019; 2019: 6462472
- 5 Moy AJ, Hobensack M, Marshall K. et al. Understanding the perceived role of electronic health records and workflow fragmentation on clinician documentation burden in emergency departments. J Am Med Inform Assoc 2023; 30 (05) 797-808
- 6 Morley C, Unwin M, Peterson GM, Stankovich J, Kinsman L. Emergency department crowding: a systematic review of causes, consequences and solutions. PLoS One 2018; 13 (08) e0203316
- 7 Kelen GD, Wolfe R, D'Onofrio G, Mills AM. Emergency department crowding: the canary in the health care system. . NEJM Catal 2021;2(05):
- 8 Colicchio TK, Cimino JJ, Del Fiol G. Unintended consequences of nationwide electronic health record adoption: challenges and opportunities in the post-meaningful use era. J Med Internet Res 2019; 21 (06) e13313
- 9 Reich J. The physician's view: healthcare digital transformation priorities and challenges. In: Hübner UH, Mustata WilsonG, Morawski TS, Ball MJ. eds. Nursing Informatics: A Health Informatics, Interprofessional and Global Perspective. Springer International Publishing;; 2022: 57-67
- 10 Holmgren AJ, Downing NL, Bates DW. et al. Assessment of electronic health record use between US and non-US health systems. JAMA Intern Med 2021; 181 (02) 251-259
- 11 Lavander P, Meriläinen M, Turkki L. Working time use and division of labour among nurses and health-care workers in hospitals - a systematic review. J Nurs Manag 2016; 24 (08) 1027-1040
- 12 Harris DA, Haskell J, Cooper E, Crouse N, Gardner R. Estimating the association between burnout and electronic health record-related stress among advanced practice registered nurses. Appl Nurs Res 2018; 43: 36-41
- 13 Wang J, Lavender M, Hoque E, Brophy P, Kautz H. A patient-centered digital scribe for automatic medical documentation. JAMIA Open 2021; 4 (01) ooab003
- 14 Shanafelt TD, Dyrbye LN, Sinsky C. et al. Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction. Mayo Clin Proc 2016; 91 (07) 836-848
- 15 Health worker burnout. Accessed November 27, 2023 at: https://www.hhs.gov/surgeongeneral/priorities/health-worker-burnout/index.html
- 16 AMIA 25 × 5. AMIA - American Medical Informatics Association. Accessed November 27, 2023 at: https://amia.org/about-amia/amia-25×5
- 17 25 by 5: Columbia leads symposium, ongoing efforts to reduce documentation burden on U.S. clinicians. Columbia DBMI. Accessed November 27, 2023 at: https://www.dbmi.columbia.edu/25×5/
- 18 Lin SY, Shanafelt TD, Asch SM. Reimagining clinical documentation with artificial intelligence. Mayo Clin Proc 2018; 93 (05) 563-565
- 19 Luh JY, Thompson RF, Lin S. Clinical documentation and patient care using artificial intelligence in radiation oncology. J Am Coll Radiol 2019; 16 (9 Pt B): 1343-1346
- 20 Bohr A, Memarzadeh K. Chapter 2 - The rise of artificial intelligence in healthcare applications. In: Bohr A, Memarzadeh K. eds. Artificial Intelligence in Healthcare. Academic Press;; 2020: 25-60
- 21 van Buchem MM, Boosman H, Bauer MP, Kant IMJ, Cammel SA, Steyerberg EW. The digital scribe in clinical practice: a scoping review and research agenda. NPJ Digit Med 2021; 4 (01) 57
- 22 Coiera E, Kocaballi B, Halamka J, Laranjo L. The digital scribe. NPJ Digit Med 2018; 1: 58
- 23 Goodwin TR, Savery ME, Demner-Fushman D. Flight of the PEGASUS? Comparing transformers on few-shot and zero-shot multi-document abstractive summarization. Proc Int Conf Comput Ling 2020; 2020: 5640-5646
- 24 Tierney AA, Gregg G, Brian H. et al. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catal 2024; 5 (03) CAT.23.0404
- 25 Rezazadegan D, Berkovsky S, Quiroz JC. et al. Automatic speech summarisation: a scoping review. arXiv [csCL] . Accessed August 27, 2020 at: http://arxiv.org/abs/2008.11897
- 26 Ghatnekar S, Faletsky A, Nambudiri VE. Digital scribe utility and barriers to implementation in clinical practice: a scoping review. Health Technol (Berl) 2021; 11 (04) 803-809
- 27 Zhang M, Zhou G, Yu W, Huang N, Liu W. A comprehensive survey of abstractive text summarization based on deep learning. Comput Intell Neurosci 2022; 2022: 7132226
- 28 Goyal T, Xu J, Li JJ, Durrett G. Training dynamics for text summarization models. arXiv [csCL] . Accessed October 15, 2021 at: http://arxiv.org/abs/2110.08370
- 29 Zhu C, Xu R, Zeng M, Huang X. A hierarchical network for abstractive meeting summarization with cross-domain pretraining. In: Cohn T, He Y, Liu Y. eds. Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics; 2020: 194-203
- 30 Nenkova A, McKeown K. Automatic Summarization. Foundations and Trends® in Information Retrieval. . 2011;5(2–3): 103-23310.1561/1500000015
- 31 Moratanch N, Chitrakala S. A survey on extractive text summarization. In: 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP). IEEE; 2017: 1-6
- 32 Sotudeh S, Goharian N, Filice RW. Attend to medical ontologies: content selection for clinical abstractive summarization. arXiv [csCL] . Accessed May 1, 2020 at: http://arxiv.org/abs/2005.00163
- 33 Liu C, Wang P, Xu J, Li Z, Ye J. Automatic dialogue summary generation for customer service. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD '19. Association for Computing Machinery; 2019: 1957-1965
- 34 Lin H, Ng V. Abstractive summarization: a survey of the state of the art. AAAI 2019; 33 (01) 9815-9822
- 35 Brown T, Mann B, Ryder N. et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020; 33: 1877-1901
- 36 Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. . Adv Neural Inf Process Syst 2017;30
- 37 Lewis M, Liu Y, Goyal N. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv [csCL] . Accessed October 29, 2019 at: http://arxiv.org/abs/1910.13461
- 38 Aghajanyan A, Gupta A, Shrivastava A, Chen X, Zettlemoyer L, Gupta S. Muppet: massive multi-task representations with pre-finetuning. arXiv [csCL] . Accessed January 26, 2021 at: http://arxiv.org/abs/2101.11038
- 39 Raffel C, Shazeer N, Roberts A. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2020; 21 (01) 5485-5551
- 40 Chung HW, Hou L, Longpre S. et al. Scaling instruction-finetuned language models. arXiv [csLG] . Accessed October 20, 2022 at: http://arxiv.org/abs/2210.11416
- 41 Wang M, Wang M, Yu F, Yang Y, Walker J, Mostafa J. A systematic review of automatic text summarization for biomedical literature and EHRs. J Am Med Inform Assoc 2021; 28 (10) 2287-2297
- 42 Feng X, Feng X, Qin B. A survey on dialogue summarization: recent advances and new frontiers. arXiv [csCL] . Accessed July 7, 2021 at: http://arxiv.org/abs/2107.03175
- 43 Physician Direct Connect (PDC). Accessed November 21, 2023 at: https://www.nationwidechildrens.org/for-medical-professionals/refer-a-patient/referrals-and-scheduling/pdc
- 44 Epic. Accessed December 5, 2023 at: https://www.epic.com/
- 45 Amazon Web Services - Transcribe. . Accessed December 1, 2021 at: https://aws.amazon.com/transcribe/
- 46 Lewis M, Liu Y, Goyal N. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020: 7871-7880
- 47 Zhang J, Zhao Y, Saleh M, Liu P. PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In: Iii HD, Singh A, eds. Proceedings of the 37th International Conference on Machine Learning. Vol. 119. Proceedings of Machine Learning Research; July 13–18, 2020: 11328-11339
- 48 Jia Z, Chen J, Xu X. et al. The importance of resource awareness in artificial intelligence for healthcare. Nat Mach Intell 2023; 5 (07) 687-698
- 49 Koch M, Arlandini C, Antonopoulos G. et al. HPC+ in the medical field: overview and current examples. Technol Health Care 2023; 31 (04) 1509-1523
- 50 Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv [csLG] . . Accessed November 14, 2017 at: http://arxiv.org/abs/1711.05101
- 51 Ecosystem. PyTorch. Accessed April 10, 2024 at: https://pytorch.org/ecosystem/
- 52 Models. Accessed April 10, 2024 at: https://huggingface.co/models
- 53 Python software. . Python.org. Accessed April 12, 2024 at: https://www.python.org/downloads/
- 54 Lin CY. ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out. Association for Computational Linguistics; 2004: 74-81
- 55 Liu S, McCoy AB, Wright AP. et al. Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J Am Med Inform Assoc 2024; 31 (06) 1388-1396
- 56 Van Veen D, Van Uden C, Blankemeier L. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024; 30: 1134-1142
- 57 Cohen A, Kantor A, Hilleli S, Kolman E. Automatic rephrasing of transcripts-based action items. In: Zong C, Xia F, Li W, Navigli R. eds. Findings of the Association for Computational Linguistics. Association for Computational Linguistics; 2021. :2862–2873
- 58 Gao Y, Miller T, Xu D, Dligach D, Churpek MM, Afshar M. Summarizing patients' problems from hospital progress notes using pre-trained sequence-to-sequence models. Proc Int Conf Comput Ling 2022; 2022: 2979-2991
- 59 Gao J, Zhao H, Zhang Y, Wang W, Yu C, Xu R. Benchmarking large language models with augmented instructions for fine-grained information extraction. arXiv [csCL] . Accessed October 8, 2023 at: http://arxiv.org/abs/2310.05092
- 60 Nguyen QA, Duong QH, Nguyen MQ. et al. A hybrid multi-answer summarization model for the biomedical question-answering system. In: 2021 13th International Conference on Knowledge and Systems Engineering (KSE). IEEE; 2021: 1-6
- 61 Park J, Kotzias D, Kuo P. et al. Detecting conversation topics in primary care office visits from transcripts of patient-provider interactions. J Am Med Inform Assoc 2019; 26 (12) 1493-1504
- 62 Shortliffe EH, Sepúlveda MJ. Clinical decision support in the era of artificial intelligence. JAMA 2018; 320 (21) 2199-2200
- 63 Sezgin E. Artificial intelligence in healthcare: complementing, not replacing, doctors and healthcare providers. Digit Health 2023; 9: 20 552076231186520
- 64 Rousseau I, Fosse L, Dkhissi Y, Damnati G, Lecorvé G. Darbarer @ AutoMin2023. Transcription simplification for concise minute generation from multi-party conversations. International Conference on Natural Language Generation. Accessed December 1, 2023 at: https://www.semanticscholar.org/paper/3d8c3cd49045e8310174146e571fae7092c7a770
- 65 Nanayakkara G, Wiratunga N, Corsar D, Martin K, Wijekoon A. Clinical Dialogue Transcription Error Correction with Self-supervision. In: Artificial Intelligence XL. Springer Nature:; Switzerland: 2023: 33-46
- 66 Ganoe CH, Wu W, Barr PJ. et al. Natural language processing for automated annotation of medication mentions in primary care visit conversations. JAMIA Open 2021; 4 (03) ooab071
- 67 Smits M, Nacar M, DSLudden G, van Goor H. Stepwise design and evaluation of a values-oriented ambient intelligence healthcare monitoring platform. Value Health 2022; 25 (06) 914-923
- 68 Rao A, Pang M, Kim J. et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res 2023; 25: e48659
- 69 Rudin RS, Perez S, Rodriguez JA. et al. User-centered design of a scalable, electronic health record-integrated remote symptom monitoring intervention for patients with asthma and providers in primary care. J Am Med Inform Assoc 2021; 28 (11) 2433-2444
- 70 McNab D, McKay J, Shorrock S, Luty S, Bowie P. Development and application of 'systems thinking' principles for quality improvement. BMJ Open Qual 2020; 9 (01) e000714
- 71 Magrabi F, Ammenwerth E, McNair JB. et al. Artificial intelligence in clinical decision support: challenges for evaluating AI and practical implications. Yearb Med Inform 2019; 28 (01) 128-134
- 72 Liao F, Adelaine S, Afshar M, Patterson BW. Governance of clinical AI applications to facilitate safe and equitable deployment in a large health system: key elements and early successes. Front Digit Health 2022; 4: 931439
- 73 Bossen C, Pine KH. Batman and Robin in healthcare knowledge work: human-AI collaboration by clinical documentation integrity specialists. ACM Trans Comput Hum Interact 2023; 30 (02) 1-29
- 74 Zhang G, Jin Q, McInerney DJ. et al. Leveraging generative AI for clinical evidence summarization needs to achieve trustworthiness. arXiv [csAI] . Accessed November 19, 2023 at: http://arxiv.org/abs/2311.11211
- 75 Sezgin E, Sirrianni J, Linwood SL. Operationalizing and implementing pretrained, large artificial intelligence linguistic models in the US health care system: outlook of generative pretrained transformer 3 (GPT-3) as a service model. JMIR Med Inform 2022; 10 (02) e32875
- 76 Moy AJ, Schwartz JM, Chen R. et al. Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. J Am Med Inform Assoc 2021; 28 (05) 998-1008
- 77 Schluter N. The limits of automatic summarisation according to rouge. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics; 2017 :41–45







