Interactive NLP in Clinical Care: Identifying Incidental Findings in Radiology Reports

Gaurav Trivedi; Esmaeel R. Dadashzadeh; Robert M. Handzel; Wendy W. Chapman; Shyam Visweswaran; Harry Hochheiser

doi:10.1055/s-0039-1695791

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035026.xml

Share / Bookmark

Facebook Linkedin Weibo

Download PDF

Appl Clin Inform 2019; 10(04): 655-669
DOI: 10.1055/s-0039-1695791

Research Article

Georg Thieme Verlag KG Stuttgart · New York

Interactive NLP in Clinical Care: Identifying Incidental Findings in Radiology Reports

Gaurav Trivedi

¹Intelligent Systems Program, University of Pittsburgh, Pittsburgh, Pennsylvania, United States

,

Esmaeel R. Dadashzadeh

²Department of Surgery and Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States

,

Robert M. Handzel

³Department of Surgery, University of Pittsburgh, Pittsburgh, Pennsylvania, United States

,

Wendy W. Chapman

⁴Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, United States

,

Shyam Visweswaran

¹Intelligent Systems Program, University of Pittsburgh, Pittsburgh, Pennsylvania, United States

⁵Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States

,

Harry Hochheiser

¹Intelligent Systems Program, University of Pittsburgh, Pittsburgh, Pennsylvania, United States

⁵Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States

› Author Affiliations

Funding The research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under award number R01LM012095 and a Provost’s Fellowship in Intelligent Systems at the University of Pittsburgh (awarded to G.T.). The content of the paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the University of Pittsburgh.

Further Information

Address for correspondence

Harry Hochheiser, PhD

Department of Biomedical Informatics

University of Pittsburgh 5607 Baum Boulevard, Pittsburgh, PA 15206

Email: harryh@pitt.edu

Publication History

25 April 2019

09 July 2019

Publication Date:
04 September 2019 (online)

Also available at

Abstract
Full Text
References
Figures
Supplementary Material

PDF Download Permissions and Reprints

Abstract
Background and Significance

Related Work

Objectives

NLP Learning Pipeline Requirements

User Interface Requirements

(i) Review

(ii) Feedback

(iii) Retrain

Hypothesis

Methods

Data and Annotation

Learning Pipeline

User Interface

Review

Feedback

Retrain

Implementation

Evaluation

Evaluation of Model Performance

Usability Evaluation

Results

Participants

Model Performance

Usability Results

Review

Feedback

Retrain

Suggested Future Directions

Discussion

Study Limitations

Future Work

Conclusion
Clinical Relevance Statement
Multiple Choice Questions
References

Abstract

Background Despite advances in natural language processing (NLP), extracting information from clinical text is expensive. Interactive tools that are capable of easing the construction, review, and revision of NLP models can reduce this cost and improve the utility of clinical reports for clinical and secondary use.

Objectives We present the design and implementation of an interactive NLP tool for identifying incidental findings in radiology reports, along with a user study evaluating the performance and usability of the tool.

Methods Expert reviewers provided gold standard annotations for 130 patient encounters (694 reports) at sentence, section, and report levels. We performed a user study with 15 physicians to evaluate the accuracy and usability of our tool. Participants reviewed encounters split into intervention (with predictions) and control conditions (no predictions). We measured changes in model performance, the time spent, and the number of user actions needed. The System Usability Scale (SUS) and an open-ended questionnaire were used to assess usability.

Results Starting from bootstrapped models trained on 6 patient encounters, we observed an average increase in F1 score from 0.31 to 0.75 for reports, from 0.32 to 0.68 for sections, and from 0.22 to 0.60 for sentences on a held-out test data set, over an hour-long study session. We found that tool helped significantly reduce the time spent in reviewing encounters (134.30 vs. 148.44 seconds in intervention and control, respectively), while maintaining overall quality of labels as measured against the gold standard. The tool was well received by the study participants with a very good overall SUS score of 78.67.

Conclusion The user study demonstrated successful use of the tool by physicians for identifying incidental findings. These results support the viability of adopting interactive NLP tools in clinical care settings for a wider range of clinical applications.

Keywords

workflow - data display - data interpretation - statistical - medical records systems - computerized

Background and Significance

Despite advances in natural language processing (NLP), extracting relevant information from clinical text reports remains challenging and time-consuming.[1] Interactive tools capable of easing the construction, review, and revision of NLP models can reduce the cost of constructing models and improve the utility of clinical reports for physicians, administrators, and other stakeholders.

We present the design, implementation, and evaluation of an interactive NLP tool for identifying incidental findings in radiology reports of trauma patients ([Fig. 1]). The modern care of trauma patients relies on extensive use of whole-body computed tomography (CT) imaging for assessment of injuries.[2] Although CT imaging is invaluable in demonstrating the extent of injuries, unrelated incidental findings such as occult masses, lesions, and anatomic anomalies are often uncovered.[3] Incidental findings are quite common and range from an insignificant cyst in the kidney to a life-threatening nodule in the lung.[4] The members of the trauma team are responsible for interpreting the radiology reports, identifying and assessing the incidental findings, and conveying this information to the patient and other physicians. However, in a busy trauma center with acutely injured patients, the task of identifying and collating incidental findings is taxing.[5] The importance of clinical context in classifying a finding as incidental is a key source of difficulties. For example, a trauma surgeon's notion of an incidental finding may be very different from an oncologist's definition of an incidental finding in a cancer patient. This presents a challenge to automated text extraction approaches based on limited training data, making the identification of incidental findings a task best served by models customized to the clinical context and medical specialty.

Fig. 1 (1) A deidentified radiology report of computed tomography (CT) imaging in a patient with trauma. It revealed a nodule as an incidental finding that is highlighted in yellow by the prototype tool (a and c). Users are able to add incidental findings missed by the prototype (bolded in a) and also remove incorrectly highlighted findings (b). (2) The tool shows an overview of the patient case in a miniaturized view of all the records with highlights marking regions of interest (d). In the right sidebar, the tool allows the users to define search terms to be highlighted in pink. (3) These can be seen as rules which can help attract user attention to potentially important parts of the case. (4) Shows a list of predictions made by system. Clicking on a blurb item scrolls the report view to relevant prediction into view. (5) A log of feedback items and changes recorded by the user.

Interactive NLP tools that provide end-users with the ability to easily label data, refine models, and review the results of those changes have the potential to lower the costs associated with the customization, and therefore to increase the value of NLP on clinical reports ([Fig. 2]). Interactive NLP can improve the clinical workflow and decrease time spent in documenting by automatically identifying and extracting relevant content needed for tasks such as preparing discharge summaries, formulating reports for rounding, and authoring consultation notes. We present the design and implementation of an interactive NLP tool for identifying incidental findings in radiology reports, followed by results from a user study with physicians to evaluate the accuracy and usability of the tool.

Fig. 2 System overview: Physicians (1) review highlights predicted by the system, and (2) provide feedback on them. (3) Once this feedback is used to retrain models, it completes an interactive learning cycle.

Related Work

Several efforts have applied NLP pipelines and machine learning methods to radiology reports.[6] [7] [8] Yetisgen-Yildiz et al[9] demonstrated the use of NLP and supervised machine learning for identifying critical sentences in radiology reports, using an extractive summarization approach, focused on binary classification of sentences. Zech et al also worked on identifying findings in radiology reports with a similar pipeline and linear classification models.[10] Follow-up work by Yetisgen et al[11] noted that because manual annotation is time-consuming and labor-intensive, they could annotate only a small portion of their corpus. Interactive annotation approaches were recommended as a means of addressing this challenge.

Interactive machine learning (IML) is defined as the process of iteratively building machine learning models through end-user input.[12] [13] IML systems require effective displays for presenting outputs and eliciting feedback from users for retraining models.[14] Both Amershi et al[15] and Boukhelifa et al[16] provide summaries of prior work in IML. Interactive methods are particularly appealing in addressing the challenges inherent in developing NLP applications, which are further exacerbated by differences across institutions and clinical subdomains. In the traditional approach, models are built by NLP experts in linguistics and machine learning, while subject matter domain experts who are often the end-users must construct training data through laborious annotation of sample texts. This approach is expensive and inefficient, particularly when language subtleties necessitate multiple iterations through the annotation cycle (as is often the case). For clinical applications, it quickly becomes infeasible to customize models for every specific task and application. RapTAT demonstrated how interactive annotation and preannotated documents can reduce the time required to create an annotated corpus.[17] [18] To simplify the process of building models, LightSIDE[19] and CLAMP[20] provide graphical user interfaces for building customized NLP pipelines. D'Avolio et al[21] described a prototype system combining tools for creating text annotations (Knowtator[22]) and for deriving NLP features (cTAKES[23]), within a common user interface to configure statistical machine learning algorithms. Other efforts provide interfaces for information extraction using rule-based NLP such as regular expressions as well as user-defined grammars and lexicons.[24] [25] Although the majority of these tools focus on supporting different parts of an NLP expert's workflow, they do not address the challenges in designing an end-to-end interactive system for physicians or other domain experts. Our work complements these efforts by focusing not only on customizing individual components of the NLP pipeline, but also on the design of all components required for building a clinically focused closed-loop IML system. Other interactive tools designed for end-user needs have addressed specific NLP tasks including clustering,[26] classification,[27] [28] [29] topic modeling,[30] [31] and word sense disambiguation.[32]

Objectives

Our objective was to develop an intelligent interactive tool to be used by physicians in the trauma care setting for identifying incidental findings. The current workflow for identifying incidental findings at the Trauma Services at University of Pittsburgh Medical Center (UPMC) is a manual process. For each patient, physicians read full-text radiology reports in the patient's electronic medical record (EMR) and synthesize them to fill in different sections of a templated signout note. One of these sections specifically focuses on incidental findings. This process is repeated daily and the signout note is revised whenever a new radiology report is added to the patient's EMR. Typically, resident physicians in the trauma team that include surgery, internal medicine, and radiology are responsible for writing the signout notes. We conducted informal discussions with members of these teams and stakeholders that provided initial validation of the problem and requirements, along with insights and feedback for developing the tool. We built on our previous work[29] to address the challenge of integrating interactive NLP into the clinical workflow. The tool consists of (1) a learning pipeline that builds, applies, and updates an NLP model for identifying incidental findings, and (2) a user interface that enables users review, provide feedback, and understand changes to the NLP model.

NLP Learning Pipeline Requirements

The NLP pipeline should have a sectionizer that is capable of splitting notes into sections and sentences. As in Yetisgen-Yildiz et al,[9] these elements are then subject to a binary classifier predicting whether or not each element discusses incidental findings. The system should incorporate end-user input to revise the models, thus completing an interactive learning cycle capable of predicting useful elements in clinical text.

User Interface Requirements

The user interface should have functionality to help physicians in selecting relevant training examples and in providing labels appropriate for updating the NLP model. The interface should display predictions from the model and allow physicians to give feedback that will be used to revise the model. Visualization and interaction components should support these steps within the interactive learning cycle. These requirements are further itemized as follows:

(i) Review

R1: The user interface should highlight sentences as predicted by the NLP model to be relevant and, where possible, help users understand why a sentence was predicted to describe an incidental finding.

R2: The interface should help users to quickly navigate between documents as well as predictions.

(ii) Feedback

R3: Users should be able to select sentences that should have been highlighted and were missed by the NLP model. Similarly, they should be able to remove incorrect highlights.

R4: The user interface should help minimize user actions and time required for providing feedback.

(iii) Retrain

R5: Feedback provided by users should be displayed as a list of additions and deletions to help users understand changes between model revisions.

Hypothesis

We hypothesize that our tool will enable physicians to build useful NLP models for identifying incidental findings in radiology reports within a closed feedback loop, with no support from NLP experts. We split this into two subhypotheses for efficiency and usability:

H1: The interactive tool will decrease time and effort for physicians for identifying incidental findings.
We compare our IML approach to a simpler interface lacking IML, using measurements of time and effort (in terms of number of user actions) to evaluate how the interactive cycle could facilitate construction of highly accurate models.
H2: The interactive tool will be used by physicians successfully to identify incidental findings with little or no support from NLP experts.
Design of interactive learning systems require that we adopt a human-centered approach for collecting training data and building models. Simple active learning approaches that involve asking a series of questions to human “oracles” can be annoying and frustrating, as noted in Cakmak and Thomaz.[33] The focus in IML is in building tools that align the process of providing feedback with user needs. Thus, we test whether the proposed tool is usable by end-users, that is, physicians, for the task of identifying incidental findings.

Methods

We followed a three-step sequence for design, implementation, and evaluation for our tool. For designing the user interface, we used an iterative process starting with design mockups ([Fig. 3]), followed by implementation and revision phases. We also created a labeled gold standard data set for the user study.

Fig. 3 An early mock-up of the tool. The left side shows the full-text reports and the right sidebar shows suggested incidental findings.

Data and Annotation

We obtained 170,052 radiology reports for trauma patients who were treated by UPMC Trauma Services. Reports were deidentified to remove patient identifiers and identifiers regarding imaging modalities using the DE-ID software from DE-ID Data Corp.[34]

To create an annotated data set, two trauma physicians used a preliminary version of our tool to annotate 4,181 radiology reports (686 encounters, 6.09 ± 4.18 reports per encounter following a power-law distribution) for incidental findings. Annotators focused on two types of incidental findings: lesions suspected to be malignant and arterial aneurysms meeting specified size and location criteria. [Table 1] provides detailed annotation guidelines that were used by the physicians. An initial pilot set of 128 radiology reports was annotated by the two physicians independently, with good interannotator agreement (IAA) of 0.73 measured using Cohen's kappa statistic.[35] Kappa calculations were based on agreement of classification of each sentence as containing an incidental or not. After review and discussions, the annotation guidelines were revised, and a second pilot set of 144 radiology reports was annotated, resulting in a revised IAA of 0.83. Each of the remaining 4,053 reports was annotated by a single physician using the revised annotation scheme.

Table 1
Annotation guidelines: Adapted from Sperry et al[5]
Lesions		Aneurysms
Brain	Any solid lesion	Thoracic aorta	≥ 5 cm
Thyroid	Any lesion	Abdominal aorta	≥ 4 cm
Bone	Any osteolytic or osteoblastic lesion, not age-related	External iliac artery	≥ 3 cm
Breast	Any solid lesion	Common femoral artery	≥ 2 cm
Lung	Any solid lesion (except lymph)	Popliteal artery	≥ 1 cm
Liver	Any heterogeneous lesion
Kidney	Any heterogeneous lesion
Adrenal	Any lesion
Pancreas	Any lesion
Ovary	Any heterogeneous lesion
Bladder	Any lesion
Prostate	Any lesion
Intraperitoneal/Retroperitoneal	Any free lesion

Note: Potentially-malignant lesions and arterial aneurysms greater than a specified size were annotated.

We sampled a subset of encounters from the annotated data set for the user study described in the “Evaluation” section. We restricted the sample to encounters that contained one or more incidental findings and had between 3 and 7 reports. This allowed us to avoid outliers with large numbers of reports to allow for a reasonably consistent review time duration per encounter. Annotators (same physicians) reviewed this smaller sample of 694 reports (130 encounters; 5.36 ± 1.3 reports per encounter; mostly CT and X-ray reports, with a small number of other modalities such as ultrasound, magnetic resonance imaging, fluoroscopy, etc.) again to remove any inconsistencies in labeled gold standard against the annotation guidelines ([Table 1]). This sample with revised annotations was used in the user study.

Learning Pipeline

We extracted individual sentences using the spaCy Python NLP library (https://spacy.io).[36] A sentence was labeled positive if any part of the sentence (or the entire sentence) was selected by the annotators. Sections were extracted after applying regular expressions to identify section headings. A section was marked positive if it contained one or more sentences with incidental findings. Similarly, a report was marked positive if it contained one or more sentences with incidental findings. [Table 2] shows the distribution of incidental findings across sentences, sections, and reports.

Table 2
Distribution of positives at sentence, section, and report levels in the evaluation data set
	Total	Positives
Reports	694	164 (23.6%)
Sections	6,046	302 (5.0%)
Sentences	20,738	369 (1.8%)

Note: Positives denote the raw count of sentences, sections, or reports containing one or more incidental findings.

We used a simple NLP pipeline with linear-kernel support vector machine (SVM) using bag-of-words feature sets. We built separate models to classify reports, sections, and sentences, respectively. Earlier results suggest that this approach performed competitively with other sophisticated methods for classifying relevant sentences in radiology reports.[9] [10] We used the “rationale model” proposed by Zaidan and Eisner[37] for implementing IML with user feedback. Specifically, when the user identified a span of text as an incidental finding, we constructed similar synthetic text as additional training data. Using a simple classification model allowed us to focus the discussion in this article on the design of the overall system. We performed a detailed exploration into classifier modeling techniques for identifying incidental findings, as described elsewhere.[38]

User Interface

The user interface of the tool is shown in [Figs. 1] and [4]. A video demonstration is available at http://vimeo.com/trivedigaurav/incidentals. In the following sections, we describe the components of the interactive feedback loop in detail.

Fig. 4 (A) Users can add feedback by highlighting a span of text and triggering the contextual menu with a right-click. (B) By right clicking on the background, without any selected text span, users can add or remove an entire sentence, report, or encounter.

Review

The tool presents all the radiology reports from a single patient encounter, in a continuous scrolling view. A timeline view on the top indicates the number of reports associated with the encounter and provides shortcuts to individual reports. Reports are broken into individual sections and sentences, which are marked by yellow highlights when predicted to contain incidental findings ([Fig. 1] [(1)]). Varying saturation levels to draw attention to predicted incidental findings: reports with predicted incidental findings are lightly colored in yellow, followed by a darker background for sections which contains the highlighted sentence. The miniview on the right displays an overview of the full encounter ([Fig. 1] [(2)]) and helps the user navigate quickly between the reports by serving as an alternate scroll bar. A list of terms relevant for identifying incidental findings includes terms such as nodule, aneurysm, incidental, etc. ([Fig. 1] [(3)]). These terms are highlighted in pink in the main document and in the miniview. Users have an option to add or remove their own terms. Incidental findings are also listed in the suggestions box on the right along with a short excerpt ([Fig. 1] [(4)]). The user can click on these excerpts to scroll to the appropriate position in the full-text report.

Feedback

To revise models, users right-click on selected text spans to launch a feedback menu enabling addition, removal, or confirmation of predicted incidental findings ([Fig. 4 (a)]). Individual sections or sentences can be selected through a single right-click (no span selection required, [Fig. 4 (b)]). The user also has an option to specify incidental findings at the sentence, section, report, or encounter levels individually. A checked box indicates the presence of an incidental finding. Hierarchical rules are automatically applied as the user provides feedback: if the sentence is marked as an incidental then all the upper levels are also checked. A similar user action is needed to remove incorrectly predicted findings as well. The appropriate interpretation of a feedback action is inferred from the context. For example, if the only predicted sentence is removed from a section, then both the sentence as well as the section containing it are unhighlighted. Text items against which feedback is provided are bolded and underlined ([Fig. 1 (a)] and [(b)]). If a user reads through a report and makes no change to predicted incidentals findings ([Fig. 1 (c)]), the initial labels are assumed to be correct and added as implicit feedback.

Retrain

A list of all current feedback is provided on the bottom panel of the right sidebar ([Fig. 1] [(5)]), which shows a short excerpt from each selected text span. If a user removes highlighted incidental findings, these are also listed in the sidebar and are denoted by a strike through. Clicking on these items in the feedback list scrolls the full-text note to appropriate location. The “x”-button allows the users to undo feedback actions and remove them from the feedback list. Switching to different patient encounter triggers model retraining. Once the retraining is complete, the new predictions are highlighted. The refresh button can also be used to manually retrain and refresh predictions.

Fig. 5 Reports: Change in F1 scores over time at report level. Colored points represent individual participants. The gray band marks the average score and tapers off in thickness to represent the decreasing number of participants completing higher numbers of revisions.

Implementation

Our tool was implemented as a Web application using the AngularJS (angularjs.org) framework. The learning pipeline was implemented as Falcon (a Web application programming interface framework for Python; falconframework.org) Web service layer. Preprocessing steps such as sentence segmentation were performed using spaCy (spacy.io),[36] with a MongoDB (mongodb.com) NoSQL database used to store preprocessed text along with full-text reports. This architecture allowed us to perform quick retraining on the fly without any delays that were noticeable to the users. SVM models were built using the Python scikit-learn machine learning library (scikit-learn.org/ [39]). The software and the source code are available at https://github.com/trivedigaurav/lemr-vis-web/ and https://github.com/trivedigaurav/lemr-nlp-server/.

Evaluation

IML systems require evaluation from two different perspectives[16] [40]: model performance and system usability. Thus, our evaluation maps to the two subhypotheses discussed in the “Objectives” section (H1: efficiency and model accuracy and H2: tool usability).

We recruited 15 physicians as participants with experience in reading radiology reports and identifying incidental findings to participate in the evaluation study.[41] Participants were given a $50 gift card as compensation for participating. Study sessions were conducted via Web conferencing. At the start of each study session, we collected background information about the participant including their clinical experience and the extent of their knowledge and experience with NLP tools. We then introduced the annotation guidelines and allowed the participant to seek clarifications. While we presented the guidelines as shown in [Table 1], participants were asked to select incidentals without specifying any categories. They were allowed to ask questions about the guidelines throughout the study. After a short demonstration of the tool, the participant conducted a trial run of the tool before reviewing the study encounters.

Predictive models were bootstrapped by training an initial model on a set of 6 patient encounters with gold standard labels. Each encounter included 3 to 7 radiology reports. The encounters were divided into control and intervention (experimental) conditions. Participants were asked to review radiology reports and identify all incidental findings in both these conditions. User feedback was saved and used to revise models. However, highlights predicting incidental findings were shown only in the experimental condition. For control encounters, no incidental findings were highlighted for the participants to review, but all other features of the tool were provided. Thus, the control encounters simulated the approach used in current annotation tools and current practice for documenting incidental findings. Each participant was presented with intervention encounters that were interleaved with control encounters. We asked participants to review as many encounters as possible within 60 minutes. We logged time spent on each encounter along with participant interactions with the tool.

At the end of the user study, each participant completed a poststudy questionnaire about their experience, including prompts intended to encourage feedback on individual design components of the tool.

Evaluation of Model Performance

We evaluated efficiency and model accuracy through a combination of intrinsic and extrinsic approaches.

Intrinsic evaluation: We compared predictions from the models built by the participants with human-annotated gold standard data, using F1, precision, and recall metrics. Two-thirds of the data set with 130 patient encounters (694 reports; “Data and annotation” section) was used for review during the study and the remaining third was held out for testing. We maintained similar distribution of positive incidental findings for the review and test data sets at all three levels. We used the same test and train split for all participants to allow comparison of final results.
Extrinsic evaluation: We measured time spent per encounter, as well as the total number of user-actions in the intervention and control conditions. Since each participant was presented the intervention and control encounters in an interleaved manner, we obtained a total of 15 paired samples. We ignored each participant's first encounters in both control and intervention conditions from the timing calculations to minimize learning effects. We found that most participants were able to clarify any questions or concerns about the interface after the trial run and the first two encounters.

Usability Evaluation

To assess the overall usability and usefulness of the tool, we performed a System Usability Scale (SUS)[42] evaluation along with semistructured interviews. SUS offers a quick and reliable measure for overall usability, asking 10 questions with 5-point Likert scale responses, which are used to compute an overall score from 0 to 100. We also recorded subjective feedback about individual components of the tool.

Results

Participants

Study participants were physicians with training in critical care, internal medicine, or radiology ([Table 3]). All participants had experience in identifying incidental findings during their clinical training, practice, and/or research.

Table 3
Study participants: Summary of participants' responses from the prestudy questionnaire
Participant	Position	Years in position	Area	Role	Experience with NLP?
p1	Physician	< 5	Pediatric emergency medicine	Clinician	No
p2	Resident	< 5	General surgery	Clinician, researcher	No; Involved in a past project
p3	Resident	< 5	Radiology	Clinician	No; But familiar
p4	Resident	< 5	Radiology	Clinician	No
p5	Resident	< 5	Neuroradiology	Clinician, researcher	No
p6	Resident	< 5	Radiology	Clinician	No
p7	Resident	< 5	Internal medicine	Clinician	No
p8	Doctoral fellow	< 5	Biomedical informatics	Researcher	No
p9	Assistant professor	< 5	Internal medicine	Clinician	No
p10	Resident	5–10	General surgery	Clinician	No
p11	Resident	5–10	Critical care	Clinician	No
p12	Research staff	< 5	Biomedical informatics	Clinician, researcher	No
p13	Senior research scientist	10+	Biomedical informatics	Researcher	No
p14	Assistant professor	10+	Internal medicine	Clinician	No
p15	Resident	< 5	General surgery	Clinician	No

Abbreviation: NLP, natural language processing.

Model Performance

Physicians reviewed between 12 and 37 encounters (mean = 29.33 ± 6.3) in our user study. The changes in F1 scores on the test data set (relative to the gold standard labels) at each revision are shown in [Figs. 5]–[7]. Comparing the F1 scores of the initial models with the final models that were derived from participant feedback in the hour-long session, we observed an increase in the F1 score from 0.22 to 0.50 to 0.68 (mean = 0.60 ± 0.04) for sentences, from 0.32 to 0.57 to 0.73 (mean = 0.68 ± 0.04) for sections, and from 0.31 to 0.70 to 0.79 (mean = 0.75 ± 0.03) for reports. [Table 4] shows precision, recall, and F1 scores for initial and final models. Precision, recall, and F1 scores for models built by each participant are shown in [Supplementary Table S1] (available in the online version).

Table 4
Final scores: Precision (P), recall (R), and F1 scores at initial and final model revisions aggregated over 15 participants
	Initial					Final
	P	R	F1	P		R		F1
				Range	Mean	Range	Mean	Range	Mean
Reports	0.90	0.19	0.31	[0.67, 0.90]	0.77 ± 0.06	[0.62, 0.81]	0.72 ± 0.05	[0.70, 0.79]	0.75 ± 0.03
Sections	0.86	0.20	0.32	[0.73, 0.86]	0.79 ± 0.04	[0.45, 0.68]	0.60 ± 0.07	[0.57, 0.73]	0.68 ± 0.04
Sentences	0.84	0.13	0.22	[0.75, 0.88]	0.80 ± 0.04	[0.36, 0.62]	0.48 ± 0.06	[0.50, 0.68]	0.60 ± 0.04

Note: The initial model was trained on the same six encounters to bootstrap the learning cycle.

Fig. 6 Sections: Change in F1 scores over time at the section level. The colored points represent individual participants. The gray band marks the average score and tapers off in thickness to represent the decreasing number of participants completing higher numbers of revisions.

Agreement of feedback labels with gold standard labels ranged from Cohen's κ of 0.74 to 0.91 (mean = 0.82 ± 0.05) for sentences, 0.84 to 0.96 (mean = 0.90 ± 0.04) for sections, and 0.76 to 0.95 (mean = 0.88 ± 0.05) for reports.

We observed statistically significant lower time in intervention encounters compared with control encounters (mean time: 134.38 vs. 148.44 seconds; Wilcoxon, Z = 10.0, p < 0.05). The average time spent per encounter for each participant is shown in [Fig. 8].

Fig. 7 Sentences: Change in F1 scores over time at the sentence level. The colored points represent individual participants. The gray band marks the average score and tapers off in thickness to represent the decreasing number of participants completing higher numbers of revisions.

Fig. 8 Average time spent in seconds in control and intervention conditions. Dots represent individual participants. We observed statistically significant lower time in intervention versus control conditions (mean time: 134.38 vs. 148.44 seconds; Wilcoxon, Z = 10, p < 0.05). One participant spent much longer time per encounter than others and can be seen as an outlier in both the conditions.

Comparing the total number of feedback actions, we observed statistically significant lower counts of feedback actions in intervention encounters compared with control encounters (average counts: 42.00 vs. 55.07; Wilcoxon, Z = 13.5, p < 0.05) (see [Fig. 9]).

Fig. 9 Average feedback counts in control and intervention conditions. Dots represent individual participants. We observed statistically significant lower counts in intervention versus control conditions (average counts: 42 vs. 55.07; Wilcoxon, Z = 13.5, p < 0.05).

We found no statistically significant differences between final F1 scores or agreement with gold standard labels between intervention and control encounters at any level ([Fig. 10]).

Fig. 10 We found no statistical differences between final F1 scores or agreement with gold standard labels between control and intervention conditions at any level. (A) F1 scores. (B) Kappa scores.

Usability Results

SUS scores averaged 78.67 ( ± 9.44) out of 100. A SUS score of 68 is considered as average usability performance.[43] [Table 5] shows a break-up of scores received from individual participants.

Table 5
System Usability Scale (SUS): Columns Q1–Q10 represent the user assessment score against each question
Participant	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10	SUS
p1	10	10	10	10	10	10	10	10	7.5	7.5	95
p2	10	7.5	10	2.5	7.5	10	7.5	10	7.5	7.5	80
p3	7.5	7.5	7.5	7.5	2.5	7.5	10	5	5	2.5	62.5
p4	7.5	7.5	5	2.5	7.5	7.5	7.5	7.5	7.5	7.5	67.5
p5	10	10	10	0	10	10	10	10	10	7.5	87.5
p6	7.5	7.5	7.5	10	7.5	7.5	7.5	7.5	7.5	7.5	77.5
p7	7.5	7.5	7.5	7.5	7.5	7.5	7.5	7.5	7.5	7.5	75
p8	5	7.5	7.5	7.5	7.5	7.5	10	7.5	5	5	70
p9	7.5	7.5	7.5	5	7.5	7.5	7.5	7.5	7.5	7.5	72.5
p10	7.5	7.5	10	10	7.5	10	10	7.5	7.5	7.5	85
Mean (SD)	8.17 ( ± 1.48)	7.83 ( ± 1.29)	8.17 ( ± 1.76)	6.67 ( ± 3.09)	7.83 ( ± 1.86)	8.17 ( ± 1.48)	8.67 ( ± 1.29)	8.17 ( ± 1.48)	7.67 ( ± 1.76)	7.33 ( ± 2)	78.67 ( ± 9.44)
List of questions from the SUS questionnaire: Q1. I think that I would like to use this system frequently
Q2. I found this system unnecessarily complex
Q3. I thought the system was easy to use
Q4. I think that I would need the support of a technical person to be able to use this system
Q5. I found the various functions in this system were well integrated
Q6. I thought there was too much inconsistency in this system
Q7. I would imagine that most people would learn to use this system very quickly
Q8. I found the system very cumbersome to use
Q9. I felt very confident in using the system
Q10. I needed to learn a lot of things before I could get going with this system

Abbreviation: SD, standard deviation.

Note: The scores are scaled and normalized from a response on the 5-point Likert scale to a 0–10 range (higher scores are better). Overall SUS scores are computed by summing these columns.

Open-ended subjective feedback revealed no major usability problems. One participant described the tool as being “intuitive and easy to use after initial training.” Overall, the idea for highlighting incidental findings was well received:

“In my personal practice, I have missed out on incidental findings [on occasion] ... if we are able to highlight them, it would be very helpful.”

“It's useful to verify that I didn't miss anything.”