Scaling-up NLP Pipelines to Process Large Corpora of Clinical Notes

G. Divita; M. Carter; A. Redd; Q. Zeng; K. Gupta; B. Trautner; M. Samore; A. Gundlapalli

doi:10.3414/ME14-02-0018

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Download PDF

Methods Inf Med 2015; 54(06): 548-552
DOI: 10.3414/ME14-02-0018

Focus Theme – Original Articles

Schattauer GmbH

Scaling-up NLP Pipelines to Process Large Corpora of Clinical Notes

Authors

G. Divita

¹VA Salt Lake City Health Care System and University of Utah School of Medicine, Salt Lake City, Utah, USA
M. Carter

¹VA Salt Lake City Health Care System and University of Utah School of Medicine, Salt Lake City, Utah, USA
A. Redd

¹VA Salt Lake City Health Care System and University of Utah School of Medicine, Salt Lake City, Utah, USA
Q. Zeng

¹VA Salt Lake City Health Care System and University of Utah School of Medicine, Salt Lake City, Utah, USA
K. Gupta

²VA Boston Healthcare System and Boston University School of Medicine, Boston, Massachusetts, USA
B. Trautner

³VA Houston Health Care System and Baylor College of Medicine, Department of Medicine, Houston, Texas, USA
M. Samore

¹VA Salt Lake City Health Care System and University of Utah School of Medicine, Salt Lake City, Utah, USA
A. Gundlapalli

¹VA Salt Lake City Health Care System and University of Utah School of Medicine, Salt Lake City, Utah, USA

Further Information

Publication History

received: 03 October 2014

accepted: 03 September 2015

Publication Date:
23 January 2018 (online)

Permissions and Reprints

Summary

Introduction: This article is part of the Focus Theme of Methods of Information in Medicine on “Big Data and Analytics in Healthcare”.

Objectives: This paper describes the scale-up efforts at the VA Salt Lake City Health Care System to address processing large corpora of clinical notes through a natural language processing (NLP) pipeline. The use case described is a current project focused on detecting the presence of an indwelling uri-nary catheter in hospitalized patients and subsequent catheter-associated urinary tract infections.

Methods: An NLP algorithm using v3NLP was developed to detect the presence of an indwelling urinary catheter in hospitalized patients. The algorithm was tested on a small corpus of notes on patients for whom the presence or absence of a catheter was already known (reference standard). In planning for a scale-up, we estimated that the original algorithm would have taken 2.4 days to run on a larger corpus of notes for this project (550,000 notes), and 27 days for a corpus of 6 million records representative of a national sample of notes. We approached scaling-up NLP pipelines through three techniques: pipeline replication via multi-threading, intra-annotator threading for tasks that can be further decomposed, and remote annotator services which enable annotator scale-out.

Results: The scale-up resulted in reducing the average time to process a record from 206 milliseconds to 17 milliseconds or a 12-fold increase in performance when applied to a corpus of 550,000 notes.

Conclusions: Purposely simplistic in nature, these scale-up efforts are the straight forward evolution from small scale NLP processing to larger scale extraction without incurring associated complexities that are inherited by the use of the underlying UIMA framework. These efforts represent generalizable and widely applicable techniques that will aid other computationally complex NLP pipelines that are of need to be scaled out for processing and analyzing big data.

Keywords

Natural language processing - big data - scale-up

References
1 Chenoweth CE, Saint S. Urinary tract infections. Infect Dis Clin North Am 2011; 25 (01) 103-115

Crossref PubMed Search in Google Scholar
Download RIS citation
2 Kudesia V. et al. Natural language processing to identify foley catheter-days. Infect Control Hosp Epidemiol 2012; 33 (12) 1270-1272

Crossref PubMed Search in Google Scholar
Download RIS citation
3 US Department of Veterans Affairs. VA Informatics and Computing Infrastructure (VINCI). 2013 (cited 2013). Available from: http://www.hsrd.research.va.gov/for_researchers/vinci/

Download RIS citation
4 Divita G. et al. Recognizing Questions and Answers in EMR Templates Using Natural Language Processing. Stud Health Technol Inform 2014; 202: 149-152

PubMed Search in Google Scholar
Download RIS citation
5 Ferrucci DA. et al. Watson: Beyond Jeopardy! Artif Intell. 2013; 199: 93-105

PubMed Search in Google Scholar
Download RIS citation
6 Divita G. et al. Sophia: An Expedient UMLS Concept Extraction Annotator. In: AMIA Annual Fall Symposium 2014. Washington D.C:

Search in Google Scholar
Download RIS citation
7 Tran L-TT. et al. OBSecAnnot: An Automated Section Annotator for Semi-structured Clinical Documents. JAMIA 2015

Search in Google Scholar
Download RIS citation
8 Chute CG. et al. The SHARPn project on secondary use of Electronic Medical Record data: progress, plans, and possibilities. In: AMIA Annual Symposium Proceedings 2011 American Medical Informatics Association; 2011

Search in Google Scholar
Download RIS citation
9 Thusoo A. et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2009; 2 (02) 1626-1629

Crossref Search in Google Scholar
Download RIS citation
10 Chambers C. et al. FlumeJava: easy, efficient data-parallel pipelines. In: ACM Sigplan Notices. ACM 2010

Search in Google Scholar
Download RIS citation
11 Svenson T. Evaluation of Document and Search Query Processing Frameworks

Download RIS citation
12 Ogren PV, Bethard S. Building Test Suites for UIMA Copmponents. Proceedings of the Workshop on Software Engineering Testing and Quality Assurance for Natural Language Processing (SETQA-NLP 2009). 2009; 1-4

PubMed Search in Google Scholar
Download RIS citation
13 Gundlapalli A. et al. Using natural language processing on electronic medical notes to detect the presence of an indwelling urinary catheter. In: Poster to be presented at ID Week. Philadelphia, PA: 2014

Search in Google Scholar
Download RIS citation
14 Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001: 17-21

Search in Google Scholar
Download RIS citation
15 Savova GK. et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17 (05) 507-513

Crossref PubMed Search in Google Scholar
Download RIS citation
16 Harkema H. et al. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform 2009; 42 (05) 839-851

Crossref PubMed Search in Google Scholar
Download RIS citation

Related Journals

Subscribe to RSS

Share / Bookmark

Scaling-up NLP Pipelines to Process Large Corpora of Clinical Notes

Authors

Publication History

Summary

Keywords

References