Appl Clin Inform 2021; 12(02): 245-250
DOI: 10.1055/s-0041-1726103
Research Article

Searching the PDF Haystack: Automated Knowledge Discovery in Scanned EHR Documents

Alexander L. Kostrinsky-Thomas
1   College of Osteopathic Medicine, Pacific Northwest University of Health Sciences, 200 University Pkwy Yakima, Washington, United States
Fuki M. Hisama
2   Division of Medical Genetics, Department of Medicine, University of Washington School of Medicine, Seattle, Washington, United States
Thomas H. Payne
3   Department of Medicine, University of Washington School of Medicine, Seattle, Washington, United States
› Author Affiliations


Background Clinicians express concern that they may be unaware of important information contained in voluminous scanned and other outside documents contained in electronic health records (EHRs). An example is “unrecognized EHR risk factor information,” defined as risk factors for heritable cancer that exist within a patient's EHR but are not known by current treating providers. In a related study using manual EHR chart review, we found that half of the women whose EHR contained risk factor information meet criteria for further genetic risk evaluation for heritable forms of breast and ovarian cancer. They were not referred for genetic counseling.

Objectives The purpose of this study was to compare the use of automated methods (optical character recognition with natural language processing) versus human review in their ability to identify risk factors for heritable breast and ovarian cancer within EHR scanned documents.

Methods We evaluated the accuracy of the chart review by comparing our criterion standard (physician chart review) versus an automated method involving Amazon's Textract service (, Seattle, Washington, United States), a clinical language annotation modeling and processing toolkit (CLAMP) (Center for Computational Biomedicine at The University of Texas Health Science, Houston, Texas, United States), and a custom-written Java application.

Results We found that automated methods identified most cancer risk factor information that would otherwise require clinician manual review and therefore is at risk of being missed.

Conclusion The use of automated methods for identification of heritable risk factors within EHRs may provide an accurate yet rapid review of patients' past medical histories. These methods could be further strengthened via improved analysis of handwritten notes, tables, and colloquial phrases.

Protection of Human and Animal Subjects

This project was approved by the University of Washington Institutional Review Board.

Publication History

Received: 06 December 2020

Accepted: 01 February 2021

Article published online:
24 March 2021

© 2021. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

  • References

  • 1 Moon S, Liu S, Chen D. et al. Salience of medical concepts of inside clinical texts and outside medical records for referred cardiovascular patients. Journal of Healthcare Informatics Research. 2019; 3: 200-219
  • 2 What Is HIE? | Healthit.Gov. 2020 . Accessed November 25, 2020 at:
  • 3 Rudin R, Volk L, Simon S, Bates D. What affects clinicians' usage of health information exchange?. Appl Clin Inform 2011; 2 (03) 250-262
  • 4 Rasmussen LV, Peissig PL, McCarty CA, Starren J. Development of an optical character recognition pipeline for handwritten form fields from an electronic health record. J Am Med Inform Assoc 2012; 19 (e1): e90-e95
  • 5 Farri O, Pieckiewicz DS, Rahman AS, Adam TJ, Pakhomov SV, Melton GB. A qualitative analysis of EHR clinical document synthesis by clinicians. AMIA Annu Symp Proc 2012; 2012: 1211-1220
  • 6 Mowery DL, Kawamoto K, Bradshaw R. et al. Determining Onset for Familial Breast and Colorectal Cancer from Family History Comments in the Electronic Health Record. AMIA Jt Summits Transl Sci Proc 2019; 2019: 173-181
  • 7 Jiang X, McGuinness JE, Sin M, Silverman T, Kukafka R, Crew KD. Identifying women at high risk for breast cancer using data from the electronic health record compared with self-report. JCO Clin Cancer Inform 2019; 3: 1-8
  • 8 Payne TH, Zhao LP, Le C. et al. Electronic health records contain dispersed risk factor information that could be used to prevent breast and ovarian cancer. J Am Med Inform Assoc 2020; 27 (09) 1443-1449
  • 9 National Comprehensive Cancer Network. Genetic/Familial High-risk Assessment: Breast, Ovarian, and Pancreatic V.1.2020. Accessed August 18, 2020 at:
  • 10 Amazon Textract. Amazon Web Services, Inc; . Accessed January 16, 2021 at:
  • 11 Soysal E, Wang J, Jiang M. et al. CLAMP—a toolkit for efficiently building customized clinical natural language processing pipelines. J Am Med Inform Assoc 2018; 25 (03) 331-336
  • 12 Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001; 34 (05) 301-310
  • 13 Holley R. How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitization programs. Dlib Mag 2009; 15: 3-4
  • 14 Hládek D, Staš J, Ondáš S, Juhár J, Kovács L. Learning string distance with smoothing for OCR spelling correction. Multimedia Tools Appl 2016; 76 (22) 24549-24567
  • 15 Ferrucci D, Brown E, Chu-Carroll J. et al. Building Watson: an overview of the DeepQA project. AI Mag 2010; 31: 59-79
  • 16 Sauer B, Jones B, Globe G, Leng J, Lu C, He T, Teng C, Sullivan P, Zeng Q. Performance of an NLP Tool to extract PFT reports from Structured and Semi-Structured VA data. eGEMs (Generating Evidence & Methods to improve patient outcomes). 2016; 4 (01) 10
  • 17 Liang J, Tsou C, Poddar A. A Novel System for Extractive Clinical Note Summarization. Paper presented at: Proceedings of the 2nd Clinical Natural Language Processing Workshop; 2019; Minneapolis, MN
  • 18 Goodrum H, Roberts K, Bernstam EV. Automatic classification of scanned electronic health record documents. Int J Med Inform 2020; 144: 104302