A Database De-identification Framework to Enable Direct Queries on Medical Data for Secondary Use

B. S. Erdal; J. Liu; J. Ding; J. Chen; C. B. Marsh; J. Kamal; B. D. Clymer

doi:10.3414/ME11-01-0048

Methods of Information in Medicine, Table of Contents

Methods Inf Med 2012; 51(03): 229-241
DOI: 10.3414/ME11-01-0048

Original Articles

Schattauer GmbH

A Database De-identification Framework to Enable Direct Queries on Medical Data for Secondary Use

Authors

B. S. Erdal

¹Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA

²Electrical and Computer Engineering, The Ohio State University, Columbus, Ohio, USA
J. Liu

¹Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA
J. Ding

¹Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA
J. Chen

¹Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA
C. B. Marsh

³Internal Medicine, The Ohio State University Medical Center, Columbus, Ohio, USA
J. Kamal

¹Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA
B. D. Clymer

²Electrical and Computer Engineering, The Ohio State University, Columbus, Ohio, USA

Abstract

Summary

Objective: To qualify the use of patient clinical records as non-human-subject for research purpose, electronic medical record data must be de-identified so there is minimum risk to protected health information exposure. This study demonstrated a robust framework for structured data de-identification that can be applied to any relational data source that needs to be de-identified.

Methods: Using a real world clinical data warehouse, a pilot implementation of limited subject areas were used to demonstrate and evaluate this new de-identification process. Query results and performances are compared between source and target system to validate data accuracy and usability.

Results: The combination of hashing, pseudonyms, and session dependent randomizer provides a rigorous de-identification framework to guard against 1) source identifier exposure; 2) internal data analyst manually linking to source identifiers; and 3) identifier cross-link among different researchers or multiple query sessions by the same researcher. In addition, a query rejection option is provided to refuse queries resulting in less than preset numbers of subjects and total records to prevent users from accidental subject identification due to low volume of data.

This framework does not prevent subject re-identification based on prior knowledge and sequence of events. Also, it does not deal with medical free text de-identification, although text de-identification using natural language processing can be included due its modular design.

Conclusion: We demonstrated a framework resulting in HIPAA Compliant databases that can be directly queried by researchers. This technique can be augmented to facilitate inter-institutional research data sharing through existing middleware such as caGrid.

Keywords

De-identification - data warehouse

Full Text

References

References
1 Powell J, Buchan I. Electronic health records should support clinical research. J Med Internet Res 2005; 14 7 (01) e4
2 Weiner M, Embi P. Toward reuse of clinical data for research and quality improvement: the end of the beginning?. Ann Intern Med 2009; 151: 359-360.
3 U.S. Dept. of Health and Human Services Standards for privacy of individually identifiable health information, final rule. Federal Registry. 2002. 45 CRF, Parts 160 and 164
4 Federal Policy for the Protection of Human Subjects (the “Common Rule”), 45 CFR part 46. (June 18, 1991) Fed Regist. 1991; 56: 28003
5 Kamal J, Silvey SA, Buskirk J, Dhaval R, Erdal S, Ding J, Ostrander M, Borlawsky T, Smaltz DH, Payne PR. Innovative applications of an enterprise-wide information warehouse. AMIA Annu Symp Proc 2008; 1134
6 Silvey SA, Schulte J, Smaltz DH, Kamal J. Honest broker protocol streamlines research access to data while safeguarding patient privacy. AMIA Annu Symp Proc 2008: 1133
7 Liu J, Erdal S, Silvey SA, Ding J, Marsh CB, Kamal J. Toward a Fully De-identified Biomedical Information Warehouse. AMIA Annu Symp Proc 2009: 370-374.
8 Boussi Rahmouni H, Solomonides T, Casassa Mont M, Shiu S, Rahmouni M. A Model-driven Privacy Compliance Decision Support for Medical Data Sharing in Europe. Methods Inf Med 2011; 50 (04) 326-336. Epub 2011 Jul 26
9 Holzer K, Gall W. Utilizing IHE-based Electronic Health Record Systems for Secondary Use. Methods Inf Med 2011; 50 (04) 319-325. Epub 2011 Mar 21
10 Safran C, Bloomrosen M, Hammond WE, Labkoff S, Markel-Fox S, Tang PC, Detmer DE. Expert Panel. Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. J Am Med Inform Assoc 2007; 14 (01) 1-9. Epub 2006 Oct 31
11 Wylie JE, Mineau GP. Biomedical databases: protecting privacy and promoting research. Trends Biotechnol 2003; 21 (03) 113-116.
12 Loukides G, Gkoulalas-Divanis A, Malin B. Anonymization of electronic medical records for validating genome-wide association studies. Proc Natl Acad Sci 2010; 107 (17) 7898-7903.
13 Claerhout B, DeMoor GJ. Privacy protection for clinical and genomic data. The use of privacy-enhancing techniques in medicine. Int J Med Inform 2005; 74 (2-4) 257-265.
14 Cooper T, Collman J. Managing Information Security and Privacy in Healthcare Data Mining. Medical Informatics, Integrated Series in Information Systems. Springer US 2005; 8: 95-137.
15 de Moor GJ, Claerhout B, de Meyer F. Privacy enhancing technologies: the key to secure communication and management of clinical and genomic data. Methods Inf Med 2003; 42: 148-153.
16 El Emam KE, Jabbouri S, Sams S, Drouet Y, Power M. Evaluating Common De-Identification Heuristics for Personal Health Information. J Med Internet Res 2006; 8 (04) e28 [old-15]
17 Kohane IS, Dong H, Szolovits P. Health information identification and de-identification toolkit. Proc AMIA Symp 1998: 356-360.
18 Arvind N, Shmatikov V. Privacy and Security: Myths and Fallacies of “Personally Identifiable Information”. Communications of the ACM 53 6 2010 24-26.
19 Cavoukian A, El Emam K. Dispelling the Myths Surrounding De-identification: Anonymization Remains a Strong Tool for Protecting Privacy. Discussion Papers, Information and Privacy Commissioner of Ontario. June 2011
20 El Emam K, Dankar FK, Vaillancourt R, Roffey T, Lysyk M. Evaluating the Risk of Re-identificationof Patients from Hospital Prescription Records. The Canadian Journal of Hospital Pharmacy 2009. 62 (4)
21 Roden DM, Pulley JM, Basford MA, Bernard GR, Clayton EW, Balser JR, Masys DR. Development of a Large-Scale De-Identified DNA Biobank to Enable Personalized Medicine. Clinical Pharmacology & Therapeutics 2008; 84 (03) 362-369. [old-8]
22 Lyman JA, Scully K, Harrison Jr JH. The Development of Health Care Data Warehouses to Support Data Mining. Clinics in Laboratory Medicine 2008; 28 (01) 55-71.
23 Berman JJ. Concept-Match Medical Data Scrubbing. How pathology text can be used in research. Archives of Pathology and Laboratory Medicine 2003; 127 (06) 680-686.
24 Gardner J, Xiong L. HIDE: An Integrated System for Health Information DE-identification. 21st IEEE International Symposium on Computer-Based Medical Systems (CBMS). June 2008: 254-259.
25 Gupta D, Saul M, Gilbertson J. Evaluation of a De-identification (DE-ID) Software Engine to Share Pathology Reports and Clinical Documents for Research. Am J Clin Pathol 2004; 121: 176-186.
26 El Emam K, Dankar FK. Protecting Privacy Using k-Anonymity. J Am Med Inform Assoc 2008; 15 (05) 627-637.
27 Sweeney L. k-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 2002. 10 (5)
28 Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol 2010; 10: 70
29 Pulley J, Clayton E, Bernard GR, Roden DM, Masys DR. Principles of human subjects protections applied in an opt-out, de-identified biobank. Clin Transl Sci 2010; 3 (01) 42-48.
30 Kantarcioglu M, Jiang W, Liu Y, Malin B. A cryptographic approach to securely share and query genomic sequences. IEEE Trans Inf Technol Biomed 2008; 12 (05) 606-617.
31 Hacigumus H, Iyer B, Li C, Mehrotra S. Executing SQL over Encrypted Data in the Database-Service-Provider Model. Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 2002
32 Kantarcioglu M, Jiang W, Malin B. A Privacy-Preserving Framework for Integrating Person-Specific Databases, Privacy in Statistical Databases, 2008, LNCS 5262 298-314.
33 Sweeney L. Guaranteeing anonymity when sharing medical data, the Datafly System. Proc AMIA Annu Fall Symp 1997: 51-55.
34 Sweeney L. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 2002. 10 (5)
35 Dwork C. Differential privacy: a survey of results. TAMC’08 Proceedings of the 5th international conference on theory and applications of models of computation. Berlin, Heidelberg: Springer-Verlag; 2008
36 Neamatullah I, Douglass MM, Lehman LH, Reisner A, Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, Clifford GD. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak 2008; 8: 32
37 Wellner B, Huyck M, Mardis S, Aberdeen J, Morgan A, Peshkin L, Yeh A, Hitzeman J, Hirschman L. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc 2007; 14 (05) 564-573.
38 Uzuner O, Sibanda TC, Luo Y, Szolovits P. A De-identifier for Medical Discharge Summaries. Artif Intell Med 2008; 42 (01) 13-35.
39 Lafky D. The Safe Harbor Method of De-Identification: An Empirical Test. Department of Health and Human Services Presentation, October 8, 2009, available at http://www.ehcca.com/presentations/HIPAAWest4/lafky_2.pdf
40 Goldwasser S, Micali S, Rackoff C. The knowledge complexity of interactive proof systems. SIAM Journal on Computing 1989; 18 (01) 186-208.
41 Berman J. J. Health and Human Services Workshop on the HIPAA Privacy Rule’s De-identifica-tion Standard March 8-9, 2010, available at http://www.hhshipaaprivacy.com/
42 Office for Human Research Protections (OHRP), U.S. Department of Health and Human Services Guidance on research involving coded private information or biological specimens. October 2008
43 Boyd AD, Saxman PR, Hunscher DA, Smith KA, Morris TD, Kaston M, Bayoff F, Rogers B, Hayes P, Rajeev N, Kline-Rogers E, Eagle K, Clauw D, Greden JF, Green LA, Athey BD. The University of Michigan Honest Broker: a Web-based service for clinical and translational research and practice. J Am Med Inform Assoc 2009; 16 (06) 784-791.
44 Dhir R, Patel AA, Winters S, Bisceglia M, Swanson D, Aamodt R, Becich MJ. A multidisciplinary approach to honest broker services for tissue banks and clinical data: a pragmatic and practical model. Cancer 2008; 113 (07) 1705-1715.
45 Java 2 Platform Standard Edition Version 1.4.2. Date accessed: April 2011 http://download.oracle.com/javase/1.4.2/docs/api/java/security/SecureRandom.html
46 National Institute of Standards and Technology (NIST), Computer Security Division, Computer Security Resource Center Date accessed: April 2011 http://csrc.nist.gov/groups/STM/index.html
47 Bruce Schneier. SHA-1 broken. February 15,2005 Available at http://www.schneier.com/blog/archives/2005/02/sha1_broken.html
48 Oracle Corporation Oracle Database Data Warehousing Guide, 11g Release 1 (11.1), Chapter 6, Indexes, September 2011 http://download.oracle.com/docs/cd/B28359_01/server.111/b28313/indexes.htm
49 Oracle Corporation Fine Grained Auditing. Date accessed: July 2010 http://www.oracle.com/technetwork/database/security/index-083815.html
50 International Business Machines Corporation. IBM InfoSphere DataStage Date accessed: September 2011 http://www01.ibm.com/software/data/infosphere/datastage/requirements.html#IBM%20InfoSphere%20DataStage85
51 Oracle Corporation. Oracle Warehouse Builder Date accessed: September 2011 http://www.oracle.com/technetwork/developer-tools/warehouse/overview/introduction/index.html
52 Kahmann S, Erdal BS, Liu J, Kamal J, Clymer BD. Generalizable Session Dependent De-identification Methods. AMIA 2011 Annual Symposium, October 2011 [accepted]
53 Erdal BS, Liu J, Key CB, Kamal J, Clymer BD. Proxy PACS Servers for Image Delivery through an Information Warehouse. AMIA 2011 Annual Symposium, October 2011 [accepted]
54 National Institute of Standards and Technology (NIST), Computer Security Division, Computer Security Resource Center Random Number Generation. Date accessed: September 2011 http://csrc.nist.gov/groups/ST/toolkit/rng/index.html
55 National Institute of Standards and Technology (NIST), Computer Security Division, Computer Security Resource Center. A Statistical Test Suite for the Validation of Random Number Generators and Pseudo Random Number Generators for Cryptographic Applications Date accessed: September 2011. Available at: http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software
56 Maurer UM. A Universal Statistical Test for Random Bit Generators. Journal of Cryptology 1992; 5 (02) 89-105.
57 Chung KL. Elementary Probability Theory with Stochastic Processes. New York: Springer Verlag; 1979: 210-217.
58 Malin B. Secure construction of k-unlinkable patient records from distributed providers. Artificial Intelligence in Medicine 2010; 48 (01) 29-41.
59 NLM. Unified Medical Language System Date accessed: April 2011 http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html
60 Bodenreider O. Using UMLS semantics for classification purposes. AMIA Annu Symp Proc 2000: 86-90.
61 Campbell KE, Oliver DE, Spackman KA, Shortliffe EH. Representing thoughts, words, and things in the UMLS. J Am Med Inform Assoc 1998; 5 (05) 421-431.
62 Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008: 128-144.
63 Uzuner O, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc 2008; 15 (01) 14-24.
64 Uzuner O. Recognizing Obesity and Co-morbidities in Sparse Data. J Am Med Inform Assoc 2009; 16 (04) 561-570.
65 Suzuki T, Yokoi H, Fujita S, Takabayashi K. Automatic DPC code selection from electronic medical records: text mining trial of discharge summary. Methods Inf Med 2008; 47 (06) 541-548.
66 Murff HJ, FitzHenry F, Matheny ME, Gentry N, Kotter KL, Crimin K, Dittus RS, Rosen AK, Elkin PL, Brown SH, Speroff T. Automated Identification of Postoperative Complications Within an Electronic Medical Record Using Natural Language Processing. JAMA 2011; 306 (08) 848-855.
67 Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 2007; 14 (05) 550-563.
68 Oracle Corporation. Oracle Business Intelligence Enterprise Edition Plus Date accessed: April 2011 http://www.oracle.com/technetwork/middleware/bi-enterprise-edition/overview/index.html . [old-29]
69 Ding J, Liu J, Kamal J. uQuery HIPAA-Compliant Web Query Tool for Retrieving Patient Clinical Data from a Data Warehouse. AMIA Annu Symp Proc 2009. p 821 [old-30]
70 Murphy SN, Mendis ME, Berkowitz DA, Kohane I, Chueh H. Integration of clinical and genetic data in the i2b2 architecture. AMIA Annu Symp Proc 2006: 1040 [old-31]
71 Murphy SN, Mendis M, Hackett K, Kuttan R, Pan W, Phillips L. et al. Architecture of the open-source clinical research chart from Informatics for Integrating Biology and the Bedside. AMIA Annu Symp Proc 2007: 548-552.
72 Saltz J, Oster S, Hastings S, Langella S, Kurc T, Sanchez W, Kher M, Manisundaram A, Shanbhag K, Covitz P. caGrid: Design and Implementation ofthe Core Architecture of the Cancer Biomedical Informatics Grid. Bioinformatics 2006; 22 (15) 1910-1916.