Methods Inf Med 2012; 51(03): 229-241
DOI: 10.3414/ME11-01-0048
Original Articles
Schattauer GmbH

A Database De-identification Framework to Enable Direct Queries on Medical Data for Secondary Use

B. S. Erdal
1  Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA
2  Electrical and Computer Engineering, The Ohio State University, Columbus, Ohio, USA
,
J. Liu
1  Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA
,
J. Ding
1  Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA
,
J. Chen
1  Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA
,
C. B. Marsh
3  Internal Medicine, The Ohio State University Medical Center, Columbus, Ohio, USA
,
J. Kamal
1  Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA
,
B. D. Clymer
2  Electrical and Computer Engineering, The Ohio State University, Columbus, Ohio, USA
› Author Affiliations
Further Information

Publication History

received:31 May 2011

accepted:08 February 2011

Publication Date:
20 January 2018 (online)

Summary

Objective: To qualify the use of patient clinical records as non-human-subject for research purpose, electronic medical record data must be de-identified so there is minimum risk to protected health information exposure. This study demonstrated a robust framework for structured data de-identification that can be applied to any relational data source that needs to be de-identified.

Methods: Using a real world clinical data warehouse, a pilot implementation of limited subject areas were used to demonstrate and evaluate this new de-identification process. Query results and performances are compared between source and target system to validate data accuracy and usability.

Results: The combination of hashing, pseudonyms, and session dependent randomizer provides a rigorous de-identification framework to guard against 1) source identifier exposure; 2) internal data analyst manually linking to source identifiers; and 3) identifier cross-link among different researchers or multiple query sessions by the same researcher. In addition, a query rejection option is provided to refuse queries resulting in less than preset numbers of subjects and total records to prevent users from accidental subject identification due to low volume of data.

This framework does not prevent subject re-identification based on prior knowledge and sequence of events. Also, it does not deal with medical free text de-identification, although text de-identification using natural language processing can be included due its modular design.

Conclusion: We demonstrated a framework resulting in HIPAA Compliant databases that can be directly queried by researchers. This technique can be augmented to facilitate inter-institutional research data sharing through existing middleware such as caGrid.