Summary
Objective: To qualify the use of patient clinical records as non-human-subject for research
purpose, electronic medical record data must be de-identified so there is minimum
risk to protected health information exposure. This study demonstrated a robust framework
for structured data de-identification that can be applied to any relational data source
that needs to be de-identified.
Methods: Using a real world clinical data warehouse, a pilot implementation of limited subject
areas were used to demonstrate and evaluate this new de-identification process. Query
results and performances are compared between source and target system to validate
data accuracy and usability.
Results: The combination of hashing, pseudonyms, and session dependent randomizer provides
a rigorous de-identification framework to guard against 1) source identifier exposure;
2) internal data analyst manually linking to source identifiers; and 3) identifier
cross-link among different researchers or multiple query sessions by the same researcher.
In addition, a query rejection option is provided to refuse queries resulting in less
than preset numbers of subjects and total records to prevent users from accidental
subject identification due to low volume of data.
This framework does not prevent subject re-identification based on prior knowledge
and sequence of events. Also, it does not deal with medical free text de-identification,
although text de-identification using natural language processing can be included
due its modular design.
Conclusion: We demonstrated a framework resulting in HIPAA Compliant databases that can be directly
queried by researchers. This technique can be augmented to facilitate inter-institutional
research data sharing through existing middleware such as caGrid.
Keywords
De-identification - data warehouse