Abstract
A protocol is proposed to allow linkage of anonymous medical information within the
framework of epidemiological follow-up studies. The protocol is composed of two steps;
the first concerns the irreversible transformation of identification data, using a
one-way hash function which is used after spelling processing. To avoid dictionary
attacks, two large random files of keys, called pads, are introduced. The second step
consists in the linkage of files rendered anonymous. The weight given to each linkage
field is estimated by a mixture model, the likelihood of which being maximized with
the Expectation and Maximization (EM) algorithm. The performance of this method has
been assessed by comparing record linkage, based on exclusive use of the automatic
procedure, with a manual linkage, obtained by the Burgundy Registry of Digestive Cancers.
The result of the linkage of a file of 2,847 cancers with a file of 388,614 hospitalization
stays in the Dijon university hospital showed a sensitivity of 97% and a specificity
of 93%.
Keywords
Epidemiological Survey - Record Linkage - Non-reversible Encryption - Hash-coding
- Security