Summary
Background: The process of merging data of different data sources is referred to as record linkage.
A medical environment with increased preconditions on privacy protection demands the
transformation of clear-text attributes like first name or date of birth into one-way
encrypted pseudonyms. When performing an automated or privacy preserving record linkage
there might be the need of a binary classification deciding whether two records should
be classified as the same entity. The classification is the final of the four main
phases of the record linkage process: Preprocessing, indexing, matching and classification.
The choice of binary classification techniques in dependence of project specifications
in particular data quality has not extensively been studied yet.
Objectives: The aim of this work is the introduction and evaluation of an automatable semi-supervised
binary classification system applied within the field of record linkage capable of
competing or even surpassing advanced automated techniques of the domain of unsupervised
classification.
Methods: This work describes the rationale leading to the model and the final implementation
of an automatable semi-supervised binary classification system and the comparison
of its classification performance to an advanced active learning approach out of the
domain of unsupervised learning. The performance of both systems has been measured
on a broad variety of artificial test sets (n = 400), based on real patient data,
with distinct and unique characteristics.
Results: While the classification performance for both methods measured as F-measure was relatively
close on test sets with maximum defined data quality, 0.996 for semi-supervised classification,
0.993 for unsupervised classification, it incrementally diverged for test sets of
worse data quality dropping to 0.964 for semi-supervised classification and 0.803
for unsupervised classification.
Conclusions: Aside from supplying a viable model for semi-supervised classification for automated
probabilistic record linkage, the tests conducted on a large amount of test sets suggest
that semi-supervised techniques might generally be capable of outperforming unsupervised
techniques especially on data with lower levels of data quality.
Keywords
Medical record linkage - classification