Evaluation of Record Linkage Methods for Iterative Insertions

M. Sariyar; A. Borg; K. Pommerening

doi:10.3414/ME9238

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00035037.xml

Teilen / Bookmarken

Facebook Linkedin Weibo

PDF herunterladen

Methods Inf Med 2009; 48(05): 429-437
DOI: 10.3414/ME9238

Original Articles

Schattauer GmbH

Evaluation of Record Linkage Methods for Iterative Insertions

M. Sariyar

¹Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Center, Mainz, Germany

,

A. Borg

¹Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Center, Mainz, Germany

,

K. Pommerening

¹Institute of Medical Biostatistics, Epidemiology and Informatics, University Medical Center, Mainz, Germany

› Institutsangaben

Weitere Informationen

Publikationsverlauf

20. August 2009

Publikationsdatum:
20. Januar 2018 (online)

Abstract
Volltext
Referenzen

Lizenzen und Reprints

Summary

Objectives: There have been many developments and applications of mathematical methods in the context of record linkage as one area of interdisciplinary research efforts. However, comparative evaluations of record linkage methods are still underrepresented. In this paper improvements of the Fellegi-Sunter model are compared with other elaborated classification methods in order to direct further research endeavors to the most promising methodologies.

Methods: The task of linking records can be viewed as a special form of object identification. We consider several non-stochastic methods and procedures for the record linkage task in addition to the Fellegi-Sunter model and perform an empirical evaluation on artificial and real data in the context of iterative insertions. This evaluation provides a deeper insight into empirical similarities and differences between different modelling frames of the record linkage problem. In addition, the effects of using string comparators on the performance of different matching algorithms are evaluated.

Results: Our central results show that stochastic record linkage based on the principle of the EM algorithm exhibits best classification results when calibrating data are structurally different to validation data. Bagging, boosting together with support vector machines are best classification methods when calibrating and validation data have no major structural differences.

Conclusions: The most promising methodologies for record linkage in environments similar to the one considered in this paper seem to be stochastic ones.

Keywords

Record linkage - object identification - decision trees - support vector machines - EM algorithm

References
1 Fellegi IP, Sunter AB. A theory for record linkage. Journal of the American Statistical Association 1969; 64: 1183-1210.

MissingFormLabel
Crossref PubMed Suche in Google Scholar
2 Christensen R. Log-Linear Models and Logistic Regression. 2nd edition. New York: Springer; 1997

MissingFormLabel
Suche in Google Scholar
3 Dempster AP. et al. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 1977; 39: 1-38.

MissingFormLabel
PubMed Suche in Google Scholar
4 Winkler WE. Advanced methods for record linkage. Proceedings of the Section on Survey Research Methods; American Statistical Association; available at www.census.gov/srd/www/byyear.html ; 1994

MissingFormLabel
PubMed Suche in Google Scholar
5 Armstrong JA, Mayda JE. Model-based estimation of record linkage error rates. Survey Methodology 1994; 9: 137-147.

MissingFormLabel
PubMed Suche in Google Scholar
6 Haberman SJ. Log-linear fit for contingency tablesalgorithm AS51. Applied Statistics 1972; 21: 218-225.

MissingFormLabel
Crossref PubMed Suche in Google Scholar
7 Espeland MA, Hui SL. A general approach to analyzing epidemiologic data that contains misclassification errors. Biometrics 1987; 43: 1001-1012.

MissingFormLabel
Crossref PubMed Suche in Google Scholar
8 Espeland MA, Odoroff C. Algorithms for computing maximum likelihood estimates from incomplete discrete data. Technical Report 01/84; University of Rochester (Statistical Department).

MissingFormLabel
PubMed
9 Darroch JN, Ratcliff D. Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics 1972; 43 (05) 1470-1480.

MissingFormLabel
Crossref PubMed Suche in Google Scholar
10 Monge A, Elkan A. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proc. of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery.; Tucson/Arizona: 1997

MissingFormLabel
Suche in Google Scholar
11 Yancey WE. Evaluating string comparator performance for record linkage. Research Report available at www.census.gov/srd/www/byyear.html; 2005

MissingFormLabel
PubMed Suche in Google Scholar
12 Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa/ Florida. Journal of the American Statistical Association 1989; 89: 414-420.

MissingFormLabel
PubMed Suche in Google Scholar
13 Winkler WE. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods; American Statistical Association; 1990 pp 354-369.

MissingFormLabel
PubMed Suche in Google Scholar
14 Breiman L, Friedman, J Olshen R Stone C. Classification and Regression Trees. Wadsworth; 1984

MissingFormLabel
Suche in Google Scholar
15 Breiman L. Bagging predictors. Machine Learning 1996; 24: 123-140.

MissingFormLabel
PubMed Suche in Google Scholar
16 Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: Machine Learning: Proceedings of the Thirteenth International Conference. San Francisco:: Morgan Kaufman; 1996. pp 148-156.

MissingFormLabel
Suche in Google Scholar
17 Burges CJC. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 1998; 2 (02) 121-167.

MissingFormLabel
Crossref PubMed Suche in Google Scholar
18 Vapnik V. Statistical Learning Theory. John Wiley: New York; 1998

MissingFormLabel
Suche in Google Scholar
19 Reng CM, Debold P, Specker C, Pommerening K. Generische Lösungen zum Datenschutz für die Forschungsnetze in der Medizin. München: MWV; 2006

MissingFormLabel
Suche in Google Scholar
20 Christen P, Churches T, Hegland M. Febrl – a parallel open source data linkage system. In: Proceedings of the 8th Pacific-Asia Conference, Sydney 2004. New York: Springer Lecture Notes in Artificial Intelligence 2004; 3056: 638-647.

MissingFormLabel
PubMed Suche in Google Scholar
21 Quantin C. et al. The Epilink record linkage software. Methods Inf Med 2005; 44 (01) 66-71.

MissingFormLabel
Thieme Connect PubMed Suche in Google Scholar
22 Haber E. AS207: Fitting a general log-linear model. Applied Statistics 1984; 33: 358-362.

MissingFormLabel
Crossref PubMed Suche in Google Scholar
23 Therneau TM, Atkinson EJ. An introduction to recursive partitioning using the rpart routine. Technical Report 61; Mayo Clinic, Section of Biostatistics; 1997

MissingFormLabel
PubMed Suche in Google Scholar
24 Peters A. et al. ipred: Improved predictors. R News 2002 pp 33-36.

MissingFormLabel
PubMed Suche in Google Scholar
25 Chang CC, Lin CJ. Libsvm: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm , 2001

MissingFormLabel
PubMed Suche in Google Scholar
26 Lin CJ. On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks 2001; 12 (06) 1288-1298.

MissingFormLabel
Crossref PubMed Suche in Google Scholar

RSS-Feed abonnieren

Teilen / Bookmarken

Evaluation of Record Linkage Methods for Iterative Insertions

Publikationsverlauf

Summary

Keywords

References