Abstract
Integrating or linking data from different sources is an increasingly important task in the preprocessing stage of many data mining projects. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. If no common unique entity identifiers (keys) are available in all data sources, the linkage needs to be performed using the available identifying attributes, like names and addresses. Data confidentiality often limits or even prohibits successful data linkage, as either no consent can be gained (for example in biomedical studies) or the data holders are not willing to release their data for linkage by other parties. We present methods for confidential data linkage based on hash encoding, public key encryption and n-gram similarity comparison techniques, and show how blind data linkage can be performed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bellare, M., Canetti, R., Krawczyk, H.: Message authentication using hash functions – the HMAC construction. RSA Laboratories, CryptoBytes 2, 15 (1996)
Borst, F., Allaert, F.A., Quantin, C.: The Swiss Solution for Anonymous Chaining Patient Files. In: MEDINFO 2001 (2001)
Diffie, W., Hellman, M.E.: New directions in cryptography. IEEE Trans. Inform. Theory IT22(6), 644–654 (1976)
Fellegi, I., Sunter, A.: A Theory for Record Linkage. Journal of the American Statistical Society (1969)
Kelman, C.W., Bass, A.J., Holman, C.D.J.: Research use of linked health data – A best practice protocol. ANZ Journal of Public Health 26, 3 (2002)
Lait, A.J., Randell, B.: An Assessment of Name Matching Algorithms, Technical Report, Dept. of Computing Science, University of Newcastle upon Tyne, UK (1993)
Quantin, C., Bouzelat, H., Allaert, F.A.A., Benhamiche, A.M., Faivre, J., Dusserre, L.: How to ensure data quality of an epidemiological follow-up: Quality assessment of an anonymous record linkage procedure. Intl. Journal of Medical Informatics 49, 117–122 (1998)
Schneider, B.: Applied Cryptography, 2nd edn. John Wiley & Sons, Chichester (1996)
Winkler, W.E.: The State of Record Linkage and Current Research Problems. RR99/03, US Bureau of the Census (1999)
Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi- Sunter model of record linkage. RR00/05, US Bureau of the Census (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Churches, T., Christen, P. (2004). Blind Data Linkage Using n-gram Similarity Comparisons. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_15
Download citation
DOI: https://doi.org/10.1007/978-3-540-24775-3_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22064-0
Online ISBN: 978-3-540-24775-3
eBook Packages: Springer Book Archive