Abstract
Motivated by the need for linking records across various databases, we propose a novel graphical model based classifier that uses a mixture of Poisson distributions with latent variables. The idea is to derive insight into each pair of hypothesis records that match by inferring its underlying latent rate of error using Bayesian Modeling techniques. The novel approach of using Gamma priors for learning the latent variables along with supervised labels is unique. The naive assumption is made deliberately as to the independence of the fields to propose a generalized theory for this class of problems and not to undermine the hierarchical dependencies that could be present in different scenarios. This classifier is able to work with sparse and streaming data. The application to record linkage is able to meet challenges of sparsity, data streams and varying nature of the datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Boyd, J.H., Randall, S.M., Ferrante, A.M., Bauer, J.K., Brown, A.P., Semmens, J.B.: Technical challenges of providing record linkage services for research. BMC Med. Inform. Decis. Mak. 14(1), 23 (2014)
Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.S.: Summarization algorithms for record linkage. In: EDBT (2018)
Mamun, A.A., Aseltine, R., Rajasekharan, S.: Efficient record linkage algorithms using complete linkage clustering. PLoS ONE 11(4), e0154446 (2016)
Ilyas, I.F., Chu, X.: Trends in cleaning relational data: consistency and deduplication. Found. Trends Databases 5(4), 281–393 (2015)
Kerr, K., Norris, T., Stockdalel, R.: Data quality information and decision making: a healthcare case study. In: Proceedings of the 18th Australasian Conference on Information Systems Doctoral Consortium, pp. 5–7 (2007)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Dong, K., Zhao, H., Tong, T., Wan, X.: NBLDA: negative binomial linear discriminant analysis for RNA-Seq data. BMC Bioinform. 17(1), 369 (2016)
Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: current practice and future directions. Technical report, CSIRO Mathematical and Information Sciences (2003)
Gelman, A., Carlin, J., Stern, H., Rubin, D.: Bayesian Data Analysis, 2nd edn. Chapman & Hal Texts in Statistical Science (2003)
McVeigh, B.S., Murray, J.S.: Practical Bayesian inference for record linkage. Technical report, Carnegie Mellon University (2017)
Sharp, S.: Deterministic and probabilistic record Linkage. Alternative sources branch, National Records of Scotland
Minka, T.P.: Estimating a gamma distribution. Technical Report, Microsoft Research, Cambridge, UK (2002)
Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. Proc. VLDB Endow 7(9), 697–708 (2014)
Ong, T.C., Mannino, M.V., Schilling, L.M., Kahn, M.G.: Improving record linkage performance in the presence of missing linkage data. J. Biomed. Inform. 52, 43–54 (2014)
Hurwitz, A.M.: Record linkage sharing using labeled comparison vectors and a machine learning domain classification trainer. US Patent, US9576248B2
Tejada, S.: Restaurant - a collection of restaurant records from the Fodor’s and Zagat’s restaurant guides that contains 112 duplicates. Includes both segmented and unsegmented versions. https://www.cs.utexas.edu/users/ml/riddle/data.html
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kashyap, H., Byadarhaly, K. (2022). Supervised Negative Binomial Classifier for Probabilistic Record Linkage. In: Arai, K. (eds) Intelligent Computing. SAI 2022. Lecture Notes in Networks and Systems, vol 507. Springer, Cham. https://doi.org/10.1007/978-3-031-10464-0_49
Download citation
DOI: https://doi.org/10.1007/978-3-031-10464-0_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10463-3
Online ISBN: 978-3-031-10464-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)