Abstract
This paper provides an overview of methods of masking microdata so that the data can be placed in public-use files. It divides the methods according to whether they have been demonstrated to provide analytic properties or not. For those methods that have been shown to provide one or two sets of analytic properties in the masked data, we indicate where the data may have limitations for most analyses and how re-identification might or can be performed. We cover several methods for producing synthetic data and possible computational extensions for better automating the creation of the underlying statistical models. We finish by providing background on analysis-specific and general information-loss metrics to stimulate research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abowd, J.M., Woodcock, S.D.: Disclosure Limitation in Longitudinal Linked Data. In: Confidentiality, Disclosure, and Data Access, North Holland, Amsterdam (2002)
Agrawal, D., Aggarwal, C.C.: On the Design on Privacy Preserving Data Mining Algorithms. In: Proceedings of the ACM SIGPODS, pp. 247–255 (2001)
Agrawal, R., Srikant, R.: Privacy Preserving Data Mining. In: Proceedings of the ACM SIGMOD, pp. 439–450 (2000)
Bacher, J., Brand, R., Bender, S.: Re-identifying Register Data by Survey Data using Cluster Analysis: An Empirical Study. International Journal of Uncertainty, Fuzziness, Knowledge-Based Systems 10(5), 589–608 (2002)
Benedetti, P., Franconi, L.: Statistical and Technological Solutions to the Controlled Data Dissemation. In: Pre-proceedings of New Techniques and Technologies for Statistics. Sorrento, vol. 1, pp. 225–232 (1998)
Bethlehem, J.A., Keller, W.J., Pannekoek, J.: Disclosure Control of Microdata. Journal of the American Statistical Association 85, 38–45 (1990)
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive Name Matching in Information Integration. IEEE Intelligent Systems 18(5), 16–23 (2003)
Brand, R.: Microdata Protection Through Noise Addition. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, p. 97. Springer, Heidelberg (2002)
Dalenius, T., Reiss, S.P.: Data-swapping: A Technique for Disclosure Control. Journal of Statistical Planning and Inference 6, 73–85 (1982)
Dandekar, R.A., Domingo-Ferrer, J., Sebe, F.: LHS-Based Hybrid Microdata vs Rank Swapping and Microaggregation for Numeric Microdata Protection. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, p. 153. Springer, Heidelberg (2002)
Dandekar, R., Cohen, M., Kirkendal, N.: Sensitive Microdata Protection Using Latin Hypercube Sampling Technique. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, p. 117. Springer, Heidelberg (2002)
Defays, D., Anwar, M.N.: Masking Microdata Using Micro-aggregation. Journal of Official Statistics 14, 449–461 (1998)
De Waal, A.G., Willenborg, L.C.R.J.: Global Recodings and Local Suppressions in Microdata Sets. Proceedings of Statistics Canada Symposium 95, 121–132 (1995)
De Waal, A.G., Willenborg, L.C.R.J.: A View of Statistical Disclosure Control for Microdata. Survey Methodology 22, 95–103 (1996)
Domingo-Ferrer, J. (ed.): Inference Control in Statistical Databases. LNCS, vol. 2316. Springer, Heidelberg (2002)
Domingo-Ferrer, J., Mateo-Sanz, J.M.: An Empirical Comparison of SDC Methods for Continuous Microdata in Terms of Information Loss and Re-Identification Risk. Presented at the UNECE Workshop On Statistical Data Editing, Skopje, Macedonia (May 2001)
Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical Data-Oriented Microaggregation for Statistical Disclosure Control. IEEE Transactions on Knowledge and Data Engineering 14(1), 189–201 (2002)
Domingo-Ferrer, J., Torra, V.: A Quantitative Comparison of Disclosure Control Methods for Microdata. In: Doyle, P., Lane, J., Theeuwes, J., Zayatz, L. (eds.) Confidentiality, Disclosure Control and Data Access: Theory and Practical Applications, pp. 111–134. North Holland, Amsterdam (2001)
Domingo-Ferrer, J., Torra, V.: Statistical Data Protection in Statistical Microdata Protection via Advanced Record Linkage. Statistics and Computing 13(4), 343–354 (2003)
Duncan, G.T., Keller-McNulty, S.A., Stokes, S.L.: Disclosure Risk vs. Data Utility: The R-U Confidentiality Map, Los Alamos National Laboratory Technical Report LA-UR- 01-6428 (2001)
Elliott, M.A., Manning, A.M., Ford, R.W.: A Computational Algorithm for Handling the Special Uniques Problem. International Journal of Uncertainty, Fuzziness, and Knowledge- Based Systems 10(5), 493–510 (2002)
Elliott, M.A., Skinner, C.J., Dale, A.: Special Uniques, Random Uniques, and Sticky Populations: Some Counterintuitive Effects of Geographical Detail on Disclosure Risk. In: Statistical Data Protection 1998, Eurostat, Brussels, Belgium, pp. 261–265 (1998); also Research in Official Statistics 1(2), 53–68
Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)
Fienberg, S.E.: Confidentiality and Disclosure Limitation Methodology: Challenges for National Statistics and Statistical Research, commissioned by Committee on National Statistics of the National Academy of Sciences (1997)
Fienberg, S.E., Makov, U.: Confidentiality, Uniqueness, and Disclosure Limitation for Categorical Data. Journal of Official Statistics 14, 385–397 (1998)
Fienberg, S.E., Makov, E.U., Sanil, A.P.: A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data. Journal of Official Statistics 14, 75–89 (1997)
Fienberg, S.E., Makov, E.U., Steel, R.J.: Disclosure Limitation using Perturbation and Related Methods for Categorical Data. Journal of Official Statistics 14, 485–502 (1998)
Fuller, W.A.: Masking Procedures for Microdata Disclosure Limitation. Journal of Official Statistics 9, 383–406 (1993)
Gomatam, S.V., Karr, A.: On Data Swapping of Categorical Data, American Statistical Association. In: Proceedings of the Section on Survey Research Methods, CD-ROM (2003)
Iyengar, V.: Transforming Data to Satisfy Privacy Constraints, Association of Computing Machinery, Special Interest Group on Knowledge Discovery and Datamining 2002 (2002)
Kennickell, A.B.: Multiple Imputation and Disclosure Control: The Case of the 1995 Survey of Consumer Finances. In: Record Linkage Techniques 1997, pp. 248–267. National Academy Press, Washington (1997)
Kim, J.J.: A Method for Limiting Disclosure in Microdata Based on Random Noise and Transformation, American Statistical Association. In: Proceedings of the Section on Survey Research Methods, pp. 303–308 (1986)
Kim, J.J.: Subdomain Estimation for the Masked Data, American Statistical Association. In: Proceedings of the Section on Survey Research Methods, pp. 456–461 (1990)
Kim, J.J., Winkler, W.E.: Masking Microdata Files, American Statistical Association. In: Proceedings of the Section on Survey Research Methods, pp. 114–119 (1995)
Lambert, D.: Measures of Disclosure Risk and Harm. Journal of Official Statistics 9, 313–331 (1993)
Little, R.J.A.: Statistical Analysis of Masked Data. Journal of Official Statistics 9, 407–426 (1993)
Little, R.J.A., Liu, F.: Selective Multiple Imputation of Keys for Statistical DisclosureControl in Microdata. In: Proceedings of the Section on Survey Research Methods, CD-ROM, American Statistical Association (2002)
Little, R.J.A., Liu, F.: Comparison of SMIKe with Data-Swapping and PRAM for Statistical Disclosure Control of Simulated Microdata, American Statistical Association. In: Proceedings of the Section on Survey Research Methods (2003)
Malin, B., Sweeney, L., Newton, E.: Trail Re-identification: Learning Who You are from Where You have Been. In: Workshop on Privacy in Data, March 2003, Carnegie-Mellon University (2003)
McCallum, A., Wellner, B.: Object Consolidation by Graph Partitioning with a Conditionally- Trained Distance Metric. In: Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington, DC (August 2003)
Moore, R.: Controlled Data Swapping Techniques for Masking Public Use Data Sets, U.S. Bureau of the Census, Statistical Research Division Report rr96/04 (1995), available at http://www.census.gov/srd/www/byyear.html
Muralidhar, K., Parsa, R., Sarathy, R.: A General Additive Data Perturbation Method for Database Security. Management Science 45(10), 1399–1415 (1999)
Muralidhar, K., Sarathy, R., Parsa, R.: An Improved Security Requirement for Data Perturbation with Implications for E-Commerce. Decision Sciences 32(4), 683–698 (2001)
Paas, G.: Disclosure Risk and Disclosure Avoidance for Microdata. Journal of Business and Economic Statistics 6, 487–500 (1988)
Palley, M.A., Simonoff, J.S.: The Use of Regression Methodology for the Compromise of Confidential Information in Statistical Databases. ACM Transactions on Database Systems 12(4), 593–608 (1987)
Polettini, S.: Maximum Entropy Simulation for Microdata Protection. Statistics and Computing 13(4), 307–320 (2003)
Polettini, S., Stander, J.: A Bayesian Hierarchical Model Approach to Risk Estimation in Statistical Disclosure Limitation. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 247–261. Springer, Heidelberg (2004)
Raghunathan, T.E., Reiter, J.P., Rubin, D.R.: Multiple Imputation for Statistical Disclosure Limitation. Journal of Official Statistics 19, 1–16 (2003)
Reiss, J.P.: Practical Data Swapping: The First Steps. ACM R=Transactions on Database Systems 9, 20–37 (1984)
Reiter, J.P.: Satisfying Disclosure Restrictions with Synthetic Data Sets. Journal of Official Statistics 18, 531–543 (2002)
Reiter, J.P.: Inference for Partially Synthetic, Public Use Data Sets. Survey Methodology (2003)
Reiter, J.P.: Releasing Multiply Imputed, Synthetic Public-Use Microdata: An Illustration and Empirical Study. Journal of the Royal Statistical Society, A (2004)
Rinott, Y.: On Models for Statistical Disclosure Risk Estimation, UNECE Work Session on Statistical Data Confidentiality, Luxembourg (April 2003), http://www.unece.org/stats/documents/2003/04/confidentiality/wp.16.e.pdf
Roque, G.M.: Masking Microdata Files with Mixtures of Multivariate Normal Distributions, Ph.D. Dissertation, University of California at Riverside (2000)
Sarathy, R., Muralidhar, K., Parsa, R.: Perturbing Non-Normal Attributes: The Copula Approach. Management Science 48(12), 1613–1627 (2002)
Scheuren, F., Winkler, W.: Regression Analysis of Data Files that are Computer Matched – Part II. In: Survey Methodology, pp. 157–165 (1997)
Schlörer, J.: Security of Statistical Databases: Multidimensional Transformation. ACM Transactions on Database Systems 6, 91–112 (1981)
Skinner, C.J., Elliot, M.A.: A Measure of Disclosure Risk for Microdata. Journal of the Royal Statistical Society, B 64(4), 855–867 (2001)
Skinner, C.J., Holmes, D.J.: Estimating the Re-identification Risk per Record in Microdata. Journal of Official Statistics 14, 361–372 (1998)
Sweeney, L.: Computational Disclosure Control for Medical Microdata: The Datafly System. In: Record Linkage Techniques 1997, pp. 442–453. National Academy Press, Washington (1999)
Sweeney, L.: Achieving k-Anonymity Privacy Protection Using Generalization and Suppression. International Journal of Uncertainty, Fuzziness, and Knowledge-Based Systems 10(5), 571–588 (2002)
Thibaudeau, Y., Winkler, W.E.: Bayesian Networks Representations, Generalized Imputation, and Synthetic Microdata Satisfying Analytic Restraints, Statistical Research Division report RR 2002/09 (2002), at http://www.census.gov/srd/www/byyear.html
Trottini, M., Fienberg, S.E.: Modelling User Uncertainty for Disclosure Risk and Data Utility. International Journal of Uncertainty, Fuzziness, and Knowledge-Based Systems 10(5), 511–528 (2002)
Van Den Hout, A., Van Der Heijden, P.G.M.: Randomized Response, Statistical Disclosure Control, and Misclassification: A Review. International Statistical Review 70(2), 269–288 (2002)
Willenborg, L., De Waal, T.: Statistical Disclosure Control in Practice. Lecture Notes in Statistics, vol. 111. Springer, New York (1996)
Willenborg, L., De Waal, T.: Elements of Statistical Disclosure Control. Lecture Notes in Statistics, vol. 155. Springer, New York (2000)
Winkler, W.E.: Matching and Record Linkage. In: Cox, B.G. (ed.) Business Survey Methods, pp. 355–384. J. Wiley, New York (1995)
Winkler, W.E.: Re-identification Methods for Evaluating the Confidentiality of Analytically Valid Microdata. Research in Official Statistics 1, 87–104 (1998)
Winkler, W.E.: Issues with Linking Files and Performing Analyses on the Merged Files. In: Proceedings of the Sections on Government Statistics and Social Statistics, American Statistical Association, pp. 262–265 (1999)
Winkler, W.E.: Single Ranking Micro-aggregation and Re-identification, Statistical Research Division report RR 2002/08 (2002), at http://www.census.gov/srd/www/byyear.html
Yancey, W.E., Winkler, W.E., Creecy, R.H.: Disclosure Risk Assessment in Perturbative Microdata Protection. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, p. 135. Springer, Heidelberg (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Winkler, W.E. (2004). Masking and Re-identification Methods for Public-Use Microdata: Overview and Research Problems. In: Domingo-Ferrer, J., Torra, V. (eds) Privacy in Statistical Databases. PSD 2004. Lecture Notes in Computer Science, vol 3050. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-25955-8_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-25955-8_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22118-0
Online ISBN: 978-3-540-25955-8
eBook Packages: Springer Book Archive