Abstract
Estimation of the ratio of probability densities has attracted a great deal of attention since it can be used for addressing various statistical paradigms. A naive approach to density-ratio approximation is to first estimate numerator and denominator densities separately and then take their ratio. However, this two-step approach does not perform well in practice, and methods for directly estimating density ratios without density estimation have been explored. In this paper, we first give a comprehensive review of existing density-ratio estimation methods and discuss their pros and cons. Then we propose a new framework of density-ratio estimation in which a density-ratio model is fitted to the true density-ratio under the Bregman divergence. Our new framework includes existing approaches as special cases, and is substantially more general. Finally, we develop a robust density-ratio estimation method under the power divergence, which is a novel instance in our framework.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Ali S.M., Silvey S.D. (1966) A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B 28(1): 131–142
Banerjee A., Merugu S., Dhillon I.S., Ghosh J. (2005) Clustering with Bregman divergences. Journal of Machine Learning Research 6: 1705–1749
Basu A., Harris I.R., Hjort N.L., Jones M.C. (1998) Robust and efficient estimation by minimising a density power divergence. Biometrika 85(3): 549–559
Best, M. J. (1982). An algorithm for the solution of the parametric quadratic programming problem. Technical report 82–24, Faculty of Mathematics, University of Waterloo.
Bickel, S., Bogojeska, J., Lengauer, T., Scheffer, T. (2008). Multi-task learning for HIV therapy screening. In A. McCallum, S. Roweis (Eds.), Proceedings of 25th annual international conference on machine learning (ICML2008) (pp. 56–63).
Bregman L.M. (1967) The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics 7: 200–217
Caruana R., Pratt L., Thrun S. (1997) Multitask learning. Machine Learning 28: 41–75
Cayton, L. (2008). Fast nearest neighbor retrieval for Bregman divergences. In A. McCallum, S. Roweis (Eds.), Proceedings of the 25th annual international conference on machine learning (ICML2008) (pp. 112–119). Madison: Omnipress.
Chen S.S., Donoho D.L., Saunders M.A. (1998) Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20(1): 33–61
Cheng K.F., Chu C.K. (2004) Semiparametric density estimation under a two-sample density ratio model. Bernoulli 10(4): 583–604
Collins M., Schapire R.E., Singer Y. (2002) Logistic regression, adaboost and Bregman distances. Machine Learning 48(1–3): 253–285
Cover T.M., Thomas J.A. (2006) Elements of information theory (2nd ed.). Wiley, Hoboken, NJ, USA
Csiszár I. (1967) Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica 2: 229–318
Dhillon, I., Sra, S. (2006). Generalized nonnegative matrix approximations with Bregman divergences. In Y. Weiss, B. Schölkopf, J. Platt (Eds.), Advances in neural information processing systems (Vol. 8, pp. 283–290). Cambridge, MA: MIT Press.
Efronm B., Hastie T., Johnstone I., Tibshirani R. (2004) Least angle regression. The Annals of Statistics 32(2): 407–499
Fujisawa H., Eguchi S. (2008) Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis 99(9): 2053–2081
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., Schölkopf, B. (2009). Covariate shift by kernel mean matching. In J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, N. Lawrence (Eds.), Dataset shift in machine learning (Chap. 8, pp. 131–160). Cambridge, MA, USA: MIT Press.
Hastie T., Tibshirani R., Friedman J. (2001) The elements of statistical learning: Data mining, inference, and prediction. Springer, New York, NY, USA
Hastie T., Rosset S., Tibshirani R., Zhu J. (2004) The entire regularization path for the support vector machine. Journal of Machine Learning Research 5: 1391–1415
Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., Kanamori, T. (2008). Inlier-based outlier detection via direct density ratio estimation. In F. Giannotti, D. Gunopulos, F. Turini, C. Zaniolo, N. Ramakrishnan, X. Wu (Eds.), Proceedings of IEEE international conference on data mining (ICDM2008) (pp. 223–232). Pisa, Italy.
Hido S., Tsuboi Y., Kashima H., Sugiyama M., Kanamori T. (2011) Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems 26(2): 309–336
Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., Schölkopf, B. (2007). Correcting sample selection bias by unlabeled data. In B. Schölkopf, J. Platt, T. Hoffman (Eds.), Advances in neural information processing systems (Vol. 19, pp. 601–608). Cambridge, MA, USA: MIT Press.
Huber P.J. (1981) Robust statistics. Wiley, New York, NY, USA
Jones M.C., Hjort N.L., Harris I.R., Basu A. (2001) A comparison of related density-based minimum divergence estimators. Biometrika 88: 865–873
Jordan M.I., Ghahramani Z., Jaakkola T.S., Saul L.K. (1999) An introduction to variational methods for graphical models. Machine Learning 37(2): 183
Kanamori T., Hido S., Sugiyama M. (2009) A least-squares approach to direct importance estimation. Journal of Machine Learning Research 10: 1391–1445
Kanamori, T., Suzuki, T., Sugiyama, M. (2010). Theoretical analysis of density ratio estimation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E93-A(4), 787–798.
Kanamori, T., Suzuki, T., Sugiyama, M. (2012). Kernel-based least-squares density-ratio estimation I: Statistical analysis. Machine Learning (to appear).
Kawahara, Y., Sugiyama, M. (2009). Change-point detection in time-series data by direct density-ratio estimation. In H. Park, S. Parthasarathy, H. Liu, Z. Obradovic (Eds.), Proceedings of 2009 SIAM international conference on data mining (SDM2009) (pp. 389–400). Nevada, USA: Sparks.
Keziou A. (2003) Dual representation of \({\phi}\)-divergences and applications. Comptes Rendus Mathématique 336(10): 857–862
Keziou A., Leoni-Aubin S. (2005) Test of homogeneity in semiparametric two-sample density ratio models. Comptes Rendus Mathématique 340(12): 905–910
Kimura M., Sugiyama M. (2011) Dependence-maximization clustering with least-squares mutual information. Journal of Advanced Computational Intelligence and Intelligent Informatics 15(7): 800–805
Kullback S., Leibler R.A. (1951) On information and sufficiency. Annals of Mathematical Statistics 22: 79–86
Minka, T. P. (2007). A comparison of numerical optimizers for logistic regression. Technical report, Microsoft Research. http://research.microsoft.com/~minka/papers/logreg/minka-logreg.pdf.
Murata N., Takenouchi T., Kanamori T, Eguchi S. (2004) Information geometry of U-boost and Bregman divergence. Neural Computation 16(7): 1437–1481
Nguyen X., Wainwright M.J., Jordan M.I. (2010) Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory 56(11): 5847–5861
Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302), 157–175.
Qin J. (1998) Inferences for case-control and semiparametric two-sample density ratio models. Biometrika 85(3): 619–630
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N. (Eds.) (2009). Dataset shift in machine learning. Cambridge, MA, USA: MIT Press.
Rockafellar R.T. (1970) Convex analysis. Princeton University Press, Princeton, NJ, USA
Schölkopf, B., Smola, A. J. (2002). Learning with kernels. Cambridge, MA, USA: MIT Press.
Shimodaira H. (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function.. Journal of Statistical Planning and Inference 90(2): 227–244
Silverman B.W. (1978) Density ratios, empirical likelihood and cot death. Journal of the Royal Statistical Society, Series C 27(1): 26–33
Smola, A., Song, L., Teo, C. H. (2009). Relative novelty detection. In D. van Dyk, M. Welling (Eds.), Proceedings of twelfth international conference on artificial intelligence and statistics (AISTATS2009) (Vol. 5, pp. 536–543). Clearwater Beach, FL, USA: JMLR Workshop and Conference Proceedings.
Steinwart I. (2001) On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research 2: 67–93
Stummer W. (2007) Some Bregman distances between financial diffusion processes. Proceedings in applied mathematics and mechanics 7: 1050503–1050504
Sugiyama, M. (2010). Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Transactions on Information and Systems, E93-D(10), 2690–2701.
Sugiyama, M., Kawanabe, M. (2011). Machine learning in non-stationary environments: introduction to covariate shift adaptation. Cambridge, MA, USA: MIT Press (to appear).
Sugiyama M., Müller K.R. (2005) Input-dependent estimation of generalization error under covariate shift. Statistics and Decisions 23(4): 249–279
Sugiyama M., Krauledat M., Müller K.R. (2007) Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8: 985–1005
Sugiyama M., Suzuki T., Nakajima S., Kashima H., von Bünau P., Kawanabe M. (2008) Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics 60(4): 699–746
Sugiyama M., Kanamori T., Suzuki T., Hido S., Sese J., Takeuchi I., Wang L. (2009) A density-ratio framework for statistical data processing. IPSJ Transactions on Computer Vision and Applications 1: 183–208
Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., Okanohara, D. (2010). Least-squares conditional density estimation. IEICE Transactions on Information and Systems, E93-D(3), 583–594.
Sugiyama M., Suzuki T., Itoh Y., Kanamori T., Kimura M. (2011a) Least-squares two-sample test. Neural Networks 24(7): 735–751
Sugiyama M., Yamada M., von Bünau P., Suzuki T., Kanamori T., Kawanabe M. (2011b) Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search. Neural Networks 24(2): 183–198
Sugiyama, M., Suzuki, T., Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge, UK: Cambridge University Press (to appear).
Suzuki, T., Sugiyama, M. (2009). Estimating squared-loss mutual information for independent component analysis. In T. Adali, C. Jutten, J. M. T. Romano, A. K. Barros (Eds.), Independent component analysis and signal separation (Vol. 5441, pp. 130–137), Lecture notes in computer science. Berlin, Germany: Springer.
Suzuki, T., Sugiyama, M. (2010). Sufficient dimension reduction via squared-loss mutual information estimation. In Y. W. Teh, M. Tiggerington (Eds.), Proceedings of the thirteenth international conference on artificial intelligence and statistics (AISTATS2010) (Vol. 9, pp. 804–811). Sardinia, Italy: JMLR Workshop and Conference Proceedings.
Suzuki T., Sugiyama M. (2011) Least-squares independent component analysis. Neural Computation 23(1): 284–301
Suzuki, T., Sugiyama, M., Sese, J., Kanamori, T. (2008). Approximating mutual information by maximum likelihood density ratio estimation. In Y. Saeys, H. Liu, I. Inza, L. Wehenkel, Y. V. de Peer (Eds.), Proceedings of ECML-PKDD2008 workshop on new challenges for feature selection in data mining and knowledge discovery 2008 (FSDM2008) (Vol. 4, pp. 5–20). Antwerp, Belgium: JMLR Workshop and Conference Proceedings.
Suzuki, T., Sugiyama, M., Kanamori, T., Sese, J. (2009a). Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics, 10(1), S52.
Suzuki, T., Sugiyama, M., Tanaka, T. (2009b). Mutual information approximation via maximum likelihood estimation of density ratio. In Proceedings of 2009 IEEE international symposium on information theory (ISIT2009) (pp. 463–467). Seoul, Korea.
Tibshirani R. (1996) Regression shrinkage and subset selection with the lasso. Journal of the Royal Statistical Society, Series B 58(1): 267–288
Tipping M.E., Bishop C.M. (1999) Mixtures of probabilistic principal component analyzers. Neural Computation 11(2): 443–482
Tsuboi Y., Kashima H., Hido S., Bickel S., Sugiyama M. (2009) Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing 17: 138–155
Tsuda, K., Rätsch, G., Warmuth, M. (2005). Matrix exponential gradient updates for on-line learning and Bregman projection. In L. K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in neural information processing systems (Vol. 17, pp. 1425–1432). Cambridge, MA: MIT Press.
Williams P.M. (1995) Bayesian regularization and pruning using a Laplace prior. Neural Computation 7(1): 117–143
Wu, L., Jin, R., Hoi, S. C. H., Zhu, J., Yu, N. (2009). Learning Bregman distance functions and its application for semi-supervised clustering. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, A. Culotta (Eds.), Advances in neural information processing systems (Vol. 22, pp. 2089–2097). Curran Associates, Inc.
Yamada, M., Sugiyama, M. (2009) Direct importance estimation with Gaussian mixture models. IEICE Transactions on Information and Systems, E92-D(10), 2159–2162.
Yamada, M., Sugiyama, M. (2010). Dependence minimizing regression with model selection for non-linear causal inference under non-Gaussian noise. In Proceedings of the twenty-fourth AAAI conference on artificial intelligence (AAAI2010) (pp. 643–648). Atlanta, Georgia, USA: The AAAI Press.
Yamada, M., Sugiyama, M. (2011). Cross-domain object matching with model selection. In G. Gordon, D. Dunson, M. Dudík (Eds.), Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS2011) (pp. 807–815). Florida, USA: Fort Lauderdale.
Yamada, M., Sugiyama, M., Wichern, G., Simm, J. (2010). Direct importance estimation with a mixture of probabilistic principal component analyzers. IEICE Transactions on Information and Systems, E93-D(10), 2846–2849.
Yamada M., Sugiyama M., Wichern G., Simm J. (2011) Improving the accuracy of least-squares probabilistic classifiers. IEICE Transactions on Information and Systems E94-D(6): 1337–1340
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Sugiyama, M., Suzuki, T. & Kanamori, T. Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation. Ann Inst Stat Math 64, 1009–1044 (2012). https://doi.org/10.1007/s10463-011-0343-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-011-0343-8