Abstract
We derive generalization bounds for learning algorithms based on their robustness: the property that if a testing sample is “similar” to a training sample, then the testing error is close to the training error. This provides a novel approach, different from complexity or stability arguments, to study generalization of learning algorithms. One advantage of the robustness approach, compared to previous methods, is the geometric intuition it conveys. Consequently, robustness-based analysis is easy to extend to learning in non-standard setups such as Markovian samples or quantile loss. We further show that a weak notion of robustness is both sufficient and necessary for generalizability, which implies that robustness is a fundamental property that is required for learning algorithms to work.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Alon, N., Ben-David, S., Cesa-Bianchi, N., & Haussler, D. (1997). Scale-sensitive dimension, uniform convergence, and learnability. Journal of the ACM, 44(4), 615–631.
Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: the size of the weight is more important than the size of the network. IEEE Transactions on Information Theory, 44(2), 525–536.
Bartlett, P. L., & Mendelson, S. (2002). Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.
Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local Rademacher complexities. The Annals of Statistics, 33(4), 1497–1537.
Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations for domain adaptation. In Advances in neural information processing systems (Vol. 19).
Ben-Tal, A., & Nemirovski, A. (1998). Robust convex optimization. Mathematics of Operations Research, 23, 769–805.
Ben-Tal, A., & Nemirovski, A. (1999). Robust solutions of uncertain linear programs. Operations Research Letters, 25(1), 1–13.
Bertsimas, D., & Sim, M. (2004). The price of robustness. Operations Research, 52(1), 35–53.
Bhattacharyya, C., Pannagadatta, K. S., & Smola, A. J. (2004). A second order cone programming formulation for classifying missing data. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems (NIPS17). Cambridge: MIT Press.
Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2, 499–526.
Bousquet, O., Boucheron, S., & Lugosi, G. (2005). Theory of classification: a survey of recent advances. ESAIM Probability and Statistics, 9, 323–375.
Cade, B., & Noon, B. (2003). A gentle introduction to quantile regression for ecologists. Frontiers in Ecology and the Environment, 1, 412–420.
Caramanis, C., Mannor, S., & Xu, H. (2011). Robust optimization and machine learning. In S. Sra, S. Nowozin, & S. Wright (Eds.), Optimization for machine learning. Cambridge: MIT Press.
Cole, T. (1988). Fitting smoothed centile curves to reference data. Journal of the Royal Statistical Society, Ser. A, 151, 385–418.
Cortes, C., & Vapnik, V. N. (1995). Support vector networks. Machine Learning, 20, 1–25.
Devroye, L., & Wagner, T. (1979a). Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions of Information Theory, 25(2), 202–207.
Devroye, L., & Wagner, T. (1979b). Distribution-free performance bounds for potential function rules. IEEE Transactions of Information Theory, 25(2), 601–604.
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer.
Doob, J. L. (1953). Stochastic processes. New York: Wiley.
Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. In A. J. Smola, P. L. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 171–203). Cambridge: MIT Press.
Gamarnik, D. (2003). Extension of the PAC framework to finite and countable Markov chains. IEEE Transaction on Information Theory, 49(1), 338–345.
Globerson, A., & Roweis, S. (2006). Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on machine learning (pp. 353–360). New York: ACM Press.
Glynn, P. W., & Ormoneit, D. (2002). Hoeffding’s inequality for uniformly ergodic Markov chains. Statistics and Probability Letters, 56, 143–146.
Huber, P. J. (1981). Robust statistics. New York: Wiley.
Kakade, S., Sridharan, K., & Tewari, A. (2009). On the complexity of linear predictions: risk bounds, margin bounds, and regularization. In Advances in neural information processing systems (Vol. 21, pp. 793–800).
Kearns, M., & Schapire, R. (1994). Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48(3), 464–497.
Klivans, A., Long, P., & Servedio, R. (2009). Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10, 2715–2740.
Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33–50.
Kolmogorov, A. N., & Tihomirov, V. (2002). ϵ-entropy and ϵ-capaciity of sets in functional spaces. American Mathematical Society Translations (2), 17, 227–364.
Koltchinskii, V. (2002). Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5), 1902–1914.
Koltchinskii, V., & Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30(1), 1–50.
Lanckriet, G. R., El Ghaoui, L., Bhattacharyya, C., & Jordan, M. I. (2003). A robust minimax approach to classification. Journal of Machine Learning Research, 3, 555–582.
Lozano, A. C., Kulkarni, S. R., & Schapire, R. E. (2006). Convergence and consistency of regularized boosting algorithms with stationary β-mixing observations. In Advances in neural information processing systems (Vol. 18).
Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009). Domain adaptation: learning bounds and algorithms. In Proceedings of the 22nd annual conference on learning theory.
McDiarmid, C. (1989). On the method of bounded differences. In Surveys in combinatorics (pp. 148–188).
Meyn, S. P., & Tweedie, R. L. (1993). Markov chains and stochastic stability. New York: Springer.
Mukherjee, S., Niyogi, P., Poggio, T., & Rifkin, R. (2006). Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1–3), 161–193.
Poggio, T., Rifkin, R., Mukherjee, S., & Niyogi, P. (2004). General conditions for predictivity in learning theory. Nature, 428(6981), 419–422.
Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley.
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge: MIT Press.
Shalev-Shwartz, S., Shamir, O., Srebro, N., & Sridharan, K. (2009). Learnability and stability in the general learning setting. In Proceedings of 22nd annual conference of learning theory.
Shivaswamy, P. K., Bhattacharyya, C., & Smola, A. J. (2006). Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Research, 7, 1283–1314.
Steinwart, I. (2006). Consistency of support vector machines and other regularized kernel classifiers. IEEE Transactions on Information Theory, 51(1), 128–142.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.
van der Vaart, A. W., & Wellner, J. A. (2000). Weak convergence and empirical processes. New York: Springer.
Vapnik, V. N., & Chervonenkis, A. (1974). Theory of pattern recognition. Moscow: Nauka.
Vapnik, V. N., & Chervonenkis, A. (1991). The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3), 260–284.
Xu, H., & Mannor, S. (2010). Robustness and generalization. In Proceedings of the twenty-third annual conference on learning theory (pp. 503–515).
Xu, H., Caramanis, C., & Mannor, S. (2009a). Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10, 1485–1510.
Xu, H., Caramanis, C., Mannor, S., & Yun, S. (2009b). Risk sensitive robust support vector machines. In Proceedings of the forty-eighth IEEE conference on decision and control (pp. 4655–4661).
Xu, H., Caramanis, C., & Mannor, S. (2010a). Principal component analysis with contaminated data: the high dimensional case. In Proceeding of the twenty-third annual conference on learning theory (pp. 490–502).
Xu, H., Caramanis, C., & Mannor, S. (2010b). Robust regression and lasso. IEEE Transactions on Information Theory, 56(7), 3561–3574.
Yu, Y., Yang, M., Xu, L., White, M., & Schuurmans, D. (2011). Relaxed clipping: a global training method for robust regression and classification. In Advances in neural information processing systems (Vol. 23).
Zakai, A., & Ritov, R. (2009). Consistency and localizability. Journal of Machine Learning Research, 10, 827–856.
Zou, B., Li, L. Q., & Xu, Z. B. (2009). The generalization performance of ERM algorithm with strongly mixing observations. Machine Learning, 75, 275–295.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Phil Long.
Rights and permissions
About this article
Cite this article
Xu, H., Mannor, S. Robustness and generalization. Mach Learn 86, 391–423 (2012). https://doi.org/10.1007/s10994-011-5268-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-011-5268-1