Abstract
We study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp PAC-Bayesian risk bounds for aggregates defined via exponential weights, under general assumptions on the distribution of errors and on the functions to aggregate. We then apply these results to derive sparsity oracle inequalities.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Audibert, J.-Y. (2004). Une approche PAC-bayésienne de la théorie statistique de l’apprentissage. PhD thesis. University of Paris 6.
Audibert, J.-Y. (2006). A randomized online learning algorithm for better variance control. In Lecture notes in artificial intelligence : Vol. 4005. Proceedings of the 19th annual conference on learning theory, COLT 2006 (pp. 392–407). Heidelberg: Springer.
Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2007, submitted). Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics. http://arxiv:0801.1158.
Bunea, F., Nobel, A.B. (2005). Sequential procedures for aggregating arbitrary estimators of a conditional mean, Preprint Florida State University. ttp://www.stat.fsu.edu/~flori.
Bunea, F., Tsybakov, A. B., & Wegkamp, M. H. (2006). Aggregation and sparsity via ℓ 1-penalized least squares. In Lecture notes in artificial intelligence : Vol. 4005. Proceedings of 19th annual conference on learning theory, COLT 2006 (pp. 379–391). Springer: Heidelberg.
Bunea, F., Tsybakov, A. B., & Wegkamp, M. H. (2007a). Aggregation for Gaussian regression. Annals of Statistics, 35, 1674–1697.
Bunea, F., Tsybakov, A. B., & Wegkamp, M. H. (2007b). Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1, 169–194.
Candes, E., & Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger than n (with discussion). Annals of Statistics, 35, 2313–2404.
Catoni, O. (1999). “Universal” aggregation rules with exact bias bounds. Preprint n.510, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6 and Paris 7. http://www.proba.jussieu.fr/mathdoc/preprints/index.html#1999.
Catoni, O., (2004). Statistical learning theory and stochastic optimization. In Lecture notes in mathematics. Ecole d’été de Probabilités de Saint-Flour 2001, New York: Springer.
Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. New York: Cambridge University Press.
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Shapire, R., & Warmuth, M. (1997). How to use expert advice. Journal of the ACM, 44, 427–485.
Cesa-Bianchi, N., Conconi, A., & Gentile, G. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50, 2050–2057.
Dalalyan, A., & Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In N. H. Bshouty & C. Gentile (Eds.), Lecture notes in artificial intelligence : Vol. 4539. Proceedings of the 20th annual conference on learning theory (COLT 2007) (pp. 97–111). Berlin: Springer.
Dembo, A., & Zeitouni, O. (1998). Large deviations techniques and applications. New York: Springer.
Donoho, D. L., Elad, M., & Temlyakov, V. (2006). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52, 6–18.
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32, 407–451.
Greenshtein, E., & Ritov, Y. (2004). Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli, 10, 971–988.
Frank, I. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, 109–148.
Juditsky, A. B., Nazin, A. V., Tsybakov, A. B., & Vayatis, N. (2005). Recursive aggregation of estimators via the Mirror Descent Algorithm with averaging. Problems of Information Transmission, 41, 368–384.
Juditsky, A., Rigollet, P., & Tsybakov, A. (2008, to appear). Learning by mirror averaging. Annals of Statistics. https://hal.ccsd.cnrs.fr/ccsd-00014097.
Kivinen, J., & Warmuth, M. K. (1999). Averaging expert predictions. In H. U. Simon & P. Fischer (Eds.), Lecture notes in artificial intelligence : Vol. 1572. Proceedings of the fourth European conference on computational learning theory (pp. 153–167). Berlin: Springer.
Koltchinskii, V. (2006, submitted). Sparsity in penalized empirical risk minimization.
Lehmann, E. L., & Casella, G. (1998). Theory of point estimation. New York: Springer.
Leung, G., & Barron, A. (2006). Information theory and mixing least-square regressions. IEEE Transactions on Information Theory, 52, 3396–3410.
Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108, 212–261.
Nemirovski, A. (2000). Topics in non-parametric statistics. In Lecture notes in mathematics : Vol. 1738. Ecole d’été de probabilités de saint-flour, 1998. New York: Springer.
Obloj, J. (2004). The Skorokhod embedding problem and its offspring. Probability Surveys, 1, 321–392.
Petrov, V. V. (1995). Limit theorems of probability theory. Oxford: Clarendon.
Revuz, D., & Yor, M. (1999). Continuous martingales and Brownian motion. Berlin: Springer.
Tsybakov, A. B. (2003). Optimal rates of aggregation. In B. Schölkopf & M. Warmuth (Eds.), Lecture notes in artificial intelligence : Vol. 2777. Computational learning theory and kernel machines (pp. 303–313). Heidelberg: Springer.
Tsybakov, A. B. (2004). Mathématiques et applications : Vol. 41. Introduction à l’estimation non-paramétrique. Berlin: Springer.
Tsybakov, A. B. (2006). Comments on “Regularization in statistics”, by P. Bickel and B. Li. Test, 15, 303–310.
van de Geer, S. A. (2006). High dimensional generalized linear models and the Lasso (Research report No. 133). Seminar für Statistik, ETH, Zürich.
Vovk, V. (1990). Aggregating strategies. In Proceedings of the 3rd annual workshop on computational learning theory, COLT1990 (pp. 371–386). San Mateo: Morgan Kaufmann.
Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69, 213–248.
Yang, Y. (2000). Combining different procedures for adaptive regression. Journal of Multivariate Analysis, 74, 135–161.
Yang, Y. (2001). Adaptive regression by mixing. Journal of the American Statistical Association, 96, 574–588.
Yang, Y. (2003). Regression with multiple candidate models: selecting or mixing? Statistica Sinica, 13, 783–809.
Zhang, T. (2006a). From epsilon-entropy to KL-complexity: analysis of minimum information complexity density estimation. Annals of Statistics, 34, 2180–2210.
Zhang, T. (2006b). Information theoretical upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52, 1307–1321.
Zhang, C.-H., & Huang, J. (2008, to appear). Model-selection consistency of the Lasso in high-dimensional regression. Annals of Statistics.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Claudio Gentile, Nader H. Bshouty.
Research partially supported by the grant ANR-06-BLAN-0194 and by the PASCAL Network of Excellence.
Rights and permissions
About this article
Cite this article
Dalalyan, A., Tsybakov, A.B. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach Learn 72, 39–61 (2008). https://doi.org/10.1007/s10994-008-5051-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-008-5051-0