Efficient approximate leave-one-out cross-validation for kernel logistic regression

Cawley, Gavin C.; Talbot, Nicola L. C.

doi:10.1007/s10994-008-5055-9

Efficient approximate leave-one-out cross-validation for kernel logistic regression

Published: 17 April 2008

Volume 71, pages 243–264, (2008)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Efficient approximate leave-one-out cross-validation for kernel logistic regression

Download PDF

Gavin C. Cawley¹ &
Nicola L. C. Talbot¹

3348 Accesses
57 Citations
Explore all metrics

Abstract

Kernel logistic regression (KLR) is the kernel learning method best suited to binary pattern recognition problems where estimates of a-posteriori probability of class membership are required. Such problems occur frequently in practical applications, for instance because the operational prior class probabilities or equivalently the relative misclassification costs are variable or unknown at the time of training the model. The model parameters are given by the solution of a convex optimization problem, which may be found via an efficient iteratively re-weighted least squares (IRWLS) procedure. The generalization properties of a kernel logistic regression machine are however governed by a small number of hyper-parameters, the values of which must be determined during the process of model selection. In this paper, we propose a novel model selection strategy for KLR, based on a computationally efficient closed-form approximation of the leave-one-out cross-validation procedure. Results obtained on a variety of synthetic and real-world benchmark datasets are given, demonstrating that the proposed model selection procedure is competitive with a more conventional k-fold cross-validation based approach and also with Gaussian process (GP) classifiers implemented using the Laplace approximation and via the Expectation Propagation (EP) algorithm.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Allen, D. M. (1974). The relationship between variable selection and prediction. Technometrics, 16, 125–127.
Article MathSciNet MATH Google Scholar
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., & Sorensen, D. (1999). LAPACK users’ guide (3rd ed.). Philadelphia: Society for Industrial and Applied Mathematics.
Google Scholar
Bo, L., Wang, L., & Jiao, L. (2006). Feature scaling for kernel Fisher discriminant analysis using leave-one-out cross validation. Neural Computation, 18(4), 961–978.
Article MathSciNet MATH Google Scholar
Boser, B. E., Guyon, I. M., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), Proceedings of the fifth annual ACM workshop on computational learning theory (pp. 144–152), Pittsburgh, PA, July 1992.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
MATH Google Scholar
Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M. Jr., & Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, 97(1), 262–267.
Article Google Scholar
Cawley, G. C., Janacek, G. J., & Talbot, N. L. C. (2007). Generalised kernel machines. In Proceedings of the IEEE/INNS international joint conference on neural networks (IJCNN-07), Orlando, FL, USA, 12–17 August 2007.
Cawley, G. C., & Talbot, N. L. C. (2003). Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers. Pattern Recognition, 36(11), 2585–2592.
Article MATH Google Scholar
Cawley, G. C., & Talbot, N. L. C. (2004a). Efficient model selection for kernel logistic regression. In Proceedings of the 17th international conference on pattern recognition (ICPR-2004) (Vol. 2, pp. 439–442), Cambridge, United Kingdom, 23–26 August 2004.
Cawley, G. C., & Talbot, N. L. C. (2004b). Fast leave-one-out cross-validation of sparse least-squares support vector machines. Neural Networks, 17(10), 1467–1475.
Article MATH Google Scholar
Cawley, G. C., & Talbot, N. L. C. (2007). Preventing over-fitting in model selection via Bayesian regularization of the hyper-parameters. Journal of Machine Learning Research, 8, 841–861.
Google Scholar
Cawley, G. C., Talbot, N. L. C., Janacek, G. J., & Peck, M. W. (2006). Parametric accelerated life survival analysis using sparse Bayesian kernel learning methods. IEEE Transactions on Neural Networks, 17(2), 471–481.
Article Google Scholar
Chapelle, O. (2006). Leave k out for kernel machines. unpublished research note. 2 October 2006.
Chapelle, O. (2007). Training a support vector machine in the primal. Neural Computation, 19(5), 1155–1178.
Article MathSciNet MATH Google Scholar
Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1), 131–159.
Article MATH Google Scholar
Cook, R. D., & Weisberg, S. (1982). Monographs on statistics and applied probability. Residuals and influence in regression. New York: Chapman and Hall.
MATH Google Scholar
Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297.
MATH Google Scholar
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58.
Article Google Scholar
Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd. ed.). Baltimore: The Johns Hopkins University Press.
MATH Google Scholar
Green, P. J., & Silverman, B. W. (1994). Monographs on statistics and applied probability : Vol. 58. Nonparametric regression and generalized linear models—a roughness penalty approach. London: Chapman & Hall/CRC.
MATH Google Scholar
Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82–95.
Article MathSciNet MATH Google Scholar
Lachenbruch, P. A., & Mickey, M. R. (1968). Estimation of error rates in discriminant analysis. Technometrics, 10(1), 1–11.
Article MathSciNet Google Scholar
Luntz, A., & Brailovsky, V. (1969). On estimation of characters obtained in statistical procedure of recognition. Techicheskaya Kibernetica, 3 (in Russian).
Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London, A, 209, 415–446.
Article Google Scholar
Mika, S., Rätsch, G., Weston, J., Schölkopf, B., & Müller, K.-R. (1999). Fisher discriminant analysis with kernels. In Neural networks for signal processing (Vol. IX, pp. 41–48). New York: IEEE Press.
Google Scholar
Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A. J., & Müller, K.-R. (2000). Invariant feature extraction and classification in feature spaces. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12, pp. 526–532). Cambridge: MIT Press.
Google Scholar
Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A., & Müller, K.-R. (2003). Constructing descriptive and discriminative features: Rayleigh coefficients in kernel feature spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5), 623–628.
Article Google Scholar
Minka, T. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of uncertainty in artificial intelligence (pp. 362–369).
Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., & Schölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201.
Article Google Scholar
Nabney, I. T. (1999). Efficient training of RBF networks for classification. In: Proceedings of the ninth international conference on artificial neural networks (Vol. 1, pp. 210–215), Edinburgh, United Kingdom, 7–10 September, 1999.
Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, 308–313.
MATH Google Scholar
Opper, M., & Winther, O. (2000). Gaussian processes for classification: mean-field algorithms. Neural Computation, 12(11), 2665–2684.
Article Google Scholar
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Adaptive computation and machine learning. Cambridge: MIT Press.
Google Scholar
Saunders, C., Gammermann, A., & Vovk, V. (1998). Ridge regression in dual variables. In J. Shavlik (Ed.), Proceedings of the fifteenth international conference on machine learning (ICML-1998). San Mateo: Morgan Kaufmann.
Google Scholar
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels—support vector machines, regularization, optimization and beyond. Cambridge: MIT Press.
Google Scholar
Schölkopf, B., Smola, A. J., & Müller, K. (1997). Kernel principal component analysis. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Lecture notes in computer science : Vol. 1327. Proceedings of the international conference on artificial neural networks (ICANN-1997) (pp. 583–588). Berlin: Springer.
Google Scholar
Schölkopf, B., Herbrich, R., & Smola, A. J. (2002). A generalized representer theorem. In Proceedings of the fourteenth international conference on computational learning theory (pp. 416–426), Amsterdam, The Netherlands, 16–19 July 2002.
Seaks, T. (1972). SYMINV: an algorithm for the inversion of a positive definite matrix by the Cholesky decomposition. Econometrica, 40(5), 961–962.
Article Google Scholar
Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.
Google Scholar
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, B, 36(1), 111–147.
MATH Google Scholar
Sundararajan, S., & Keerthi, S. S. (2001). Predictive approaches for choosing hyperparameters in Gaussian processes. Neural Computation, 13(5), 1103–1118.
Article MATH Google Scholar
Suykens, J. A. K., De Brabanter, J., Lukas, L., & Vandewalle, J. (2002a). Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing, 48(1–4), 85–105.
Article MATH Google Scholar
Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vanderwalle, J. (2002b). Least squares support vector machines. Singapore: World Scientific.
MATH Google Scholar
Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. New York: Wiley.
MATH Google Scholar
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
MATH Google Scholar
Vapnik, V., & Chapelle, O. (2000). Bounds on error expectation for SVM. In: A. J. Smola, P. L. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 261–280).
Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM.
MATH Google Scholar
Weisberg, S. (1985). Applied linear regression (2nd ed.). New York: Wiley.
MATH Google Scholar
Williams, P. M. (1991). A Marquardt algorithm for choosing the step-size in backpropagation learning with conjugate gradients. Cognitive Science Research Paper CSRP-229, University of Sussex, Brighton, UK, February 1991.
Williams, C. K. I., & Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1342–1351.
Article Google Scholar
Zhu, J., & Hastie, T. (2005). Kernel logistic regression and the import vector machine. Journal of Computational & Graphical Statistics, 14(1), 185–205.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK
Gavin C. Cawley & Nicola L. C. Talbot

Authors

Gavin C. Cawley
View author publications
You can also search for this author in PubMed Google Scholar
Nicola L. C. Talbot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gavin C. Cawley.

Additional information

Editor: Olivier Chapelle.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cawley, G.C., Talbot, N.L.C. Efficient approximate leave-one-out cross-validation for kernel logistic regression. Mach Learn 71, 243–264 (2008). https://doi.org/10.1007/s10994-008-5055-9

Download citation

Received: 07 March 2007
Revised: 24 March 2008
Accepted: 29 March 2008
Published: 17 April 2008
Issue Date: June 2008
DOI: https://doi.org/10.1007/s10994-008-5055-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient approximate leave-one-out cross-validation for kernel logistic regression

Abstract

Article PDF

Similar content being viewed by others

Extreme logistic regression

A Novel Recursive Kernel-Based Algorithm for Robust Pattern Classification

Optimal Learning Rates for Kernel Partial Least Squares

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient approximate leave-one-out cross-validation for kernel logistic regression

Abstract

Article PDF

Similar content being viewed by others

Extreme logistic regression

A Novel Recursive Kernel-Based Algorithm for Robust Pattern Classification

Optimal Learning Rates for Kernel Partial Least Squares

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation