Abstract
Kernel logistic regression (KLR) is the kernel learning method best suited to binary pattern recognition problems where estimates of a-posteriori probability of class membership are required. Such problems occur frequently in practical applications, for instance because the operational prior class probabilities or equivalently the relative misclassification costs are variable or unknown at the time of training the model. The model parameters are given by the solution of a convex optimization problem, which may be found via an efficient iteratively re-weighted least squares (IRWLS) procedure. The generalization properties of a kernel logistic regression machine are however governed by a small number of hyper-parameters, the values of which must be determined during the process of model selection. In this paper, we propose a novel model selection strategy for KLR, based on a computationally efficient closed-form approximation of the leave-one-out cross-validation procedure. Results obtained on a variety of synthetic and real-world benchmark datasets are given, demonstrating that the proposed model selection procedure is competitive with a more conventional k-fold cross-validation based approach and also with Gaussian process (GP) classifiers implemented using the Laplace approximation and via the Expectation Propagation (EP) algorithm.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Allen, D. M. (1974). The relationship between variable selection and prediction. Technometrics, 16, 125–127.
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., & Sorensen, D. (1999). LAPACK users’ guide (3rd ed.). Philadelphia: Society for Industrial and Applied Mathematics.
Bo, L., Wang, L., & Jiao, L. (2006). Feature scaling for kernel Fisher discriminant analysis using leave-one-out cross validation. Neural Computation, 18(4), 961–978.
Boser, B. E., Guyon, I. M., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), Proceedings of the fifth annual ACM workshop on computational learning theory (pp. 144–152), Pittsburgh, PA, July 1992.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M. Jr., & Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, 97(1), 262–267.
Cawley, G. C., Janacek, G. J., & Talbot, N. L. C. (2007). Generalised kernel machines. In Proceedings of the IEEE/INNS international joint conference on neural networks (IJCNN-07), Orlando, FL, USA, 12–17 August 2007.
Cawley, G. C., & Talbot, N. L. C. (2003). Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers. Pattern Recognition, 36(11), 2585–2592.
Cawley, G. C., & Talbot, N. L. C. (2004a). Efficient model selection for kernel logistic regression. In Proceedings of the 17th international conference on pattern recognition (ICPR-2004) (Vol. 2, pp. 439–442), Cambridge, United Kingdom, 23–26 August 2004.
Cawley, G. C., & Talbot, N. L. C. (2004b). Fast leave-one-out cross-validation of sparse least-squares support vector machines. Neural Networks, 17(10), 1467–1475.
Cawley, G. C., & Talbot, N. L. C. (2007). Preventing over-fitting in model selection via Bayesian regularization of the hyper-parameters. Journal of Machine Learning Research, 8, 841–861.
Cawley, G. C., Talbot, N. L. C., Janacek, G. J., & Peck, M. W. (2006). Parametric accelerated life survival analysis using sparse Bayesian kernel learning methods. IEEE Transactions on Neural Networks, 17(2), 471–481.
Chapelle, O. (2006). Leave k out for kernel machines. unpublished research note. 2 October 2006.
Chapelle, O. (2007). Training a support vector machine in the primal. Neural Computation, 19(5), 1155–1178.
Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1), 131–159.
Cook, R. D., & Weisberg, S. (1982). Monographs on statistics and applied probability. Residuals and influence in regression. New York: Chapman and Hall.
Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297.
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58.
Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd. ed.). Baltimore: The Johns Hopkins University Press.
Green, P. J., & Silverman, B. W. (1994). Monographs on statistics and applied probability : Vol. 58. Nonparametric regression and generalized linear models—a roughness penalty approach. London: Chapman & Hall/CRC.
Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82–95.
Lachenbruch, P. A., & Mickey, M. R. (1968). Estimation of error rates in discriminant analysis. Technometrics, 10(1), 1–11.
Luntz, A., & Brailovsky, V. (1969). On estimation of characters obtained in statistical procedure of recognition. Techicheskaya Kibernetica, 3 (in Russian).
Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London, A, 209, 415–446.
Mika, S., Rätsch, G., Weston, J., Schölkopf, B., & Müller, K.-R. (1999). Fisher discriminant analysis with kernels. In Neural networks for signal processing (Vol. IX, pp. 41–48). New York: IEEE Press.
Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A. J., & Müller, K.-R. (2000). Invariant feature extraction and classification in feature spaces. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (Vol. 12, pp. 526–532). Cambridge: MIT Press.
Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A., & Müller, K.-R. (2003). Constructing descriptive and discriminative features: Rayleigh coefficients in kernel feature spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5), 623–628.
Minka, T. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of uncertainty in artificial intelligence (pp. 362–369).
Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., & Schölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201.
Nabney, I. T. (1999). Efficient training of RBF networks for classification. In: Proceedings of the ninth international conference on artificial neural networks (Vol. 1, pp. 210–215), Edinburgh, United Kingdom, 7–10 September, 1999.
Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, 308–313.
Opper, M., & Winther, O. (2000). Gaussian processes for classification: mean-field algorithms. Neural Computation, 12(11), 2665–2684.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Adaptive computation and machine learning. Cambridge: MIT Press.
Saunders, C., Gammermann, A., & Vovk, V. (1998). Ridge regression in dual variables. In J. Shavlik (Ed.), Proceedings of the fifteenth international conference on machine learning (ICML-1998). San Mateo: Morgan Kaufmann.
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels—support vector machines, regularization, optimization and beyond. Cambridge: MIT Press.
Schölkopf, B., Smola, A. J., & Müller, K. (1997). Kernel principal component analysis. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Lecture notes in computer science : Vol. 1327. Proceedings of the international conference on artificial neural networks (ICANN-1997) (pp. 583–588). Berlin: Springer.
Schölkopf, B., Herbrich, R., & Smola, A. J. (2002). A generalized representer theorem. In Proceedings of the fourteenth international conference on computational learning theory (pp. 416–426), Amsterdam, The Netherlands, 16–19 July 2002.
Seaks, T. (1972). SYMINV: an algorithm for the inversion of a positive definite matrix by the Cholesky decomposition. Econometrica, 40(5), 961–962.
Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, B, 36(1), 111–147.
Sundararajan, S., & Keerthi, S. S. (2001). Predictive approaches for choosing hyperparameters in Gaussian processes. Neural Computation, 13(5), 1103–1118.
Suykens, J. A. K., De Brabanter, J., Lukas, L., & Vandewalle, J. (2002a). Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing, 48(1–4), 85–105.
Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vanderwalle, J. (2002b). Least squares support vector machines. Singapore: World Scientific.
Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. New York: Wiley.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Vapnik, V., & Chapelle, O. (2000). Bounds on error expectation for SVM. In: A. J. Smola, P. L. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 261–280).
Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM.
Weisberg, S. (1985). Applied linear regression (2nd ed.). New York: Wiley.
Williams, P. M. (1991). A Marquardt algorithm for choosing the step-size in backpropagation learning with conjugate gradients. Cognitive Science Research Paper CSRP-229, University of Sussex, Brighton, UK, February 1991.
Williams, C. K. I., & Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1342–1351.
Zhu, J., & Hastie, T. (2005). Kernel logistic regression and the import vector machine. Journal of Computational & Graphical Statistics, 14(1), 185–205.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Olivier Chapelle.
Rights and permissions
About this article
Cite this article
Cawley, G.C., Talbot, N.L.C. Efficient approximate leave-one-out cross-validation for kernel logistic regression. Mach Learn 71, 243–264 (2008). https://doi.org/10.1007/s10994-008-5055-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-008-5055-9