Dual coordinate descent methods for logistic regression and maximum entropy models

Yu, Hsiang-Fu; Huang, Fang-Lan; Lin, Chih-Jen

doi:10.1007/s10994-010-5221-8

Dual coordinate descent methods for logistic regression and maximum entropy models

Published: 11 November 2010

Volume 85, pages 41–75, (2011)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Dual coordinate descent methods for logistic regression and maximum entropy models

Download PDF

Hsiang-Fu Yu¹,
Fang-Lan Huang¹ &
Chih-Jen Lin¹

3214 Accesses
237 Citations
3 Altmetric
Explore all metrics

Abstract

Most optimization methods for logistic regression or maximum entropy solve the primal problem. They range from iterative scaling, coordinate descent, quasi-Newton, and truncated Newton. Less efforts have been made to solve the dual problem. In contrast, for linear support vector machines (SVM), methods have been shown to be very effective for solving the dual problem. In this paper, we apply coordinate descent methods to solve the dual form of logistic regression and maximum entropy. Interestingly, many details are different from the situation in linear SVM. We carefully study the theoretical convergence as well as numerical issues. The proposed method is shown to be faster than most state of the art methods for training logistic regression and maximum entropy.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Baldridge, J., Morton, T., & Bierner, G. OpenNLP package, 2001. URL http://opennlp.sourceforge.net/.
Bertsekas, D. P. (1999). Nonlinear programming (2nd edn.). Belmont: Athena Scientific.
MATH Google Scholar
Chang, K.-W., Hsieh, C.-J., & Lin, C.-J. (2008). Coordinate descent method for large-scale L2-loss linear SVM. Journal of Machine Learning Research, 9, 1369–1398.
MathSciNet Google Scholar
Collins, M., Globerson, A., Koo, T., Carreras, X., & Bartlett, P. (2008). Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Journal of Machine Learning Research, 9, 1775–1822.
MathSciNet Google Scholar
Crammer, K., & Singer, Y. (2000). On the learnability and design of output codes for multiclass problems. In Computational learning theory (pp. 35–46).
Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), 1470–1480.
Article MathSciNet MATH Google Scholar
Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380–393.
Article Google Scholar
Fa, R.-E., Chan, P.-H., & Lin, C.-J. (2005). Working set selection using second order information for training SVM. Journal of Machine Learning Research, 6, 1889–1918.
Google Scholar
Gao, J., Andrew, G., Johnson, M., & Toutanova, K. (2007). A comparative study of parameter estimation methods statistical natural language processing. In Proceedings of the 45th annual meeting of the association of computational linguistics (ACL) (pp. 824–831).
Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1), 5–48.
Article Google Scholar
Goodman, J. (2002). Sequential conditional generalized iterative scaling. In Proceedings of the 40th annual meeting of the association of computational linguistics (ACL) (pp. 9–16).
Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S. S., & Sundararajan, S. (2008). A dual coordinate descent method for large-scale linear SVM. In Proceedings of the twenty fifth international conference on machine learning (ICML).
Google Scholar
Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425.
Article Google Scholar
Huang, F.-L., Hsien, C.-J., Chang, K.-W., & Lin, C.-J. (2010). Iterative scaling and coordinate descent methods for maximum entropy. Journal of Machine Learning Research, 11, 815–848.
Google Scholar
Jaakkola, T. S., & Haussler, D. (1999). Probabilistic kernel regression models. In Proceedings of the conference on AI and statistics. Society for Artificial Intelligence in Statistics, New Jersey.
Google Scholar
Jin, R., Yan, R., Zhang, J., & Hauptmann, A. G. (2003). A faster iterative scaling algorithm for conditional exponential model. In Proceedings of the twentieth international conference on machine learning (ICML).
Joachims, T. (1998). Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—support vector learning. Cambridge: MIT Press.
Google Scholar
Jurafsky, D., & Martin, J. H. (2008). Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition (2nd edn.). New York: Prentice Hall.
Google Scholar
Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13, 637–649.
Article MATH Google Scholar
Keerthi, S. S., Duan, K., Shevade, S., & Poo, A. N. (2005). A fast dual algorithm for kernel logistic regression. Machine Learning, 61, 151–165.
Article MATH Google Scholar
Keerthi, S. S., Sundararajan, S., Chang, K.-W., Hsieh, C.-J., & Lin, C.-J. (2008). A sequential dual method for large scale multi-class linear SVMs. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining.
Komarek, P., & Moore, A. W. (2005). Making logistic regression a core data mining tool: a practical investigation of accuracy, speed, and simplicity (Technical report TR-05-27). Robotics Institute, Carnegie Mellon University
Lebanon, G., & Lafferty, J. (2002). Boosting and maximum likelihood for exponential models. In Advances in neural information processing systems, vol. 14. Cambridge: MIT Press.
Google Scholar
Lin, C.-J., Weng, R. C., & Keerthi, S. S. (2008). Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9, 627–650.
MathSciNet Google Scholar
Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528.
Article MathSciNet MATH Google Scholar
Luo, Z.-Q., & Tseng, P. (1992). On the convergence of coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1), 7–35.
Article MathSciNet MATH Google Scholar
Malouf, R. (2002). A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of the 6th conference on Natural language learning (pp. 1–7). Stroudsburg: Association for Computational Linguistics.
Chapter Google Scholar
Memisevic, R. (2006). Dual optimization of conditional probability models (Technical report). Department of Computer Science, University of Toronto.
Minka, T. P. (2003). A comparison of numerical optimizers for logistic regression. URL http://research.microsoft.com/~minka/papers/logreg/.
Pérez-Cruz, F., Figueiras-Vidal, A. R., & Artés-Rodríguez, A. (2004). Double chunking for solving SVMs for very large datasets. In Proceedings of learning 2004, Spain 2004
Ratnaparkhi, A. (1998). Maximum entropy models for natural language ambiguity resolution (PhD thesis). University of Pennsylvania.
Rüping, S. (2000). mySVM—another one of those support vector machines. Software available at http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/.
Zhang, T. (2002). On the dual formulation of regularized linear systems with convex risks. Machine Learning, 46(1–3), 91–129.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, National Taiwan University, Taipei, 106, Taiwan
Hsiang-Fu Yu, Fang-Lan Huang & Chih-Jen Lin

Authors

Hsiang-Fu Yu
View author publications
You can also search for this author in PubMed Google Scholar
Fang-Lan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Jen Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chih-Jen Lin.

Additional information

Editors: Süreyya Özöğür-Akyüz, Devrim Ünay, and Alex Smola.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, HF., Huang, FL. & Lin, CJ. Dual coordinate descent methods for logistic regression and maximum entropy models. Mach Learn 85, 41–75 (2011). https://doi.org/10.1007/s10994-010-5221-8

Download citation

Received: 28 February 2010
Revised: 15 August 2010
Accepted: 07 October 2010
Published: 11 November 2010
Issue Date: October 2011
DOI: https://doi.org/10.1007/s10994-010-5221-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Dual coordinate descent methods for logistic regression and maximum entropy models

Abstract

Article PDF

Similar content being viewed by others

Hyper-parameter optimization for support vector machines using stochastic gradient descent and dual coordinate descent

Nonlinear optimization and support vector machines

Nonlinear optimization and support vector machines

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dual coordinate descent methods for logistic regression and maximum entropy models

Abstract

Article PDF

Similar content being viewed by others

Hyper-parameter optimization for support vector machines using stochastic gradient descent and dual coordinate descent

Nonlinear optimization and support vector machines

Nonlinear optimization and support vector machines

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation