Abstract
Most optimization methods for logistic regression or maximum entropy solve the primal problem. They range from iterative scaling, coordinate descent, quasi-Newton, and truncated Newton. Less efforts have been made to solve the dual problem. In contrast, for linear support vector machines (SVM), methods have been shown to be very effective for solving the dual problem. In this paper, we apply coordinate descent methods to solve the dual form of logistic regression and maximum entropy. Interestingly, many details are different from the situation in linear SVM. We carefully study the theoretical convergence as well as numerical issues. The proposed method is shown to be faster than most state of the art methods for training logistic regression and maximum entropy.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Baldridge, J., Morton, T., & Bierner, G. OpenNLP package, 2001. URL http://opennlp.sourceforge.net/.
Bertsekas, D. P. (1999). Nonlinear programming (2nd edn.). Belmont: Athena Scientific.
Chang, K.-W., Hsieh, C.-J., & Lin, C.-J. (2008). Coordinate descent method for large-scale L2-loss linear SVM. Journal of Machine Learning Research, 9, 1369–1398.
Collins, M., Globerson, A., Koo, T., Carreras, X., & Bartlett, P. (2008). Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Journal of Machine Learning Research, 9, 1775–1822.
Crammer, K., & Singer, Y. (2000). On the learnability and design of output codes for multiclass problems. In Computational learning theory (pp. 35–46).
Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), 1470–1480.
Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380–393.
Fa, R.-E., Chan, P.-H., & Lin, C.-J. (2005). Working set selection using second order information for training SVM. Journal of Machine Learning Research, 6, 1889–1918.
Gao, J., Andrew, G., Johnson, M., & Toutanova, K. (2007). A comparative study of parameter estimation methods statistical natural language processing. In Proceedings of the 45th annual meeting of the association of computational linguistics (ACL) (pp. 824–831).
Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1), 5–48.
Goodman, J. (2002). Sequential conditional generalized iterative scaling. In Proceedings of the 40th annual meeting of the association of computational linguistics (ACL) (pp. 9–16).
Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S. S., & Sundararajan, S. (2008). A dual coordinate descent method for large-scale linear SVM. In Proceedings of the twenty fifth international conference on machine learning (ICML).
Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425.
Huang, F.-L., Hsien, C.-J., Chang, K.-W., & Lin, C.-J. (2010). Iterative scaling and coordinate descent methods for maximum entropy. Journal of Machine Learning Research, 11, 815–848.
Jaakkola, T. S., & Haussler, D. (1999). Probabilistic kernel regression models. In Proceedings of the conference on AI and statistics. Society for Artificial Intelligence in Statistics, New Jersey.
Jin, R., Yan, R., Zhang, J., & Hauptmann, A. G. (2003). A faster iterative scaling algorithm for conditional exponential model. In Proceedings of the twentieth international conference on machine learning (ICML).
Joachims, T. (1998). Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—support vector learning. Cambridge: MIT Press.
Jurafsky, D., & Martin, J. H. (2008). Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition (2nd edn.). New York: Prentice Hall.
Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13, 637–649.
Keerthi, S. S., Duan, K., Shevade, S., & Poo, A. N. (2005). A fast dual algorithm for kernel logistic regression. Machine Learning, 61, 151–165.
Keerthi, S. S., Sundararajan, S., Chang, K.-W., Hsieh, C.-J., & Lin, C.-J. (2008). A sequential dual method for large scale multi-class linear SVMs. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining.
Komarek, P., & Moore, A. W. (2005). Making logistic regression a core data mining tool: a practical investigation of accuracy, speed, and simplicity (Technical report TR-05-27). Robotics Institute, Carnegie Mellon University
Lebanon, G., & Lafferty, J. (2002). Boosting and maximum likelihood for exponential models. In Advances in neural information processing systems, vol. 14. Cambridge: MIT Press.
Lin, C.-J., Weng, R. C., & Keerthi, S. S. (2008). Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9, 627–650.
Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528.
Luo, Z.-Q., & Tseng, P. (1992). On the convergence of coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1), 7–35.
Malouf, R. (2002). A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of the 6th conference on Natural language learning (pp. 1–7). Stroudsburg: Association for Computational Linguistics.
Memisevic, R. (2006). Dual optimization of conditional probability models (Technical report). Department of Computer Science, University of Toronto.
Minka, T. P. (2003). A comparison of numerical optimizers for logistic regression. URL http://research.microsoft.com/~minka/papers/logreg/.
Pérez-Cruz, F., Figueiras-Vidal, A. R., & Artés-Rodríguez, A. (2004). Double chunking for solving SVMs for very large datasets. In Proceedings of learning 2004, Spain 2004
Ratnaparkhi, A. (1998). Maximum entropy models for natural language ambiguity resolution (PhD thesis). University of Pennsylvania.
Rüping, S. (2000). mySVM—another one of those support vector machines. Software available at http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/.
Zhang, T. (2002). On the dual formulation of regularized linear systems with convex risks. Machine Learning, 46(1–3), 91–129.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Süreyya Özöğür-Akyüz, Devrim Ünay, and Alex Smola.
Rights and permissions
About this article
Cite this article
Yu, HF., Huang, FL. & Lin, CJ. Dual coordinate descent methods for logistic regression and maximum entropy models. Mach Learn 85, 41–75 (2011). https://doi.org/10.1007/s10994-010-5221-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-010-5221-8