Abstract
We address the problem of learning text categorization from a corpus of multilingual documents. We propose a multiview learning, co-regularization approach, in which we consider each language as a separate source, and minimize a joint loss that combines monolingual classification losses in each language while ensuring consistency of the categorization across languages. We derive training algorithms for logistic regression and boosting, and show that the resulting categorizers outperform models trained independently on each language, and even, most of the times, models trained on the joint bilingual data. Experiments are carried out on a multilingual extension of the RCV2 corpus, which is available for benchmarking.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Adeva, J. J. G., Calvo, R. A., & de Ipiña, D. L. (2005). Multilingual approaches to text categorisation. UPGRADE: The European Journal for the Informatics Professional, VI(3), 43–51.
Amini, M.-R., Usunier, N., & Goutte, C. (2009). Learning from multiple partially observed views—an application to multilingual text categorization. Advances in Neural Information Processing, 23.
Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In Proc. 21st international conference on machine learning.
Bel, N., Koster, C. H., & Villegas, M. (2003). Cross-lingual text categorization. In Proceedings ECDL 2003 (pp. 126–139).
Bertsekas, D. (1999). Nonlinear programming (2nd ed.). Belmont: Athena Scientific.
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on computational learning theory (pp. 92–100).
Brefeld, U., Gärtner, T., Scheffer, T., & Wrobel, S. (2006). Efficient co-regularised least squares regression. In Proc. 23rd international conference on machine learning (pp. 137–144).
Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of the third annual symposium on document analysis and information retrieval, Las Vegas, NV (pp. 161–175).
Collins, M., Schapire, R. E., & Singer, Y. (2000). Logistic regression, AdaBoost and Bregman distances. In Proc. computational learning theory (pp. 158–169).
Csiszár, I. (1995). Maxent, mathematics and information theory. In Proceedings of the fifteenth international workshop on maximum entropy and Bayesian methods (pp. 35–50).
Diethe, T., Hardoon, D. R., & Shawe-Taylor, J. (2008). Multiview Fisher discriminant analysis. In Hardoon, D. R., Leen, G., Kaski, S., & Shawe-Taylor, J. (Eds.), NIPS workshop on learning from multiple sources.
Farquhar, J. D. R., Hardoon, D. R., Meng, H., Shawe-Taylor, J., & Szedmak, S. (2005). Two view learning: SVM-2K, theory and practice. Advances in Neural Information Processing, 18.
Joachims, T. (1999). Transductive inference for text classification using support vector machines. In International conference on machine learning (pp. 200–209).
Kakade, S. M., & Foster, D. P. (2007). Multi-view regression via canonical correlation analysis. In Proc. computational learning theory (COLT).
Lafferty, J. D., Della Pietrea, S., & Della Pietra, V. (1999). Statistical learning algorithms based on Bregman distances. In Proceedings of the Canadian workshop on information theory.
Lehmann, E. L. (1975). Nonparametric statistical methods based on ranks. New York: McGraw-Hill.
Oard, D. W., & Diekema, A. R. (1998). Cross-language information retrieval. Annual Review of Information Science and Technology, 33.
Reuters (2000). Reuters Corpus, Vol. 2: Multilingual, 1996-08-20 to 1997-08-19.
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1994). Okapi at TREC-3. In Proc. 3rd text retrieval conference (TREC).
Rosenberg, D. S., & Bartlett, P. L. (2007). The Rademacher complexity of co-regularized kernel classes. In M. Meila & X. Shen (Eds.), Proceedings of the eleventh international conference on artificial intelligence and statistics.
Sindhwani, V., Niyogi, P., & Belkin, M. (2005). A co-regularization approach to semi-supervised learning with multiple views. In Proc. of the workshop on learning with multiple views at the 22nd ICML.
Topsoe, F. (1979). Information theoretical optimization techniques. Kybernetika, 15, 7–17.
Ueffing, N., Simard, M., Larkin, S., & Johnson, J. H. (2007). NRC’s PORTAGE system for WMT 2007. In ACL-2007 second workshop on SMT (pp. 185–188).
van Rijsbergen, C. (1979). Information retrieval. London: Butterworths.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Nicolo Cesa-Bianchi, David R. Hardoon, and Gayle Leen.
Rights and permissions
About this article
Cite this article
Amini, MR., Goutte, C. A co-classification approach to learning from multilingual corpora. Mach Learn 79, 105–121 (2010). https://doi.org/10.1007/s10994-009-5151-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-009-5151-5