A co-classification approach to learning from multilingual corpora

Amini, Massih-Reza; Goutte, Cyril

doi:10.1007/s10994-009-5151-5

A co-classification approach to learning from multilingual corpora

Published: 29 October 2009

Volume 79, pages 105–121, (2010)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

A co-classification approach to learning from multilingual corpora

Download PDF

Massih-Reza Amini¹ &
Cyril Goutte¹

1280 Accesses
27 Citations
Explore all metrics

Abstract

We address the problem of learning text categorization from a corpus of multilingual documents. We propose a multiview learning, co-regularization approach, in which we consider each language as a separate source, and minimize a joint loss that combines monolingual classification losses in each language while ensuring consistency of the categorization across languages. We derive training algorithms for logistic regression and boosting, and show that the resulting categorizers outperform models trained independently on each language, and even, most of the times, models trained on the joint bilingual data. Experiments are carried out on a multilingual extension of the RCV2 corpus, which is available for benchmarking.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Adeva, J. J. G., Calvo, R. A., & de Ipiña, D. L. (2005). Multilingual approaches to text categorisation. UPGRADE: The European Journal for the Informatics Professional, VI(3), 43–51.
Google Scholar
Amini, M.-R., Usunier, N., & Goutte, C. (2009). Learning from multiple partially observed views—an application to multilingual text categorization. Advances in Neural Information Processing, 23.
Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In Proc. 21st international conference on machine learning.
Bel, N., Koster, C. H., & Villegas, M. (2003). Cross-lingual text categorization. In Proceedings ECDL 2003 (pp. 126–139).
Bertsekas, D. (1999). Nonlinear programming (2nd ed.). Belmont: Athena Scientific.
MATH Google Scholar
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on computational learning theory (pp. 92–100).
Brefeld, U., Gärtner, T., Scheffer, T., & Wrobel, S. (2006). Efficient co-regularised least squares regression. In Proc. 23rd international conference on machine learning (pp. 137–144).
Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of the third annual symposium on document analysis and information retrieval, Las Vegas, NV (pp. 161–175).
Collins, M., Schapire, R. E., & Singer, Y. (2000). Logistic regression, AdaBoost and Bregman distances. In Proc. computational learning theory (pp. 158–169).
Csiszár, I. (1995). Maxent, mathematics and information theory. In Proceedings of the fifteenth international workshop on maximum entropy and Bayesian methods (pp. 35–50).
Diethe, T., Hardoon, D. R., & Shawe-Taylor, J. (2008). Multiview Fisher discriminant analysis. In Hardoon, D. R., Leen, G., Kaski, S., & Shawe-Taylor, J. (Eds.), NIPS workshop on learning from multiple sources.
Farquhar, J. D. R., Hardoon, D. R., Meng, H., Shawe-Taylor, J., & Szedmak, S. (2005). Two view learning: SVM-2K, theory and practice. Advances in Neural Information Processing, 18.
Joachims, T. (1999). Transductive inference for text classification using support vector machines. In International conference on machine learning (pp. 200–209).
Kakade, S. M., & Foster, D. P. (2007). Multi-view regression via canonical correlation analysis. In Proc. computational learning theory (COLT).
Lafferty, J. D., Della Pietrea, S., & Della Pietra, V. (1999). Statistical learning algorithms based on Bregman distances. In Proceedings of the Canadian workshop on information theory.
Lehmann, E. L. (1975). Nonparametric statistical methods based on ranks. New York: McGraw-Hill.
Google Scholar
Oard, D. W., & Diekema, A. R. (1998). Cross-language information retrieval. Annual Review of Information Science and Technology, 33.
Reuters (2000). Reuters Corpus, Vol. 2: Multilingual, 1996-08-20 to 1997-08-19.
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1994). Okapi at TREC-3. In Proc. 3rd text retrieval conference (TREC).
Rosenberg, D. S., & Bartlett, P. L. (2007). The Rademacher complexity of co-regularized kernel classes. In M. Meila & X. Shen (Eds.), Proceedings of the eleventh international conference on artificial intelligence and statistics.
Sindhwani, V., Niyogi, P., & Belkin, M. (2005). A co-regularization approach to semi-supervised learning with multiple views. In Proc. of the workshop on learning with multiple views at the 22nd ICML.
Topsoe, F. (1979). Information theoretical optimization techniques. Kybernetika, 15, 7–17.
MathSciNet Google Scholar
Ueffing, N., Simard, M., Larkin, S., & Johnson, J. H. (2007). NRC’s PORTAGE system for WMT 2007. In ACL-2007 second workshop on SMT (pp. 185–188).
van Rijsbergen, C. (1979). Information retrieval. London: Butterworths.
Google Scholar

Download references

Author information

Authors and Affiliations

Interactive Language Technologies group, National Research Council Canada, 283, boulevard Alexandre-Taché, Gatineau, QC, J8X 3X7, Canada
Massih-Reza Amini & Cyril Goutte

Authors

Massih-Reza Amini
View author publications
You can also search for this author in PubMed Google Scholar
Cyril Goutte
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Massih-Reza Amini.

Additional information

Editors: Nicolo Cesa-Bianchi, David R. Hardoon, and Gayle Leen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amini, MR., Goutte, C. A co-classification approach to learning from multilingual corpora. Mach Learn 79, 105–121 (2010). https://doi.org/10.1007/s10994-009-5151-5

Download citation

Received: 26 February 2009
Revised: 14 September 2009
Accepted: 23 September 2009
Published: 29 October 2009
Issue Date: May 2010
DOI: https://doi.org/10.1007/s10994-009-5151-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A co-classification approach to learning from multilingual corpora

Abstract

Article PDF

Similar content being viewed by others

Learning to Classify Text Using a Few Labeled Examples

Knowledge-Based Representation for Transductive Multilingual Document Classification

How Many Labels? Determining the Number of Labels in Multi-Label Text Classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A co-classification approach to learning from multilingual corpora

Abstract

Article PDF

Similar content being viewed by others

Learning to Classify Text Using a Few Labeled Examples

Knowledge-Based Representation for Transductive Multilingual Document Classification

How Many Labels? Determining the Number of Labels in Multi-Label Text Classification

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation