Abstract
This paper explores bridging the content of two different languages via latent topics. Specifically, we propose a unified probabilistic model to simultaneously model latent topics from bilingual corpora that discuss comparable content and use the topics as features in a cross-lingual, dictionary-less text categorization task. Experimental results on multilingual Wikipedia data show that the proposed topic model effectively discovers the topic information from the bilingual corpora, and the learned topics successfully transfer classification knowledge to other languages, for which no labeled training data are available.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Amini, M.-R., Goutte, C.: A co-classification approach to learning from multilingual corpora. Mach. Learn. 79(1-2), 105–121 (2010)
Argyriou, A., Evgeniou, T., Pontil, M.: Multi-task feature learning. In: Proceedings of the 18th Neural Information Processing Systems (NIPS 2006), pp. 41–48 (2006)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)
Blitzer, J., McDonald, R., Pereira, F.: Domain adaptation with structural correspondence learning. In: Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pp. 120–128 (2006)
Bonilla, E., Chai, K.M., Williams, C.: Multi-task gaussian process prediction. In: Proceedings of the 20th Neural Information Processing Systems (NIPS 2008), pp. 153–160 (2008)
Chew, P.A., Bader, B.W., Kolda, T.G., Abdelali, A.: Cross-language information retrieval using parafac2. In: KDD 2007, pp. 143–152 (2007)
Dai, W., Yang, Q., Xue, G.-R., Yu, Y.: Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning (ICML 2007), pp. 193–200 (2007)
De Smet, W., Moens, M.-F.: Cross-language linking of news stories on the web using interlingual topic modelling. In: CIKM-SWSM, pp. 57–64 (2009)
Fortuna, B., Shawe-Taylor, J.: The use of machine translation tools for cross-lingual text mining. In: ICML 2005 Workshop, KCCA (2005)
Gao, J., Fan, W., Jian, J., Han, J.: Knowledge transfer via multiple model local structure mapping. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), pp. 283–291 (2008)
Gliozzo, A., Strapparava, C.: Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In: ACL 2006, pp. 553–560 (2006)
Grefenstette, G.: Cross-Language Information Retrieval. Kluwer Academic Publishers, Norwell (1998)
Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of Uncertainty in Artificial Intelligence, UAI, Stockholm (1999)
Jebara, T.: Multi-task feature and kernel selection for svms. In: Proceedings of the 21th International Conference on Machine Learning (ICML 2004) (July 2004)
Lee, S.-I., Chatalbashev, V., Vickrey, D., Koller, D.: Learning a meta-level prior for feature relevance from multiple related tasks. In: Proceedings of the 24th International Conference on Machine Learning (ICML 2007), pp. 489–496 (July 2007)
Mathieu, B., Besançon, R., Fluhr, C.: Multilingual document clusters discovery. In: RIAO, pp. 116–125 (2004)
Mihalcea, R., Banea, C., Wiebe, J.: Learning multilingual subjective language via cross-lingual projections. In: ACL 2007 (2007)
Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: EMNLP 2009, pp. 880–889 (2009)
Muramatsu, T., Mori, T.: Integration of plsa into probabilistic clir model. In: Proceedings of NTCIR 2004 (2004)
Ni, X., Sun, J.-T., Hu, J., Chen, Z.: Mining multilingual topics from wikipedia. In: WWW 2009, pp. 1155–1155 (April 2009)
Nie, J.-Y., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In: SIGIR 1999: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 74–81. ACM, New York (1999)
Olsson, J.S., Oard, D.W., Hajič, J.: Cross-language text classification. In: SIGIR 2005, pp. 645–646 (2005)
Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: Transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning (ICML 2007), pp. 759–766 (June 2007)
Savoy, J.: Combining multiple strategies for effective monolingual and cross-language retrieval. Inf. Retr. 7(1-2), 121–148 (2004)
Wan, X.: Co-training for cross-lingual sentiment classification. In: ACL-IJCNLP 2009, pp. 235–243 (2009)
Xue, G.-R., Dai, W., Yang, Q., Yu, Y.: Topic-bridged plsa for cross-domain text classification. In: SIGIR 2008, New York, NY, USA, pp. 627–634 (2008)
Yang, Y., Carbonell, J.G., Brown, R.D., Frederking, R.E.: Translingual information retrieval: Learning from bilingual corpora. Artif. Intell. 103(1-2), 323–345 (1998)
Zhao, B., Xing, E.P.: Bitam: Bilingual topic admixture models for word alignment. In: ACL 2006 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
De Smet, W., Tang, J., Moens, MF. (2011). Knowledge Transfer across Multilingual Corpora via Latent Topics. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20841-6_45
Download citation
DOI: https://doi.org/10.1007/978-3-642-20841-6_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20840-9
Online ISBN: 978-3-642-20841-6
eBook Packages: Computer ScienceComputer Science (R0)