Abstract
In the text literature, many topic models were proposed to represent documents and words as topics or latent topics in order to process text effectively and accurately. In this paper, we propose LDACLM or Latent Dirichlet Allocation Category Language Model for text categorization and estimate parameters of models by variational inference. As a variant of Latent Dirichlet Allocation Model, LDACLM regard documents of category as Language Model and use variational parameters to estimate maximum a posteriori of terms. Experiments show LDACLM model to be effective for text categorization, outperforming standard Naive Bayes and Rocchio method for text categorization.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Debole, F., Sebastiani, F.: An Analysis of the Relative Difficulty of Reuters-21578 Subsets. Journal of the American Society for Information Science and Technology 56(2), 584–596 (2004)
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)
Girolami, M., Kaban, A.: On an equivalence between PLSI and LDA. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, pp. 433–434 (2003)
Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101, 5228–5235 (2004)
Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, pp. 50–57 (1999)
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA (1997)
Jordan, M., Ghahramani, Z., Jaakkola, T., Saul, L.: An introduction to variational methods for graphical models. Machine Learning 37, 183–233 (1999)
Ponte, J., Croft, W.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia, pp. 275–281 (1998)
Wallach, H.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning (2006)
Wei, X., Croft, W.: LDA-Based Document Models for Ad-hoc Retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, Seattle, pp. 178–185 (2006)
Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA, pp. 412–420 (1997)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhou, S., Li, K., Liu, Y. (2008). Text Categorization Based on Topic Model. In: Wang, G., Li, T., Grzymala-Busse, J.W., Miao, D., Skowron, A., Yao, Y. (eds) Rough Sets and Knowledge Technology. RSKT 2008. Lecture Notes in Computer Science(), vol 5009. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79721-0_77
Download citation
DOI: https://doi.org/10.1007/978-3-540-79721-0_77
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-79720-3
Online ISBN: 978-3-540-79721-0
eBook Packages: Computer ScienceComputer Science (R0)