Abstract
Latent Dirichlet Allocation (LDA) probabilistic topic model is a very effective dimension-reduction tool which can automatically extract latent topics and dedicate to text representation in a lower-dimensional semantic topic space. But the original LDA and its most variants are unsupervised without reference to category label of the documents in the training corpus. And most of them view the terms in vocabulary as equally important, but the weight of each term is different, especially for a skewed corpus in which there are many more samples of some categories than others. As a result, we propose a supervised parameter estimation method based on category and document information which can estimate the parameters of LDA according to term weight. The comparative experiments show that the proposed method is superior for the skewed text classification, which can largely improve the recall and precision of the minority category.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Machine Learning Research 3(3), 993–1022 (2003)
Xu, G., Wang, H.: The Development of Topic Models in Natural Language Processing. Chinese Journal of Computers 34(8), 1423–1436 (2011) (in Chinese)
Blei, D., McAuliffe, J.: Supervised topic models. Advances in Neural Information Processing Systems 20, 121–128 (2008)
Salton, G., Buckley, C.: Term-Weighting Approaches in Automatic Text Retrieval. Journal of Information Processing & Management 24(5), 513–523 (1988)
Madsen, R., Kauchak, D., Elkan, C.: Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 545–552 (2005)
Reisinger, J., Waters, A., Silverthorn, B., Mooney, R.: Spherical topic models. In: Proceedings of the 27th International Conference on Machine Learning, pp. 903–910 (2010)
Zhang, X., Zhou, X., Huang, H., et al.: An improved LDA Topic Model. Journal of Beijing Jiaotong University 34(2), 111–114 (2010) (in Chinese)
Wilson, A., Chew, P.: Term weighting schemes for latent dirichlet allocation. In: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 465–473 (2010)
Wu, D., Zhang, Y., Yin, F., Li, M.: Feature Selection Based on Class Distritution Difference and VPRS for Text Classification. Journal of Electronics & Information Technology 29(12), 2880–2884 (2007) (in Chinese)
Xu, Y., Li, J., Wang, B., Sun, C., Zhang, S.: A Study of Feature Selection for Text Categorization on Imbalanced Data. Journal of Computer Research and Development 44(suppl.), 58–62 (2007) (in Chinese)
Zhang, A., Jing, H., Wang, B., Xu, Y.: Research on Effects of Term Weighting Factors for Text Categorization. Journal of Chinese Information Processing 24(3), 97–104 (2010) (in Chinese)
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Heinrich, G.: Parameter estimation for text analysis. Technical Note Version 2.9. http://www.arbylon.net/publications/text-est2.pdf (2009)
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Landauer, T., McNamara, D.S., Dennis, S., Kintsch, W. (eds.) Handbook of Latent Semantic Analysis. Erlbaum, Hillsdale (2007)
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of the 14th International Conference on Machine Learning, pp. 170–178 (1997)
Mladenic, D., Grobelnk, M.: Feature selection for unbalanced class distribution and Naïve Bayes. In: Proceeding of the 16th International Conference Machine Learning, pp. 258–267 (1999)
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhenyan, L., Dan, M., Weiping, W., Chunxia, Z. (2015). A Supervised Parameter Estimation Method of LDA. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds) Web Technologies and Applications. APWeb 2015. Lecture Notes in Computer Science(), vol 9313. Springer, Cham. https://doi.org/10.1007/978-3-319-25255-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-25255-1_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25254-4
Online ISBN: 978-3-319-25255-1
eBook Packages: Computer ScienceComputer Science (R0)