Abstract
This paper compares two topic modeling algorithms - Latent Dirichlet Allocation (LDA), Latent Semantic Index (LSI), and a feature selection algorithm chi-square to extract news feature words. After feature extraction, the three classifiers (Logistics Regression, Naive Bayes and SVM) are compared in news classification. Based on the test results, combined LSI and Logistics Regression gives the highest result compared to the other algorithms, with precision of 96% and recall of 95%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Platos, J., Gajdos, P., Kromer, P., Snasel, V.: Non-negative matrix factorization on GPU. In: Second International Conference 2010, vol. 87, pp. 21–30. Springer, Heidelberg (2010)
Snasel, V., Nowakova, J., Xhafa, F., Barolli, L.: Geometrical and topological approaches to Big Data. J. Future Gener. Comput. Syst. 67, 286–296 (2017)
Berry, M., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia (1999)
Snasel, V., Gajdos, P., Abdulla, H.M.D., Polovincak, M.: Concept lattice reduction by matrix decompositins. DCCA (2007)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Chi2 Feature selection Homepage. https://nlp.standford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html
Van der Maaten, L., Hinton, G.E.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Platos, J., Kromer, P.: Prediction of multi-class industrial data. In: International Conference on Intelligent Networking and Collaborative Systems 2013, pp. 64–68 (2013)
Mantyla, M.V., Claes M., Farooq U.: Measuring LDA topic stability from clusters of replicated runs. In: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, p. 49. ACM (2018)
Linderman, G.C., Steinerberger, S.: Clustering with t-SNE, provably. arXiv preprint arXiv:1706.02582 (2017)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In: RecSys, pp. 165–172. ACM (2013)
Yang, X., Macdonald, C., Ounis, I.: Using word embeddings in twitter election classification. In: The SIGIR 2016 Workshop on Neural Information Retrieval (2016)
Sun, Y., Platoš, J.: CAPTCHA recognition based on Kohonen maps. In: International Conference on Intelligent Networking and Collaborative Systems 2019, pp. 296–305. Springer, Cham (2019)
Pan, J.S., Liu, J.L., Liu, E.J.: Improved whale optimization algorithm and its application to UCAV path planning problem. In: International Conference on Genetic and Evolutionary Computing 2018, vol. 834, pp. 37–47. Springer, Singapore (2018)
Chang, K.C., Pan, J.S., Chu, K.C., Horng, D.J., Jing, H.: Study on information and integrated of MES big data and semiconductor process furnace automation. In: International Conference on Genetic and Evolutionary Computing 2018, vol. 834, pp. 669–678. Springer, Singapore (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sun, Y., Platoš, J. (2020). Text Classification Based on Topic Modeling and Chi-square. In: Pan, JS., Lin, JW., Liang, Y., Chu, SC. (eds) Genetic and Evolutionary Computing. ICGEC 2019. Advances in Intelligent Systems and Computing, vol 1107. Springer, Singapore. https://doi.org/10.1007/978-981-15-3308-2_56
Download citation
DOI: https://doi.org/10.1007/978-981-15-3308-2_56
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3307-5
Online ISBN: 978-981-15-3308-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)