Abstract
Feature extraction is one of the challenging works in the Machine Learning (ML) arena. The more features one able to extract correctly, the more accurate knowledge one can exploit from data. Latent Dirichlet Allocation (LDA) is a form of topic modeling used to extract features from text data. But finding the optimal number of topics (on which success of LDA depends on) is tremendous challenging, especially if there is no prior knowledge about the data. Some studies suggest perplexity; some are Rate of Perplexity Change (RPC); some suggest coherence as a method to find an optimal number of a topic for achieving both of accuracy and less processing time for LDA. In this study, the authors propose two new methods named Normalized Absolute Coherence (NAC) and Normalized Absolute Perplexity (NAP) for predicting the optimal number of topics. The authors run highly standard ML experiments to measure and compare the reliability of existing methods (perplexity, coherence, RPC) and proposed NAC and NAP in searching for an optimal number of topics in LDA. The study successfully proves and suggests that NAC and NAP work better than existing methods. This investigation also suggests that perplexity, coherence, and RPC are sometimes distracting and confusing to estimate the optimal number of topics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Text Retrieval Conference Data (2004). https://dmice.ohsu.edu/trec-gen/data/2004/. Accessed: 01 Aug 2020
Asmussen, C.B., Møller, C.: Smart literature review: a practical topic modelling approach to exploratory literature review. J. Big Data 6(1), 93 (2019)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan):993–1022 (2003)
Dewangan, J.K., Sharaff, A., Pandey, S.: Improving topic coherence using parsimonious language model and latent semantic indexing. In: Lecture Notes in Electrical Engineering, vol. 601, pp. 823–830. Springer (2020)
Fang, A., Macdonald, C., Ounis, I., Habel, P.: Examining the coherence of the top ranked tweet topics. In: SIGIR 2016—Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 825–828. ACM, New York, New York, USA (2016)
Gelbukh, A.: Computational Linguistics and Intelligent Text Processing: 13th International Conference, CICLing: New Delhi, India, 11–17, 2012. Proceedings, Part II (2012)(2012)
Gerlach, M., Peixoto, T.P., Altmann, E.G.: A network approach to topic models. Sci. Adv. 4(7), eaaq1360 (2018)
Hersh, W., Cohen, A., Yang, J., Teja Bhupatiraju, R., Roberts, P., Hearst, M.: TREC 2005 Genomics Track Overview. Technical report
Huang, C.M.: Incorporating prior knowledge by selective context features to enhance topic coherence. In: Communications in Computer and Information Science, vol. 1013, pp. 310–318. Springer Verlag (2019)
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., Zhao, L.: Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl. 78(11), 15169–15211
Kobayashi, H.: Perplexity on reduced corpora. In: Proceedings of the 52nd Annual Meeting of the ACL (Volume 1: Long Papers), pp. 797–806, Maryland, ACL (2014)
Neishabouri, A., Desmarais, M.C.: Reliability of perplexity to find number of latent topics. In: The Thirty-Third International Flairs Conference (2020)
Pathik, N., Shukla, P.: Simulated annealing based algorithm for tuning LDA hyper parameters. In: Advances in Intelligent Systems and Computing, vol. 1154, pp. 515–521. Springer (2020)
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM ’15, pp. 399–408. Association for Computing Machinery, New York, NY, USA (2015)
Sbalchiero, S., Eder, M.: Topic modeling, long texts and the best number of topics. Some problems and solutions. Qual. Quant. 1–14 (2020)
Text Retrieval Conference TREC 2005 Genomics Track Ad Hoc Retrieval Topics. Technical report (2005)
Thiyagarajan, D., Shanthi, N.: A modified multi objective heuristic for effective feature selection in text classification. Cluster Comput. 22(5), 10625–10635 (2019)
Dang, T., Nguyen, V.T.: ComModeler: topic modeling using community detection. EuroVis Workshop on Visual Analytics (EuroVA) (2018)
Wang, H., Wang, J., Zhang, Y., Wang, M., Mao, C.: Optimization of topic recognition model for news texts based on LDA
Wang, R., Zhou, D., He, Y.: Optimising topic coherence with Weighted Poólya Urn scheme. Neurocomputing 385, 329–339 (2020)
Yuan, B., Wu, G.: A hybrid hdp-me-lda model for sentiment analysis. In: Proceedings of the 2017 2nd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2017), pp. 659–663. Atlantis Press (2017)
Zhao, W., Chen, J.J., Perkins, R., Liu, Z., Ge, W., Ding, Y., Zou, W.: A heuristic approach to determine an appropriate number of topics in topic modeling. In: BMC bioinformatics, vol. 16, pp. S8. Springer (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Hasan, M., Rahman, A., Karim, M.R., Khan, M.S.I., Islam, M.J. (2021). Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation (LDA). In: Kaiser, M.S., Bandyopadhyay, A., Mahmud, M., Ray, K. (eds) Proceedings of International Conference on Trends in Computational and Cognitive Engineering. Advances in Intelligent Systems and Computing, vol 1309. Springer, Singapore. https://doi.org/10.1007/978-981-33-4673-4_27
Download citation
DOI: https://doi.org/10.1007/978-981-33-4673-4_27
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-4672-7
Online ISBN: 978-981-33-4673-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)