Abstract
Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to comprise both the words of the document itself and its citations of other documents. In this paper, we propose a citation- content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the con-tent of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ‘father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evolution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and IEEE Computer Society (CS), to demonstrate that our algorithm effectively discovers important topics and reflects the topic evolution of important research themes. According to our evaluation metrics, citation-content-LDA outperforms both content-LDA and citation-LDA.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Ahmed, A., Xing, E.P., 2010. Timeline: a dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. Proc. 26th Conf. on Uncertainty in Artificial Intelligence, p.20–29.
Blei, D.M., Lafferty, J.D., 2006. Dynamic topic models. Proc. 23rd ACM Int. Conf. on Machine Learning, p.113–120. https://doi.org/10.1145/1143844.1143859
Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res., 3: 993–1022.
Brin, B.S., Page, L., 1998. The anatomy of a large scale hy-pertextual web search engine. Comput. Netw. ISDN Syst., 30(98): 107–117. https://doi.org/10.1016/S0169-7552(98)00110-X
Chang, J., Blei, D.M., 2009. Relational topic models for document networks. Proc. 12th Int. Conf. on Artificial Intelligence and Statistics, p.81–88.
Cohn, D., Chang, H., 2000. Learning to probabilistically identify authoritative documents. Proc. 17th Int. Conf. on Machine Learning, p.167–174.
Dietz, L., Bickel, S., Scheffer, T., 2007. Unsupervised predic-tion of citation influences. Proc. 24th ACM Int. Conf. on Machine Learning, p.233–240. https://doi.org/10.1145/1273496.1273526
Erosheva, E., Fienberg, S., Lafferty, J., 2004. Mixed-membership models of scientific publications. PNAS, 101(Suppl 1):5220–5227. https://doi.org/10.1073/pnas.0307760101
Griffiths, T.L., Steyvers, M., 2004. Finding scientific topics. PNAS, 101(Suppl 1):5228–5235. https://doi.org/10.1073/pnas.0307752101
Guo, Z., Zhang, Z., Zhu, S., et al., 2014. A two-level topic model towards knowledge discovery from citation net-works. IEEE Trans. Knowl. Data Eng., 26(4): 780–794. https://doi.org/10.1109/TKDE.2013.56
He, Q., Chen, B., Pei, J., et al., 2009. Detecting topic evolution in scientific literature: how can citations help? Proc. 18th ACM Conf. on Information and Knowledge Management, p.957–966. https://doi.org/10.1145/1645953.1646076
Hofmann, T., 2001. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn., 42(1–2): 177–196. https://doi.org/10.1023/A:1007617005950
Lin, F.R., Huang, F.M., Liang, C.H., 2007. Individualized storyline-based news topic retrospection. Pacific Asia Conf. on Information Systems, Article 140.
Lu, Z., Mamoulis, N., Cheung, D.W., 2014. A collective topic model for milestone paper discovery. Proc. 37th Int. ACM SIGIR Conf. on Research & Development in In-formation Retrieval, p.1019–1022. https://doi.org/10.1145/2600428.2609499
Macroberts, M.H., Macroberts, B.R., 1989. Problems of cita-tion analysis: a critical review. J. Am. Soc. Inform. Sci., 40(5): 342–349. https://doi.org/10.1002/(SICI)1097-4571(198909)40:5<342::AID-ASI7>3.0.CO;2-U
Mei, Q., Zhai, C., 2005. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, p.198–207. https://doi.org/10.1145/1081870.1081895
Mei, Q., Cai, D., Zhang, D., et al., 2008. Topic modeling with network regularization. Proc. 17th Int. Conf. on World Wide Web, p.101–110. https://doi.org/10.1145/1367497.1367512
Nallapati, R., Cohen, W.W., 2008. Link-PLSA-LDA: a new unsupervised model for topics and influence of blogs. Proc. 2nd Int. Conf. on Weblogs and Social Media, p.84–92.
Nallapati, R.M., Ahmed, A., Xing, E.P., et al., 2008. Joint latent topic models for text and citations. Proc. 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.542–550. https://doi.org/10.1145/1401890.1401957
Wang, X.L., Zhai, C.X., Roth, D., 2013. Understanding evo-lution of research themes: a probabilistic generative model for citations. Proc. 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.1115–1123. https://doi.org/10.1145/2487575.2487698
Wang, X.R., McCallum, A., 2006. Topics over time: a non-Markov continuous-time model of topical trends. Proc. 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.424–433. https://doi.org/10.1145/1150402.1150450
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the National Basic Research Program (973) of China (No. 2012CB316400)
Electronic supplementary materials: The online version of this article (https://doi.org/10.1631/FITEE.1601125) contains supplementary materials, which are available to authorized users
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Zhou, Hk., Yu, Hm. & Hu, R. Topic discovery and evolution in scientific literature based on content and citations. Frontiers Inf Technol Electronic Eng 18, 1511–1524 (2017). https://doi.org/10.1631/FITEE.1601125
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1601125