Abstract
Topic models have been successfully applied to uncover hidden probabilistic structures in collections of documents, where documents are treated as unstructured texts. However, it is not uncommon that some documents, which we call multi-part documents, are composed of multiple named parts. To exploit the information buried in the document-part relationships in the process of topic modeling, this paper adopts two assumptions: the first is that all parts in a given document should have similar topic distributions, and the second is that the multiple versions (corresponding to multiple named parts) of a given topic should have similar word distributions. Based on these two underlying assumptions, we propose a novel topic model for multi-part documents, called Multi-Part Topic Model (or MPTM in short), and develop its construction and inference method with the aid of the techniques of collapsed Gibbs sampling and maximum likelihood estimation. Experimental results on real datasets demonstrate that our approach has not only achieved significant improvement on the qualities of discovered topics, but also boosted the performance in information retrieval and document classification.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via Dirichlet Forest priors. In: ICML, pp. 25–32 (2009)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Blei, D., Lafferty, J.: Correlated topic models. Advances in neural information processing systems 18, 147–154 (2006). MIT Press, Cambridge, MA
Blei, D., McAuliffe, J.: Supervised topic models. (2010). arXiv preprint arXiv:1003.0783
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3), 27 (2011)
Chen, Z., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., Ghosh, R.: Exploiting domain knowledge in aspect extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2013) (2013)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101(Suppl 1), 5228–5235 (2004)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57 (1999)
Jagarlamudi, J., Daumé III, H., and Udupa, R.: Incorporating lexical priors into topic models. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 204–213 (2012)
Lacoste-Julien, S., Sha, F., and Jordan, M.: DiscLDA: discriminative learning for dimensionality reduction and classification. In: Advances in Neural Information Processing Systems, pp. 89–904 (2008)
Lau, J.H., Baldwin, T., Newman, D.: On collocations and topic models. ACM Transactions on Speech and Language Processing (TSLP) 10(3), 10 (2013)
Li, W., McCallum, A.: Pachinko allocation: DAG-structured mixture models of topic correlations. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 577–584 (2006)
Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 880–889 (2009)
Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272 (2011)
Minka, T.: Estimating a Dirichlet distribution. Technical Report (2012). http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108 (2010)
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 248–256 (2009)
Tam, Y.-C., Schultz, T.: Correlated latent semantic model for unsupervised LM adaptation. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 41–44 (2007)
Zhu, X., Blei, D., Lafferty, J.: TagLDA: bringing document structure knowledge into topic models. Technical Report TR-1553, University of Wisconsin (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Xie, Z., Jiang, L., Ye, T., He, Z. (2015). MPTM: A Topic Model for Multi-Part Documents. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-18123-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18122-6
Online ISBN: 978-3-319-18123-3
eBook Packages: Computer ScienceComputer Science (R0)