Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora

Tran, Nam Khanh; Zerr, Sergej; Bischoff, Kerstin; Niederée, Claudia; Krestel, Ralf

doi:10.1007/978-3-642-40501-3_30

Nam Khanh Tran²¹,
Sergej Zerr²¹,
Kerstin Bischoff²¹,
Claudia Niederée²¹ &
…
Ralf Krestel²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8092))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

2756 Accesses
6 Citations
1 Altmetric

Abstract

Topic modeling has gained a lot of popularity as a means for identifying and describing the topical structure of textual documents and whole corpora. There are, however, many document collections such as qualitative studies in the digital humanities that cannot easily benefit from this technology. The limited size of those corpora leads to poor quality topic models. Higher quality topic models can be learned by incorporating additional domain-specific documents with similar topical content. This, however, requires finding or even manually composing such corpora, requiring considerable effort. For solving this problem, we propose a fully automated adaptable process of topic cropping. For learning topics, this process automatically tailors a domain-specific Cropping corpus from a general corpus such as Wikipedia. The learned topic model is then mapped to the working corpus via topic inference. Evaluation with a real world data set shows that the learned topics are of higher quality than those learned from the working corpus alone. In detail, we analyzed the learned topics with respect to coherence, diversity, and relevance.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

A Novel Document Generation Process for Topic Detection Based on Hierarchical Latent Tree Models

Multi-objective Topic Modeling

DTR: A Novel Topic Generate Algorithm Based on Dbscan and TextRank

Keywords

References

Newman, D., Bonilla, E.V., Buntine, W.: Improving topic coherence with regularized topic models. In: Proceedings NIPS, pp. 496–504 (2011)
Google Scholar
Leetaru, K.H.: Data Mining Methods for the Content Analyst: An Introdution to the Computational Analysis of Content. Routledge, New York (2012)
Google Scholar
Janasik, N., Honkela, T., Bruun, H.: Text mining in qualitative research: Application of an unsupervised learning method. Organizational Research Methods 12(3), 436–460 (2009)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings UAI, pp. 289–296 (1999)
Google Scholar
Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings 1st Workshop on Social Media Analytics, SOMA, pp. 80–88 (2010)
Google Scholar
Krestel, R., Fankhauser, P., Nejdl, W.: Latent Dirichlet Allocation for Tag Recommendation. In: Proceedings RecSys, pp. 61–68 (2009)
Google Scholar
Purver, M., Körding, K.P., Griffiths, T.L., Tenenbaum, J.B.: Unsupervised topic modelling for multi-party spoken discourse. In: Proceedings ACL, pp. 17–24 (2006)
Google Scholar
Howes, C., Purver, M., McCabe, R.: Investigating topic modelling for therapy dialogue analysis. In: Proceedings IWCS Workshop on Computational Semantics in Clinical Text (CSCT), pp. 7–16 (2013)
Google Scholar
Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings WWW, pp. 91–100 (2008)
Google Scholar
Xue, G.R., Dai, W., Yang, Q., Yu, Y.: Topic-bridged plsa for cross-domain text classification. In: Proceedings SIGIR, pp. 627–634 (2008)
Google Scholar
Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings CIKM, pp. 775–784 (2011)
Google Scholar
Zhu, X., He, X., Munteanu, C., Penn, G.: Using latent dirichlet allocation to incorporate domain knowledge for topic transition detection. In: Proceedings INTERSPEECH, pp. 2443–2445 (2008)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Google Scholar
McCallum, A.K.: Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu
Newman, D., Asuncion, A.U., Smyth, P., Welling, M.: Distributed algorithms for topic models. Journal of Machine Learning Research 10, 1801–1828 (2009)
MathSciNet MATH Google Scholar
Yao, L., Mimno, D., McCallum, A.: Efficient methods for topic model inference on streaming document collections. In: Proceedings KDD, pp. 937–946 (2009)
Google Scholar
Marchington, M., Rubery, J., Willmott, H.: Changing organizational forms and the re-shaping of work: Case study interviews, 1999-2002 (computer file) (2004)
Google Scholar
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Proceedings Human Language Technologies, HLT, pp. 100–108 (2010)
Google Scholar
Deng, F., Siersdorfer, S., Zerr, S.: Efficient jaccard-based diversity analysis of large document collections. In: Proceedings CIKM, pp. 1402–1411 (2012)
Google Scholar
Cilibrasi, R.L., Vitanyi, P.M.B.: The google similarity distance. IEEE Trans. on Knowl. and Data Eng. 19(3), 370–383 (2007)
Article Google Scholar
Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Leibniz Universität Hannover / Forschungszentrum L3S, Hannover, Germany
Nam Khanh Tran, Sergej Zerr, Kerstin Bischoff & Claudia Niederée
Bren School of Information and Computer Sciences, University of California, Irvine, USA
Ralf Krestel

Authors

Nam Khanh Tran
View author publications
You can also search for this author in PubMed Google Scholar
Sergej Zerr
View author publications
You can also search for this author in PubMed Google Scholar
Kerstin Bischoff
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Niederée
View author publications
You can also search for this author in PubMed Google Scholar
Ralf Krestel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, Norwegian University of Science and Technology, 7491, Trondheim, Norway
Trond Aalberg
Department of Archives and Library Science, Ionian University, 49100, Corfu, Greece
Christos Papatheodorou
Department of Library Information and Archive Sciences, University of Malta, MSD2280, Msida, Malta
Milena Dobreva
Library and Information Center, University of Patras, 26504, Patras, Greece
Giannis Tsakonas
National Archives of Malta, RBT1043, Rabat, Malta
Charles J. Farrugia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tran, N.K., Zerr, S., Bischoff, K., Niederée, C., Krestel, R. (2013). Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2013. Lecture Notes in Computer Science, vol 8092. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40501-3_30

Download citation

DOI: https://doi.org/10.1007/978-3-642-40501-3_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40500-6
Online ISBN: 978-3-642-40501-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora

Abstract

Chapter PDF

Similar content being viewed by others

A Novel Document Generation Process for Topic Detection Based on Hierarchical Latent Tree Models

Multi-objective Topic Modeling

DTR: A Novel Topic Generate Algorithm Based on Dbscan and TextRank

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora

Abstract

Chapter PDF

Similar content being viewed by others

A Novel Document Generation Process for Topic Detection Based on Hierarchical Latent Tree Models

Multi-objective Topic Modeling

DTR: A Novel Topic Generate Algorithm Based on Dbscan and TextRank

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation