Skip to main content

Topic Modeling for Short Texts: A Novel Modeling Method

  • Chapter
  • First Online:
AI and IoT for Sustainable Development in Emerging Countries

Abstract

Topic modeling is one of the major concerns in the short texts area, and mining these texts could uncover meaningful insights. However, the extreme short texts’ sparsity and imbalance bring new challenges to conventional topic models. In this paper, we combine a new ranking method with hierarchical representation for short text. Words ranking proves to be inexorable in generating value from disorganized short texts; thus, a novel ranking paradigm is developed and is referred to as the ordered biterm topic model (OBTM). OBTM models the semantic connection between every two words, regardless of whether they show up in similar short content, reinforcing the capacity to reveal the genuine semantic examples behind the corpus. The intense contextual information maintained in the space of word-to-word assures the word sensation in recognizing relevant topics with reliable quality. Then, the paradigm is associated with a hierarchical representation that captures the relations connecting the created topics. OBTM learns topics at the corpus level. This makes the inference more effective and robust, referring to hidden semantics patterns. Experiments on real-world collection reveal that OBTM can discover more relevant and coherent topics. It achieves high performance in various tasks and outperforming the state-of-the-art baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aletras N, Stevenson M (2013) Evaluating topic coherence using distributional semantics. In: Proceedings of the 10th international conference on computational semantics (IWCS 2013)—long papers, pp 13–22

    Google Scholar 

  2. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    Google Scholar 

  3. Carlo CM (2004) Markov chain Monte Carlo and Gibbs sampling. Notes

    Google Scholar 

  4. Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: Twenty-second international joint conference on artificial intelligence

    Google Scholar 

  5. Cheng X, Yan X, Lan Y, Guo J (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941

    Google Scholar 

  6. Dai Z, Sun A, Liu XY (2013) Crest: cluster-based representation enrichment for short text classification. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 256–267

    Google Scholar 

  7. Douven I, Meijs W (2007) Measuring coherence. Synthese 156(3):405–425

    MathSciNet  MATH  Google Scholar 

  8. Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 209–230

    Google Scholar 

  9. Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20(1):116–131

    Google Scholar 

  10. Gao W, Peng M, Wang H, Zhang Y, Xie Q, Tian G (2018) Incorporating word embeddings into topic modeling of short text. Knowl Inf Syst 1–23

    Google Scholar 

  11. Griffiths TL, Jordan MI, Tenenbaum JB, Blei DM (2004) Hierarchical topic models and the nested Chinese restaurant process. In: Advances in neural information processing systems, pp 17–24

    Google Scholar 

  12. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235

    Google Scholar 

  13. Gruber A, Weiss Y, Rosen-Zvi M (2007) Hidden topic Markov models. In: Artificial intelligence and statistics, pp 163–170

    Google Scholar 

  14. Hirchoua B, Ouhbi B, Frikh B (2017) A new knowledge capitalization framework in big data context. In: Proceedings of the 19th international conference on information integration and web-based applications & services, iiWAS’17. Association for Computing Machinery, New York, NY, pp 40–48. https://doi.org/10.1145/3151759.3151780

  15. Hirchoua B, Ouhbi B, Frikh B (2019) Topic hierarchies for knowledge capitalization using hierarchical Dirichlet processes in big data context. In: Ezziyyani M (ed) Advanced intelligent systems for sustainable development (AI2SD’2018). Springer International Publishing, Cham, pp 592–608

    Google Scholar 

  16. Hoang T, Le H, Quan T (2019) Towards autoencoding variational inference for aspect-based opinion summary. Appl Artif Intell 1–21

    Google Scholar 

  17. Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 289–296

    Google Scholar 

  18. Ishwaran H, James LF (2003) Generalized weighted Chinese restaurant processes for species sampling mixture models. Stat Sin 1211–1235

    Google Scholar 

  19. Jiang L, Lu H, Xu M, Wang C (2016) Biterm pseudo document topic model for short text. In: 2016 IEEE 28th international conference on tools with artificial intelligence (ICTAI). IEEE, pp 865–872

    Google Scholar 

  20. Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on information and knowledge management. ACM, pp 775–784

    Google Scholar 

  21. Kou F, Du J, Yang C, Shi Y, Liang M, Xue Z, Li H (2019) A multi-feature probabilistic graphical model for social network semantic search. Neurocomputing 336:67–78

    Google Scholar 

  22. Lim KW, Chen C, Buntine W (2016) Twitter-network topic model: a full Bayesian treatment for social network and text modeling. arXiv preprint arXiv:1609.06791

  23. Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd international conference on world wide web. ACM, pp 539–550

    Google Scholar 

  24. Lu H, Ge G, Li Y, Wang C, Xie J (2018) Exploiting global semantic similarity biterms for short-text topic discovery. In: 2018 IEEE 30th international conference on tools with artificial intelligence (ICTAI), pp 975–982. https://doi.org/10.1109/ICTAI.2018.00151

  25. Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 262–272

    Google Scholar 

  26. Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265

    MathSciNet  Google Scholar 

  27. Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134

    MATH  Google Scholar 

  28. Pang J, Li X, Xie H, Rao Y (2016) SBTM: topic modeling over short texts. In: International conference on database systems for advanced applications. Springer, pp 43–56

    Google Scholar 

  29. Pathak AR, Pandey M, Rautaray S (2021) Topic-level sentiment analysis of social media data using deep learning. Appl Soft Comput 108:107440. https://doi.org/10.1016/j.asoc.2021.107440

    Google Scholar 

  30. Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100

    Google Scholar 

  31. Pitman J et al (2002) Combinatorial stochastic processes. Technical report 621. Department of Statistics, UC Berkeley. Lecture notes for St. Flour course

    Google Scholar 

  32. Qiang J, Chen P, Ding W, Wang T, Xie F, Wu X (2016) Topic discovery from heterogeneous texts. In: 2016 IEEE 28th international conference on tools with artificial intelligence (ICTAI). IEEE, pp 196–203

    Google Scholar 

  33. Razavi AH, Inkpen D (2014) Text representation using multi-level latent Dirichlet allocation. In: Canadian conference on artificial intelligence. Springer, pp 215–226

    Google Scholar 

  34. Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on web search and data mining. ACM, pp 399–408

    Google Scholar 

  35. Ruan D, Han J, Dang Y, Zhang S, Gao K (2017) Modeling on micro-blog topic detection based on semantic dependency. In: 2017 9th international conference on modelling, identification and control (ICMIC), pp 839–844. https://doi.org/10.1109/ICMIC.2017.8321571

  36. Shen Y, Zhang Q, Zhang J, Huang J, Lu Y, Lei K (2018) Improving medical short text classification with semantic expansion using word-cluster embedding. In: International conference on information science and applications. Springer, pp 401–411

    Google Scholar 

  37. Shi T, Kang K, Choo J, Reddy CK (2018) Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: Proceedings of the 2018 world wide web conference on world wide web. International World Wide Web Conferences Steering Committee, pp 1105–1114

    Google Scholar 

  38. Škrlj B, Martinc M, Kralj J, Lavrač N, Pollak S (2019) tax2vec: constructing interpretable features from taxonomies for short text classification. arXiv preprint arXiv:1902.00438

  39. Sridhar VKR (2015) Unsupervised topic modeling for short texts using distributed representations of words. In: Proceedings of the 1st workshop on vector space modeling for natural language processing, pp 192–200

    Google Scholar 

  40. Vo DT, Ock CY (2015) Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Syst Appl 42(3):1684–1698

    Google Scholar 

  41. Yi F, Jiang B, Wu J (2020) Topic modeling for short texts via word embedding and document correlation. IEEE Access 8:30692–30705. https://doi.org/10.1109/ACCESS.2020.2973207

    Google Scholar 

  42. Zeng J, Li J, Song Y, Gao C, Lyu MR, King I (2018) Topic memory networks for short text classification. arXiv preprint arXiv:1809.03664

  43. Zhao X, Wang D, Zhao Z, Liu W, Lu C, Zhuang F (2021) A neural topic model with word vectors and entity vectors for short texts. Inf Process Manag 58(2):102455. https://doi.org/10.1016/j.ipm.2020.102455

    Google Scholar 

  44. Zuo Y, Zhao J, Xu K (2016) Word network topic model: a simple but general solution for short and imbalanced texts. Knowl Inf Syst 48(2):379–398

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Appendices

Appendix

Parameters Tuning

The manual tuning of the \( \alpha , \beta \) hyper-parameters, is avoided, since the model figures out the exact values based on the statistical data distribution. In the preceding experiments, the model always achieves the best performances when \(\alpha = 0.01\) and \( \beta = 0.001\). The \(\alpha \) is used as the prior hyper-parameters in the Dirichlet process to generate topics, while \(\beta \) is used to generate words. The proposed approach determines the number of the hidden topic using a Dirichlet process, where the finite distribution of topics is sampled from a selected common base distribution, which considers the countably infinite set of possible topics. For all baseline models, Table 8 illustrates the experimental setting used in the original paper.

Table 8 The main common parameters setting for OBTM and the baselines methods

Evaluation Metrics

Short text topic models evaluation is an open problem, where a lot of metrics have been proposed for measuring the quality of the topics. To provide a good evaluation, this section highlights the evaluation metrics used in this paper.

1.1 Topic Coherence

Human topic ranking is the highest standard, and consequently a topic interpretability measure. In recent years, a new automatic evaluation methods are developed to evaluate the topic model quality. The topic coherence reflects the homogeneity for words, which contribute to the topic formulation. The proposed approach adopts the \(C_V\) coherence measure proposed by Roder et al. [34]. Notably, the \(C_V\) consists of four major parts. The first step is the data segmentation pairs, more formally, let \(W = \{w_1, \ldots , w_N \}\) be the set of top-N words that describes a topic, then \(S_i = \{(W', W^* )|W' = {w_i}; w_i \in W; W^* = W \}\), is the set of all pairs. For example, if \(W = \{w_1, w_2, w_3\}\), then the pair \(S_1 = \{(W' = w_1),(W^* = w_1, w_2, w_3)\}\). Douven et al. [7] assume that the segmentation measures the extent to which the subset \(W^*\) supports or conversely undermines the subset \(W'\).

The second step retrieves the probability of a single word \(p(w_i)\), or the joint probability of two words \(p(w_i, w_j )\), which can be guessed using their frequency over the corpus. The coherence measure \(C_V\) creates a new virtual document using a frequency sliding window calculation. The window size creates a slid over the document by one-word token per step. The final probabilities \(p(w_i)\) and \(p(w_i, w_j )\) are calculated from the total number of virtual documents.

Given a pair \(S_i = (W', W^* )\), the third step calculates a confirmation measure \(\phi \), which reflects the strength of \( W^* \) supports \(W'\). Similarly to Aletras et al. [1], \(W'\) and \(W^*\) are represented as a means context vectors, that captures the semantic support of words in W using Eq. (10). Thus, the agreement between individual words \(w_i\) and \(w_j\) is calculated using Eq. (11) along normalized pointwise mutual information. Furthermore, the \(\log \) operator is smoothed by adding \(\epsilon \) to \(\log p(w_i, w_j)\), and the \(\gamma \) parameter controls the weight on higher NPMI values. \(\phi \) is the confirmation measure for a given pair \(S_i\) which is obtained by calculating the cosine vector similarity of all context vectors \(\phi _{ S_i }(\vec {u}, \vec {w})\) (Eq. 12).

$$\begin{aligned} \mathbf {v}(W') = \left\{ \sum _{w_i \in W'} NMPI(w_i, w_j)^{\gamma } \right\} _{j=1,\ldots ,|W|} \end{aligned}$$
(10)
$$\begin{aligned} NMPI(w_i, w_j)^{\gamma } = \Bigg ( \frac{ \log \frac{p(w_i, w_j) + \epsilon }{p(w_i).p(w_j)}}{-\log {p(w_i, w_j) + \epsilon }} \Bigg )^{\gamma } \end{aligned}$$
(11)
$$\begin{aligned} \phi _{ S_i }(\vec {u}, \vec {w}) = \frac{\sum _{i=1} ^{|W|} {u_i \cdot w_i}}{\mid \mid \vec {u} \mid \mid _2 \cdot \mid \mid \vec {w} \mid \mid _2 } \end{aligned}$$
(12)

The final step returns the arithmetic mean of all confirmation measures \(\phi \) as the final coherence score.

1.2 Pointwise Mutual Information

Since the \(C_V\) metric evaluates the topic models quality internally, the proposed approach evaluated using the PMI-Score [25], which measures the topic coherence based on pointwise mutual information using external sources. Besides, these external data are model-independent, which makes the PMI-Score fair for all topic models. Given a topic k and its n probable words \((w_1, \ldots ,w_n)\), the PMI-Score measures the pairwise association between them is:

$$\begin{aligned} PMI_{Score}(k) = \frac{1}{n(1-n)} \sum _{1< i< j<n} PMI(w_i, w_j) \end{aligned}$$
(13)
$$\begin{aligned} PMI(w_i, w_j) = \log \frac{p(w_i, w_j) + \epsilon }{p(w_i)p(w_j)} \end{aligned}$$
(14)

This is an empirical conditional log-probability \( \log p(w_j|w_i) = \log \frac{p(w_i, w_j)}{p(w_j)}\) smoothed by adding \(\epsilon \) to \(p(w_i, w_j)\).

1.3 Word Similarity

The conditional topic distribution for the word w can be defined as its semantic representation Eq. (15), where the \(p(z_k|w)\) value is obtained after completing the Gibbs sampling step (Eq. 16).

$$\begin{aligned} s_w = [p(z_1|w), p(z_2|w), \ldots , p(z_k|w)], \end{aligned}$$
(15)
$$\begin{aligned} p(z_k|w) = \frac{n_{w|z_k}}{n_w} \end{aligned}$$
(16)

where \(n_{w|z_k}\) stands for how many times w has be assigned with topic k during the sampling, and \(n_w\) is the total occurrence of w in the given corpus. The distance between two words \(w_i\) and \(w_j\) represented by their semantic representations \(s_i\) and \(s_j\) is calculated using the Jensen–Shannon divergence:

$$\begin{aligned} JS(s_i,s_j) = \frac{1}{2} D_{KL} (s_j || m) +\frac{1}{2} D_{KL} (s_i || m) \end{aligned}$$
(17)

where \(m = \frac{1}{2}(s_i+s_j)\) and \(D_{KL} (p || q) = \sum _i p_i \ln \frac{p_i}{q_i}\) is the Kullback–Leibler divergence. The cosine similarity is also adopted to measure the distance between two words’ vectors, which is defined as:

$$\begin{aligned} Cosine(s_i,s_j) = \frac{s_i . s_j}{||s_i|| \quad ||s_j||} \end{aligned}$$
(18)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Hirchoua, B., Ouhbi, B., Frikh, B. (2022). Topic Modeling for Short Texts: A Novel Modeling Method. In: Boulouard, Z., Ouaissa, M., Ouaissa, M., El Himer, S. (eds) AI and IoT for Sustainable Development in Emerging Countries. Lecture Notes on Data Engineering and Communications Technologies, vol 105. Springer, Cham. https://doi.org/10.1007/978-3-030-90618-4_29

Download citation

Publish with us

Policies and ethics