Topic Modeling for Short Texts: A Novel Modeling Method

Hirchoua, Badr; Ouhbi, Brahim; Frikh, Bouchra

doi:10.1007/978-3-030-90618-4_29

Badr Hirchoua⁶,
Brahim Ouhbi⁶ &
Bouchra Frikh⁷

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 105))

718 Accesses
2 Citations

Abstract

Topic modeling is one of the major concerns in the short texts area, and mining these texts could uncover meaningful insights. However, the extreme short texts’ sparsity and imbalance bring new challenges to conventional topic models. In this paper, we combine a new ranking method with hierarchical representation for short text. Words ranking proves to be inexorable in generating value from disorganized short texts; thus, a novel ranking paradigm is developed and is referred to as the ordered biterm topic model (OBTM). OBTM models the semantic connection between every two words, regardless of whether they show up in similar short content, reinforcing the capacity to reveal the genuine semantic examples behind the corpus. The intense contextual information maintained in the space of word-to-word assures the word sensation in recognizing relevant topics with reliable quality. Then, the paradigm is associated with a hierarchical representation that captures the relations connecting the created topics. OBTM learns topics at the corpus level. This makes the inference more effective and robust, referring to hidden semantics patterns. Experiments on real-world collection reveal that OBTM can discover more relevant and coherent topics. It achieves high performance in various tasks and outperforming the state-of-the-art baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Pseudo-document-based Topical N-grams model for short texts

Article 23 July 2020

LDA-PSTR: A Topic Modeling Method for Short Text

Online Topic Modeling for Short Texts

References

Aletras N, Stevenson M (2013) Evaluating topic coherence using distributional semantics. In: Proceedings of the 10th international conference on computational semantics (IWCS 2013)—long papers, pp 13–22
Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Google Scholar
Carlo CM (2004) Markov chain Monte Carlo and Gibbs sampling. Notes
Google Scholar
Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: Twenty-second international joint conference on artificial intelligence
Google Scholar
Cheng X, Yan X, Lan Y, Guo J (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941
Google Scholar
Dai Z, Sun A, Liu XY (2013) Crest: cluster-based representation enrichment for short text classification. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 256–267
Google Scholar
Douven I, Meijs W (2007) Measuring coherence. Synthese 156(3):405–425
MathSciNet MATH Google Scholar
Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 209–230
Google Scholar
Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20(1):116–131
Google Scholar
Gao W, Peng M, Wang H, Zhang Y, Xie Q, Tian G (2018) Incorporating word embeddings into topic modeling of short text. Knowl Inf Syst 1–23
Google Scholar
Griffiths TL, Jordan MI, Tenenbaum JB, Blei DM (2004) Hierarchical topic models and the nested Chinese restaurant process. In: Advances in neural information processing systems, pp 17–24
Google Scholar
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235
Google Scholar
Gruber A, Weiss Y, Rosen-Zvi M (2007) Hidden topic Markov models. In: Artificial intelligence and statistics, pp 163–170
Google Scholar
Hirchoua B, Ouhbi B, Frikh B (2017) A new knowledge capitalization framework in big data context. In: Proceedings of the 19th international conference on information integration and web-based applications & services, iiWAS’17. Association for Computing Machinery, New York, NY, pp 40–48. https://doi.org/10.1145/3151759.3151780
Hirchoua B, Ouhbi B, Frikh B (2019) Topic hierarchies for knowledge capitalization using hierarchical Dirichlet processes in big data context. In: Ezziyyani M (ed) Advanced intelligent systems for sustainable development (AI2SD’2018). Springer International Publishing, Cham, pp 592–608
Google Scholar
Hoang T, Le H, Quan T (2019) Towards autoencoding variational inference for aspect-based opinion summary. Appl Artif Intell 1–21
Google Scholar
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 289–296
Google Scholar
Ishwaran H, James LF (2003) Generalized weighted Chinese restaurant processes for species sampling mixture models. Stat Sin 1211–1235
Google Scholar
Jiang L, Lu H, Xu M, Wang C (2016) Biterm pseudo document topic model for short text. In: 2016 IEEE 28th international conference on tools with artificial intelligence (ICTAI). IEEE, pp 865–872
Google Scholar
Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on information and knowledge management. ACM, pp 775–784
Google Scholar
Kou F, Du J, Yang C, Shi Y, Liang M, Xue Z, Li H (2019) A multi-feature probabilistic graphical model for social network semantic search. Neurocomputing 336:67–78
Google Scholar
Lim KW, Chen C, Buntine W (2016) Twitter-network topic model: a full Bayesian treatment for social network and text modeling. arXiv preprint arXiv:1609.06791
Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd international conference on world wide web. ACM, pp 539–550
Google Scholar
Lu H, Ge G, Li Y, Wang C, Xie J (2018) Exploiting global semantic similarity biterms for short-text topic discovery. In: 2018 IEEE 30th international conference on tools with artificial intelligence (ICTAI), pp 975–982. https://doi.org/10.1109/ICTAI.2018.00151
Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 262–272
Google Scholar
Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265
MathSciNet Google Scholar
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134
MATH Google Scholar
Pang J, Li X, Xie H, Rao Y (2016) SBTM: topic modeling over short texts. In: International conference on database systems for advanced applications. Springer, pp 43–56
Google Scholar
Pathak AR, Pandey M, Rautaray S (2021) Topic-level sentiment analysis of social media data using deep learning. Appl Soft Comput 108:107440. https://doi.org/10.1016/j.asoc.2021.107440
Google Scholar
Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web. ACM, pp 91–100
Google Scholar
Pitman J et al (2002) Combinatorial stochastic processes. Technical report 621. Department of Statistics, UC Berkeley. Lecture notes for St. Flour course
Google Scholar
Qiang J, Chen P, Ding W, Wang T, Xie F, Wu X (2016) Topic discovery from heterogeneous texts. In: 2016 IEEE 28th international conference on tools with artificial intelligence (ICTAI). IEEE, pp 196–203
Google Scholar
Razavi AH, Inkpen D (2014) Text representation using multi-level latent Dirichlet allocation. In: Canadian conference on artificial intelligence. Springer, pp 215–226
Google Scholar
Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on web search and data mining. ACM, pp 399–408
Google Scholar
Ruan D, Han J, Dang Y, Zhang S, Gao K (2017) Modeling on micro-blog topic detection based on semantic dependency. In: 2017 9th international conference on modelling, identification and control (ICMIC), pp 839–844. https://doi.org/10.1109/ICMIC.2017.8321571
Shen Y, Zhang Q, Zhang J, Huang J, Lu Y, Lei K (2018) Improving medical short text classification with semantic expansion using word-cluster embedding. In: International conference on information science and applications. Springer, pp 401–411
Google Scholar
Shi T, Kang K, Choo J, Reddy CK (2018) Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: Proceedings of the 2018 world wide web conference on world wide web. International World Wide Web Conferences Steering Committee, pp 1105–1114
Google Scholar
Škrlj B, Martinc M, Kralj J, Lavrač N, Pollak S (2019) tax2vec: constructing interpretable features from taxonomies for short text classification. arXiv preprint arXiv:1902.00438
Sridhar VKR (2015) Unsupervised topic modeling for short texts using distributed representations of words. In: Proceedings of the 1st workshop on vector space modeling for natural language processing, pp 192–200
Google Scholar
Vo DT, Ock CY (2015) Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Syst Appl 42(3):1684–1698
Google Scholar
Yi F, Jiang B, Wu J (2020) Topic modeling for short texts via word embedding and document correlation. IEEE Access 8:30692–30705. https://doi.org/10.1109/ACCESS.2020.2973207
Google Scholar
Zeng J, Li J, Song Y, Gao C, Lyu MR, King I (2018) Topic memory networks for short text classification. arXiv preprint arXiv:1809.03664
Zhao X, Wang D, Zhao Z, Liu W, Lu C, Zhuang F (2021) A neural topic model with word vectors and entity vectors for short texts. Inf Process Manag 58(2):102455. https://doi.org/10.1016/j.ipm.2020.102455
Google Scholar
Zuo Y, Zhao J, Xu K (2016) Word network topic model: a simple but general solution for short and imbalanced texts. Knowl Inf Syst 48(2):379–398
Google Scholar

Download references

Author information

Authors and Affiliations

Industrial Engineering and Productivity Department, National Higher School of Arts and Crafts (ENSAM), Moulay Ismal University (UMI), Meknes, Morocco
Badr Hirchoua & Brahim Ouhbi
LIASSE Laboratory, National School of Applied Sciences (ENSA), Sidi Mohamed Ben Abdellah University (USMBA), Fez, Morocco
Bouchra Frikh

Authors

Badr Hirchoua
View author publications
You can also search for this author in PubMed Google Scholar
Brahim Ouhbi
View author publications
You can also search for this author in PubMed Google Scholar
Bouchra Frikh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LIM, Hassan II University, Mohammedia, Morocco
Zakaria Boulouard
Moulay Ismail University, Meknes, Morocco
Mariya Ouaissa
Moulay Ismail University, Meknes, Morocco
Mariyam Ouaissa
Sidi Mohamed Ben Abdellah University, Fes, Morocco
Sarah El Himer

Appendices

Appendix

Parameters Tuning

The manual tuning of the $ \alpha , \beta $ hyper-parameters, is avoided, since the model figures out the exact values based on the statistical data distribution. In the preceding experiments, the model always achieves the best performances when $\alpha = 0.01$ and $ \beta = 0.001$. The $\alpha $ is used as the prior hyper-parameters in the Dirichlet process to generate topics, while $\beta $ is used to generate words. The proposed approach determines the number of the hidden topic using a Dirichlet process, where the finite distribution of topics is sampled from a selected common base distribution, which considers the countably infinite set of possible topics. For all baseline models, Table 8 illustrates the experimental setting used in the original paper.

Table 8 The main common parameters setting for OBTM and the baselines methods

Full size table

Evaluation Metrics

Short text topic models evaluation is an open problem, where a lot of metrics have been proposed for measuring the quality of the topics. To provide a good evaluation, this section highlights the evaluation metrics used in this paper.

1.1 Topic Coherence

Human topic ranking is the highest standard, and consequently a topic interpretability measure. In recent years, a new automatic evaluation methods are developed to evaluate the topic model quality. The topic coherence reflects the homogeneity for words, which contribute to the topic formulation. The proposed approach adopts the $C_V$ coherence measure proposed by Roder et al. [34]. Notably, the $C_V$ consists of four major parts. The first step is the data segmentation pairs, more formally, let $W = \{w_1, \ldots , w_N \}$ be the set of top-N words that describes a topic, then $S_i = \{(W', W^* )|W' = {w_i}; w_i \in W; W^* = W \}$, is the set of all pairs. For example, if $W = \{w_1, w_2, w_3\}$, then the pair $S_1 = \{(W' = w_1),(W^* = w_1, w_2, w_3)\}$. Douven et al. [7] assume that the segmentation measures the extent to which the subset $W^*$ supports or conversely undermines the subset $W'$.

The second step retrieves the probability of a single word $p(w_i)$, or the joint probability of two words $p(w_i, w_j )$, which can be guessed using their frequency over the corpus. The coherence measure $C_V$ creates a new virtual document using a frequency sliding window calculation. The window size creates a slid over the document by one-word token per step. The final probabilities $p(w_i)$ and $p(w_i, w_j )$ are calculated from the total number of virtual documents.

Given a pair $S_i = (W', W^* )$, the third step calculates a confirmation measure $\phi $, which reflects the strength of $ W^* $ supports $W'$. Similarly to Aletras et al. [1], $W'$ and $W^*$ are represented as a means context vectors, that captures the semantic support of words in W using Eq. (10). Thus, the agreement between individual words $w_i$ and $w_j$ is calculated using Eq. (11) along normalized pointwise mutual information. Furthermore, the $\log $ operator is smoothed by adding $\epsilon $ to $\log p(w_i, w_j)$, and the $\gamma $ parameter controls the weight on higher NPMI values. $\phi $ is the confirmation measure for a given pair $S_i$ which is obtained by calculating the cosine vector similarity of all context vectors $\phi _{ S_i }(\vec {u}, \vec {w})$ (Eq. 12).

$$\begin{aligned} \mathbf {v}(W') = \left\{ \sum _{w_i \in W'} NMPI(w_i, w_j)^{\gamma } \right\} _{j=1,\ldots ,|W|} \end{aligned}$$

(10)

$$\begin{aligned} NMPI(w_i, w_j)^{\gamma } = \Bigg ( \frac{ \log \frac{p(w_i, w_j) + \epsilon }{p(w_i).p(w_j)}}{-\log {p(w_i, w_j) + \epsilon }} \Bigg )^{\gamma } \end{aligned}$$

(11)

$$\begin{aligned} \phi _{ S_i }(\vec {u}, \vec {w}) = \frac{\sum _{i=1} ^{|W|} {u_i \cdot w_i}}{\mid \mid \vec {u} \mid \mid _2 \cdot \mid \mid \vec {w} \mid \mid _2 } \end{aligned}$$

(12)

The final step returns the arithmetic mean of all confirmation measures $\phi $ as the final coherence score.

1.2 Pointwise Mutual Information

Since the $C_V$ metric evaluates the topic models quality internally, the proposed approach evaluated using the PMI-Score [25], which measures the topic coherence based on pointwise mutual information using external sources. Besides, these external data are model-independent, which makes the PMI-Score fair for all topic models. Given a topic k and its n probable words $(w_1, \ldots ,w_n)$, the PMI-Score measures the pairwise association between them is:

$$\begin{aligned} PMI_{Score}(k) = \frac{1}{n(1-n)} \sum _{1< i< j<n} PMI(w_i, w_j) \end{aligned}$$

(13)

$$\begin{aligned} PMI(w_i, w_j) = \log \frac{p(w_i, w_j) + \epsilon }{p(w_i)p(w_j)} \end{aligned}$$

(14)

This is an empirical conditional log-probability $ \log p(w_j|w_i) = \log \frac{p(w_i, w_j)}{p(w_j)}$ smoothed by adding $\epsilon $ to $p(w_i, w_j)$.

1.3 Word Similarity

The conditional topic distribution for the word w can be defined as its semantic representation Eq. (15), where the $p(z_k|w)$ value is obtained after completing the Gibbs sampling step (Eq. 16).

$$\begin{aligned} s_w = [p(z_1|w), p(z_2|w), \ldots , p(z_k|w)], \end{aligned}$$

(15)

$$\begin{aligned} p(z_k|w) = \frac{n_{w|z_k}}{n_w} \end{aligned}$$

(16)

where $n_{w|z_k}$ stands for how many times w has be assigned with topic k during the sampling, and $n_w$ is the total occurrence of w in the given corpus. The distance between two words $w_i$ and $w_j$ represented by their semantic representations $s_i$ and $s_j$ is calculated using the Jensen–Shannon divergence:

$$\begin{aligned} JS(s_i,s_j) = \frac{1}{2} D_{KL} (s_j || m) +\frac{1}{2} D_{KL} (s_i || m) \end{aligned}$$

(17)

where $m = \frac{1}{2}(s_i+s_j)$ and $D_{KL} (p || q) = \sum _i p_i \ln \frac{p_i}{q_i}$ is the Kullback–Leibler divergence. The cosine similarity is also adopted to measure the distance between two words’ vectors, which is defined as:

$$\begin{aligned} Cosine(s_i,s_j) = \frac{s_i . s_j}{||s_i|| \quad ||s_j||} \end{aligned}$$

(18)

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hirchoua, B., Ouhbi, B., Frikh, B. (2022). Topic Modeling for Short Texts: A Novel Modeling Method. In: Boulouard, Z., Ouaissa, M., Ouaissa, M., El Himer, S. (eds) AI and IoT for Sustainable Development in Emerging Countries. Lecture Notes on Data Engineering and Communications Technologies, vol 105. Springer, Cham. https://doi.org/10.1007/978-3-030-90618-4_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-90618-4_29
Published: 31 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90617-7
Online ISBN: 978-3-030-90618-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Topic Modeling for Short Texts: A Novel Modeling Method

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Pseudo-document-based Topical N-grams model for short texts

LDA-PSTR: A Topic Modeling Method for Short Text

Online Topic Modeling for Short Texts

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Appendices

Appendix

Parameters Tuning

Evaluation Metrics

1.1 Topic Coherence

1.2 Pointwise Mutual Information

1.3 Word Similarity

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Topic Modeling for Short Texts: A Novel Modeling Method

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Pseudo-document-based Topical N-grams model for short texts

LDA-PSTR: A Topic Modeling Method for Short Text

Online Topic Modeling for Short Texts

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Appendices

Appendix

Parameters Tuning

Evaluation Metrics

1.1 Topic Coherence

1.2 Pointwise Mutual Information

1.3 Word Similarity

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation