1 Introduction

Nowadays, people are inclined towards social network websites such as Twitter, Facebook, etc. share their opinions or emotions. Though there is an unlimited range of message sizes, users are more comfortable expressing their comments on distinct topics such as politics, news, events, sports, etc., in a short length. A message consists of a few words to communicate. Therefore, it is different from other text information. However, due to its limited number of words in the message, words are usually once in each message. Classifying emotions such as joy, surprise, sadness, etc., is the most challenging task on short text because it contains low information compared to other text. Another big challenge on short text for researchers is feature sparsity. Because two different short texts may have different words, they may semantically correlate. Each word conveys multiple meanings based on its context (Erik et al. 2017).

Emotion detection using topic-level modeling is one of the solutions where each document is a mixture of topics. In the corpus, each word is semantically correlated to the other. Each topic consists of emotions of unlabelled documents. Bao et al. developed the emotion-topic model, generating more informative and coherent topics that come under different emotion labels (Bao et al. 2012). A labeled latent Dirichlet allocation model proposed defines a one-to-one mapping between the topics and emotion labels that neglect the latent topic features (Ramage et al. 2009). Other joint emotion topic models such as multi-label supervised topic model (MSTM), Sentiment latent topic model (SLTM) (Rao et al. 2014a), The affective topic model (ATM) (Rao et al. 2014b), proposed an intermediate layer in LDA. Words are extracted from both contextual and background themes in the contextual sentiment topic model (CSTM) proposed by Rao (2016). Contextual theme indicates information on the contextual theme, whereas background theme indicates non-discriminative information. Ere classification of emotional labels is context-independent topics. These models beliefs in bag-of-words assumptions and do not follow word order in the sentence. To address this issue, another topic model is the Hidden Topic Markov Model (HTMM) (Gruber et al. 2007) approach, considering the structure of the sentence and order of the words to generate the topics. Though PLSA and LDA are the more popular and successful models in text mining, they cannot create proper topical knowledge over short text, leading to word co-occurrence sparsity. To handle data sparsity over short text, authors (Yan et al. 2013; Cheng et al. 2014) proposed a binary topic model if two co-occurred words belonging to the same topic learn more accurate topics from short messages in each document. However, BTM is simple and easy to implement but time-consuming to model on large datasets. Predicting emotions from the generated topic features of the BTM model is difficult without knowledge of labeled documents.

According to our knowledge, not much work has been done regarding this problem. Motivated by the concern mentioned above, we construct a weighted label topic model (WLTM) and affected biterm emotion topic (ABET) method to detect the emotions of the labeled document. This paper proposes two supervised topic models, weighted labeled topic model (WLTM) and Affective Biterm emotion-topic (ABET). WLTM model is the probability distribution of biterms of unlabelled documents. ABET, a multi-labeled topic model, is a probability distribution of topic of emotions to predict emotions distributions based on training data. BTM concept was adopted to solve the problem of data sparsity and learn more latent topics. Due to its high time complexity, inspired by highly efficient models LightLDA (Yuan et al. 2015) and AliasLDA (Li et al. 2014), Alia's method and Metropolis–Hasting (MH) algorithm have been applied to reduce the sampling complexity of Gibb's sampling algorithm.

The main contribution of our work is as follows:

  1. 1.

    A set of Biterm model of total corpus is generated, and Alias table Method Metropolis-Hasting (MH) algorithm are applied to reduce sampling complexity from O(k) to O (1).

  2. 2.

    The generative process of ABET is followed to generate a probability distribution for each label to multiple topics.

  3. 3.

    Probability distribution of each biterm of WLTM model created.

  4. 4.

    Radial Basis function employed to predict emotions on both the WLTM and ABET model.

  5. 5.

    Two public short text datasets ISEAR and SemEval were used to conduct experiment.

The organization of this paper is as follows. In Sect. 2, related work on the analysis of short text and emotion prediction on short text is presented. In Sect. 3, the detail of our research model WLTM and ABET is explained. The result is evaluated in Sect. 4 and the Conclusion of the paper is presented in Sect. 5.

2 Related article

Research on sentiment analysis and emotion in texts has attracted researchers due to its wide applications. The applications are mainly focused on stock prediction (Bollen et al. 2011), advertisement or product recommender systems (Bougie et al. 2003), marketing of a company based on consumers' emotions strategies (Mohammad and Yang 2013), etc. The fundamental method of emotion prediction is mainly divided into three categories: lexicon-based, supervised learning and unsupervised learning. The lexicon-based method creates dictionaries in concept-level, word-level, or emotional/sentiment level to detect emotions. It is domain-dependent training set is not required. Lexicon-based approaches depend on the dataset. Therefore, their accuracy depends on the availability of the word-emotion pairs in the respective lexicon.

Data sparsity problem is the main issue in a short text. Various solutions have been proposed to tackle it on short messages, including the prevalence of tweets, news headlines, Q&A websites, status messages, etc. Several methods proposed, aggregate the short text based on some dependent information to increase the length of a message before training model (Zhao et al. 2011; Weng et al. 2010). Hong and Davison (2010) proposed another solution: aggregate the tweet's message containing standard terms and train the author-topic and standard-topic models (Rosen-Zvi et al. 2004). In recent years, BTM achieved great success, combining the co-occurrence of word-pair in the document for the topic model (Yan et al. 2013; Cheng et al. 2014). For short text with V number of total words creates biterm, unorder co-occurrence of word-pair assuming each biterm share same topic in the document. Predicting the emotions of the labeled document via a supervised model is another challenging issue. Gibbs sampling was employed to efficiently estimate the parameters on many topics model (Yuan et al. 2015), but it consumes much time with increasing the size of the document or topic and combining the alias method and Metropolis-Hasting algorithms in parameter estimation to solve the issue.

The principle of the topic-based method described first (Bao et al. 2009, 2012) for emotion classification that the text of emotion correlated to the topic than the words. A popular topic modeling is known as latent Dirichlet allocation (LDA) (Blei et al. 2003), used in the task of text mining such as information retrieval. An intermediate layer was added into the LDA model to associate the similarity between emotion and topics.

Emotion-LDA (ELDA) (Rao et al. 2014c) was proposed to deal with social emotions at word-level and topic-level. However, ELDA is not a supervised model. A single word in a document has a different meaning. The probability of the social emotion is estimated by maximum likelihood estimation for each topic.

The topic-over-time (TOT) model proposed by Wang and Andrew (2006) and Blei and McAuliffe (2007) studied a supervised topic model (STM) are single-label topic modeling that is not suitable for the emotion classification task.. Two popular emotion-topic models, multi-label supervised topic model (MSTM) and sentiment latent topic model (SLTM) (Rao et al. 2014a), introduced LDA as an emotional layer to predict emotion. They collected 4570 news articles from Sina news to evaluate the experiment. Another multi-label topic model affective topic model (ATM) evaluated for emotion classification (Rao et al. 2014b). Siasme network-based supervised topic model (SNSTM) (Huang et al. 2018) model is developed by joining documents and emotion labels where the weight matrices are considered as a conditional distribution.

The short text documents are collected from various sources and the labels them manually from reader's emotion. A supervised topic model, universal affective model (UAM) (Liang et al. 2018) combines with word-emotion dictionary to solve the data sparsity problem. It consists of two sub models: word-level and topic-level. Another topic-level model combines the results of the unsupervised topic model into a Maximum Entropy classifier to solve the issue of data sparsity. The model's performance is evaluated on real-world dataset and the accuracy of the model is quite accurate.

To automatically recognize emotions from the different contexts, such as companions, locations, and activities of 32 participants, use machine learning techniques (Salido Ortega et al. 2020). Information regarding humans' emotions from other facial expressions (angry, happy, or sad), head movement (frequency and direction), eye gaze (averted or direct) extracted, and soft computing techniques are employed to get cognitional and emotional states (Zhao et al. 2013). Machine learning and sensor techniques are applied to recognize the emotions from the facial expression of people having issues with an autism spectrum disorder (Sivasangari et al. 2019).

Some classification algorithms such as support vector machine (Pang et al. 2002), Naive Bayes (Kim et al. 2006), maximum entropy (Li et al. 2016), deep memory network (Tang et al. 2016), H-Sentic LSTM, and Sentic LSTM (Ma et al. 2018), are supervised based learning methods used to detect sentiments and emotions from the text document. Detection of Emotion or Sentiment orientation based on unsupervised learning algorithms is by counting co-occurrence frequency between words. However, these methods are suitable for short text but not for standard text.

3 Affective biterm emotion-topic (ABET)

This section first proposed an effective supervised topic model for dealing with emotions over short texts. Our objective is to correctly model the links between emotion and words to help predict a document's sentiment. WLTM model and ABET have been proposed to achieve our goal. To make more efficient accelerated algorithms are being developed.

3.1 Problem definition

Summarization of terms, notations, variables to effectively represent our model is presented in Table 1. Let a corpus containing L number of labeled short texts \(\left\{{s}_{1}, {s}_{2 }, ... ,{s}_{L}\right\}\) associated with words \({w}_{L}\) and emotion e. For each labeled text s, Ls words denoted as ws = {w1, w2, …, wL} and Ne number emotions represented as \(\left\{{E}_{s}=\right\}\) Es = {e1, e2, …., Ne}. Here Ls is the length of the document and V is total vocabulary size. We create a group of biterms for each document in corpus represented as B = {bi}j=1, NB with bi = {wi,1, wi,2}. For example, a short text can be generated as follows, {w1, w2, w3} =  > {(w1, w2), (w2, w3), (w1, w3)}. Common labels are joy, anger, sad, surprise, etc. A list of presence/absence of binary indicator \(\Lambda \left( d \right) = \left\{ {w_{1} ,w_{2, } \ldots w_{v} } \right\}\) where each I ε (0,1).

Table 1 Notations

Our objective is to model links between emotion and words correctly to help in improving the performance of predicting the sentiment of a document. The first weighted label topic model (WLTM) has been modeled to map each emotion of training text into topics to achieve our goal. Therefore, label-to-topic projection is many-to-many.

3.2 Biterm topic model (BTM)

Each document d of the corpus is a multinomial distribution over topics and each topic as a multinomial distribution over words described by LDA. The generative process of LDA is given as follows:

For whole biterm set B, draw topic distribution,

$${\theta_{d}} \sim {\text{Dirichlet}} \left( \alpha \right),$$
(1)

where Dirichlet (α) is the distributions of Dirichlet parameter.

For each word, draw a word distribution,

$$\phi_{{\text{t}}} \sim {\text{Dirichlet }}\left( \beta \right),$$
(2)

where Dirichlet (β) is the distribution of Dirichlet parameter for topic t.

For each biterm, draw topic t from the multinomial distribution \(\theta_{d}\),

$${\text{T}} \sim {\text{Multinomial}}\left( {\uptheta } \right).$$
(3)

For biterm b, draw two words from the multinomial distribution \(\phi_{t}\),

$$w_{i} ,w_{j} \sim Multinomial\left( \phi \right).$$
(4)

Gibb's sampling is applied to infer topics because it is more efficient than maximum posterior estimation and variational inference. Conditional probability equation for biterm bi is given by,

$$p\left( {T{|}_{ - b} ,B,\alpha ,\beta } \right) \propto \frac{{\left( {n_{{w_{i} |t}} + \beta } \right)\left( {n_{{w_{j} |t}} + \beta } \right)}}{{\left( {\mathop \sum \nolimits_{w = 1}^{V} n_{w|t}^{ - b} + V\beta } \right)^{2} }},$$
(5)

where B denotes set of biterm group, \(b = \left( {w_{i} ,w_{j} } \right)\). \(T_{ - b}\) is denoted as biterm assigned to topic \(t\) where biterm \(b\) is not included. \(n_{t}^{ - b}\) is some biterms assigned to topic \(t\) where biterm b is not including. \(n_{{w_{i} }}^{ - b}\) and \(n_{{w_{j} }}^{ - b}\) are the number of words \(w_{i}\) assigned to topic \(t\) where biterm \(b\) is excluded respectively. V is the size of the vocabulary of corpus. Then, probability of each biterm bi is computed with given parameters θ and \(\phi\) is presented as,

$$P\left( {{\text{b}}_{{\text{i}}} \left| {{\uptheta },\phi } \right.} \right) = \mathop \sum \limits_{{{\text{t}} = 1}}^{{\text{T}}} {\uptheta }_{{\text{t}}} \phi_{{{\text{t}},{\text{w}}_{{{\text{i}},1}} }} \phi_{{{\text{t}},{\text{w}}_{{{\text{i}},2}} }} .$$
(6)

After that likelihood function of total biterms of the whole corpus is computed as follows:

$$P\left( {B\left| {{\Theta },{\Phi }} \right.} \right) = \mathop \prod \limits_{i = 1}^{B} \mathop \sum \limits_{t = 1}^{T} \theta_{t} \phi_{{t,w_{i,1} }} ,\phi_{{t,w_{i,2} }} ,$$
(7)

where T denotes the number of topics of the whole corpus. θ is T-dimensional multinomial distribution and Φ is T* V matrix. The probability of topic t is indicated by \(\theta_{t}\). \(\phi_{t}\) is V-dimensional multinomial distribution and probability of word w with given condition t is denoted as \(\phi_{t,w.}\) Then with the given number of iterations, the occurrences of biterms assigned to topic t are recorded, indicated by \(n_{t}\) and the occurrences of word w assigned to topic t of vocabulary denoted as \(n_{w|t}\). Then probabilities of topics over corpus θ and probabilities of words conditioned to ɸ are computed as follows:

$$\theta_{t} = \frac{{n_{t} + \alpha }}{B + T\alpha },$$
(8)
$$\phi_{t,w} = \frac{{n_{w\left| t \right.} + \beta }}{{\mathop \sum \nolimits_{w = 1}^{V} n_{w\left| t \right.} + V\beta }}.$$
(9)

For each document d, each biterms generated through, the topic proportion is computed via Gibbs sampling algorithm used in WLTM (He et al. 2017). However, it is unable to model the documents directly in BTM, the topic proportion of a document \(P\left( {t{|}d} \right)\) is derived from the posterior probability topic of biterms \(b_{i}^{\left( d \right)}\) = (\(w_{i,1}^{\left( d \right)} ,w_{i,2}^{\left( d \right)}\)) assuming each topic t is conditionally independent of d can be calculated with the given equation:

$$P\left( {t{|}d} \right) = \mathop \sum \limits_{i = 1}^{N} P(t|b_{i}^{\left( d \right)} )P(b_{i}^{\left( d \right)} |d).$$
(10)

\(P(t|b_{i}^{\left( d \right)} )\) is computed using Baye's formula as follows:

$$P\left( {t{|}b_{i}^{\left( d \right)} } \right) = \frac{{\theta_{t} \phi_{{t,w_{i,1}^{\left( d \right)} }} \phi_{{t,w_{i,2}^{\left( d \right)} }} }}{{\mathop \sum \nolimits_{{t^{ - i} }} \theta_{{t^{ - i} }} \phi_{{t^{ - i} ,w_{i,1}^{\left( d \right)} }} \phi_{{t^{ - i} ,w_{i,2}^{\left( d \right)} }} }}.$$
(11)

3.3 Affective biterm emotion topic (ABET)

The generative process of ABET is shown as follows:

For emotion \(e \in \left[ {1,N_{E} } \right]\), draw

$$\delta_{e} \sim Dirichlet \left( \alpha \right);$$
(12)

For each topic \(t \in \left[ {1, N_{t} } \right]\), draw

$$\phi_{t} \sim Dirichlet \left( \beta \right);$$
(13)

For each document \(d \in D, do\).

For each biterm \(b_{i} \in d\), do.

Draw

$$e_{i} \sim Multinomial \left( {\gamma_{d} } \right);$$
(14)

Draw

$$t_{i } \sim {\text{ Multinomial}}\left( {\psi_{i} } \right);$$
(15)

Draw

$$w_{i,1} ,w_{i,2} \;\in \;b_{i} \sim Multinomial \left( {\phi_{{t_{i} }} } \right).$$
(16)

Here, emotion and topic for biterm are denoted as \({e}_{i} \in E\) and \({t}_{i }\in T\) respectively. For each document in the corpus, the value of emotion \(\varepsilon\) is normalized and summed to 1 is sampled as the parameter for multinomial distribution.

Based on the generative process, the joint probability of all the samples in each document is as follows:

$$P\left( {\gamma ,\varepsilon ,t,w,\psi ,\varphi ;\alpha ,\beta } \right) = P\left( {\psi ;\alpha } \right)P\left( {\phi ,\beta } \right)P\left( \gamma \right) \times P\left( {\varepsilon {|}\gamma } \right)P\left( {t{|}\varepsilon ,\psi } \right)P\left( {w{|}t,\phi } \right).$$
(17)

Posterior probability distribution on emotion conditioned to the topic for each biterm

$$P\left({\varepsilon }_{i}=e|\gamma ,{\varepsilon }_{-b},t,w:\alpha ,\beta \right)\propto \frac{\alpha +{nt}_{e,{t}_{i}}^{-b}}{\left|T\right|\alpha +\sum_{t}{nt}_{{\varepsilon }_{i},t}^{-b}}\times \frac{{\gamma }_{{d}_{i},e}}{\sum_{{e}^{^{\prime}}}{\gamma }_{{d}_{i,{e}^{^{\prime}}}}}$$
(18)

Sampling topic conditioned to the biterm is given as:

$$P\left( {t_{i} = t{|}t_{ - b} ,\gamma ,\varepsilon ,w;\alpha ,\beta } \right) \propto \frac{{\alpha + nt_{{\varepsilon_{i} ,t}}^{ - b} }}{{\left| T \right|\alpha + \mathop \sum \nolimits_{t} nt_{{\varepsilon_{i} ,t}}^{ - b} }} \times \frac{{\left( {\beta + nw_{{t,w_{i} }}^{ - b} } \right)\left( {\beta + nw_{{t,w_{j} }}^{ - b} } \right)}}{{\left| V \right|\beta + \mathop \sum \nolimits_{w} nw_{t,w}^{ - b} }}.$$
(19)

where t and ε are served as candidate topic and emotion respectively. \({w}_{i}\) is current word extracted from document d. The number of times topic t is assigned to emotion e is denoted as,\({nt}_{\varepsilon ,t}\) and the occurrence of word w is assigned to topic t is denoted as, \({nw}_{t,w}\). The -b of nt indicates topic assignment for all topic t where current biterm is not included and -b for nw indicates word w assignment for topic t where biterm b is not included.

Then the posterior probabilities of ѱ, φ sampled from topics and emotions is estimated as follows:

$$\psi_{\varepsilon ,t} = \frac{{\alpha + n_{t|\varepsilon } }}{{T\alpha + \mathop \sum \nolimits_{t} n_{t|\varepsilon } }}$$
(20)

and

$${\varphi }_{t,w}=\frac{\beta +{n}_{w|t}}{V\beta +\sum_{t}{n}_{w|t}}$$
(21)

With all given parameters, to predict the probability of word w given emotion ε, the latent topic t is integrated as follows:

$$P\left( {w{|}\varepsilon } \right) = \mathop \sum \limits_{t} \psi_{\varepsilon ,t} \varphi_{t,w} .$$
(22)

Finally, emotion distribution in each document d is estimated via Bayes theorem as:

$$\begin{aligned} P\left( {\varepsilon \left| d \right.} \right) &= \frac{{P(d\left| \varepsilon \right.)P\left( \varepsilon \right)}}{{P\left( d \right)}} \propto {\text{P}}({\text{d}}\left| \varepsilon \right.){\text{P}}(\varepsilon ) \hfill \\ & = {\text{P}}\left( \varepsilon \right)\mathop \prod \limits_{{w \in d}} P(w\left| \varepsilon \right.)^{{\delta _{{d,w}} }}.\end{aligned}$$
(23)

A brief procedure of ABET is shown in Algorithm 3.

3.3.1 Acceleration algorithms

The acceleration algorithm for WLTM and ABET has been employed to reduce complexity through Metropolis-Hasting sampling (Geweke and Tanizaki 2001) and Alias method (Walker 1977).

3.3.1.1 Alias method

Generally, if anyone wants to sample n number of discrete distributions, it will take at least O(n) number of operations. The alias method gives an algorithm to extract n number of samples from sample distributions in O (1) operations if a discrete distribution is a uniform. Alias method creates an alias table by simulating uniform samples. For n times of sampling, it can finish the sampling in O (1) amortized time, though take O(n) operations for creating alias table. An example of probabilities table and alias table built using discrete probabilities distribution shown in Fig. 1. A detailed description of the method is given in Algorithm 1 and Algorithm 2.

Fig. 1
figure 1

An example of probability table and alias table

3.3.1.2 Metropolis–Hastings

In Gibbs sampling algorithm for WLTM and ABET, extracting topics in each iteration of ABET consumes much time when the total number of biterms is too large. The total cost to complete the above task is relatively high,

which is also a waste of storage space. For Gibbs sampling, if only alias table and probability table build up, total B × T size takes to save these tables for total biterms, B. Inspired by LightLDA (Yuan et al. 2015), the alias method and MH sampling method are employed together to estimate the parameters, as it is cheap.

figure a
figure b
3.3.1.3 Parameter estimation

In WLTM, the conditional distribution of BTM decomposes into three parts:\(\left({n}_{t}+\alpha \right)\), \(\frac{\left({n}_{{w}_{i}|t}+\beta \right)}{\left({\sum }_{w=1}^{V}{n}_{w|t}^{-b}+V\beta \right)}\) and \(\frac{\left({n}_{{w}_{j}|t}+\beta \right)}{\left({\sum }_{w=1}^{V}{n}_{w|t}^{-b}+V\beta \right)}\).These parts are known as proposal distribution as per MH sampling. \(\left({n}_{t}+\alpha \right)\) is considered as corpus proposal \({p}_{c}(t)\) and \(\frac{\left({n}_{{w}_{i}|t}+\beta \right)}{\left({\sum }_{w=1}^{V}{n}_{w|t}^{-b}+V\beta \right)}\) as word proposal\({p}_{{w}_{i}}(t)\).

Corpus proposal distribution is given as follows,

$${p}_{c} \left(t\right)\propto \left({n}_{t}+\alpha \right)$$
(24)

The acceptance probability is \(\mathrm{min}(1,{\pi }_{c})\), when topic \({t}_{1}\) translates to topic \({t}_{2}\), \({\pi }_{c}^{{t}_{1}\to {t}_{2}}\) is given as:

$$\pi_{c} = \frac{{(n_{{t_{2} }}^{ - b} + \alpha )\left( {n_{{w_{i} |t_{2} }}^{ - b} + \beta } \right)}}{{(n_{{t_{1} }}^{ - b} + \alpha )\left( {n_{{w_{i} |t_{1} }}^{ - b} + \beta } \right)}}.\frac{{(n_{{w_{j} |t_{2} }}^{ - b} + \beta )}}{{\left( {n_{{w_{j} |t_{1} }}^{ - b} + \beta } \right)}} \cdot \frac{{\left( {\mathop \sum \nolimits_{w}^{V} n_{{w|t_{1} }}^{ - b} + V\beta } \right)^{2} \left( {n_{{t_{1} }} + \alpha } \right)}}{{\left( {\mathop \sum \nolimits_{w}^{V} n_{{w|t_{2} }}^{ - b} + V\beta } \right)^{2} \left( {n_{{t_{2} }} + \alpha } \right)}}.$$
(25)

In the corpus proposal, \({p}_{c}\left(t\right)\) is decomposed into two parts, \({n}_{t}\) and α. Topics assigned for each biterm \({b}_{i}\), is stored as \({T}_{{b}_{i}}\) which equals to the length of the number of biterms B of the corpus. First, topics \({T}_{{b}_{i}}\) are randomly assigned for each ith biterm \({b}_{i}\) from \({T}_{b}\), the current topic is considered a translation topic \({t}_{1}\).

$$\mathrm{p}({\mathrm{T}}_{\mathrm{i}})=\frac{\sum_{\mathrm{i}=1}^{\mathrm{B}}{\mathrm{T}}_{{\mathrm{b}}_{\mathrm{i}}}}{\mathrm{B}}$$
(26)

Probability of topic for biterm \({b}_{i}\) from T is given in Eq. (26). Here assigned topic of \({T}_{{b}_{i}}\) is considered as uniform distribution of \({n}_{t}\). Therefore, sampling topic from \({T}_{{b}_{i}}\) in O (1) time. Drawing topic from the second term is also in O (1) times due to the constant value of α for all biterms. \({T}_{{b}_{i}},\) α, both can draw in O (1) from corpus proposal without a built-up alias table. The values are randomly assigned in the range x = [0, B + T \(\alpha\)]. If x is less than, x = int(x) is set else x = int(x-B).

3.3.1.4 Word proposal

Word proposal distribution is given as,

$${\mathrm{p}}_{{\mathrm{w}}_{\mathrm{j}}}(\mathrm{t})\propto \frac{({\mathrm{n}}_{{\mathrm{w}}_{\mathrm{j}}|\mathrm{t}}+\upbeta )}{{\sum }_{\mathrm{w}=1}^{\mathrm{V}}{\mathrm{n}}_{\mathrm{w}|\mathrm{t}}+\mathrm{V\beta }}$$
(27)

The acceptance probability is \(\mathrm{min}(1,{\pi }_{w})\), when a topic \({t}_{1}\) translates to topic \({t}_{2}\), \({\pi }_{{w}_{j}}^{{t}_{1}\to {t}_{2}}\) is given as:

$${\pi }_{{w}_{j}=}\frac{{(n}_{{w}_{i}|{t}_{2}}^{-b}+\beta )\left({n}_{{w}_{j}|{t}_{2}}^{-b}+\beta \right){({\sum }_{w=1}^{V}{n}_{w|{t}_{1}}^{-b}+V\beta )}^{2}}{{{(n}_{{w}_{i}|{t}_{1}}^{-b}+\beta )(n}_{{w}_{j}|{t}_{1}}^{-b}+\beta ){({\sum }_{w=1}^{V}{n}_{w|{t}_{2}}^{-b}+V\beta )}^{2}}\times \frac{{(n}_{{t}_{2}}^{-b}+\alpha )({n}_{{w}_{j}|{t}_{1}}^{-b}+\beta )({\sum }_{w=1}^{V}{n}_{w|{t}_{2}}+V\beta )}{{(n}_{{t}_{1}}^{-b}+\alpha ){(n}_{{w}_{j}|{t}_{2}}^{-b}+\beta )({\sum }_{w=1}^{V}{n}_{w|{t}_{1}}+V\beta )}$$
(28)

O(K) operations were sampled to extract topics from word proposal that is more costly than Gibbs sampling. Therefore, to minimize the cost to O (1) operations, an alias table is constructed (Li et al. 2014) and (Yuan et al. 2015) for computing \({p}_{{w}_{i}}\).

Parameter estimation for Algorithm 2, Eq. (19) is decomposed into three parts,

\(\frac{\alpha +{nt}_{{\varepsilon }_{i},t}^{-b}}{\left|T\right|\alpha +\sum_{t}{nt}_{{\varepsilon }_{i},{t}^{^{\prime}}}^{-b}}\), \(\frac{(\beta +{nw}_{{t,w}_{i}}^{-b})}{\left|V\right|\beta +\sum_{w}{nw}_{t,w}^{-b}}\) and\(\frac{(\beta +{nw}_{{t,w}_{j}}^{-b})}{\left|V\right|\beta +\sum_{w}{nw}_{t,w}^{-b}}\). The first part is an emotional proposal, while the second and third parts are a word proposal.

The emotion proposal is given as follows:

$${p}_{t|{\varepsilon }_{i}}\propto \frac{{n}_{t|{\varepsilon }_{i}}^{-b}}{\left|T\right|\alpha +\sum_{t}{nt}_{{\varepsilon }_{i},t}^{-b}}$$
(29)

Acceptance probability is \(\mathrm{min}(1,{\pi }_{{\varepsilon }_{i}}^{{t}_{1}\to {t}_{2}})\), when a topic \({t}_{1}\) translates to topic \({t}_{2}\), \({\pi }_{{\varepsilon }_{i}}^{{t}_{1}\to {t}_{2}}\) is given as:

$${\pi }_{{\varepsilon }_{i}}=\frac{{(n}_{{t}_{2}|{\varepsilon }_{i}}^{-b}+\alpha )\left({n}_{{t}_{1}}+\alpha \right){({\sum }_{w=1}^{V}{n}_{w|{t}_{1}}^{-b}+V\beta )}^{2}}{{(n}_{{t}_{1}|{\varepsilon }_{i}}^{-b}+\alpha )({n}_{{t}_{2}}+\alpha ){({\sum }_{w=1}^{V}{n}_{w|{t}_{2}}^{-b}+V\beta )}^{2}} \frac{({n}_{{w}_{i}|{t}_{2}}^{-b}+\beta )\left({n}_{{w}_{j}|{t}_{2}}^{-b}+\beta \right))(\left|T\right|\alpha +\sum_{t}{nt}_{{\varepsilon }_{i}|{t}_{2}}^{-b})}{({n}_{{w}_{i}|{t}_{1}}^{-b}+\beta )\left({n}_{{w}_{j}|{t}_{1}}^{-b}+\beta \right)(\left|T\right|\alpha +\sum_{t}{nt}_{{\varepsilon }_{i}|{t}_{1}}^{-b})}$$
(30)

Word proposal distribution is given as,

$$p_{{w_{j} }} \left( t \right) \propto \frac{{\left( {n_{{w_{j} |t}}^{ - b} + \beta } \right)}}{{\mathop \sum \nolimits_{w = 1}^{V} n_{w|t} + V\beta }}$$
(31)

Topic translates from \(t_{1}\) to \(t_{2}\), acceptance probability \({\text{min}}\left( {1,\pi_{{w_{j} }}^{{t_{1} \to t_{2} }} } \right)\) is calculated as follows,

$$\frac{{(n_{{t_{{1|\varepsilon_{i} }} }}^{ - b} + \alpha )\left( {n_{{w_{i} |t_{2} }}^{ - b} + \beta } \right)\left( {n_{{w_{j} |t_{2} }}^{ - b} + \beta } \right)\left( {\mathop \sum \nolimits_{w = 1}^{V} n_{{w|t_{1} }} + V\beta } \right)}}{{(n_{{t_{{2|\varepsilon_{i} }} }}^{ - b} + \alpha )(n_{{w_{i} |t_{1} }}^{ - b} + \beta )\left( {n_{{w_{j} |t_{1} }}^{ - b} + \beta } \right)\left( {\mathop \sum \nolimits_{w = 1}^{V} n_{{w|t_{2} }} + V\beta } \right)}} \times \frac{{(n_{{w_{j} |t_{1} }}^{ - b} + \beta )(\mathop \sum \nolimits_{w = 1}^{V} n_{{w|t_{1} }}^{ - b} + V\beta )^{2} }}{{(n_{{w_{j} |t_{2} }}^{ - b} + \beta )(\mathop \sum \nolimits_{w = 1}^{V} n_{{w|t_{2} }}^{ - b} + V\beta )^{2} }}.$$
(32)

The MH-sampling method is applied to infer topics based on emotion \(\varepsilon_{i}\), that depends on the emotion label of the dataset used in the dataset.

figure c

4 Experiment

Here, the result of our experiment on the proposed model has presented. The performance of the models is analyzed to achieve emotional prediction from the proposed model.

4.1 Dataset

ISEAR: This dataset consists of 7666 sentences, where every 1099 sentences belong to each emotion category. There are seven emotions: anger, fear, guilt, joy, disgust, sadness, and shame. 60% of the dataset was selected randomly for the training set, 20% for the validation set, and 20% for the testing set.

SemEval: There are 1246 news headlines in the dataset used in the 14th task of the 4th International Workshop on Semantic Evaluations (SemEval-2007). The training set consists of 1000 documents, and the testing set includes 246. Feelings of six basic emotions, joy, surprise, disgust, sadness, anger and fear, are contained in emotion labels (Katz et al. 2007).

During pre-processing, the stop-words, non-Latin characters are removed and converted into a lower case of each dataset document. For, ISEAR dataset, 1,571,829 biterms, and for SemEval, 5123 terms are created. Then, datasets are split into the training and testing sets and evaluated using fivefold cross-validation. Two methods, WLTM and ABET were implemented that incorporate accelerated algorithms. Six topic-level baseline methods modeling, LLDA (Ramage et al. 2009), BTM (Cheng et al. 2014), ETM (Bao et al. 2012), CSTM (Rao 2016), SLTM (Rao et al. 2014a), and SNSTM (Huang et al. 2018) have been implemented for comparison.

4.2 Experimental result

The performance of our model is evaluated using a fine-grained metric, the average Pearson's correlation coefficients (AP) (Rao 2016). AP is given as follows:

$$AP\left( {m,n} \right) = \frac{{\mathop \sum \nolimits_{l} \left( {m\left( l \right) - m^{\prime}} \right)\left( {n\left( l \right) - n^{\prime}} \right)}}{{\sqrt {\mathop \sum \nolimits_{l} \left( {m\left( l \right) - m^{\prime}} \right)^{2} } \sqrt {\mathop \sum \nolimits_{l} \left( {n\left( l \right) - n^{\prime}} \right)^{2} } }}$$
(33)

where m, n, two vectors with an element l, m' and n' are mean of m and n, respectively. The range of AP, − 1 to 1, indicates more correlation coefficient with perfect prediction.

For emotion prediction, the Radial basis function (RBF) is applied on WLTM, BTM and LLDA. Five-fold cross-validation is performed on the training data for ISEAR and SemEval. The value of hyperparameters α and β is set to 0.1 and 0.01, respectively. For the SemEval dataset, the number of iterations was set to 500 on Gibbs sampling, whereas 200 iterations were run for ISEAR dataset as the average number of words is large. We run MH-sampling two times for WLTM and ABET algorithms to get a more effective acceptance rate.

To predict emotion by estimating the probability \(P(\varepsilon |d)\), the emotion-term and topic-emotion model has been applied. The accuracy of the evaluation metric evaluates the performance of predicted emotion is given as:

$$Accuracy_{d} @N = = \left\{ {\begin{array}{*{20}c} {1\, if\, \varepsilon_{p} \in E_{top} N@d} \\ {0,\, n \,{\text{even}}} \\ \end{array} } \right..$$
(34)

Accuracy @ N for the testing set D is

$$Accuracy@N = \mathop \sum \limits_{d \in D} \frac{{Accuracy_{d} @N}}{\left| D \right|}.$$
(35)

Emotion-term and emotion-topic models can be applied to emotion prediction by estimating the probability. In this section, their prediction performance has been evaluated. The parameter N is set to 1, 2, 3 (Erik et al. 2017).

4.2.1 Influenced of number of topics

In this part, we focus on selecting topic numbers that indicate the number of latent aspects that may affect the performance of our proposed model. Topics varied from 2 to 200 (a total 30 number of topics were tested) were used to evaluate the number of topics for Gibb's sampling. The performance of the WLTM method was evaluated with a different number of topics, which is measured by the log-likelihood function. Based on the log-likelihood function, the top 15 topics were considered for better accuracy for the WLTM algorithm. Figure 2a, b presented log-likelihood values of top topics over ISEAR and SemEval, respectively. Mean Accu@1, Accu@2 and Accu@3 metric is presented for different models to measure the performance of our proposed algorithm based on the topic numbers. The proposed model was compared with the popular six baseline methods LLDA (LLDA (Ramage et al. 2009), BTM (Cheng et al. 2014), ETM (Bao et al. 2012), CSTM(Rao 2016), SLTM (Rao et al. 2014a) and SNSTM (Huang et al. 2018).

Fig. 2
figure 2

Log-likelihood values over the top 15 topics for a ISEAR and b SemEval dataset

4.2.2 Comparison with baselines

Experiments are conducted to analyze the mean and variance of the model in terms of AP. The top values of mean and variance of AP are reported in Table 2a, b in the boldface on ISEAR and SemEval dataset, respectively.

Table 2 Performance of AP on (a) ISEAR and (b) SemEval dataset

On the SemEval dataset, APdocument and APemotion performance is measured with baseline models such as LLDA, BTM, ETM, CSTM, SLTM, and SNSTM. Our proposed model WLTM outperformed in terms of APemotion than other models. Compared to LLDA, BTM, ETM, CSTM, SLTM, SNSTM, the mean of APemotion improves 0.0032, 0.0024, 0.0031, 0.0023, 0.0027 and 0.0031 respectively and ABET placed top 3 rank with the 0.1998 value. For the variance, WLTM placed rank top 3 and ABET in rank 4. In terms of APdocument, the performance of ABET gives better mean value as compared to LLDA, BTM, ETM, CSTM, SLTM, SNSTM that improves 0.3092, 0.1229, 0.0856, 0.0123, 0.1378, 0.0656 respectively and WLTM performed slightly worse for SemEval dataset. The possible reason is, 28 words do not appear in the 246 training documents, whereas available in 1000 testing documents. Due to missing samples in tuning parameters, SVR may underfit the emotion prediction in the document level of WLTM, BTM, and LLDA. According to variance values, WLTM achieves top rank 3 and ABET in top 4. Hence, WLTM reliable model for SemEval dataset.

Performance of experimental results over ISEAR dataset indicate that WLTM outperformed on both APdocument and APemotion. WLTM outperformed baselines LLDA, BTM, ETM, CSTM, SLTM, SNSTM improves 0.4159, 0.0974, 0.0831, 0.2190,0.3344, 0.0796 on APemotion respectively. The variance result of the WLTM is placed in rank top 2 on APdocument and gives better performance on APemotion with variance value 9.31E05 which is more stable than baselines. According to the experimental result, ABET model yields competitive performance on both APdocument and APemotion with the value 0.2978 and 0.3427, respectively. The variance result is in top 4 and top 3 in APdocument and APemotion respectively. Although the experimental result of ABET in Pearson Correlation Coefficient cannot achieve best results on both SemEval and ISEAR dataset, still indicates significantly stable model. Based on the experimental result of WLTM on ISEAR yield better performance on both APdocument and APemotion that prove WLTM is more efficient than ABET and baselines.

For the metrics Accuracy@1, Accuracy@2 and Accuracy@3, on both ISEAR and SemEval datasets presented in shown in Table 3a, b respectively. On SemEval dataset, WLTM outperformed other models that improves 32.66%, 7.86%, 11.02%, 5.61%, 16.25%, 4.06 in Accuracy@3 metric with baselines LLDA, BTM, ETM, CSTM, SLTM, SNSTM respectively. Our proposed ABET model shows competitive performance which is in top2 rank according to Accuracy@3 metric result. Compared to baselines LLDA, BTM, ETM, CSTM, SLTM, SNSTM improves 30.87%, 6.07%, 9.23%, 3.82%, 14.46%, 2.27%, respectively. On the ISEAR dataset, the performance of WLTM model is quite better, that is placed in the rank top 4 and ABET is in the top 2 ranks in Accuracy@3. The performance of ETM is better than both WLTM and ABET on the ISEAR dataset, as topic sampling in ETM is constrained by one label. So, ETM can be mapping most of the samples to their actual emotion label.

Table 3 Experimental result of accuracy in terms of percentage over (a) ISEAR and (b) SemEval on different models

The T test is conducted to compare the performance of a paired model of WLTM and ABET model. The conventional significant level i.e., p value = 0.05. The result based on t test, WLTM outperformed the ABET, BTM, and LLDA and randomly significantly with a p value equal to 5.32E−8, 3.31E−11 respectively. The performance of between ABET is not statistically significant with a p value equal to 0.2917.

Figure 3a represents the accuracy results for ISEAR dataset. As shown in the figure for accuracy1, ABET model achieves an accuracy of 39.48%, whereas it is 5.43% for BTM model and 23.86% for LLDA model. Similarly, the accuracy2 for ABET model is 58.27%, for BTM model is 0.92% and LLDA model is 29.02% and the accuracy3 for ABET model is 69.78%, for BTM model it is 1.75% and LLDA model is 22.77%. Figure 3b shows the comparison of accuracy results of SemEval dataset in terms of percentage on different models. For accuracy1, the ABET model achieves the highest accuracy of 36.12% compared to 3.04% for the BTM model and 19.78% for the LLDA model. Similarly, the accuracy2 for ABET model is 57.02%, for BTM model is 4.61% and LLDA model is 24.98% and the accuracy3 for ABET model is 76.07%, for BTM model it is 4.09% and LLDA model is 30.79%.

Fig. 3
figure 3

Comparison of accuracy result in terms of percentage over a ISEAR and b SemEval on different models

4.2.3 Samples of the emotion Lexicons

After analyzing deeply, the topics generated in the WLTM model, the top terms of the topics assigned strong emotion labels are shown in Table 4. It shows that most of the terms in topic 6, topic 13, topic 21, topic 28 of the corpus with high probability values are strongly correlated to emotions like Fear, Sadness, Anger, guilt. After a close analysis, the emotion topic model can identify successfully that Topic 6 related to death news-related terms topic 21 relates to crime activity terms. Topic 13 and topic 28 share the same term "fail" associated with different emotion labels such as sadness and guilt. Though topic 6 belongs to fear associated with sadness, able to identify by the emotion-topic model. Topic 13 and topic 28 contain the similar term "fail" associated with both Sadness and Guilt emotion. After study thoroughly the documents, it happens to depend on the types of documents. For example, the sentence "I failed to complete a working task with the greed time", and "I fail in the exam." The first sentence expresses guilt emotion whereas the second sentence belongs to sadness. We can see that the same term may express different emotions based on topics identified by our model.

Table 4 Emotion Lexicon samples from WLTM over ISEAR

The probability distribution of seven different emotions for each topic is generated. Some samples have shown in Table 5. It indicates that each topic connected heavily to one emotion label with probability rate. We can see that Topic 7 as probability 97.77% related to the emotion label "Fear" and Topic 16 related to the emotion "joy".

Table 5 Emotion Lexicon samples from ABET over ISEAR

5 Conclusion

Predicting emotions from short text is a challenging task in text mining. Here, two algorithms, WLTM and ABET, are presented to set the connection between topics and emotions. Our algorithms can also handle the feature sparsity issue in detecting emotion over short messages. Alias method and MH- algorithms, two accelerated methods, are proposed to reduce complexity in parameter estimation. An experiment has been conducted to evaluate the effectiveness of the proposed methods. After being compared with baseline methods, the experimental result indicated that the performance of our approach was competitive.