Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Topic modeling is one of the outstanding text mining techniques that are based on unsupervised machine learning and have a wide variety of applications. After the proposal of LDA [4], many extensions are provided by considering more realistic situations. Especially, LDA cannot model correlations among latent topics. Therefore, the correlated topic model (CTM) has been proposed [3]. However, the inference for CTM is a bit complicated, because the logistic normal prior is not conjugate to the multinomial distribution. This paper proposes a new variational Bayesian inference for CTM. The main contribution is that we make the inference simple with the stochastic gradient variational Bayes (SGVB) [7, 8]. While the proposed inference adopts a gradient-based optimization similar to the original one [3], no explicit inversion of the covariance matrix is required.

We briefly describe the variational inference for topic models. Let \(\varvec{x}_d = \{ x_{d1}, \ldots , x_{dN_d} \}\) be the multiset of the words in the document d. \(z_{dn}\) denotes the topic to which the word token \(x_{dn}\) is assigned. Then the log evidence of the document d is lower bounded as \(\log p(\varvec{x}_d)\ge \mathbb {E}_{q(\varvec{z}_d,\varvec{\theta }_d)}[\log p(\varvec{x}_d | \varvec{z}_d, \mathbf {\Phi })p(\varvec{z}_d|\varvec{\theta }_d) p(\varvec{\theta }_d)] -\mathbb {E}_{q(\varvec{z}_d,\varvec{\theta }_d)}[\log q(\varvec{z}_d,\varvec{\theta }_d)]\), where \(\varvec{\theta }_d\) is the topic probability distribution of the document d. \(\mathbf {\Phi } = \{ \varvec{\phi }_1, \ldots , \varvec{\phi }_K \}\) is the set of the per-topic word probability distributions, where K is the number of topics, and is MAP-estimated in this paper. Even when the posterior \(q(\varvec{z}_d,\varvec{\theta }_d)\) is assumed to factorize as \(q(\varvec{z}_d)q(\varvec{\theta }_d)\), closed form updates cannot be obtained for several parameters in CTM. Therefore, a gradient-based optimization is required. Further, since \(\varvec{\theta }_d\) is drawn from the logistic normal prior in CTM, the covariance matrix makes the inference complicated. The implementation by [3] explicitly inverts the covariance matrix. This paper proposes a simpler inference as an instance of SGVB [7, 8]. The proposed method does not invert the covariance matrix explicitly thanks to SGVB.

2 Our Proposal: SGVB for CTM

In CTM, the per-document topic probabilities \(\varvec{\theta }_d\) are drawn from the logistic normal distribution, which is parameterized by the mean parameter \(\varvec{m}\) and the covariance matrix \(\mathbf {\Sigma }\). We assume that the variational posterior \(q(\varvec{\theta }_d ; \varvec{\mu }_d, \varvec{\sigma }_d)\) is the diagonal logistic normal, where the variational mean and standard deviation parameters for each pair (dk) are referred to by \(\mu _{dk}\) and \(\sigma _{dk}\), respectively. The main contribution of this paper is that the inference is made simple by applying SGVB. A sample from \(q(\varvec{\theta }_d)\) is computed as \(\theta _{dk} \propto \exp ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} )\) with the reparameterization technique [7], where \(\epsilon _{dk}\) is a noise distribution sample. The ELBO, i.e., the variational lower bound of the log evidence, is obtained as \(\mathbb {E}_{q(\varvec{z}_d ; \varvec{\gamma }_d)} \big [ \log p(\varvec{x}_d | \varvec{z}_d ; \mathbf {\Phi }) \big ] + \mathbb {E}_{q(\varvec{z}_d ; \varvec{\gamma }_d)q(\varvec{\theta }_d)} \big [ \log p(\varvec{z}_d | \varvec{\theta }_d) \big ] + \mathbb {E}_{q(\varvec{\theta }_d)} \big [ \log p(\varvec{\theta }_d | \varvec{m}, \mathbf {\Sigma }) \big ] - \mathbb {E}_{q(\varvec{z}_d ; \varvec{\gamma }_d)} \big [ \log q(\varvec{z}_d ; \varvec{\gamma }_d) \big ] - \mathbb {E}_{q(\varvec{\theta }_d)} \big [ \log q(\varvec{\theta }_d) \big ]\) for the document d, where \(q(\varvec{z}_d ; \varvec{\gamma }_d)\) is assumed to be the discrete distribution. As no significant improvement was achieved by increasing the number of samples, the number of noise distribution samples was set to one in our experiment.

To estimate parameters, we take the partial derivatives of the ELBO L. The partial derivatives with respect to \(\mu _{dk}\) and \(\tau _{dk}\), defined by \(\tau _{dk}\equiv \log (\sigma ^2)\), are \(\frac{\partial L}{\partial \mu _{dk}} = \sum _{n=1}^{N_d} \gamma _{dnk} - N_d \exp ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} ) / \zeta _d - 2 \sum _{k^\prime =1}^{K-1} \varLambda _{k k^\prime } ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} - m_{k^\prime } )\) and \(\frac{\partial L}{\partial \tau _{dk}} = \frac{1}{2} + \frac{1}{2} \epsilon _{dk} \exp (\frac{\tau _{dk}}{2}) \big [ \sum _{n=1}^{N_d} \gamma _{dnk} - N_d \exp \{ \epsilon _{dk} \exp (\frac{\tau _{dk}}{2}) + \mu _{dk} \} / \zeta _d - 2 \sum _{k^\prime =1}^{K-1} \varLambda _{k k^\prime } \{ \epsilon _{d k^\prime } \exp (\frac{{\tau }_{d k^\prime }}{2}) + \mu _{dk^\prime } - m_{k^\prime } \}\big ]\), where \(\zeta _d = 1 + \sum _{k=1}^{K-1} \exp ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} )\). We skip the derivation due to the space limitation. Adam [6] is used for updating \(\mu _{dk}\) and \(\tau _{dk}\) repeatedly. Since the precision matrix , i.e., the inverse of the covariance matrix, appears only in the multiplication with a vector, the Cholesky decomposition can make the explicit inversion unnecessary. However, the analytic shrinkage [2] is used to make the computation stable. The \(\varvec{\phi }_k\)s are MAP-estimated. The other parameters are updated by \(\gamma _{dnk} \propto \theta _{dk} \phi _{kx_{dn}}\), \(\varvec{m} = \sum _{d=1}^D ( \varvec{\mu }_d + \varvec{\epsilon }_d \circ \varvec{\sigma }_d ) / D\), and \(\mathbf {\Sigma } = \sum _{d=1}^D ( \varvec{\epsilon }_d \circ \varvec{\sigma }_d + \varvec{\mu }_d - \varvec{m} ) ( \varvec{\epsilon }_d \circ \varvec{\sigma }_d + \varvec{\mu }_d - \varvec{m} )^{\top } / 2D\), where \(\circ \) is the element-wise product.

3 Evaluation Experiment

The evaluation experiment was conducted over the four English document sets in Table 1. NYT is the first half of the New York Times news articles in “Bag of Words Data Set” of the UCI Machine Learning Repository.Footnote 1 MOVIE is the set of movie reviews known as “Movie Review Data.”Footnote 2 NSF is “NSF Research Award Abstracts 1990-2003 Data Set” of the UCI Machine Learning Repository. MEDLINE is a subset of the MEDLINE®/PUBMED®.Footnote 3 For all document sets, we applied the Porter stemming and removed high- and low-frequency words.

Table 1. Specifications of the four document sets used in the experiment

We compared the proposed SGVB for CTM to the collapsed Gibbs sampling (CGS) for LDA [5] and also to the collapsed variational Bayes with a zero-order Taylor expansion approximation (CVB0) for LDA [1]. The original VB for CTM has already been compared to LDA in [3]. Therefore, we did not repeat the comparison. However, CVB0 could not be considered in [3]. Therefore, we picked CVB0 up. The evaluation measure was the predictive perplexity. We first ran each compared method on the randomly selected 90 % training documents. We then ran each method on a randomly selected one third of the word tokens of each test document to obtain an estimation of the topic probabilities, where the per-topic word probabilities were never updated. By using the rest two thirds, the perplexity was computed by \(\exp \big \{ - \frac{1}{N_{\text{ test }}} \sum _{ d \in \mathcal {D}_{\text{ test }} } \sum _{i \in \mathcal {I}_d} \log ( \sum _{k=1}^K \theta _{dk} \phi _{kx_{di}} ) \big \}\), where \(\mathcal {D}_{\text{ test }}\)denotes the test document set, \(\mathcal {I}_d\) the set of the indices of the test word tokens in the dth document, and \(N_{\text{ test }}\)the total number of the test tokens. For each data set, the methods were compared for \(K=50, 100\), and 150.

Figure 1 depicts the mean and standard deviation of the perplexities obtained from ten different training/test random splits. The horizontal axis gives K, and the vertical axis the perplexity. CVB0 was better than the other methods. A similar result has been given in [1], though only the inferences for LDA is considered. With respect to the comparison of SGVB for CTM to CGS for LDA, the former was better than the latter for the following five cases: \(K=50 \text{ and } 100\) for NYT (resp. \(p=0.00072 \text{ and } 0.001\)), \(K=50 \text{ and } 100\) for MOVIE (resp. \(p=7.6\times 10^{-7} \text{ and } 0.00013\)), and \(K=50\) for MEDLINE (\(p=9.1\times 10^{-6}\)). The latter was better for the following five cases: \(K=150\) for MOVIE, \(K=50, 100, \text{ and } 150\) for NSF, and \(K=150\) for MEDLINE. The former was comparable with the latter for the other two cases. The p values were obtained by the paired two-tailed t-test. In sum, the proposed SGVB for CTM was as good as CGS for LDA. However, the proposed method was far worse only for NSF, where it is suspected that the gradient-based optimization did not work well. Figure 2 gives topic correlations obtained from NYT for \(K=100\). The graph is drawn by CytoscapeFootnote 4. The edge thickness represents the magnitude of the corresponding entry of the covariance matrix. The left panel contains the topics seemingly relating to art, and the right those seemingly relating to politics.

Fig. 1.
figure 1

Evaluation results in terms of predictive perplexity.

Fig. 2.
figure 2

Topic correlations obtained by our method from the NYT data set. The left and right panels contain the topics seemingly relating to art and politics, respectively.

4 Conclusion

This paper proposed a new inference for CTM. We apply SGVB to CTM and obtain a set of simple updating formulas. The experiment showed that the proposed method was comparable with CGS for LDA in terms of perplexity, though CVB0 for LDA was the best for all settings. While the proposed method was as good as CGS for LDA, it did not work well for some cases. A further elaboration seems required for a more effective gradient-based optimization.