Abstract
This paper proposes a new inference for the correlated topic model (CTM) [3]. CTM is an extension of LDA [4] for modeling correlations among latent topics. The proposed inference is an instance of the stochastic gradient variational Bayes (SGVB) [7, 8]. By constructing the inference network with the diagonal logistic normal distribution, we achieve a simple inference. Especially, there is no need to invert the covariance matrix explicitly. We performed a comparison with LDA in terms of predictive perplexity. The two inferences for LDA are considered: the collapsed Gibbs sampling (CGS) [5] and the collapsed variational Bayes with a zero-order Taylor expansion approximation (CVB0) [1]. While CVB0 for LDA gave the best result, the proposed inference achieved the perplexities comparable with those of CGS for LDA.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Topic modeling is one of the outstanding text mining techniques that are based on unsupervised machine learning and have a wide variety of applications. After the proposal of LDA [4], many extensions are provided by considering more realistic situations. Especially, LDA cannot model correlations among latent topics. Therefore, the correlated topic model (CTM) has been proposed [3]. However, the inference for CTM is a bit complicated, because the logistic normal prior is not conjugate to the multinomial distribution. This paper proposes a new variational Bayesian inference for CTM. The main contribution is that we make the inference simple with the stochastic gradient variational Bayes (SGVB) [7, 8]. While the proposed inference adopts a gradient-based optimization similar to the original one [3], no explicit inversion of the covariance matrix is required.
We briefly describe the variational inference for topic models. Let \(\varvec{x}_d = \{ x_{d1}, \ldots , x_{dN_d} \}\) be the multiset of the words in the document d. \(z_{dn}\) denotes the topic to which the word token \(x_{dn}\) is assigned. Then the log evidence of the document d is lower bounded as \(\log p(\varvec{x}_d)\ge \mathbb {E}_{q(\varvec{z}_d,\varvec{\theta }_d)}[\log p(\varvec{x}_d | \varvec{z}_d, \mathbf {\Phi })p(\varvec{z}_d|\varvec{\theta }_d) p(\varvec{\theta }_d)] -\mathbb {E}_{q(\varvec{z}_d,\varvec{\theta }_d)}[\log q(\varvec{z}_d,\varvec{\theta }_d)]\), where \(\varvec{\theta }_d\) is the topic probability distribution of the document d. \(\mathbf {\Phi } = \{ \varvec{\phi }_1, \ldots , \varvec{\phi }_K \}\) is the set of the per-topic word probability distributions, where K is the number of topics, and is MAP-estimated in this paper. Even when the posterior \(q(\varvec{z}_d,\varvec{\theta }_d)\) is assumed to factorize as \(q(\varvec{z}_d)q(\varvec{\theta }_d)\), closed form updates cannot be obtained for several parameters in CTM. Therefore, a gradient-based optimization is required. Further, since \(\varvec{\theta }_d\) is drawn from the logistic normal prior in CTM, the covariance matrix makes the inference complicated. The implementation by [3] explicitly inverts the covariance matrix. This paper proposes a simpler inference as an instance of SGVB [7, 8]. The proposed method does not invert the covariance matrix explicitly thanks to SGVB.
2 Our Proposal: SGVB for CTM
In CTM, the per-document topic probabilities \(\varvec{\theta }_d\) are drawn from the logistic normal distribution, which is parameterized by the mean parameter \(\varvec{m}\) and the covariance matrix \(\mathbf {\Sigma }\). We assume that the variational posterior \(q(\varvec{\theta }_d ; \varvec{\mu }_d, \varvec{\sigma }_d)\) is the diagonal logistic normal, where the variational mean and standard deviation parameters for each pair (d, k) are referred to by \(\mu _{dk}\) and \(\sigma _{dk}\), respectively. The main contribution of this paper is that the inference is made simple by applying SGVB. A sample from \(q(\varvec{\theta }_d)\) is computed as \(\theta _{dk} \propto \exp ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} )\) with the reparameterization technique [7], where \(\epsilon _{dk}\) is a noise distribution sample. The ELBO, i.e., the variational lower bound of the log evidence, is obtained as \(\mathbb {E}_{q(\varvec{z}_d ; \varvec{\gamma }_d)} \big [ \log p(\varvec{x}_d | \varvec{z}_d ; \mathbf {\Phi }) \big ] + \mathbb {E}_{q(\varvec{z}_d ; \varvec{\gamma }_d)q(\varvec{\theta }_d)} \big [ \log p(\varvec{z}_d | \varvec{\theta }_d) \big ] + \mathbb {E}_{q(\varvec{\theta }_d)} \big [ \log p(\varvec{\theta }_d | \varvec{m}, \mathbf {\Sigma }) \big ] - \mathbb {E}_{q(\varvec{z}_d ; \varvec{\gamma }_d)} \big [ \log q(\varvec{z}_d ; \varvec{\gamma }_d) \big ] - \mathbb {E}_{q(\varvec{\theta }_d)} \big [ \log q(\varvec{\theta }_d) \big ]\) for the document d, where \(q(\varvec{z}_d ; \varvec{\gamma }_d)\) is assumed to be the discrete distribution. As no significant improvement was achieved by increasing the number of samples, the number of noise distribution samples was set to one in our experiment.
To estimate parameters, we take the partial derivatives of the ELBO L. The partial derivatives with respect to \(\mu _{dk}\) and \(\tau _{dk}\), defined by \(\tau _{dk}\equiv \log (\sigma ^2)\), are \(\frac{\partial L}{\partial \mu _{dk}} = \sum _{n=1}^{N_d} \gamma _{dnk} - N_d \exp ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} ) / \zeta _d - 2 \sum _{k^\prime =1}^{K-1} \varLambda _{k k^\prime } ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} - m_{k^\prime } )\) and \(\frac{\partial L}{\partial \tau _{dk}} = \frac{1}{2} + \frac{1}{2} \epsilon _{dk} \exp (\frac{\tau _{dk}}{2}) \big [ \sum _{n=1}^{N_d} \gamma _{dnk} - N_d \exp \{ \epsilon _{dk} \exp (\frac{\tau _{dk}}{2}) + \mu _{dk} \} / \zeta _d - 2 \sum _{k^\prime =1}^{K-1} \varLambda _{k k^\prime } \{ \epsilon _{d k^\prime } \exp (\frac{{\tau }_{d k^\prime }}{2}) + \mu _{dk^\prime } - m_{k^\prime } \}\big ]\), where \(\zeta _d = 1 + \sum _{k=1}^{K-1} \exp ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} )\). We skip the derivation due to the space limitation. Adam [6] is used for updating \(\mu _{dk}\) and \(\tau _{dk}\) repeatedly. Since the precision matrix , i.e., the inverse of the covariance matrix, appears only in the multiplication with a vector, the Cholesky decomposition can make the explicit inversion unnecessary. However, the analytic shrinkage [2] is used to make the computation stable. The \(\varvec{\phi }_k\)s are MAP-estimated. The other parameters are updated by \(\gamma _{dnk} \propto \theta _{dk} \phi _{kx_{dn}}\), \(\varvec{m} = \sum _{d=1}^D ( \varvec{\mu }_d + \varvec{\epsilon }_d \circ \varvec{\sigma }_d ) / D\), and \(\mathbf {\Sigma } = \sum _{d=1}^D ( \varvec{\epsilon }_d \circ \varvec{\sigma }_d + \varvec{\mu }_d - \varvec{m} ) ( \varvec{\epsilon }_d \circ \varvec{\sigma }_d + \varvec{\mu }_d - \varvec{m} )^{\top } / 2D\), where \(\circ \) is the element-wise product.
3 Evaluation Experiment
The evaluation experiment was conducted over the four English document sets in Table 1. NYT is the first half of the New York Times news articles in “Bag of Words Data Set” of the UCI Machine Learning Repository.Footnote 1 MOVIE is the set of movie reviews known as “Movie Review Data.”Footnote 2 NSF is “NSF Research Award Abstracts 1990-2003 Data Set” of the UCI Machine Learning Repository. MEDLINE is a subset of the MEDLINE®/PUBMED®.Footnote 3 For all document sets, we applied the Porter stemming and removed high- and low-frequency words.
We compared the proposed SGVB for CTM to the collapsed Gibbs sampling (CGS) for LDA [5] and also to the collapsed variational Bayes with a zero-order Taylor expansion approximation (CVB0) for LDA [1]. The original VB for CTM has already been compared to LDA in [3]. Therefore, we did not repeat the comparison. However, CVB0 could not be considered in [3]. Therefore, we picked CVB0 up. The evaluation measure was the predictive perplexity. We first ran each compared method on the randomly selected 90 % training documents. We then ran each method on a randomly selected one third of the word tokens of each test document to obtain an estimation of the topic probabilities, where the per-topic word probabilities were never updated. By using the rest two thirds, the perplexity was computed by \(\exp \big \{ - \frac{1}{N_{\text{ test }}} \sum _{ d \in \mathcal {D}_{\text{ test }} } \sum _{i \in \mathcal {I}_d} \log ( \sum _{k=1}^K \theta _{dk} \phi _{kx_{di}} ) \big \}\), where \(\mathcal {D}_{\text{ test }}\)denotes the test document set, \(\mathcal {I}_d\) the set of the indices of the test word tokens in the dth document, and \(N_{\text{ test }}\)the total number of the test tokens. For each data set, the methods were compared for \(K=50, 100\), and 150.
Figure 1 depicts the mean and standard deviation of the perplexities obtained from ten different training/test random splits. The horizontal axis gives K, and the vertical axis the perplexity. CVB0 was better than the other methods. A similar result has been given in [1], though only the inferences for LDA is considered. With respect to the comparison of SGVB for CTM to CGS for LDA, the former was better than the latter for the following five cases: \(K=50 \text{ and } 100\) for NYT (resp. \(p=0.00072 \text{ and } 0.001\)), \(K=50 \text{ and } 100\) for MOVIE (resp. \(p=7.6\times 10^{-7} \text{ and } 0.00013\)), and \(K=50\) for MEDLINE (\(p=9.1\times 10^{-6}\)). The latter was better for the following five cases: \(K=150\) for MOVIE, \(K=50, 100, \text{ and } 150\) for NSF, and \(K=150\) for MEDLINE. The former was comparable with the latter for the other two cases. The p values were obtained by the paired two-tailed t-test. In sum, the proposed SGVB for CTM was as good as CGS for LDA. However, the proposed method was far worse only for NSF, where it is suspected that the gradient-based optimization did not work well. Figure 2 gives topic correlations obtained from NYT for \(K=100\). The graph is drawn by CytoscapeFootnote 4. The edge thickness represents the magnitude of the corresponding entry of the covariance matrix. The left panel contains the topics seemingly relating to art, and the right those seemingly relating to politics.
4 Conclusion
This paper proposed a new inference for CTM. We apply SGVB to CTM and obtain a set of simple updating formulas. The experiment showed that the proposed method was comparable with CGS for LDA in terms of perplexity, though CVB0 for LDA was the best for all settings. While the proposed method was as good as CGS for LDA, it did not work well for some cases. A further elaboration seems required for a more effective gradient-based optimization.
Notes
- 1.
- 2.
- 3.
We used the XML files from medline14n0770.xml to medline14n0774.xml.
- 4.
References
Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: UAI, pp. 27–34 (2009)
Bartz, D., Müller, K.R.: Generalizing analytic shrinkage for arbitrary covariance structures. In: NIPS 26, pp. 1869–1877 (2013)
Blei, D.M., Lafferty, J.D.: Correlated topic models. In: NIPS, pp. 147–154 (2005)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS 101(Suppl 1), 5228–5235 (2004)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Kingma, D.P., Welling, M.: Stochastic gradient VB and the variational auto-encoder. In: ICLR (2014)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML, pp. 1278–1286 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Masada, T., Takasu, A. (2016). A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model. In: Li, F., Shim, K., Zheng, K., Liu, G. (eds) Web Technologies and Applications. APWeb 2016. Lecture Notes in Computer Science(), vol 9932. Springer, Cham. https://doi.org/10.1007/978-3-319-45817-5_39
Download citation
DOI: https://doi.org/10.1007/978-3-319-45817-5_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45816-8
Online ISBN: 978-3-319-45817-5
eBook Packages: Computer ScienceComputer Science (R0)