A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model

Masada, Tomonari; Takasu, Atsuhiro

doi:10.1007/978-3-319-45817-5_39

Tomonari Masada¹⁷ &
Atsuhiro Takasu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9932))

Included in the following conference series:

Asia-Pacific Web Conference

1688 Accesses
1 Citations

Abstract

This paper proposes a new inference for the correlated topic model (CTM) [3]. CTM is an extension of LDA [4] for modeling correlations among latent topics. The proposed inference is an instance of the stochastic gradient variational Bayes (SGVB) [7, 8]. By constructing the inference network with the diagonal logistic normal distribution, we achieve a simple inference. Especially, there is no need to invert the covariance matrix explicitly. We performed a comparison with LDA in terms of predictive perplexity. The two inferences for LDA are considered: the collapsed Gibbs sampling (CGS) [5] and the collapsed variational Bayes with a zero-order Taylor expansion approximation (CVB0) [1]. While CVB0 for LDA gave the best result, the proposed inference achieved the perplexities comparable with those of CGS for LDA.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Revised Inference for Correlated Topic Model

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation

Online Sparse Collapsed Hybrid Variational-Gibbs Algorithm for Hierarchical Dirichlet Process Topic Models

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Topic modeling is one of the outstanding text mining techniques that are based on unsupervised machine learning and have a wide variety of applications. After the proposal of LDA [4], many extensions are provided by considering more realistic situations. Especially, LDA cannot model correlations among latent topics. Therefore, the correlated topic model (CTM) has been proposed [3]. However, the inference for CTM is a bit complicated, because the logistic normal prior is not conjugate to the multinomial distribution. This paper proposes a new variational Bayesian inference for CTM. The main contribution is that we make the inference simple with the stochastic gradient variational Bayes (SGVB) [7, 8]. While the proposed inference adopts a gradient-based optimization similar to the original one [3], no explicit inversion of the covariance matrix is required.

We briefly describe the variational inference for topic models. Let \(\varvec{x}_d = \{ x_{d1}, \ldots , x_{dN_d} \}\) be the multiset of the words in the document d. \(z_{dn}\) denotes the topic to which the word token \(x_{dn}\) is assigned. Then the log evidence of the document d is lower bounded as \(\log p(\varvec{x}_d)\ge \mathbb {E}_{q(\varvec{z}_d,\varvec{\theta }_d)}[\log p(\varvec{x}_d | \varvec{z}_d, \mathbf {\Phi })p(\varvec{z}_d|\varvec{\theta }_d) p(\varvec{\theta }_d)] -\mathbb {E}_{q(\varvec{z}_d,\varvec{\theta }_d)}[\log q(\varvec{z}_d,\varvec{\theta }_d)]\), where \(\varvec{\theta }_d\) is the topic probability distribution of the document d. \(\mathbf {\Phi } = \{ \varvec{\phi }_1, \ldots , \varvec{\phi }_K \}\) is the set of the per-topic word probability distributions, where K is the number of topics, and is MAP-estimated in this paper. Even when the posterior \(q(\varvec{z}_d,\varvec{\theta }_d)\) is assumed to factorize as \(q(\varvec{z}_d)q(\varvec{\theta }_d)\), closed form updates cannot be obtained for several parameters in CTM. Therefore, a gradient-based optimization is required. Further, since \(\varvec{\theta }_d\) is drawn from the logistic normal prior in CTM, the covariance matrix makes the inference complicated. The implementation by [3] explicitly inverts the covariance matrix. This paper proposes a simpler inference as an instance of SGVB [7, 8]. The proposed method does not invert the covariance matrix explicitly thanks to SGVB.

2 Our Proposal: SGVB for CTM

In CTM, the per-document topic probabilities \(\varvec{\theta }_d\) are drawn from the logistic normal distribution, which is parameterized by the mean parameter \(\varvec{m}\) and the covariance matrix \(\mathbf {\Sigma }\). We assume that the variational posterior \(q(\varvec{\theta }_d ; \varvec{\mu }_d, \varvec{\sigma }_d)\) is the diagonal logistic normal, where the variational mean and standard deviation parameters for each pair (d, k) are referred to by \(\mu _{dk}\) and \(\sigma _{dk}\), respectively. The main contribution of this paper is that the inference is made simple by applying SGVB. A sample from \(q(\varvec{\theta }_d)\) is computed as \(\theta _{dk} \propto \exp ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} )\) with the reparameterization technique [7], where \(\epsilon _{dk}\) is a noise distribution sample. The ELBO, i.e., the variational lower bound of the log evidence, is obtained as \(\mathbb {E}_{q(\varvec{z}_d ; \varvec{\gamma }_d)} \big [ \log p(\varvec{x}_d | \varvec{z}_d ; \mathbf {\Phi }) \big ] + \mathbb {E}_{q(\varvec{z}_d ; \varvec{\gamma }_d)q(\varvec{\theta }_d)} \big [ \log p(\varvec{z}_d | \varvec{\theta }_d) \big ] + \mathbb {E}_{q(\varvec{\theta }_d)} \big [ \log p(\varvec{\theta }_d | \varvec{m}, \mathbf {\Sigma }) \big ] - \mathbb {E}_{q(\varvec{z}_d ; \varvec{\gamma }_d)} \big [ \log q(\varvec{z}_d ; \varvec{\gamma }_d) \big ] - \mathbb {E}_{q(\varvec{\theta }_d)} \big [ \log q(\varvec{\theta }_d) \big ]\) for the document d, where \(q(\varvec{z}_d ; \varvec{\gamma }_d)\) is assumed to be the discrete distribution. As no significant improvement was achieved by increasing the number of samples, the number of noise distribution samples was set to one in our experiment.

To estimate parameters, we take the partial derivatives of the ELBO L. The partial derivatives with respect to \(\mu _{dk}\) and \(\tau _{dk}\), defined by \(\tau _{dk}\equiv \log (\sigma ^2)\), are \(\frac{\partial L}{\partial \mu _{dk}} = \sum _{n=1}^{N_d} \gamma _{dnk} - N_d \exp ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} ) / \zeta _d - 2 \sum _{k^\prime =1}^{K-1} \varLambda _{k k^\prime } ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} - m_{k^\prime } )\) and \(\frac{\partial L}{\partial \tau _{dk}} = \frac{1}{2} + \frac{1}{2} \epsilon _{dk} \exp (\frac{\tau _{dk}}{2}) \big [ \sum _{n=1}^{N_d} \gamma _{dnk} - N_d \exp \{ \epsilon _{dk} \exp (\frac{\tau _{dk}}{2}) + \mu _{dk} \} / \zeta _d - 2 \sum _{k^\prime =1}^{K-1} \varLambda _{k k^\prime } \{ \epsilon _{d k^\prime } \exp (\frac{{\tau }_{d k^\prime }}{2}) + \mu _{dk^\prime } - m_{k^\prime } \}\big ]\), where \(\zeta _d = 1 + \sum _{k=1}^{K-1} \exp ( \epsilon _{dk} \sigma _{dk} + \mu _{dk} )\). We skip the derivation due to the space limitation. Adam [6] is used for updating \(\mu _{dk}\) and \(\tau _{dk}\) repeatedly. Since the precision matrix , i.e., the inverse of the covariance matrix, appears only in the multiplication with a vector, the Cholesky decomposition can make the explicit inversion unnecessary. However, the analytic shrinkage [2] is used to make the computation stable. The \(\varvec{\phi }_k\)s are MAP-estimated. The other parameters are updated by \(\gamma _{dnk} \propto \theta _{dk} \phi _{kx_{dn}}\), \(\varvec{m} = \sum _{d=1}^D ( \varvec{\mu }_d + \varvec{\epsilon }_d \circ \varvec{\sigma }_d ) / D\), and \(\mathbf {\Sigma } = \sum _{d=1}^D ( \varvec{\epsilon }_d \circ \varvec{\sigma }_d + \varvec{\mu }_d - \varvec{m} ) ( \varvec{\epsilon }_d \circ \varvec{\sigma }_d + \varvec{\mu }_d - \varvec{m} )^{\top } / 2D\), where \(\circ \) is the element-wise product.

3 Evaluation Experiment

The evaluation experiment was conducted over the four English document sets in Table 1. NYT is the first half of the New York Times news articles in “Bag of Words Data Set” of the UCI Machine Learning Repository.^{Footnote 1} MOVIE is the set of movie reviews known as “Movie Review Data.”^{Footnote 2} NSF is “NSF Research Award Abstracts 1990-2003 Data Set” of the UCI Machine Learning Repository. MEDLINE is a subset of the MEDLINE^®/PUBMED^®.^{Footnote 3} For all document sets, we applied the Porter stemming and removed high- and low-frequency words.

Table 1. Specifications of the four document sets used in the experiment

Full size table

We compared the proposed SGVB for CTM to the collapsed Gibbs sampling (CGS) for LDA [5] and also to the collapsed variational Bayes with a zero-order Taylor expansion approximation (CVB0) for LDA [1]. The original VB for CTM has already been compared to LDA in [3]. Therefore, we did not repeat the comparison. However, CVB0 could not be considered in [3]. Therefore, we picked CVB0 up. The evaluation measure was the predictive perplexity. We first ran each compared method on the randomly selected 90 % training documents. We then ran each method on a randomly selected one third of the word tokens of each test document to obtain an estimation of the topic probabilities, where the per-topic word probabilities were never updated. By using the rest two thirds, the perplexity was computed by \(\exp \big \{ - \frac{1}{N_{\text{ test }}} \sum _{ d \in \mathcal {D}_{\text{ test }} } \sum _{i \in \mathcal {I}_d} \log ( \sum _{k=1}^K \theta _{dk} \phi _{kx_{di}} ) \big \}\), where \(\mathcal {D}_{\text{ test }}\)denotes the test document set, \(\mathcal {I}_d\) the set of the indices of the test word tokens in the dth document, and \(N_{\text{ test }}\)the total number of the test tokens. For each data set, the methods were compared for \(K=50, 100\), and 150.

Figure 1 depicts the mean and standard deviation of the perplexities obtained from ten different training/test random splits. The horizontal axis gives K, and the vertical axis the perplexity. CVB0 was better than the other methods. A similar result has been given in [1], though only the inferences for LDA is considered. With respect to the comparison of SGVB for CTM to CGS for LDA, the former was better than the latter for the following five cases: \(K=50 \text{ and } 100\) for NYT (resp. \(p=0.00072 \text{ and } 0.001\)), \(K=50 \text{ and } 100\) for MOVIE (resp. \(p=7.6\times 10^{-7} \text{ and } 0.00013\)), and \(K=50\) for MEDLINE (\(p=9.1\times 10^{-6}\)). The latter was better for the following five cases: \(K=150\) for MOVIE, \(K=50, 100, \text{ and } 150\) for NSF, and \(K=150\) for MEDLINE. The former was comparable with the latter for the other two cases. The p values were obtained by the paired two-tailed t-test. In sum, the proposed SGVB for CTM was as good as CGS for LDA. However, the proposed method was far worse only for NSF, where it is suspected that the gradient-based optimization did not work well. Figure 2 gives topic correlations obtained from NYT for \(K=100\). The graph is drawn by Cytoscape^{Footnote 4}. The edge thickness represents the magnitude of the corresponding entry of the covariance matrix. The left panel contains the topics seemingly relating to art, and the right those seemingly relating to politics.

4 Conclusion

This paper proposed a new inference for CTM. We apply SGVB to CTM and obtain a set of simple updating formulas. The experiment showed that the proposed method was comparable with CGS for LDA in terms of perplexity, though CVB0 for LDA was the best for all settings. While the proposed method was as good as CGS for LDA, it did not work well for some cases. A further elaboration seems required for a more effective gradient-based optimization.

Notes

1.
https://archive.ics.uci.edu/ml/datasets.html.
2.
http://www.cs.cornell.edu/people/pabo/movie-review-data/.
3.
We used the XML files from medline14n0770.xml to medline14n0774.xml.
4.
http://www.cytoscape.org/.

References

Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: UAI, pp. 27–34 (2009)
Google Scholar
Bartz, D., Müller, K.R.: Generalizing analytic shrinkage for arbitrary covariance structures. In: NIPS 26, pp. 1869–1877 (2013)
Google Scholar
Blei, D.M., Lafferty, J.D.: Correlated topic models. In: NIPS, pp. 147–154 (2005)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)
MATH Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS 101(Suppl 1), 5228–5235 (2004)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kingma, D.P., Welling, M.: Stochastic gradient VB and the variational auto-encoder. In: ICLR (2014)
Google Scholar
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML, pp. 1278–1286 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Nagasaki University, 1-14 Bunkyo-machi, Nagasaki, Japan
Tomonari Masada
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan
Atsuhiro Takasu

Authors

Tomonari Masada
View author publications
You can also search for this author in PubMed Google Scholar
Atsuhiro Takasu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomonari Masada .

Editor information

Editors and Affiliations

School of Computing, University of Utah, Salt Lake City, Utah, USA
Feifei Li
School of Electrical Engineering, Seoul National University, Seoul, Korea (Republic of)
Kyuseok Shim
Soochow University , Suzhou, China
Kai Zheng
Soochow University , Suzhou, China
Guanfeng Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Masada, T., Takasu, A. (2016). A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model. In: Li, F., Shim, K., Zheng, K., Liu, G. (eds) Web Technologies and Applications. APWeb 2016. Lecture Notes in Computer Science(), vol 9932. Springer, Cham. https://doi.org/10.1007/978-3-319-45817-5_39

Download citation

DOI: https://doi.org/10.1007/978-3-319-45817-5_39
Published: 18 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45816-8
Online ISBN: 978-3-319-45817-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model

Abstract

Similar content being viewed by others

A Revised Inference for Correlated Topic Model

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation

Online Sparse Collapsed Hybrid Variational-Gibbs Algorithm for Hierarchical Dirichlet Process Topic Models

Keywords

1 Introduction

2 Our Proposal: SGVB for CTM

3 Evaluation Experiment

4 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model

Abstract

Similar content being viewed by others

A Revised Inference for Correlated Topic Model

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation

Online Sparse Collapsed Hybrid Variational-Gibbs Algorithm for Hierarchical Dirichlet Process Topic Models

Keywords

1 Introduction

2 Our Proposal: SGVB for CTM

3 Evaluation Experiment

4 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation