Keywords

1 Introduction

When we use Bayesian probabilistic models for data mining applications, we need to infer the posterior distribution. While the Markov Chain Monte Carlo (MCMC) is undoubtedly important [5, 14], this paper focuses on the variational Bayesian inference (VB). Therefore, we first present an outline of VB.

Let \(\varvec{x}\) be a set of the random variables whose values are observed. A probabilistic model for analyzing the observed data \(\varvec{x}\) can be specified unambiguously by its full joint distribution \(p(\varvec{x}, \varvec{z}, \mathbf {\Theta })\), where \(\varvec{z}\) denotes the discrete latent variables and \(\mathbf {\Theta }\) the continuous ones. In VB, we maximize the log of the evidence \(p(\varvec{x})\), which is obtained from the full joint distribution by marginalizing \(\varvec{z}\) and \(\mathbf {\Theta }\) out, i.e., \(p(\varvec{x}) = \int \sum _{\varvec{z}} p(\varvec{x}, \varvec{z}, \mathbf {\Theta }) d\mathbf {\Theta }\). However, the maximization of \(\log p(\varvec{x})\) is generally intractable. Therefore, we instead maximize its lower bound:

$$\begin{aligned} \log p(\varvec{x}) \!=\!\log \!\int \!\sum _{\varvec{z}} q(\varvec{z}, \mathbf {\Theta }) \frac{p(\varvec{x}, \varvec{z}, \mathbf {\Theta })}{q(\varvec{z}, \mathbf {\Theta })} d\mathbf {\Theta } \ge \!\int \!\sum _{\varvec{z}} q(\varvec{z}, \mathbf {\Theta }) \log \frac{p(\varvec{x}, \varvec{z}, \mathbf {\Theta })}{q(\varvec{z}, \mathbf {\Theta })} d\mathbf {\Theta }. \end{aligned}$$
(1)

We have introduced an approximate posterior \(q(\varvec{z}, \mathbf {\Theta })\) in Eq. (1) to apply Jensen’s inequality. If we put the true posterior \(p(\varvec{z}, \mathbf {\Theta } | \varvec{x})\) in place of the approximate posterior, Jensen’s inequality holds with equality. However, the true posterior is typically intractable. Therefore, in VB, the inference of the approximate posterior is the main task.

In this paper, we consider the latent Dirichlet allocation (LDA) [4], the best-known Bayesian model in text mining, as our target. LDA and its extensions have been applied to solve a wide variety of text mining problems [7, 10, 12, 15, 17]. It is known that the performance of LDA measured in terms of test set perplexity heavily depends on how the inference is conducted [2]. Therefore, to provide a new proposal relating to the posterior inference for LDA is highly relevant to text mining research.

The main contribution of this paper is to propose a new VB-type inference for LDA. The proposed inference was better than the VB presented in [4] in terms of test set perplexity in all situations consulted by our experiment. For brevity, we call the VB presented in [4] standard.Footnote 1

In the standard VB, the true posterior distribution is approximated by the Dirichlet distribution. This is because the Dirichlet is obtained analytically by taking the functional derivative after applying a mean field approximation. It has been shown experimentally that the standard VB works as well as other inference methods [2]. In our proposed inference, we apply the same mean field approximation. However, we do not use the Dirichlet for approximating the true posterior. Nevertheless, the proposed inference could achieve a better perplexity than the standard VB in our experiment. Interestingly, our method was better even than the collapsed Gibbs sampling (CGS) [6] in not all but many situations.

The proposed inference for LDA is based on the stochastic gradient variational Bayes (SGVB), which has been proposed by the two papers [9, 13] almost simultaneously. SGVB can be regarded as a general framework for obtaining VB-type inferences for a wide range of Bayesian probabilistic models. Precisely, SGVB provides a general framework for obtaining a Monte Carlo estimate of the log-evidence lower bound, i.e., the lower bound of \(\log p(\varvec{x})\) in Eq. (1). In this paper, we utilize SGVB to devise an inference easy to implement for LDA. We use the logistic normal distribution [1] for approximating the true posterior. While the logistic normal is more complicated than the Dirichlet, we can obtain a simple VB-type inference owing to SGVB.

In the next section, we describe SGVB based on [9], which gives an explanation easy to understand for those familiar with LDA. Our description does not cover the full generality of SGVB, partly because we focus only on LDA as our target. We then provide the details of our proposal in Sect. 3. Section 4 presents the results of an evaluation experiment, where we compared our proposal with other methods including the standard VB and CGS. Section 5 concludes the paper with discussion on worthwhile future work.

2 Stochastic Gradient Variational Bayes

The log-evidence lower bound in Eq. (1) can be rewritten as follows:

$$\begin{aligned} \mathcal {L}(\mathbf {\Lambda }) = \mathbb {E}_{q(\varvec{z}, \mathbf {\Theta }|\mathbf {\Lambda })} [ \log p(\varvec{x}, \varvec{z}, \mathbf {\Theta }) ] - \mathbb {E}_{q(\varvec{z}, \mathbf {\Theta }| \mathbf {\Lambda })} [ \log q(\varvec{z}, \mathbf {\Theta }| \mathbf {\Lambda }) ], \end{aligned}$$
(2)

where \(\mathbf {\Lambda }\) denotes the parameters of the approximate posterior \(q(\varvec{z}, \mathbf {\Theta }| \mathbf {\Lambda })\), and \(\mathbb {E}_{q(\varvec{z}, \mathbf {\Theta }|\mathbf {\Lambda })} [ \cdot ]\) denotes the expectation with respect to \(q(\varvec{z}, \mathbf {\Theta }| \mathbf {\Lambda })\). We assume that \(q(\varvec{z}, \mathbf {\Theta }| \mathbf {\Lambda })\) factorizes as \(q(\varvec{z}| \mathbf {\Lambda }_{\varvec{z}}) q(\mathbf {\Theta }| \mathbf {\Lambda }_{\mathbf {\Theta }})\). Then we can write \(\mathcal {L}(\mathbf {\Lambda })\) as

$$\begin{aligned} \mathcal {L}(\mathbf {\Lambda }) =&\mathbb {E}_{q(\varvec{z}, \mathbf {\Theta }|\mathbf {\Lambda })} [ \log p(\varvec{x}, \varvec{z}, \mathbf {\Theta }) ] \nonumber \\ {}&-\mathbb {E}_{q(\varvec{z}| \mathbf {\Lambda }_{\varvec{z}})} [ \log q(\varvec{z}| \mathbf {\Lambda }_{\varvec{z}}) ] -\mathbb {E}_{q(\mathbf {\Theta }| \mathbf {\Lambda }_{\mathbf {\Theta }})} [ \log q(\mathbf {\Theta }| \mathbf {\Lambda }_{\mathbf {\Theta }}) ]. \end{aligned}$$
(3)

Our task is to estimate the expectations on the right hand side of Eq. (3).

In this paper, we estimate the log-evidence lower bound of the latent Dirichlet allocation (LDA) by using the stochastic gradient variational Bayes (SGVB) [9, 13]. SGVB is a general framework for obtaining a Monte Carlo estimate of the log-evidence lower bound for a wide variety of Bayesian probabilistic models. Note that SGVB cannot provide an estimate of the expectation with respect to the distribution for the discrete random variables [11]. However, we can perform an estimation as in the standard VB for LDA [4] with resect to \(\varvec{z}\).

In SGVB, we can assume that the approximate posterior \(q(\mathbf {\Theta }| \mathbf {\Lambda }_{\mathbf {\Theta }})\) depends on the observed data \(\varvec{x}\). We do not consider this option here and thus do not explore the full generality of SGVB. However, by making the approximate posterior not dependent on \(\varvec{x}\), we can make the proposed inference simple.

When we apply SGVB, the approximate posterior should meet at least two requirements. SGVB estimates the expectations for the continuous variables \(\mathbf {\Theta }\) by the Monte Carlo method. Therefore, the approximate posterior \(q(\mathbf {\Theta }| \mathbf {\Lambda }_{\mathbf {\Theta }})\) should be a distribution from which we can draw samples. This is the first requirement. Let \(\mathbf {\Theta }^{(l)}\), \(l=1,\ldots , L\) be the samples drawn from \(q(\mathbf {\Theta } | \mathbf {\Lambda }_{\mathbf {\Theta }})\). Then \(\mathcal {L}(\mathbf {\Lambda })\) in Eq. (3) is estimated as

$$\begin{aligned} \hat{\mathcal {L}}(\mathbf {\Lambda }) =&\frac{1}{L}\!\sum _{l=1}^L\!\Big \{ \mathbb {E}_{q(\varvec{z}| \mathbf {\Lambda }_{\varvec{z}})} [ \log p(\varvec{x}, \varvec{z}, \mathbf {\Theta }^{(l)}) ] -\log q(\mathbf {\Theta }^{(l)}| \mathbf {\Lambda }_{\mathbf {\Theta }}) \Big \} \nonumber \\ {}&-\mathbb {E}_{q(\varvec{z}| \mathbf {\Lambda }_{\varvec{z}})} [ \log q(\varvec{z}| \mathbf {\Lambda }_{\varvec{z}})]. \end{aligned}$$
(4)

In SGVB, we maximize \(\hat{\mathcal {L}}(\mathbf {\Lambda })\) in Eq. (4) in place of \(\mathcal {L}(\mathbf {\Lambda })\) in Eq. (3). To maximize \(\hat{\mathcal {L}}(\mathbf {\Lambda })\), we need to obtain its derivatives with respect to the relevant variables. Therefore, \(q(\mathbf {\Theta }| \mathbf {\Lambda }_{\mathbf {\Theta }})\) should be a distribution that makes \(\hat{\mathcal {L}}(\mathbf {\Lambda })\) differentiable. This is the second requirement. As described below, the inference proposed in this paper for LDA can be regarded as an example of SGVB.

3 Our Proposal

3.1 Lower Bound Estimation

We first describe LDA. Let D, K, and V denote the numbers of documents, latent topics, and vocabulary words, respectively. The parameters of the per-document topic multinomial distributions and the parameters of the per-topic word multinomial distributions are represented as \(\varvec{\theta }_d = (\theta _{d1}, \ldots , \theta _{dK})\) for \(d=1,\ldots ,D\) and \(\varvec{\phi }_k = (\phi _{k1},\ldots ,\phi _{kV})\) for \(k=1,\ldots ,K\), respectively. Then the full joint distribution of LDA is written as follows:

$$\begin{aligned}&p(\varvec{x}, \varvec{z}, \varvec{\theta }, \varvec{\phi } | \alpha , \beta ) = \prod _{d=1}^D p(\varvec{x}_d | \varvec{z}_d, \varvec{\phi }) p(\varvec{z}_d | \varvec{\theta }_d) \cdot \prod _{d=1}^D p(\varvec{\theta }_d | \alpha ) \cdot \prod _{k=1}^K p(\varvec{\phi }_k | \beta ) \nonumber \\&= \prod _{d=1}^D \prod _{i=1}^{N_d} \phi _{z_{di} x_{di}} \theta _{d z_{di}} \cdot \prod _{d=1}^D \frac{\Gamma (K\alpha )}{\Gamma (\alpha )^K} \prod _{k=1}^K \theta _{dk}^{\alpha - 1} \cdot \prod _{k=1}^K \frac{\Gamma (V\beta )}{\Gamma (\beta )^V} \prod _{v=1}^V \phi _{kv}^{\beta - 1}, \end{aligned}$$
(5)

where \(N_d\) is the length of the dth document. \(x_{di}\) is an observed variable whose value is the vocabulary word appearing as the ith token of the dth document. \(z_{di}\) is a latent variable whose value is the topic to which the ith word token of the dth document is assigned. The notation \(\phi _{z_{di} x_{di}}\) is equivalent to \(\phi _{kv}\) when \(x_{di} = v\) and \(z_{di}=k\). \(\alpha \) and \(\beta \) are the hyperparameters of the symmetric Dirichlet priors for \(\varvec{\theta }_d\) and \(\varvec{\phi }_k\), respectively.

We propose a new inference method for LDA based on SGVB explained in Sect. 2. However, SGVB is applicable only to the continuous latent variables. Therefore, in LDA, SGVB works only for the \(\varvec{\theta }_d\)s and the \(\varvec{\phi }_k\)s.

With the mean field approximation \(q(\varvec{z},\varvec{\theta },\varvec{\phi }) \approx \prod _{d=1}^D \prod _{i=1}^{N_d} q(z_{di}) \cdot \prod _{d=1}^D q(\varvec{\theta }_d) \cdot \prod _{k=1}^K q(\varvec{\phi }_k)\), the lower bound of \(\log p(\varvec{x})\) (cf. Eq. (3)) is obtained as follows:

$$\begin{aligned}&\mathcal {L}(\mathbf {\Lambda }) \nonumber \\ {}&= \sum _{d=1}^D\!\sum _{i=1}^{N_d}\!\mathbb {E}_{q(z_{di})q(\varvec{\phi }_{z_{di}})}\!\big [ \log p(x_{di} | z_{di}, \varvec{\phi }_{z_{di}}) \big ] + \sum _{d=1}^D\!\sum _{i=1}^{N_d}\!\mathbb {E}_{q(z_{di})q(\varvec{\theta }_d)}\!\big [ \log p(z_{di} | \varvec{\theta }_d) \big ] \nonumber \\ {}&+ \sum _{d=1}^D \mathbb {E}_{q(\varvec{\theta }_d)} \big [ \log p(\varvec{\theta }_d | \alpha ) \big ] + \sum _{k=1}^K \mathbb {E}_{q(\varvec{\phi }_k)} \big [ \log p(\varvec{\phi }_k | \beta ) \big ] \nonumber \\ {}&- \sum _{d=1}^D\!\mathbb {E}_{q(\varvec{\theta }_d)}\!\big [ \log q(\varvec{\theta }_d)\big ] - \sum _{k=1}^K\!\mathbb {E}_{q(\varvec{\phi }_k)}\!\big [ \log q(\varvec{\phi }_k) \big ] - \sum _{d=1}^D\!\sum _{i=1}^{N_d}\!\mathbb {E}_{q(z_{di})}\!\big [ \log q(z_{di}) \big ]. \end{aligned}$$
(6)

In the standard VB [4], we obtain the approximate posteriors by a functional derivative method after using the mean field approximation given above. The result is that the posterior \(q(\varvec{\theta }_d)\) for each d and the posterior \(q(\varvec{\phi }_k)\) for each k are a Dirichlet distribution. However, it is one thing that approximate posteriors can be found analytically by a functional derivative method, and it is a different thing that such approximate posteriors lead to a good evaluation result in terms of test set perplexity. Therefore, we can choose a distribution other than the Dirichlet for approximating the true posterior.

In this paper, we propose to use the logistic normal distribution [1] for approximating the true posterior. We define \(\theta _{dk}\) and \(\phi _{kv}\) with the samples \(\epsilon _{\theta , dk}\) and \(\epsilon _{\phi ,kv}\) from the standard normal distribution \(\mathcal {N}(0,1)\) as follows:

$$\begin{aligned} \theta _{dk}&\equiv \frac{\exp ( \epsilon _{\theta , dk} \sigma _{\theta , dk} + \mu _{\theta , dk} )}{\sum _{k^\prime =1}^K\exp ( \epsilon _{\theta , dk^\prime } \sigma _{\theta , dk^\prime } + \mu _{\theta , dk^\prime } )} \text{ and } \nonumber \\ \phi _{kv}&\equiv \frac{\exp ( \epsilon _{\phi ,kv} \sigma _{\phi , kv} + \mu _{\phi , kv} )}{\sum _{v^\prime =1}^V \exp ( \epsilon _{\phi ,kv^\prime } \sigma _{\phi , kv^\prime } + \mu _{\phi , kv^\prime } )}. \end{aligned}$$
(7)

Note that \(\epsilon \sigma + \mu \sim \mathcal {N}(\mu , \sigma )\) when \(\epsilon \sim \mathcal {N}(0,1)\). \(\mu \) and \(\sigma \) in Eq. (7) are the mean and standard deviation parameters of the logistic normal. Equation (7) gives the reparameterization trick [9] in our case, where we assume that the covariance matrix of the logistic normal is diagonal to make the inference simple.

We can draw \(\varvec{\theta }_d \sim \text{ LogitNorm }(\varvec{\mu }_{\theta ,d}, \varvec{\sigma }_{\theta ,d})\) and \(\varvec{\phi }_k \sim \text{ LogitNorm }(\varvec{\mu }_{\phi ,k}, \varvec{\sigma }_{\phi ,k})\) efficiently based on Eq. (7). Therefore, the first requirement given in Sect. 2 is met. For the approximate posterior \(q(\varvec{z})\), we assume as in the standard VB that we have a different discrete distribution \(\mathrm{Discrete}(\varvec{\gamma }_{di})\) for each word token \(x_{di}\), where \(\gamma _{dik}\) is the probability that \(z_{di} = k\) holds, i.e., the probability that the ith token of the dth document is assigned to the kth topic.

However, the algebraic manipulation of the expectation with respect to the logistic normal distribution is highly complicated. Here SGVB has an advantage, because it estimates the expectations with respect to approximate posteriors by the Monte Carlo method. \(\mathcal {L}(\mathbf {\Lambda })\) in Eq. (6) is estimated with L samples \(\varvec{\theta }^{(l)}_d\sim \text{ LogitNorm }(\varvec{\mu }_{\theta ,d}, \varvec{\sigma }_{\theta ,d})\) and \(\varvec{\phi }^{(l)}_k\sim \text{ LogitNorm }(\varvec{\mu }_{\phi ,k}, \varvec{\sigma }_{\phi ,k})\) for \(l=1,\ldots ,L\) as

$$\begin{aligned} \hat{\mathcal {L}}&(\mathbf {\Lambda }) = \frac{1}{L}\sum _{l=1}^L \sum _{d=1}^D \sum _{i=1}^{N_d} \mathbb {E}_{q(z_{di})} \big [ \log p(x_{di} | z_{di}, \varvec{\phi }^{(l)}_{z_{di}}) \big ] \nonumber \\ {}&+ \frac{1}{L}\sum _{l=1}^L \sum _{d=1}^D \sum _{i=1}^{N_d} \mathbb {E}_{q(z_{di})} \big [ \log p(z_{di} | \varvec{\theta }^{(l)}_d) \big ] - \sum _{d=1}^D \sum _{i=1}^{N_d} \mathbb {E}_{q(z_{di})} \big [ \log q(z_{di}) \big ] \nonumber \\ {}&+ \frac{1}{L}\sum _{l=1}^L \sum _{d=1}^D \mathbb {E}_{q(\varvec{\theta }_d)} \big [ \log p(\varvec{\theta }_d^{(l)} | \alpha ) - \log q(\varvec{\theta }_d^{(l)}) \big ] \nonumber \\ {}&+ \frac{1}{L}\sum _{l=1}^L \sum _{k=1}^K \mathbb {E}_{q(\varvec{\phi }_k)} \big [ \log p(\varvec{\phi }_k^{(l)} | \beta ) - \log q(\varvec{\phi }_k^{(l)}) \big ]. \end{aligned}$$
(8)

We maximize the above estimate, denoted as \(\hat{\mathcal {L}}(\mathbf {\Lambda })\), in place of \(\mathcal {L}(\mathbf {\Lambda })\) in Eq. (6).

Due to the limit of space, we only discuss the first two expectation terms of the right hand side of Eq. (8). The first term can be rewritten as follows:

(9)

where we use the definition of \(\phi _{kv}\) in Eq. (7). The summation term on the last line in Eq. (9) can be upper bounded by using the Taylor expansion [3]:

$$\begin{aligned} \log \!\sum _v\!\exp ( \epsilon _{\phi ,kv}^{(l)} \sigma _{\phi , kv} + \mu _{\phi , kv} ) \!\le \! \frac{\!\sum _v\!\exp ( \epsilon _{\phi ,kv}^{(l)} \sigma _{\phi , kv} + \mu _{\phi , kv} )}{\eta _{\phi ,k}^{(l)}}\!-\!1\!+\!\log \eta _{\phi ,k}^{(l)}, \end{aligned}$$
(10)

where we have introduced a new variational parameter \(\eta _{\phi ,k}^{(l)}\). Consequently, we can lower bound Eq. (9) as follows:

$$\begin{aligned} \frac{1}{L}&\sum _{l=1}^L \mathbb {E}_{q(\varvec{z}| \varvec{\gamma })} \big [ \log p(\varvec{x} | \varvec{z}, \varvec{\phi }^{(l)}) \big ] \ge \frac{1}{L} \sum _{l=1}^L \sum _{d=1}^D \sum _{i=1}^{N_d} \sum _{k=1}^K \gamma _{dik} ( \epsilon _{\phi ,kx_{di}}^{(l)} \sigma _{\phi , kx_{di}} + \mu _{\phi , kx_{di}} ) \nonumber \\ {}&- \frac{1}{L} \sum _{l=1}^L \sum _{d=1}^D \sum _{i=1}^{N_d} \sum _{k=1}^K \gamma _{dik} \bigg \{ \frac{ \sum _v \exp ( \epsilon _{\phi ,kv}^{(l)} \sigma _{\phi , kv} + \mu _{\phi , kv} ) }{\eta _{\phi ,k}^{(l)}} + \log \eta _{\phi ,k}^{(l)} - 1 \bigg \}. \end{aligned}$$
(11)

Let us define \(N_{kv} \equiv \sum _{d=1}^D \sum _{i=1}^{N_d} \gamma _{dik} \delta (x_{di}=v)\), where \(\delta (\cdot )\) is an indicator function that evaluates to 1 if the condition in parentheses holds and to 0 otherwise. \(N_{kv}\) means how many tokens of the vth vocabulary word are assigned to the kth topic in expectation. Further, we define \(N_k \equiv \sum _{v=1}^V N_{kv}\). Then Eq. (11) can be presented more neatly:

$$\begin{aligned} \frac{1}{L}&\sum _{l=1}^L \sum _{d=1}^D \sum _{i=1}^{N_d} \mathbb {E}_{q(z_{di})} \big [ \log p(x_{di} | z_{di}, \varvec{\phi }^{(l)}_{z_{di}}) \big ] \ge \sum _{k=1}^K \sum _{v=1}^V N_{kv} (\sigma _{\phi , kv} \bar{\epsilon }_{\phi ,kv} + \mu _{\phi , kv}) \nonumber \\ {}&- \frac{1}{L}\sum _{l=1}^L\sum _{k=1}^K N_k \bigg \{ \frac{ \sum _v \exp ( \epsilon _{\phi ,kv}^{(l)} \sigma _{\phi , kv} + \mu _{\phi , kv} )}{\eta _{\phi ,k}^{(l)}} + \log \eta _{\phi ,k}^{(l)} - 1 \bigg \}, \end{aligned}$$
(12)

where we define \(\bar{\epsilon }_{\phi ,kv} \equiv \frac{1}{L}\sum _{l=1}^L \epsilon _{\phi ,kv}^{(l)}\).

In a similar manner, we can lower bound the second expectation term of the right hand side in Eq. (8) as follows:

$$\begin{aligned} \frac{1}{L}&\sum _{l=1}^L \sum _{d=1}^D \sum _{i=1}^{N_d} \mathbb {E}_{q(z_{di})} \big [ \log p(z_{di} | \varvec{\theta }^{(l)}_d) \big ] \ge \sum _{d=1}^D \sum _{k=1}^K N_{dk} ( \sigma _{\theta , dk} \bar{\epsilon }_{\theta ,dk} + \mu _{\theta , dk} ) \nonumber \\ {}&- \frac{1}{L}\sum _{l=1}^L\sum _{d=1}^D N_d \bigg \{ \frac{ \sum _k \exp ( \epsilon _{\theta ,dk}^{(l)} \sigma _{\theta , dk} + \mu _{\theta , dk} )}{\eta _{\theta ,d}^{(l)}} + \log \eta _{\theta ,d}^{(l)} - 1 \bigg \}, \end{aligned}$$
(13)

where \(\eta _{\theta ,d}\) is a new parameter introduced in a manner similar to Eq. (10). \(N_{dk}\) is defined as \(\sum _{i=1}^{N_d} \gamma _{dik}\). \(N_{dk}\) means how many word tokens of the dth document are assigned to the kth topic in expectation.

We skip the explanation for other expectation terms in Eq. (8) and only show the final result. We can lower bound \(\hat{\mathcal {L}}(\mathbf {\Lambda })\) as follows:

$$\begin{aligned} \hat{\mathcal {L}}(\mathbf {\Lambda }) \ge&\sum _{d,k} (N_{dk} + \alpha ) (\sigma _{\theta , dk} \bar{\epsilon }_{\theta ,dk} + \mu _{\theta ,dk}) + \sum _{k,v} (N_{kv} + \beta ) (\sigma _{\phi , kv} \bar{\epsilon }_{\phi ,kv} + \mu _{\phi , kv}) \nonumber \\ {}&- \frac{1}{L}\sum _{l=1}^L\sum _{d=1}^D (N_d + K \alpha ) \bigg \{ \frac{ \sum _k \exp ( \epsilon _{\theta ,dk}^{(l)} \sigma _{\theta , dk} + \mu _{\theta , dk} )}{\eta _{\theta ,d}^{(l)}} + \log \eta _{\theta ,d}^{(l)} - 1 \bigg \} \nonumber \\ {}&- \frac{1}{L}\sum _{l=1}^L\sum _{k=1}^K (N_k + V \beta ) \bigg \{ \frac{ \sum _v \exp ( \epsilon _{\phi ,kv}^{(l)} \sigma _{\phi , kv} + \mu _{\phi , kv} )}{\eta _{\phi ,k}^{(l)}} + \log \eta _{\phi ,k}^{(l)} - 1 \bigg \} \nonumber \\ {}&+ \sum _{k=1}^K \sum _{v=1}^V \log \sigma _{\phi ,kv} + \sum _{d=1}^D \sum _{k=1}^K \log \sigma _{\theta ,dk} - \sum _{d=1}^D \sum _{i=1}^{N_d} \sum _{k=1}^K \gamma _{dik} \log \gamma _{dik} \nonumber \\ {}&+ D \log \Gamma (K \alpha ) - D K \log \Gamma (\alpha ) + K \log \Gamma (V \beta ) - K V \log \Gamma (\beta ), \end{aligned}$$
(14)

where the constant term is omitted. We refer to the right hand side of Eq. (14) by \(\tilde{\mathcal {L}}(\mathbf {\Lambda })\), where \(\mathbf {\Lambda }\) denotes the posterior parameters \(\{ \varvec{\mu }, \varvec{\sigma }, \varvec{\eta }, \varvec{\gamma } \}\). In our version of SGVB for LDA, we maximize \(\tilde{\mathcal {L}}(\mathbf {\Lambda })\). Note that \(\tilde{\mathcal {L}}(\mathbf {\Lambda })\) is differentiable with respect to all relevant variables owing to the parameterization trick in Eq. (7). Therefore, the second requirement given in Sect. 2 is met.

3.2 Maximization of Lower Bound

We maximize \(\tilde{\mathcal {L}}(\mathbf {\Lambda })\), i.e., the right hand side of Eq. (14), by differentiating it with respect to the relevant variables. With respect to \(\eta ^{(l)}_{\theta ,d}\), we obtain the derivative:

$$\begin{aligned} \frac{\partial \hat{L}(\mathbf {\Lambda })}{\partial \eta ^{(l)}_{\theta ,d}}&= \frac{1}{L} ( N_d + K \alpha ) \bigg \{ \frac{1}{\eta ^{(l)}_{\theta ,d}} - \frac{1}{ ( \eta ^{(l)}_{\theta ,d} )^2 } \sum _{k=1}^K \exp \big ( \epsilon ^{(l)}_{\theta ,dk} \sigma _{\theta , dk} + \mu _{\theta , dk} \big ) \bigg \}. \end{aligned}$$
(15)

The equation \(\frac{\partial \hat{L}(\mathbf {\Lambda })}{\partial \eta ^{(l)}_{\theta ,d}}=0\) gives the solution: \(\eta ^{(l)}_{\theta ,d} = \sum _{k=1}^K \exp ( \epsilon ^{(l)}_{\theta ,dk} \sigma _{\theta , dk} + \mu _{\theta , dk} )\). Similarly, \(\frac{\partial \hat{L}(\mathbf {\Lambda })}{\partial \eta ^{(l)}_{\phi ,k}}=0\) gives the solution: \(\eta ^{(l)}_{\phi ,k} = \sum _{v=1}^V \exp ( \epsilon ^{(l)}_{\phi ,kv} \sigma _{\phi ,kv} + \mu _{\phi ,kv} )\). Note that each of these solutions appears as the denominator in Eq. (7).

Further, with respect to \(\mu _{\theta ,dk}\), we obtain the derivative:

$$\begin{aligned} \frac{\partial \hat{L}(\mathbf {\Lambda })}{\partial \mu _{\theta ,dk}}&= (N_{dk} + \alpha ) - \frac{\exp ( \mu _{\theta ,dk} )}{L} ( N_d + K \alpha ) \sum _{l=1}^L \frac{ \exp ( \epsilon ^{(l)}_{\theta ,dk} \sigma _{\theta ,dk} ) }{\eta ^{(l)}_{\theta ,d}}. \end{aligned}$$
(16)

Therefore, \(\exp ( \mu _{\theta ,dk} )\) is estimated as \(\frac{N_{dk} + \alpha }{N_d + K\alpha } \cdot \frac{ L }{ \sum _l \exp ( \epsilon ^{(l)}_{\theta ,dk} \sigma _{\theta ,dk} ) / \eta ^{(l)}_{\theta ,d} }\). Based on the definition of \(\theta _{dk}^{(l)}\) in Eq. (7), we obtain the following update for \(\exp ( \mu _{\theta ,dk} )\):

$$\begin{aligned} \exp ( \mu _{\theta ,dk} ) \leftarrow \exp ( \mu _{\theta ,dk} ) \cdot \bigg ( \frac{N_{dk} + \alpha }{N_d + K\alpha } \bigg ) \bigg / \bigg ( \frac{\sum _l \theta _{dk}^{(l)}}{L} \bigg ). \end{aligned}$$
(17)

Similarly, \(\exp ( \mu _{\phi ,kv} )\) is updated as follows:

$$\begin{aligned} \exp ( \mu _{\phi ,kv} ) \leftarrow \exp ( \mu _{\phi ,kv} ) \cdot \bigg ( \frac{N_{kv} + \beta }{N_k + V\beta } \bigg ) \bigg / \bigg ( \frac{\sum _l \phi _{kv}^{(l)}}{L} \bigg ). \end{aligned}$$
(18)

We can give an intuitive explanation to the update in Eq. (17). \(\frac{N_{dk} + \alpha }{N_d + K\alpha }\) is an estimate of the per-document topic probability based on the word count expectation \(N_{dk}\). In contrast, \(\frac{\sum _l \theta _{dk}^{(l)}}{L}\) is an estimate of the same probability based on the logistic normal samples \(\theta _{dk}^{(1)},\ldots ,\theta _{dk}^{(L)}\). We adjust \(\exp ( \mu _{\theta ,dk} )\) based on to what extent the latter estimate deviates from the former. A similar explanation can be given to the update in Eq. (18).

For the standard deviation parameters \(\sigma _{\theta ,dk}\) and \(\sigma _{\phi ,kv}\), we cannot obtain any closed form update. Therefore, we perform a gradient-based optimization. For numerical reasons, we change variables as \(\tau = \log (\sigma ^2)\). Then the derivatives with respect to \(\tau _{\theta ,dk}\) and \(\tau _{\phi ,kv}\) are obtained as follows:

$$\begin{aligned} \frac{\partial \hat{L}(\mathbf {\Lambda })}{\partial \tau _{\theta ,dk}}&= \frac{1}{2} + \frac{1}{2}\exp \Big (\frac{\tau _{\theta ,dk}}{2}\Big ) \frac{ \sum _{l=1}^L \epsilon ^{(l)}_{\theta ,dk} \{ (N_{dk} + \alpha ) - ( N_d + K\alpha ) \theta ^{(l)}_{dk} \} }{L}, \end{aligned}$$
(19)
$$\begin{aligned} \frac{\partial \hat{L}(\mathbf {\Lambda })}{\partial \tau _{\phi ,kv}}&= \frac{1}{2} + \frac{1}{2}\exp \Big (\frac{\tau _{\phi ,kv}}{2}\Big ) \frac{ \sum _{l=1}^L \epsilon ^{(l)}_{\phi ,kv} \{ (N_{kv} + \beta ) - ( N_k + V \beta ) \phi ^{(l)}_{kv} \} }{L}. \end{aligned}$$
(20)

With respect to \(\gamma _{dik}\), we obtain the following derivative:

$$\begin{aligned}&\frac{\partial \hat{L}(\mathbf {\Lambda })}{\partial \gamma _{dik}} = \frac{1}{L}\sum _{l=1}^L ( \epsilon ^{(l)}_{\theta ,dk} \sigma _{\theta ,dk} + \mu _{\theta ,dk} ) + \frac{1}{L}\sum _{l=1}^L ( \epsilon ^{(l)}_{\phi ,kx_{di}} \sigma _{\phi ,kx_{di}} + \mu _{\phi ,kx_{di}} ) \nonumber \\ {}&- \frac{1}{L}\sum _{l=1}^L \Big \{ \frac{ \sum _v \exp ( \epsilon ^{(l)}_{\phi ,kv} \sigma _{\phi , kv} + \mu _{\phi , kv} ) }{\eta ^{(l)}_{\phi ,k}} + \log \eta ^{(l)}_{\phi ,k} - 1 \Big \} - \log \gamma _{dn,k} - 1. \end{aligned}$$
(21)

\(\frac{\partial \hat{L}(\mathbf {\Lambda })}{\partial \gamma _{dik}} = 0\) gives the following update:

$$\begin{aligned} \gamma _{dik} \propto \bigg ( \prod _{l=1}^L \theta ^{(l)}_{dk} \bigg )^{\frac{1}{L}} \cdot \bigg ( \prod _{l=1}^L \phi ^{(l)}_{kx_{dn}} \bigg )^{\frac{1}{L}}. \end{aligned}$$
(22)

The right hand side of Eq. (22) is the product of the geometric mean of the sampled per-document topic probabilities \(\theta ^{(l)}_{dk}\) and the geometric mean of the sampled per-topic word probabilities \(\phi ^{(l)}_{kx_{dn}}\). Interestingly, other inference methods for LDA also represent the per-token topic probability as the product of the per-document topic probability and the per-topic word probability [2].

figure a

Algorithm 1 gives the pseudocode. As in CVB for LDA [16], we only need to maintain one copy of \(\gamma _{dik}\) for each unique document/word pair. Therefore, we can reduce the number of iterations of the loop on line 11 from \(N_d\) to the number of different words in the dth document. Consequently, the time complexity of the proposed inference for each scan of the data is \(\mathcal {O}(MK)\), where M is the total number of unique document/word pairs. For updating \(\tau _{\theta ,dk}\) and \(\tau _{\phi ,kv}\) based on the gradients in Eqs. (19) and (20), we use Adam [8] in this paper.

3.3 Estimation Without Sampling

By setting all standard deviation parameters \(\varvec{\sigma }\) to 0, we obtain an estimate of the log-evidence lower bound without sampling as a degenerated version of our proposal. In this case, we only update the parameter \(\gamma _{dik}\) by

$$\begin{aligned} \gamma _{dik} \propto \bigg ( \frac{N_{dk} + \alpha }{N_d + K\alpha } \bigg ) \cdot \bigg ( \frac{N_{kv} + \beta }{N_k + V\beta } \bigg ). \end{aligned}$$
(23)

This is almost the same with the update of \(\gamma _{dik}\) in CVB0 [2] except that the contribution of \(\gamma _{dik}\) is not subtracted from \(N_{dk}\), \(N_{kv}\), and \(N_k\). In the evaluation experiment presented in the next section, we compared the proposed method also with this degenerated version to clarify the effect of the sampling.

Table 1. Specifications of the document sets

4 Experiment

4.1 Data Sets

In the evaluation experiment, we used the four English document sets in Table 1. NYT is a part of the NYTimes news articles in “Bag of Words Data Set” of the UCI Machine Learning Repository.Footnote 2 We reduced the number of documents to one third of its original number due to the limit of the main memory. MOVIE is the set of movie reviews known as “Movie Review Data.”Footnote 3 NSF is “NSF Research Award Abstracts 1990–2003 Data Set” of the UCI Machine Learning Repository. MED is a subset of the paper abstracts of the MEDLINE®/PUBMED®, a database of the U.S. National Library of Medicine.Footnote 4 For all document sets, we applied the Porter stemming algorithm and removed highly frequent words and extremely rare words. The average document lengths, i.e., \(\sum _d N_d / D\), of NYT and MOVIE are 344.6 and 459.0, respectively. In contrast, those of NSF and MED are 114.0 and 140.3, respectively. This difference comes from the fact that NSF and MED consist of abstracts.

4.2 Evaluation Method

By using the above four data sets, we compared our proposal with the following three inference methods for LDA: the standard VB [4], CGS [6], and the degenerated version described in Sect. 3.3. The evaluation measure is the test set perplexity. We ran each of the compared inference methods on the 90 % word tokens randomly selected from each document and used the estimated parameters for computing the test set perplexity on the other 10 % word tokens as follows:

$$\begin{aligned} perplexity \equiv \exp \bigg \{ - \frac{1}{N_{\text{ test }}} \sum _{d=1}^D \sum _{i \in \mathcal {I}_d} \log \Big ( \sum _{k=1}^K \theta _{dk} \phi _{kx_{di}} \Big ) \bigg \}, \end{aligned}$$
(24)

where \(\mathcal {I}_d\) is the set of the indices of the test word tokens in the dth document, and \(N_\mathrm{test}\) is the total number of the test tokens. For K, i.e., the number of topics, we tested the following three settings: \(K=50, 100, \text{ and } 200\).

4.3 Inference Settings

The proposed inference was run on each data set in the following manner. We tuned the free parameters on a random training/test split that was prepared only for validation. On lines 7 and 16 in Algorithm 1, \(\tau _{\phi ,kv}\) and \(\tau _{\theta ,dk}\) are updated by using the gradients. For this optimization, we used Adam [8]. The stepsize parameter in Adam was chosen from \(\{0.01, 0.001, 0.0001\}\), though the other parameters are used with their default settings. The common initial value of the parameters \(\tau _{\theta ,dk}\) and \(\tau _{\phi ,kv}\) was chosen from \(\{-10.0, -1.0, -0.5\}\). The sample size L was one, because larger sample sizes gave comparable or worse results.

Based on the discussion in [2], we tuned the hyperparameters \(\alpha \) and \(\beta \) of the symmetric Dirichlet priors by a grid search. Each of \(\alpha \) and \(\beta \) was chosen from \(\{\)0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5\(\}\). The same grid search was used for the Dirichlet hyperparameters of the compared methods. The number of small batches in Algorithm 1 was set to 20. The number of iterations, i.e., how many times we scanned the entire document set, was chosen as 500. We computed ten test set perplexities based on the ten different random splits prepared for evaluation. The test set perplexity in Eq. (24) was computed at the 500th iteration.

4.4 Evaluation Results

Figs. 1 and 2 present the evaluation results. We split the results into two figures, because the two data sets NYT and MOVIE are widely different from the other two NSF and MED in their average document lengths. This difference may be part of the reason why we obtained different evaluation results. The horizontal axis of the charts in Figs. 1 and 2 gives the three settings for the number of topics: \(K=50\), 100, and 200. The vertical axis gives the test set perplexity averaged over the ten different random splits. The standard deviation of the ten test set perplexities is presented by the error bar. LNV and DEG are the labels for our proposal and its degenerated version, respectively. VB and CGS refer to the standard VB [4] and the collapsed Gibbs sampling [6], respectively.

Fig. 1.
figure 1

Evaluation results in terms of test set perplexity for the two document sets NYT (left) and MOVIE (right), whose average document lengths are relatively long.

Figure 1 presents the results for the two data sets NYT and MOVIE, which consist of relatively long documents. Our proposal LNV led to the best perplexity for four cases among the six cases given in Fig. 1. When we set \(K=50\), DEG could give almost the same test set perplexity with LNV for both of the NYT and MOVIE data sets. However, the differences were not statistically significant, because the p-values of the two-tailed t-test were 0.463 and 0.211, respectively. For the other four cases, LNV could give the best perplexity. The differences for all these four cases were statistically significant. For example, when we set \(K=100\) for the MOVIE data set, the p-value of the two-tailed t-test where we compare LNV and DEG was 0.00134. It can be said that our proposal worked effectively for these two document sets.

Figure 2 shows the results for the two data sets NSF and MED, which consist of relatively short documents. Our proposal LNV could provide the best perplexity for the following three cases: \(K=50\) for the NSF data set, \(K=50\) for the MED data set, and \(K=100\) for the MED data set. Note that the difference between LNV and CGS when we set \(K=50\) for the NSF data set was statistically significant, because the p-value of the two-tailed t-test was 0.000121. For the other three cases, CGS gave the best perplexity. For these two data sets, our proposal worked only when the number of topics was not large.

Note that LNV was superior to VB for all settings presented in Figs. 1 and 2. This proves empirically that the Dirichlet distribution is not necessarily the best choice for approximating the true posterior even when we use the same mean field approximation with the standard VB. In sum, we can draw the following conclusion. When LNV is available, there may be no reason to use VB, DEG, or CGS for relatively long documents. Also for relatively short documents, our proposal may be adopted when the number of latent topics is small.

However, in terms of computation time, LNV has a disadvantage. For example, when we set \(K=200\) for the NYT data set, it took 43 h for finishing 500 iterations with LNV, though it took 14 h with CGS and 23 h with VB. However, inferences for LDA are generally easy to parallelize, e.g. by using GPU [18, 19]. It may be an advantage for parallelization that each sample \(\epsilon \sim \mathcal {N}(0,1)\) in the proposed method can be drawn independently.

Fig. 2.
figure 2

Evaluation results in terms of test set perplexity for the two document sets NSF (left) and MED (right), whose average document lengths are relatively short.

4.5 Effect of Sampling

The degenerated version of our proposal gave a comparable perplexity only for the limited number of cases in our experiment. When the number of latent topics was large, or when the average document length was short, the degenerated version led to quite a poor perplexity. Especially in the two charts of Fig. 2, it seems that the degenerated version exhibits substantial overfitting. Therefore, sampling is indispensable. Recall that L, i.e., the number of the samples from the standard normal, was set to 1, because larger numbers of samples did not improve the test set perplexity. However, a single random sample could work as a kind of perturbation for the update of the corresponding parameter. Without this perturbation, the inference tended to get trapped in local minima as shown in Fig. 2. A single sample can change the course of inference through perturbation. This may be the reason why our proposal gave better perplexities than its degenerated version in many of the situations consulted in the experiment.

Based on our experience, it is important to optimize the standard deviation parameters \(\tau _{\theta ,dk}\) and \(\tau _{\phi ,kv}\) carefully in order to avoid overfitting. When the stepsize parameter of Adam was not tuned, the standard deviation parameters stayed around at their initial values. This made the perplexity almost similar to that of the degenerated version. In addition, the initial values of \(\tau _{\theta ,dk}\) and \(\tau _{\phi ,kv}\) also needed to be tuned carefully. However, the tuning was not that difficult, because the optimization was almost always successful when the parameters \(\tau _{\theta ,dk}\) and \(\tau _{\phi ,kv}\) took values widely different from their initial values. Further, we only needed to test several setting for the combination of the stepsize parameter in Adam and the common initial value of \(\tau _{\theta ,dk}\) and \(\tau _{\phi ,kv}\). In our experiment, the stepsize parameter was chosen from \(\{0.01, 0.001, 0.0001\}\). The common initial value of the parameters \(\tau _{\theta ,dk}\) and \(\tau _{\phi ,kv}\) was chosen from \(\{-10.0, -1.0, -0.5\}\). Therefore, at most nine settings were checked. However, the combination of 0.001 for the stepsize parameter and \(-0.5\) for the initial value of \(\tau _{\theta ,dk}\) and \(\tau _{\phi ,kv}\) often worked. Only when this setting did not work, we considered other settings.

5 Conclusion

In this paper, we proposed a new VB-type inference method for LDA. Our method is based on the stochastic gradient variational Bayes [9, 13] and approximates the true posterior with the logistic normal distribution. The proposed method was better than the standard VB for all situations consulted in the experiment and was better even than the collapsed Gibbs sampling for not all but many situations. Further, when deprived of sampling, the inference tended to get trapped in local minima. Therefore, sampling worked.

While we use the logistic normal distribution in the proposed inference, we can choose other distributions as long as they meet the two requirements given in Sect. 2. Further, we can propose a similar inference also for other Bayesian probabilistic models. One important merit of SGVB is that the expectation with respect to the approximate posterior for continuous variables is estimated by the Monte Carlo method. Even when the full joint distribution of the target Bayesian model is complicated, SGVB may make the computation relating to such expectations efficient. Therefore, it is worthwhile future work to provide a new inference for the existing Bayesian models with the distribution that has not been considered due to the complication in handling the expectations.