1 Introduction

In many modern statistical applications, it is necessary to model complex dependent data. In these situations, models which employ observation specific latent variables such as random effects and state space models are widely used because of their flexibility, and Bayesian approaches dealing naturally with the hierarchical structure are attractive in principle. However, incorporating observation specific latent variables leads to a parameter dimension increasing with sample size, and standard Bayesian computational methods can be challenging to implement in very high-dimensional settings. For this reason, approximate inference methods are attractive for these models, both in exploratory settings where many models need to be fitted quickly, as well as in applications involving large datasets where exact methods are infeasible. One of the most common approximate inference paradigms is variational inference (Ormerod and Wand 2010; Blei et al. 2017), which is the approach considered here.

Our main contribution is to consider partitioning the unknowns in a local latent variable model into “global” parameters and “local” latent variables and to suggest ways of structuring the dependence in a variational approximation that match the specification of these models. We go beyond standard Gaussian approximations by defining the variational approximation sequentially, through a marginal density for the global parameters and a conditional density for local parameters given global parameters. Each term in our approximation is Gaussian, but we allow the conditional covariance matrix for the local parameters to depend on the global parameters, which leads to an approximation that is not jointly Gaussian. We are particularly interested in improved inference on global variance and dependence parameters which determine the scale and dependence structure of local latent variables. With this objective, we suggest a parametrization of our conditional approximation to the local variables that is well motivated and respects the exact conditional independence structure in the true posterior distribution. Our approximations are parsimonious in terms of the number of required variational parameters, which is important since a high-dimensional variational optimization is computationally burdensome. The methods suggested improve on Gaussian variational approximation methods for a similar computational cost. Besides defining a novel and useful variational family appropriate to local latent variable models, we also employ importance weighted variational inference methods (Burda et al. 2016; Domke and Sheldon 2018) to further improve the quality of inference and elaborate further on the connections between this approach and the use of Rényi’s divergence within the variational optimization (Li and Turner 2016; Regli and Silva 2018; Yang et al. 2019).

Our method is a contribution to the literature on the development of flexible variational families, and there are many interesting existing methods for this task. One fruitful approach is based on normalizing flows (Rezende and Mohamed 2015), where a variational family is defined using an invertible transformation of a random vector with some known density function. To be useful, the transformation should have an easily computed Jacobian determinant. In the original work of Rezende and Mohamed (2015), compositions of simple flows called radial and planar flows were considered. Later authors have suggested alternatives, such as autoregressive flows (Germain et al. 2015), inverse autoregressive flows (Kingma et al. 2016), and real-valued non-volume preserving transformations (Dinh et al. 2017), among others. Spantini et al. (2018) gives a theoretical framework connecting Markov properties of a target posterior distribution to representations involving transport maps, with normalizing flows being one way to parametrize such mappings. The variational family we consider here can be thought of as a simple autoregressive flow, but carefully constructed to preserve the conditional independence structure in the true posterior and to achieve parsimony in the representation of dependence between local latent variables and global scale parameters. Our work is also related to the hierarchically structured approximations considered in Salimans and Knowles (2013, Sect. 7.1); these authors also consider other flexible approximations based on mixture models and a variety of innovative numerical approaches to the variational optimization. Hoffman and Blei (2015) propose an approach called structured stochastic variational inference which is applicable in conditionally conjugate models. Their approach is similar to ours, in the sense that local variables depend on global variables in the variational posterior. However, conditional conjugacy does not hold in the examples we consider.

The methods we describe can be thought of as extending the Gaussian variational approximation (GVA) of Tan and Nott (2018), where parametrization of the variational covariance matrix was considered in terms of a sparse Cholesky factor of the precision matrix. Similar approximations have been considered for state space models in Archer et al. (2016). The sparse structure reduces the number of free variational parameters and allows matching the exact conditional independence structure in the true posterior. Tan (2018) propose an approach called reparametrized variational Bayes, where the model is reparametrized by applying an invertible affine transformation to the local variables to minimize their posterior dependency on global variables, before applying a mean field approximation. The affine transformation is obtained by considering a second order Taylor series approximation to the posterior of the local variables conditional on the global variables. One way of improving on Gaussian approximations is to consider mixtures of Gaussians (Jaakkola and Jordan 1998; Salimans and Knowles 2013; Miller et al. 2016; Guo et al. 2016). However, even with a parsimonious parametrization of component densities, a large number of additional variational parameters are added with each mixture component. Other flexible variational families can be formed using copulas (Tran et al. 2015; Han et al. 2016; Smith et al. 2019), hierarchical variational models (Ranganath et al. 2016) or implicit approaches (Huszár 2017).

A closely related work is the framework for hierarchical dynamical systems (HDS) proposed by Roeder et al. (2019) which appeared shortly after this manuscript. To allow their methods to scale up to massive data sets, Roeder et al. (2019) consider amortized variational inference for HDS, which similarly uses doubly reparameterized importance-weighted gradient estimators. A Gaussian mean-field variational posterior was assumed for the latent variables, which were partitioned into three blocks, population, group and individual. The mean and variance parameters of group-level variables were expressed as functions of group indicators, while that of individual-level variables were encoded using neural networks depending on both observations and group indicators. While our parameters are also classified as either local or global, we do not apply the mean-field assumption or amortized variational inference. Instead, we consider conditionally structured variational posteriors which exploit the conditional independence structure of the true posterior and express the local variational parameters as linear functions of the global ones to capture the strong posterior dependencies between local and global variables.

We specify the model and notation in Sect. 2 and introduce the conditionally structured Gaussian variational approximation (CSGVA) in Sect. 3. The algorithm for optimizing the variational parameters is described in Sect. 4, and Sect. 5 highlights the association between GVA and CSGVA. Section 6 describes how CSGVA can be improved using importance weighting. Experimental results and applications to generalized linear mixed models (GLMMs) and state space models are presented in Sects. 78 and 9, respectively. Section 10 gives some concluding discussion.

2 Model specification and notation

Let \(y = (y_1, \dots , y_n)^T\) be observations from a model with global variables \(\theta _G\) and local variables \(\theta _L = (b_1, \dots , b_n)^T\), where \(b_i\) contains latent variables specific to \(y_i\) for \(i=1, \dots , n\). Suppose \(\theta _G\) is a vector of length G and each \(b_i\) is a vector of length L. Let \(\theta = (\theta _G^T, \theta _L^T)^T\). We consider models where the joint density is of the form

$$\begin{aligned} \begin{aligned} p(y, \theta ) =&p(\theta _G) p(b_1, \dots , b_\ell |\theta _G)\bigg \{ \prod _{i=1}^n p(y_i|b_i, \theta _G) \bigg \} \\&\times \bigg \{ \prod _{i> \ell } p(b_i|b_{i-1}, \dots , b_{i-\ell }, \theta _G) \bigg \}. \end{aligned} \end{aligned}$$

The observations \(\{y_i\}\) are conditionally independent given \(\{b_i\}\) and \(\theta _G\). Conditional on \(\theta _G\), the local variables \(\{b_i\}\) form a \(\ell \)th order Markov chain if \(\ell \ge 1\), and they are conditionally independent if \(\ell =0\). This class of models include important models such as GLMMs and state space models. Next, we define some mathematical notation before discussing CSGVA for this class of models.

2.1 Notation

For an \(r \times r\) matrix A, let \(\mathrm{diag}(A)\) denote the diagonal elements of A and \(\mathrm{dg}(A)\) be the diagonal matrix obtained by setting non-diagonal elements in A to zero. Let \(\text {vec}(A)\) be the vector of length \(r^2\) obtained by stacking the columns of A under each other from left to right and \(v(A)\) be the vector of length \(r(r+1)/2\) obtained from \(\mathrm{vec}(A)\) by eliminating all superdiagonal elements of A. Let \(E_r\) be the \(r(r+1)/2 \times r^2\) elimination matrix, \(K_r\) be the \(r^2 \times r^2\) commutation matrix and \(D_r\) be the \(r^2 \times r(r+1)/2\) duplication matrix (see Magnus and Neudecker 1980). Then \(K_r \mathrm{vec}(A) = \mathrm{vec}(A^T)\), \(E_r \mathrm{vec}(A) = v(A)\), \(E_r^T v(A) = \mathrm{vec}(A)\) if A is lower triangular, and \(D_r v(A) = \mathrm{vec}(A)\) if A is symmetric. Let \(\mathbf{1}_r\) be a vector of ones of length r. Scalar functions applied to vector arguments are evaluated element by element. Let \(\mathrm{d}\) denote the differential operator (see e.g. Magnus and Neudecker 1999).

3 Conditionally structured Gaussian variational approximation

We propose to approximate the posterior distribution \(p(\theta |y)\) of the model defined in Sect. 2 by a density of the form

$$\begin{aligned} q(\theta ) = q(\theta _G) q(\theta _L|\theta _G), \end{aligned}$$

where \(q(\theta _G) = N(\mu _1, \Omega _1^{-1})\), \(q(\theta _L|\theta _G) = N(\mu _2, \Omega _2^{-1})\), and \(\Omega _1\) and \(\Omega _2\) are the precision (inverse covariance) matrices of \(q(\theta _G)\) and \(q(\theta _L|\theta _G)\), respectively. Here \(\mu _2\) and \(\Omega _2\) depend on \(\theta _G\), but we do not denote this explicitly for notational conciseness. Let \(C_1C_1^T\) and \(C_2C_2^T\) be unique Cholesky factorizations of \(\Omega _1\) and \(\Omega _2\), respectively, where \(C_1\) and \(C_2\) are lower triangular matrices with positive diagonal entries. We further define \(C_1^*\) and \(C_2^*\) to be lower triangular matrices of order G and nL, respectively, such that \(C_{r,ii}^* = \log (C_{r,ii})\) and \(C_{r,ij}^* = C_{r,ij}\) if \(i \ne j\) for \(r=1,2\). The purpose of introducing \(C_1^*\) and \(C_2^*\) is to allow unconstrained optimization of the variational parameters in the stochastic gradient ascent algorithm, since diagonal entries of \(C_1\) and \(C_2\) are constrained to be positive. Note that \(C_2\) and \(C_2^*\) also depend on \(\theta _G\) but again we do not show this explicitly in our notation.

As \(\Omega _2\) depends on \(\theta _G\), the joint distribution \(q(\theta )\) is generally non-Gaussian even though \(q(\theta _G)\) and \(q(\theta _L|\theta _G)\) are individually Gaussian. Here we consider a first order approximation and assume that \(\mu _2\) and \(v(C_2^*)\) are linear functions of \(\theta _G\):

$$\begin{aligned} \mu _2 = d + C_2^{-T} D(\mu _1 - \theta _G), \quad v(C_2^*) = f + F \theta _G. \end{aligned}$$
(1)

In (1), d is a vector of length nL, D is a \(nL \times G\) matrix, f is a vector of length \(nL(nL+1)/2\) and F is a \(nL(nL+1)/2 \times G\) matrix. For this specification, \(q(\theta )\) is not jointly Gaussian due to dependence of the covariance matrix of \(q(\theta _L|\theta _G)\) on \(\theta _G\). It is Gaussian if and only if \(F \equiv 0\). The set of variational parameters to be optimized is denoted as

$$\begin{aligned} \lambda = (\mu _1^T, v(C_1^*)^T, d^T, \mathrm{vec}(D)^T, f^T, \mathrm{vec}(F)^T)^T. \end{aligned}$$

As motivation for the linear approximation in (1), consider the linear mixed model,

$$\begin{aligned} \begin{aligned} y_i =&X_i \beta + Z_i b_i + \epsilon _i, \quad (i=1, \dots , n) , \\&\beta \sim N(0, \sigma _\beta ^2I_p), \quad b_i \sim N(0, \Lambda ), \quad \epsilon _i \sim N(0, \sigma _\epsilon ^2 I_{n_i}), \\ \end{aligned} \end{aligned}$$

where \(y_i\) is a vector of responses of length \(n_i\) for the ith subject, \(X_i\) and \(Z_i\) are covariate matrices of dimensions \(n_i \times p\) and \(n_i \times L\), respectively, \(\beta \) is a vector of coefficients of length p and \(\{b_i\}\) are random effects. Assume \(\sigma _\epsilon ^2\) is known. Then the global parameters \(\theta _G\) consists of \(\beta \) and \(\Lambda \). The posterior of \(\theta _L\) conditional on \(\theta _G\) is

$$\begin{aligned} \begin{aligned} p(\theta _L|y, \theta _G)&\propto \prod _{i=1}^n p(y_i|\beta , b_i) p(b_i|\Lambda ) \\&\propto \prod _{i=1}^n \exp [-\{b_i^T ( Z_i^TZ_i /\sigma _\epsilon ^2 + \Lambda ^{-1}) b_i \\&\quad - 2 b_i^T Z_i^T(y_i - X_i \beta ) /\sigma _\epsilon ^2 \}/2]. \end{aligned} \end{aligned}$$

Thus \(p(\theta _L|y, \theta _G) = \prod _{i=1}^n p(b_i|y, \theta _G)\), where \(p(b_i|y, \theta _G)\) is a normal density with precision matrix \(Z_i^TZ_i /\sigma _\epsilon ^2+ \Lambda ^{-1}\) and mean \(( Z_i^TZ_i / \sigma _\epsilon ^2 + \Lambda ^{-1})^{-1} Z_i^T(y_i - X_i \beta )/\sigma _\epsilon ^2\). The precision matrix depends on \(\Lambda ^{-1}\) linearly and the mean depends on \(\beta \) linearly after scaling by the covariance matrix. The linear approximation in (1) tries to mimic this dependence relationship.

The proposed variational density is conditionally structured and highly flexible. Such dependence structure is particularly valuable in constructing variational approximations for hierarchical models, where there are global scale parameters in \(\theta _G\) which help to determine the scale of local latent variables in the conditional posterior of \(\theta _L|\theta _G\). While marginal posteriors of the global variables are often well approximated by Gaussian densities, marginal posteriors of the local variables tend to exhibit more skewness and kurtosis. This deviation from normality can be captured by \(q(\theta _L) = \int q(\theta _G) q(\theta _L|\theta _G) d\theta _G\), which is a mixture of normal densities. The formulation in (1) also allows for a reduction in the number of variational parameters if conditional independence structure consistent with that in the true posterior is imposed on the variational approximation.

3.1 Using conditional independence structure

Tan and Nott (2018) incorporate the conditional independence structure of the true posterior into Gaussian variational approximations by using the fact that zeros in the precision matrix correspond to conditional independence for Gaussian random vectors. This incorporation achieves sparsity in the precision matrix of the approximation and leads to a large reduction in the number of variational parameters to be optimized. For high-dimensional \(\theta \), this sparse structure is especially important because a full Gaussian approximation involves learning a covariance matrix where the number of elements grows quadratically with the dimension of \(\theta \).

Recall that \(\theta _L=(b_1,\dots , b_n)^T\). Suppose \(b_i\) is conditionally independent of \(b_j\) in the posterior for \(|i-j|>\ell \), given \(\theta _G\) and \(\{b_j \mid |i-j|\le \ell \}\). For instance, in a GLMM, \(\{b_i\}\) may be subject specific random effects, and these are conditionally independent given the global parameters, so this structure holds with \(\ell =0\). In the case of a state space model for a time series, \(\{b_i\}\) are the latent states, and this structure holds with \(\ell =1\). Note that ordering of the latent variables is important here.

Now partition the precision matrix \(\Omega _2\) of \(q(\theta _L|\theta _G)\) into \(L \times L\) blocks with n row and n column partitions corresponding to \(\theta _L=(b_1,\dots , b_n)^T\). Let \(\Omega _{2, ij}\) be the block corresponding to \(b_i\) horizontally and \(b_j\) vertically for \(i,j =1, \dots , n\). If \(b_i\) is conditionally independent of \(b_j\) for \(|i-j|>\ell \), given \(\theta _G\) and \(\{b_j \mid |i-j|\le \ell \}\), then we set \(\Omega _{2, ij}=0\) for all pairs (ij) with \(|i - j| > \ell \). Let \({\mathcal {I}}\) denote the indices of elements in \(v(\Omega _2)\) which are fixed at zero by this conditional independence requirement. If we choose \(f_i=0\) and \(F_{ij}=0\) for all \(i\in {\mathcal {I}}\) and all j in (1), then \(C_2^*\) has the same block sparse structure we desire for the lower triangular part of \(\Omega _2\). By Proposition 1 of Rothman et al. (2010), this means that \(\Omega _2\) will have the desired block sparse structure. Hence we impose the constraints \(f_i=0\) and \(F_{ij}=0\) for \(i\in {\mathcal {I}}\) and all j, which reduces the number of variational parameters to be optimized.

4 Optimization of variational parameters

To make the dependence on \(\lambda \) explicit, write \(q(\theta )\) as \(q_\lambda (\theta )\). The variational parameters \(\lambda \) are optimized by minimizing the Kullback-Leibler divergence between \(q_\lambda (\theta )\) and the true posterior \(p(\theta |y)\), where

$$\begin{aligned} \mathrm{KL}\{q_\lambda ||p(\theta |y)\}= & {} \int q_\lambda (\theta ) \log \frac{q_\lambda (\theta )}{p(\theta |y)} d\theta \\= & {} \log p(y) - \int q_\lambda (\theta ) \log \frac{p(y,\theta )}{q_\lambda (\theta )} d\theta \ge 0. \end{aligned}$$

Minimizing \(\mathrm{KL}\{q_\lambda ||p(\theta |y)\}\) is therefore equivalent to maximizing an evidence lower bound \({\mathcal {L}}^\text {VI}\) on the log marginal likelihood \(\log p(y)\), where

$$\begin{aligned} {\mathcal {L}}^\text {VI}= E_{q_\lambda } \{ \log p(y, \theta ) - \log q_\lambda (\theta ) \}. \end{aligned}$$
(2)

In (2), \(E_{q_\lambda }\) denotes expectation with respect to \(q_\lambda (\theta )\). We seek to maximize \({\mathcal {L}}^\text {VI}\) with respect to \(\lambda \) using stochastic gradient ascent. Starting with some initial estimate of \(\lambda \), we perform the following update at each iteration t,

$$\begin{aligned} \lambda _t = \lambda _{t-1} + \rho _t {\widehat{\nabla }}_\lambda {\mathcal {L}}^\text {VI}, \end{aligned}$$

where \(\rho _t\) represents a small stepsize taken in the direction of the stochastic gradient \({\widehat{\nabla }}_\lambda {\mathcal {L}}^\text {VI}\). The sequence \(\{\rho _t\}\) should satisfy the conditions \(\sum _t \rho _t = \infty \) and \(\sum _t \rho _t^2 < \infty \) (Spall 2003).

An unbiased estimate of the gradient \(\nabla _\lambda {\mathcal {L}}^\text {VI}\) can be constructed using (2) by simulating \(\theta \) from \(q_\lambda (\theta )\). However, this approach usually results in large fluctuations in the stochastic gradients. Hence we implement the “reparametrization trick” (Kingma and Welling 2014; Rezende et al. 2014; Titsias and Lázaro-Gredilla 2014), which helps to reduce the variance of the stochastic gradients. This approach writes \(\theta \sim q_\lambda (\theta )\) as a function of the variational parameters \(\lambda \) and a random vector s having a density not depending on \(\lambda \). To explain further, let \(s = [s_1^T, s_2^T]^T\), where \(s_1\) and \(s_2\) are vectors of length G and nL corresponding to \(\theta _G\) and \(\theta _L\), respectively. Consider a transformation \(\theta = r_\lambda (s)\) of the form

$$\begin{aligned} \begin{bmatrix} \theta _G \\ \theta _L \end{bmatrix} = \begin{bmatrix} \mu _1 + C_1^{-T} s_1 \\ \mu _2 + C_2^{-T} s_2 \end{bmatrix}. \end{aligned}$$
(3)

Since \(\mu _2\) and \(C_2\) are functions of \(\theta _G\) from (1),

$$\begin{aligned} \begin{aligned} \mu _2&= d + C_2^{-T} D(\mu _1 - \theta _G) = d- C_2^{-T} D C_1^{-T} s_1, \\ v(C_2^*)&= f + F \theta _G = f + F(\mu _1 + C_1^{-T} s_1). \end{aligned} \end{aligned}$$

Hence \(\mu _2\) and \(C_2\) are functions of \(s_1\), and \(\theta _L\) is a function of both \(s_1\) and \(s_2\). This transformation is invertible since given \(\theta \) and \(\lambda \), we can first recover \(s_1 = C_1^T(\theta _G - \mu _1)\), find \(\mu _2\) and \(C_2\), and then recover \(s_2 = C_2^T(\theta _L - \mu _2)\). Applying this transformation,

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^\text {VI}&= \int \phi (s) \{ \log p(y, \theta ) - \log q_\lambda (\theta ) \} d s \\&= E_\phi \{ \log p(y, \theta ) - \log q_\lambda (\theta ) \}, \end{aligned} \end{aligned}$$
(4)

where \(E_\phi \) denotes expectation with respect to \(\phi (s)\) and \(\theta = r_\lambda (s)\).

4.1 Stochastic gradients

Next, we differentiate (4) with respect to \(\lambda \) to find unbiased estimates of the gradients. As \(\log q_\lambda (\theta )\) depends on \(\lambda \) directly as well as through \(\theta \), applying the chain rule, we have

$$\begin{aligned} \nabla _\lambda {\mathcal {L}}^\text {VI}&= E_\phi [ \nabla _\lambda r_\lambda (s) \{\nabla _\theta \log p(y, \theta ) - \nabla _\theta \log q_\lambda (\theta )\} \nonumber \\&\quad - \nabla _\lambda \log q_\lambda (\theta ) ] \end{aligned}$$
(5)
$$\begin{aligned}&= E_\phi [ \nabla _\lambda r_\lambda (s) \{\nabla _\theta \log p(y, \theta ) - \nabla _\theta \log q_\lambda (\theta )\} ]. \end{aligned}$$
(6)

Note that \(E_\phi \{ \nabla _\lambda \log q_\lambda (\theta ) \} = 0\) as it is the expectation of the score function. Roeder et al. (2017) refer to the expressions inside the expectations in (5) and (6) as the total derivative and path derivative, respectively. In (6), the contributions to the gradient from \(\log p(y, \theta )\) and \(\log q_\lambda (\theta )\) cancel each other if \(q_\lambda (\theta )\) approximates the true posterior well (at convergence). However, the score function \(\nabla _\lambda \log q_\lambda (\theta )\) is not necessarily small even if \(q_\lambda (\theta )\) is a good approximation to \(p(\theta |y)\). This term affects adversely the ability of the algorithm to converge and “stick” to the optimal variational parameters, a phenomenon Roeder et al. (2017) refers to as “sticking the landing”. Hence, we consider the path derivative,

$$\begin{aligned} {\widehat{\nabla }}_\lambda {\mathcal {L}}^\text {VI}= \nabla _\lambda r_\lambda (s) \{\nabla _\theta \log p(y, \theta ) - \nabla _\theta \log q_\lambda (\theta )\} \end{aligned}$$
(7)

as an unbiased estimate of the true gradient \(\nabla _\lambda {\mathcal {L}}^\text {VI}\). Tan and Nott (2018) and Tan (2018) also demonstrate that the path derivative has smaller variation about zero when the algorithm is close to convergence.

Let \(\nabla _\theta \log p(y, \theta ) - \nabla _\theta \log q_\lambda (\theta ) = ({\mathcal {G}}_1^T, {\mathcal {G}}_2^T)^T\), where \({\mathcal {G}}_1\) and \({\mathcal {G}}_2\) are vectors of length G and nL, respectively, corresponding to the partitioning of \(\theta =[\theta _G^T, \theta _L^T]^T\). Then \({\widehat{\nabla }}_\lambda {\mathcal {L}}^\text {VI}= \nabla _\lambda r_\lambda (s) ({\mathcal {G}}_1^T, {\mathcal {G}}_2^T)^T\) is given by

$$\begin{aligned} \begin{bmatrix} {\mathcal {G}}_1 + \nabla _{\mu _1} \theta _L {\mathcal {G}}_2 \\ - D_1^* v[ C_1^{-T} s_1 \{{\mathcal {G}}_1 + (\nabla _{\mu _1} \theta _L - D^T C_2^{-1}) {\mathcal {G}}_2\} ^T C_1^{-T} ]\\ {\mathcal {G}}_2 \\ -\mathrm{vec}( C_2^{-1}{\mathcal {G}}_2 s_1^T C_1^{-1} )\\ \nabla _{f} \theta _L {\mathcal {G}}_2\\ \mathrm{vec}(\nabla _{f} \theta _L {\mathcal {G}}_2\theta _G^T) \\ \end{bmatrix}, \end{aligned}$$

where

$$\begin{aligned} \begin{aligned} \nabla _{\mu _1} \theta _L&= F^T \nabla _{f} \theta _L ,\\ \nabla _{f} \theta _L {\mathcal {G}}_2&= - D_2^*v\{C_2^{-T} (s_2 - DC_1^{-T} s_1) {\mathcal {G}}_2^T C_2^{-T} \}. \end{aligned} \end{aligned}$$

Here \(D^*_1\) and \(D^*_2\) are diagonal matrices of order \(G(G+1)/2\) and \(nL(nL+1)/2\), respectively, such that \(\mathrm{d}v(C_r) = D^*_r \mathrm{d}v(C_r^*)\) for \(r=1,2\). Formally, \(D^*_1 = \mathrm{diag}\{ v(\mathrm{dg}(C_1) + \mathbf{1}_G\mathbf{1}_G^T - I_G) \}\) and \(D^*_2 = \mathrm{diag}\{ v(\mathrm{dg}(C_2) + \mathbf{1}_{nL} \mathbf{1}_{nL}^T - I_{nL}) \}\). The full expression and derivation of \(\nabla _\lambda r_\lambda (s) \) are given in “Appendix A”. In addition, we show (in “Appendix A”) that

$$\begin{aligned} \begin{aligned}&\nabla _\theta \log q_\lambda (\theta ) = \begin{bmatrix} \nabla _{\theta _G} \log q_\lambda (\theta ) \\ \nabla _{\theta _L} \log q_\lambda (\theta ) \end{bmatrix} \\&\quad = \begin{bmatrix} F^T [v(I_{nL})- D_2^* v\{(\theta _L - d)s_2^T\}]- C_1 s_1- D^T s_2 \\ -C_2 s_2 \end{bmatrix}. \end{aligned} \end{aligned}$$

\(\nabla _\theta \log p(y, \theta )\) is model specific and we discuss the application to GLMMs and state space models in Sects. 8 and 9, respectively.

4.2 Stochastic variational algorithm

The stochastic gradient ascent algorithm for CSGVA is outlined in Algorithm 1. For computing the stepsize, we use Adam (Kingma and Ba 2015), which uses bias-corrected estimates of the first and second moments of the stochastic gradients to compute adaptive learning rates.

figure a

At iteration t, the variational parameter \(\lambda \) is updated as \(\lambda _t = \lambda _{t-1}+ \Delta _t\). Let \(g_t\) denote the stochastic gradient estimate at iteration t. In steps 3 and 4, Adam computes estimates of the mean (first moment) and uncentered variance (second moment) of the gradients using exponential moving averages, where \(\tau _1,\tau _2 \in [0,1)\) control the decay rates. In step 4, \(g_t^2\) is evaluated as \(g_t \odot g_t\), where \(\odot \) denotes the Hadamard (element-wise) product. As \(m_t\) and \(v_t\) are initialized as zero, these moment estimates tend to be biased towards zero, especially at the beginning of the algorithm if \(\tau _1\), \(\tau _2\) are close to one. As \(m_t = (1-\tau _1) \sum _{i=1}^t \tau _1^{t-i} g_i\),

$$\begin{aligned} E(m_t) = E(g_t) (1-\tau _1^t) + \zeta , \end{aligned}$$

where \(\zeta =0\) if \(E(g_i) = E(g_t)\) for \(1 \le i < t\). Otherwise, \(\zeta \) can be kept small since the weights for past gradients decrease exponentially. An analogous argument holds for \(v_t\). Thus the bias can be corrected by using the estimates \({\hat{m}}_t\) and \({\hat{v}}_t\) in steps 5 and 6. The change is then computed as

$$\begin{aligned} \Delta _t = \frac{\alpha {\hat{m}}_t}{\sqrt{{\hat{v}}_t} + \epsilon }, \end{aligned}$$

where \(\alpha \) controls the magnitude of the stepsize and \(\epsilon \) is a small positive constant which ensures that the denominator is positive. In our experiments, we set \(\alpha = 0.001\), \(\tau _1 = 0.9\), \(\tau _2=0.99\) and \(\epsilon = 10^{-8}\), values close to what is recommended by Kingma and Ba (2015).

At each iteration t, we can also compute an unbiased estimate of the lower bound,

$$\begin{aligned} {\hat{{\mathcal {L}}}}^\text {VI}= \log p(y, \theta ) - \log q_{\lambda _{t-1}}(\theta ), \end{aligned}$$

where \(\theta \) is computed in step 1. Since these estimates are stochastic, we follow the path traced by \({\bar{{\mathcal {L}}}}^\text {VI}\), which is an average of the lower bounds averaged over every 1000 iterations, as a means to diagnose the convergence of Algorithm 1. \({\bar{{\mathcal {L}}}}^{\text {VI}}\) tends to increase monotonically at the start, but as the algorithm comes close to convergence, the values of \({\bar{{\mathcal {L}}}}^\text {VI}\) fluctuate close to and about the true maximum lower bound. Hence, we fit a least squares regression line to the past \(\kappa \) values of \({\bar{{\mathcal {L}}}}^\text {VI}\) and terminate Algorithm 1 once the gradient of the regression line becomes negative (see Tan 2018). For our experiments, we set \(\kappa = 6\).

5 Links to Gaussian variational approximation

CSGVA is an extension of Gaussian variational approximation (GVA, Tan and Nott 2018). In both approaches, the conditional posterior independence structure of the local latent variables is used to introduce sparsity in the precision matrix of the approximation. Below we demonstrate that GVA is a special case of CSGVA when \(F \equiv 0\).

Tan and Nott (2018) consider a GVA of the form

$$\begin{aligned} \begin{bmatrix} \theta _L \\ \theta _G \end{bmatrix} \sim {\mathcal {N}}\left( \begin{bmatrix} \mu _L \\ \mu _G \end{bmatrix}, T^{-T} T^{-1} \right) \,\text {where} \; T = \begin{bmatrix} T_{LL} &{} 0 \\ T_{GL} &{} T_{GG} \end{bmatrix}. \end{aligned}$$

Note that \(T_{LL}\) and \(T_{GG}\) are lower triangular matrices. Using a vector \(s = [s_{L}^T, s_G^T]^T \sim N(0, I_{nL+G})\), we can write

$$\begin{aligned} \begin{aligned}&\begin{bmatrix} \theta _L \\ \theta _G \end{bmatrix} = \begin{bmatrix} \mu _L \\ \mu _G \end{bmatrix} + T^{-T} \begin{bmatrix} s_{L} \\ s_G \end{bmatrix}, \\&\text {where} \quad T^{-T} = \begin{bmatrix} T_{LL}^{-T} &{} - T_{LL}^{-T} T_{GL}^T T_{GG}^{-T} \\ 0 &{} T_{GG}^{-T} \end{bmatrix}. \end{aligned} \end{aligned}$$

Assuming \(F\equiv 0\) for CSGVA, we have from (3) that

$$\begin{aligned} \begin{aligned} \begin{bmatrix} \theta _L \\ \theta _G \end{bmatrix} = \begin{bmatrix} d \\ \mu _1 \end{bmatrix} + \begin{bmatrix} C_2^{-T} &{} - C_2^{-T} D C_1^{-T} \\ 0 &{} C_1^{-T} &{} \end{bmatrix} \begin{bmatrix} s_2 \\ s_1 \end{bmatrix}, \end{aligned} \end{aligned}$$

where \([s_2^T, s_1^T]^T \sim N(0, I_{nL+G})\). Hence we can identify

$$\begin{aligned} \mu _1 = \mu _G, \; d = \mu _L, \; C_1 = T_{GG}, \; C_2 = T_{LL}, \; D = T_{GL}^T. \end{aligned}$$

If the standard way of initializing of Algorithm 1 (by setting \(\lambda =0\)) does not work well, we can use this association to initialize Algorithm 1 by using the fit from GVA. This informative initialization can reduce computation time significantly although there may be a risk of getting stuck in a local mode.

6 Importance weighted variational inference

Here we discuss how CSGVA can be improved by maximizing an importance weighted lower bound (IWLB, Burda et al. 2016), which leads to a tighter lower bound on the log marginal likelihood, and a variational approximation less prone to underestimation of the true posterior variance. We also relate the IWLB with Rényi’s divergence (Rényi 1961; van Erven and Harremos 2014) between \(q_\lambda (\theta )\) and \(p(\theta |y)\), demonstrating that maximizing the IWLB instead of the usual evidence lower bound leads to a transition in the behavior of the variational approximation from “mode-seeking” to “mass-covering”. We first define Rényi’s divergence and the variational Rényi bound (Li and Turner 2016), before introducing the IWLB as the expectation of a Monte Carlo approximation of the variational Rényi bound.

6.1 Rényi’s divergence and variational Rényi bound

Rényi’s divergence provides a measure of the distance between two densities q and p, and it is defined as

$$\begin{aligned} D_\alpha (q||p) = \frac{1}{\alpha -1} \log \int q(\theta )^\alpha p(\theta )^{1-\alpha } d\theta , \end{aligned}$$

for \(0<\alpha <\infty \), \(\alpha \ne 1\). This definition can be extended by continuity to the orders 0, 1 and \(\infty \), as well as to negative orders \(\alpha < 0\). Note that \(D_\alpha (q||p)\) is no longer a divergence measure if \(\alpha < 0\), but we can write \(D_\alpha (q||p)\) as \(\frac{\alpha }{1-\alpha } D_{1-\alpha }(p||q)\) for \(\alpha \notin \{0,1\}\) by the skew symmetry property. As \(\alpha \) approaches 1, the limit of \(D_\alpha (q||p)\) is the Kullback-Leibler divergence, \(\mathrm{KL}(q||p)\). In variational inference, minimizing the Kullback-Leibler divergence between the variational density \(q_\lambda (\theta )\) and the true posterior \(p(\theta |y)\) is equivalent to maximizing a lower bound \({\mathcal {L}}^{\text {VI}}\) on the log marginal likelihood due to the relationship:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^\text {VI}&= \log p(y) - \mathrm{KL}\{q_\lambda ||p(\theta |y)\} = E_{q_\lambda }\left\{ \log \frac{ p(y, \theta )}{q_\lambda (\theta )} \right\} . \end{aligned} \end{aligned}$$

Generalizing this relation using Rényi’s divergence measure, Li and Turner (2016) define the variational Rényi bound \({\mathcal {L}}_\alpha \) as

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_\alpha&= \log p(y) - D_\alpha \{q_\lambda ||p(\theta |y)\} \\&= \frac{1}{1-\alpha } \log E_{q_\lambda }\left\{ \left( \frac{ p(y, \theta ) }{q_\lambda (\theta )} \right) ^{1-\alpha } \right\} . \end{aligned} \end{aligned}$$

Note that \({\mathcal {L}}_1\), the limit of \({\mathcal {L}}_\alpha \) as \(\alpha \rightarrow 1\), is equal to \({\mathcal {L}}^\text {VI}\). A Monte Carlo approximation of \({\mathcal {L}}_\alpha \) when the expectation is intractable is

$$\begin{aligned} {\hat{{\mathcal {L}}}}_{\alpha , K} = \frac{1}{1-\alpha } \log \frac{1}{K} \sum _{k=1}^K w_k^{1-\alpha }, \end{aligned}$$
(8)

where \(\Theta _K = [\theta _1, \dots , \theta _K]\) is a set of K samples generated randomly from \(q_\lambda (\theta )\), and

$$\begin{aligned} w_k = w(\theta _k) = \frac{p(y, \theta _k)}{q_\lambda (\theta _k)}, \quad k=1, \dots , K, \end{aligned}$$

are importance weights. For each k, \(E_{q_\lambda } (w_k) = p(y)\). The limit of \({\hat{{\mathcal {L}}}}_{\alpha , K}\) as \(\alpha \rightarrow 1\) is \(\frac{1}{K} \sum _{k=1}^K \log \frac{p(y, \theta _k)}{q_\lambda (\theta _k)}\). Hence \({\hat{{\mathcal {L}}}}_{1, K}\) is an unbiased estimate of \({\mathcal {L}}_1\) as \(E_{\Theta _K}({\hat{{\mathcal {L}}}}_{1, K}) = {\mathcal {L}}_1 = {\mathcal {L}}^\text {VI}\), where \(E_{\Theta _K}\) denotes expectation with respect to \(q(\Theta _K) = \prod _{k=1}^{K} q_\lambda (\theta _k)\). For \(\alpha \ne 1\), \({\hat{{\mathcal {L}}}}_{\alpha , K} \) is not an unbiased estimate of \({\mathcal {L}}_\alpha \).

6.2 Importance weighted lower bound

The importance weighted lower bound (IWLB, Burda et al. 2016) is defined as

$$\begin{aligned} {\mathcal {L}}_K^\text {IW}= E_{\Theta _K}({\hat{{\mathcal {L}}}}_{0, K}) = E_{\Theta _K}\left( \log \frac{1}{K} \sum _{k=1}^K w_k \right) , \end{aligned}$$

where \(\alpha = 0\) in (8). It reduces to \({\mathcal {L}}^{\text {VI}}\) when \(K=1\). By Jensen’s inequality,

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_K^\text {IW}\le \log E_{\Theta _K} \left( \frac{1}{K} \sum _{k=1}^K w_k \right) = \log p(y). \end{aligned} \end{aligned}$$

Thus \({\mathcal {L}}_K^\text {IW}\) provides a lower bound to the log marginal likelihood for any positive integer K. From Theorem 1 (Burda et al. 2016), this bound becomes tighter as K increases.

Theorem 1

\({\mathcal {L}}^\text {IW}_K\) increases with K and approaches \(\log p(y)\) as \(K \rightarrow \infty \) if \(w_k\) is bounded.

Proof

Let \(I = \{w_{I_1}, \dots , w_{I_K}\}\) be selected randomly without replacement from \(\{w_1, \dots , w_{K+1}\}\). Then \(E_{I|\Theta _{K+1}}(w_{I_j}) = \frac{1}{K+1} \sum _{k=1}^{K+1} w_k\) for any \(j=1, \dots , K\), where \(E_{I|\Theta _{K+1}}\) denotes the expectation associated with the randomness in selecting I given \(\Theta _{K+1}\). Thus

$$\begin{aligned} \begin{aligned} {\mathcal {L}}^\text {IW}_{K+1}&= E_{\Theta _{K+1}} \left( \log \frac{1}{K+1} \sum \nolimits _{k=1}^{K+1} w_k \right) \\&= E_{\Theta _{K+1}} \left\{ \log E_{I|\Theta _{K+1}} \left( \frac{1}{K} \sum \nolimits _{j=1}^K w_{I_j} \right) \right\} \\&\ge E_{\Theta _{K+1}} \left\{ E_{I|\Theta _{K+1}} \log \left( \frac{1}{K} \sum \nolimits _{j=1}^K w_{I_j}\right) \right\} \\&= E_{\Theta _{K}} \log \left( \frac{1}{K} \sum \nolimits _{k=1}^K w_k \right) = {\mathcal {L}}^\text {IW}_K. \end{aligned} \end{aligned}$$

If \(w_k\) is bounded, then \(\frac{1}{K} \sum _{k=1}^K w_k \overset{a.s.}{\rightarrow } p(y)\) as \(K \rightarrow \infty \) by the law of large numbers. Hence \({\mathcal {L}}^\text {IW}_K {=} E_{\Theta _K} ( \log \frac{1}{K} \sum _{k=1}^K w_k ) \rightarrow \log p(y)\) as \(K \rightarrow \infty \). \(\square \)

Next, we present some properties of Rényi’s divergence and \(E_{\Theta _K}({\hat{{\mathcal {L}}}}_{\alpha , K})\) which are important in understanding the behavior of the variational density arising from maximizing \({\mathcal {L}}^\text {IW}_K\). The proofs of these properties can be found in van Erven and Harremos (2014) and Li and Turner (2016).

Property 1

\(D_\alpha \) is increasing in \(\alpha \), and is continuous in \(\alpha \) on \([0,1] \cup \{\alpha \notin [0,1] \mid |D_\alpha | < \infty \}\).

Property 2

\(E_{\Theta _K}({\hat{{\mathcal {L}}}}_{\alpha , K})\) is continuous and decreasing in \(\alpha \) for fixed K.

Theorem 2

There exists \(0 \le \alpha _{q_\lambda ,K} \le 1\) for given \(q_\lambda \) and K such that

$$\begin{aligned} \log p(y) = D_{\alpha _{q_\lambda ,K}}\{q_\lambda ||p(\theta |y)\} + {\mathcal {L}}_K^\text {IW}. \end{aligned}$$

Proof

From Property 2,

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_1 =E_{\Theta _K}({\hat{{\mathcal {L}}}}_{1, K})&\le E_{\Theta _K}({\hat{{\mathcal {L}}}}_{0, K}) = {\mathcal {L}}^\text {IW}_K \le {\mathcal {L}}_0, \\ {\mathcal {L}}_1- \log p(y)&\le {\mathcal {L}}^\text {IW}_K - \log p(y) \le {\mathcal {L}}_0 - \log p(y), \\ D_0\{ q_\lambda ||p(\theta |y)\}&\le \log p(y) - {\mathcal {L}}^\text {IW}_K \le D_1\{ q_\lambda ||p(\theta |y)\} . \end{aligned} \end{aligned}$$

From Property 1, since \(D_\alpha \) is continuous and decreasing for \(\alpha \in [0,1]\), there exists \(0 \le \alpha _{q_\lambda ,K} \le 1\) such that \(\log p(y) - {\mathcal {L}}^\text {IW}_K = D_{\alpha _{q_\lambda ,K}}\{q_\lambda ||p(\theta |y)\}\). \(\square \)

Minimizing Rényi’s divergence for \(\alpha \gg 1\) tends to produce approximations which are mode-seeking (zero-forcing), while maximizing Rényi’s divergence for \(\alpha \ll 0\) encourages mass-covering behavior. Theorem 2 suggests that maximizing the IWLB results in a variational approximation \(q_\lambda \) whose Rényi’s divergence from the true posterior can be captured with \(0 \le \alpha \le 1\), which represents a mix and certain balance between mode-seeking and mass-covering behavior (Minka 2005). In our experiments, we observe that maximizing the IWLB is highly effective in correcting the underestimation of posterior variance in variational inference.

Alternatively, if we approximate \({\mathcal {L}}_K^\text {IW}\) by considering a second-order Taylor expansion of \(\log {\bar{w}}_K\) about \(E_\Theta ({\bar{w}}_K) = p(y)\), where \({\bar{w}}_K =\frac{1}{K} \sum _{k=1}^K w_K\), and then take expectations, we have

$$\begin{aligned} {\mathcal {L}}_K^\text {IW}\approx \log p(y) - \frac{\mathrm{var}(w_k)}{2K p(y)^2}. \end{aligned}$$

Maddison et al. (2017) and Domke and Sheldon (2018) provide bounds on the order of the remainder term in the Taylor approximation above, and demonstrate that the “looseness” of the IWLB is given by \(\mathrm{var}(w_k) \) as \(K \rightarrow \infty \). Minimizing \(\mathrm{var}(w_K)\) is equivalent to minimizing the \(\chi ^2\) divergence \(D_2(p||q)\). Note that if \(q_\lambda (\theta )\) has thin tails compared to \(p(\theta |y)\), then the variance of \(\mathrm{var}(w_k)\) will be large. Hence minimizing \(\mathrm{var}(w_K)\) attempts to match \(p(\theta |y)\) with \(q_\lambda (\theta )\) so that \(q_\lambda (\theta )\) is able to cover the tails.

6.3 Unbiased gradient estimate of importance weighted lower bound

To maximize the IWLB in CSGVA, we need to find an unbiased estimate of \(\nabla _\lambda {\mathcal {L}}_K^\text {IW}\) using the transformation in (3). Let \(s_k \sim N(0, I_{G+nL})\), \(\theta _k = r _\lambda (s_k)\) for \(k=1, \dots , K\), and \(S_K = [s_1, \dots , s_K]^T\).

$$\begin{aligned} \begin{aligned} \nabla _\lambda {\mathcal {L}}_K^\text {IW}&= \nabla _\lambda E_{\Theta _K} (\log {\bar{w}}_K )= \nabla _\lambda E_{S_K} ( \log {\bar{w}}_K )\\&= E_{S_K} \left[ \sum _{k=1}^K \frac{ \nabla _\lambda w_k }{ \sum _{k'=1}^K w_{k'}} \right] \\&= E_{S_K} \left[ \sum _{k=1}^K \frac{ w_k \nabla _\lambda \log w_k }{ \sum _{k'=1}^K w_{k'} } \right] \\&= E_{S_K} \left[ \sum \nolimits _{k=1}^K {\widetilde{w}}_k \nabla _\lambda \log w_k \right] , \end{aligned} \end{aligned}$$
(9)

where \(w_k = w(\theta _k) = w\{r _\lambda (s_k)\}\) and \({\widetilde{w}}_k = w_k/ \sum _{k'=1}^K w_{k'}\) for \(k=1, \dots , K\) are normalized importance weights. Applying chain rule,

$$\begin{aligned} \begin{aligned} \nabla _\lambda \log w_k&= \nabla _\lambda r_\lambda (s_k) \nabla _{\theta _k} \log w_k - \nabla _\lambda \log q_\lambda (\theta _k). \end{aligned} \end{aligned}$$
(10)

In Sect. 4.1, we note that \(E_\phi \{ \nabla _\lambda \log q_\lambda (\theta ) \} = 0\) as it is the expectation of the score function, and hence, we can omit \(\nabla _\lambda \log q_\lambda (\theta )\) to obtain an unbiased estimate of \(\nabla _\lambda {\mathcal {L}}^\text {VI}\). However, in this case, it is unclear if

$$\begin{aligned} E_{S_K} \left[ \sum _{k=1}^K {\widetilde{w}}_k \nabla _\lambda \log q_\lambda (\theta _k) \right] = 0. \end{aligned}$$
(11)

Roeder et al. (2017) conjecture that (11) is true and report improved results when omitting the term \(\nabla _\lambda \log q_\lambda (\theta _k)\) from \(\nabla _\lambda \log w_k\) in computing gradient estimates. However, Tucker et al. (2018) demonstrated via simulations that (11) does not hold generally and that such omission will result in biased gradient estimates. Our own simulations using CSGVA also suggest that (11) does not hold even though omission of \(\nabla _\lambda \log q_\lambda (\theta _k)\) does lead to improved results. As the stochastic gradient algorithm is not guaranteed to converge with biased gradient estimates, we turn to the double reparametrized gradient estimate proposed by Tucker et al. (2018) which allows unbiased gradient estimates to be constructed with the omission of \(\nabla _\lambda \log q_\lambda (\theta _k)\) albeit with revised weights.

Since \({\tilde{w}}_k\) depends on \(\lambda \) directly as well as through \(\theta _k\), we use chain rule to obtain

$$\begin{aligned} \begin{aligned} \nabla _\lambda E_{\theta _k}({\tilde{w}}_k)&= \nabla _\lambda E_{s_k}({\tilde{w}}_k) \\&= E_{s_k}(\nabla _\lambda \theta _k \nabla _{\theta _k} {\tilde{w}}_k) + E_{s_k}(\nabla _\lambda {\tilde{w}}_k), \end{aligned} \end{aligned}$$
(12)

where

$$\begin{aligned} \begin{aligned} \nabla _{\theta _k} {\tilde{w}}_k&= \left\{ \frac{1}{\sum _{k'=1}^K w_{k'}} - \frac{w_k }{(\sum _{k'=1}^K w_{k'})^2} \right\} \nabla _{\theta _k} w_k \\&= ( {\tilde{w}}_k- {\tilde{w}}_k^2 ) \nabla _{\theta _k} \log w_k. \end{aligned} \end{aligned}$$

Alternatively,

$$\begin{aligned} \begin{aligned} \nabla _\lambda E_{\theta _k}({\tilde{w}}_k)&=\nabla _\lambda \int q_\lambda (\theta _k) {\tilde{w}}_k \;d\theta _k \\&= \int {\tilde{w}}_k \nabla _\lambda q_\lambda (\theta _k) + q_\lambda (\theta _k) \nabla _\lambda {\tilde{w}}_k \; d\theta _k \\&= \int {\tilde{w}}_k q_\lambda (\theta _k) \nabla _\lambda \log q_\lambda (\theta _k) \;d\theta _k + E_{\theta _k}(\nabla _\lambda {\tilde{w}}_k) \\&= E_{\theta _k} [{\tilde{w}}_k \nabla _\lambda \log q_\lambda (\theta _k) ] + E_{s_k} (\nabla _\lambda {\tilde{w}}_k). \end{aligned} \end{aligned}$$
(13)

Comparing (12) and (13), we have

$$\begin{aligned} \begin{aligned}&E_{\Theta _K} \left( \sum \nolimits _{k=1}^K {\tilde{w}}_k \nabla _\lambda \log q_\lambda (\theta _k) \right) \\&\quad = \sum \nolimits _{k=1}^K E_{\Theta _K\backslash \theta _k} E_{\theta _k} [{\tilde{w}}_k \nabla _\lambda \log q_\lambda (\theta _k)] \\&\quad = \sum \nolimits _{k=1}^K E_{S_K\backslash s_k} E_{s_k}(\nabla _\lambda \theta _k ( {\tilde{w}}_k- {\tilde{w}}_k^2 ) \nabla _{\theta _k} \log w_k) \\&\quad = E_{S_K} \left\{ \sum \nolimits _{k=1}^K ( {\tilde{w}}_k - {\tilde{w}}_k^2 ) \nabla _\lambda r_\lambda (s_k) \nabla _{\theta _k} \log w_k \right\} . \end{aligned} \end{aligned}$$

Combining the above expression with (9) and (10), we find that

$$\begin{aligned} \nabla _\lambda {\mathcal {L}}_K^\text {IW}= E_{S_K} \left( \sum \nolimits _{k=1}^K {\tilde{w}}_k^2 \nabla _\lambda r_\lambda (s_k) \nabla _{\theta _k} \log w_k \right) \end{aligned}$$

An unbiased gradient estimate is thus given by

$$\begin{aligned} {\widehat{\nabla }}_\lambda {\mathcal {L}}_K^\text {IW}= \sum _{k=1}^K {\tilde{w}}_k^2 \nabla _\lambda r_\lambda (s_k) \nabla _{\theta _k} \{\log p(y, \theta _k) - \log q_\lambda (\theta _k) \}. \end{aligned}$$

Thus, to use CSGVA with important weights, we only need to modify Algorithm 1 by drawing K samples \(s_1, \dots , s_K\) in step 1 instead of a single sample and then compute the unbiased gradient estimate, \(g_t = {\widehat{\nabla }}_\lambda {\mathcal {L}}_K^\text {IW}\), in step 2. The rest of the steps in Algorithm 1 remain unchanged. In the importance weighted version of CSGVA, the gradient estimate based on a single sample s is replaced by a weighted sum of the gradients in (7) based on K samples \(s_1, \dots , s_K\). However, these weights do not necessarily sum to 1. An unbiased estimate of \({\mathcal {L}}_K^\text {IW}\) itself is given by \({\hat{{\mathcal {L}}}}_K^\text {IW}= \log \frac{1}{K} \sum _{k=1}^K w_k\).

7 Experimental results

We apply CSGVA to GLMMs and state space models and compare the results with that of GVA in terms of computation time and accuracy of the posterior approximation. Lower bounds reported exclude constants which are independent of the model variables. We also illustrate how CSGVA can be improved using importance weighting (IW-CSGVA), considering \(K \in \{5, 20, 100\}\). We find that IW-CSGVA performs poorly if it is initialized in the standard manner using \(\lambda = 0\). This is because, when \(q_\lambda (\theta )\) is still far from optimal, a few of the importance weights tend to dominate with the rest close to zero, thus producing inaccurate estimates of the gradients. Hence, we initialize IW-CSGVA using the CSGVA fit, and the algorithm is terminated after a short run of 1000 iterations as IW-CSGVA is computationally intensive and improvements in the IWLB and variational approximation seem to be negligible thereafter. Posterior distributions estimated using MCMC via RStan are regarded as the ground truth. Code for all variational algorithms are written in Julia and are available as supplementary materials. All experiments are run on Intel Core i9-9900K CPU @3.60 GHz with 16GB RAM. As the computation time of IW-CSGVA increases linearly with K, we also investigate the speedup that can be achieved through parallel computing on a machine with 8 cores. Julia retains one worker (or core) as the master process, and parallel computing is implemented using the remaining seven workers.

The parametrization of a hierarchical model plays a major role in the rate of convergence of both GVA and CSGVA. In some cases, it can even affect the ability of the algorithm to converge (to a local mode). We have attempted both the centered and noncentered parametrizations (Papaspiliopoulos et al. 2003, 2007), which are known to have a huge impact on the rate of convergence of MCMC algorithms. These two parametrizations are complementary, and neither is superior to the other. If an algorithm converges slowly under one parametrization, it often converges much faster under the other. Which parametrization works better usually depends on the nature of data. For the datasets that we use in the experiments, the centered parametrization was found to have better convergence properties than the noncentered parametrization for GLMMs while the noncentered parametrization is preferred for stochastic volatility models. These observations are further discussed below.

8 Generalized linear mixed models

Let \(y_i = (y_{i1}, \dots , y_{in_i})^T\) denote the vector of responses of length \(n_i\) for the ith subject for \( i=1,\dots ,n\), where \(y_{ij}\) is generated from some distribution in the exponential family. The mean \(\mu _{ij} = E(y_{ij})\) is connected to the linear predictor \(\eta _{ij}\) via

$$\begin{aligned} g(\mu _{ij}) = \eta _{ij} = X_{ij}^T\beta + Z_{ij}^Tb_i, \end{aligned}$$

for some smooth invertible link function \(g(\cdot )\). The fixed effects \(\beta \) is a \(p \times 1\) vector and \(b_i \sim N(0,\Lambda )\) is a \(L \times 1\) vector of random effects specific to the ith subject. \(X_{ij}\) and \(Z_{ij}\) are vectors of covariates of length p and L, respectively. Let \(\eta _i = [\eta _{i1}, \dots , \eta _{in_i}]^T\), \(X_i = [X_{i1}, \dots , X_{in_i}]^T\) and \(Z_i = [Z_{i1}, \dots , Z_{in_i}]^T\). We focus on the one-parameter exponential family with canonical links. This includes the Bernoulli model, \(y_{ij} \sim \) Bern(\(\mu _{ij}\)), with the logit link \(g(\mu _{ij}) = \log \{\mu _{ij}/(1-\mu _{ij})\}\) and the Poisson model, \(y_{ij} \sim \) Pois(\(\mu _{ij}\)), with the log link \(g(\mu _{ij}) = \log (\mu _{ij})\). Let \(WW^T\) be the unique Cholesky decomposition of \(\Lambda ^{-1}\), where W is a lower triangular matrix with positive diagonal entries. Define \(W^*\) such that \(W^*_{ii} = \log (W_{ii})\) and \(W^*_{ij} = W_{ij}\) if \(i \ne j\), and let \(\omega = v(W^*)\). We consider normal priors, \(\beta \sim N(0, \sigma _{\beta }^2I)\) and \(\omega \sim N(0, \sigma _\omega ^2 I)\), where \(\sigma _\beta ^2\) and \(\sigma _\omega ^2\) are set as 100.

The above parametrization of the GLMM is noncentered since \(b_i\) has mean 0. Alternatively, we can consider the centered parametrization proposed by Tan and Nott (2013). Suppose the covariates for the random effects are a subset of the covariates for the fixed effects and the first column of \(X_i\) and \(Z_i\) are ones corresponding to an intercept and random intercept, respectively. Then we can write

$$\begin{aligned} \eta _i = X_i \beta + Z_i b_i = Z_i \beta _R + X_i^G \beta _G + Z_i b_i. \end{aligned}$$

where \(\beta = [\beta _R^T, \beta _G^T]^T\). We further split \(X_i^G\) into covariates which are subject specific (varies only with i and assumes the same value for \(j=1, \dots , n_i\)) and those which are not, and \(\beta _G = [ \beta _{G_1}^T, \beta _{G_2}^T]^T\) accordingly, where \(\beta _{G_1}\), \(\beta _{G_2}\) are vectors of length \(g_1\) and \(g_2\), respectively. Then

$$\begin{aligned} \begin{aligned} \eta _i&= Z_i \beta _R + \mathbf{1}_{n_i} {x_i^{G_1}}^T \beta _{G_1} + X_i^{G_2} \beta _{G_2} + Z_i b_i \\&= Z_i (C_i \beta _{RG_1} + b_i) + X_i^{G_2} \beta _{G_2}, \end{aligned} \end{aligned}$$

where

$$\begin{aligned} C_i =\begin{bmatrix} I_L &{} \begin{array}{l} {x_i^{G_1}}^T \\ 0_{L-1 \times g_1} \end{array} \end{bmatrix}\;\; \text {and}\;\;\beta _{RG_1}= \begin{bmatrix} \beta _R \\ \beta _{G_1} \end{bmatrix}. \end{aligned}$$

Let \({\tilde{b}}_i = C_i \beta _{RG_1}+ b_i\). The centered parametrization is represented as

$$\begin{aligned} \eta _i = Z_i {\tilde{b}}_i + X_i^{G_2} \beta _{G_2}, \quad {\tilde{b}}_i \sim N(C_i \beta _{RG_1}, \Lambda ) \end{aligned}$$
(14)

for \(i=1, \dots , n\).

Tan and Nott (2013) compare the centered, noncentered and partially noncentered parametrizations for GLMMs in the context of variational Bayes, showing that the choice of parametrization affects not only the rate of convergence, but also the accuracy of the variational approximation. For CSGVA, we observe that the accuracy of the variational approximation and the rate of convergence can also be greatly affected. Tan and Nott (2013) demonstrate that the centered parametrization is preferred when the observations are highly informative about the latent variables. In practice, a general guideline is to use the centered parametrization for Poisson models when observed counts are large and the noncentered parametrization when most counts are close to zero. For Bernoulli models, differences between using centered and noncentered parametrizations are usually minor. Here we use the centered parametrization in (14), which has been observed to yield gains in convergence rates for the datasets used for illustration.

The global parameters are \(\theta _G = (\beta ^T, \omega ^T)^T\) of dimension \(G = p + L(L+1)/2\), and the local variables are \(\theta _L = ({\tilde{b}}_1, \dots , {\tilde{b}}_n)^T\). The joint density is

$$\begin{aligned} p(y,\theta ) = p(\beta ) p(\omega ) \prod ^n_{i=1} \bigg \{ p({\tilde{b}}_i|\omega , \beta )\prod ^{n_i}_{j=1}p(y_{ij}|\beta ,{\tilde{b}}_i) \bigg \}. \end{aligned}$$

The log of the joint density is given by

$$\begin{aligned} \begin{aligned} \log p(y, \theta ) =&\sum \nolimits _{i=1}^n \{ y_i \eta _i - \mathbf{1}^T h(\eta _i) \\&- {({\tilde{b}}_i - C_i \beta _{RG_1})^T WW^T ({\tilde{b}}_i - C_i \beta _{RG_1})}/{2} \} \\&- \beta ^T \beta / (2\sigma _\beta ^2) - \omega ^T \omega / (2\sigma _\omega ^2) + n \log |W| + c, \end{aligned} \end{aligned}$$

where \(h(\cdot )\) is the log-partition function and c is a constant independent of \(\theta \). For the Poisson model with log link, \(h(x) = \exp (x)\). For the Bernoulli model with logit link, \(h(x) = \log \{ 1+\exp (x) \}\). The gradient \(\nabla _\theta \log p(y, \theta ) \) is given in “Appendix B”.

For the GLMM, \(b_i\) and \(b_j\) are conditionally independent given \(\theta _G\) for \( i \ne j\) in \(p(\theta |y)\). Hence we impose the following sparsity structure on \(\Omega _2\) and \(C_2\),

$$\begin{aligned} \begin{aligned} \Omega _2 = \begin{bmatrix} \Omega _{2,11} &{} 0 &{} \dots &{} 0 \\ 0 &{} \Omega _{2,22} &{} \dots &{} 0 \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ 0 &{} 0 &{} \dots &{} \Omega _{2, nn} \end{bmatrix}, \\ C_2^* = \left[ \begin{array}{cccc} C^*_{2,11} &{} 0 &{} \dots &{} 0 \\ 0 &{} C^*_{2, 22} &{} \dots &{} 0 \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ 0 &{} 0 &{} \dots &{} C^*_{2, nn} \end{array}\right] , \end{aligned} \end{aligned}$$

where each \(\Omega _{2,ii}\) is a \(L \times L\) block matrix and each \(C^*_{2,ii}\) is a \(L \times L\) lower triangular matrix. We set \(f_i = 0\) and \(F_{ij} =0\) for all \(i \in {\mathcal {I}}\) and all j, where \({\mathcal {I}}\) denotes the set of indices in \(v(C_2^*)\) which are fixed as zero. The number of nonzero elements in \(v(C_2^*)\) is \(nL(L+1)/2\). Hence the number of variational parameters to be optimized are reduced from \(nL(nL+1)/2\) to \(nL(L+1)/2\) for f and from \(nL(nL+1)G/2\) to \(nL(L+1)G/2\) for F.

8.1 Epilepsy data

In this epilepsy data (Thall and Vail 1990), \(n = 59\) patients are involved in a clinical trial to investigate the effects of the anti-epileptic drug Progabide. The patients are randomly assigned to receive either the drug (\(\hbox {Trt} = 1\)) or a placebo (\(\hbox {Trt} = 0\)). The response \(y_i\) denotes the number of epileptic attacks patient i had during 4 follow-up periods of two weeks each. Covariates include the log of the age of the patients (Age), the log of 1/4 the baseline seizure count recorded over an eight-week period prior to the trial (Base) and Visit (coded as \(\hbox {Visit}_1 = -\,0.3\), \(\hbox {Visit}_2 = -\,0.1\), \(\hbox {Visit}_3 = 0.1\) and \(\hbox {Visit}_4 = 0.3\)). Note that Age is centered about its mean. Consider \(y_{ij} \sim \text {Pois}(\mu _{ij})\), where

$$\begin{aligned} \begin{aligned} \log \mu _{ij}&= \beta _0 + \beta _{\text {Base}}\text {Base}_i +\beta _{\text {Trt}}\text {Trt}_i +\beta _{\text {Age}}\text {Age}_i\\&\quad + \beta _{\text {Base}\times \text {Trt}}\text {Base}_i \times \text {Trt}_i + \beta _{\text {Visit}}\text {Visit}_{ij} \\&\quad + b_{i1} + b_{i2}\text {Visit}_{ij}, \end{aligned} \end{aligned}$$

for \(i = 1,\dots , 59, j = 1,\dots , 4\) (Breslow and Clayton 1993).

Table 1 shows the results obtained from applying the variational algorithms to this data. The lower bounds are estimated using 1000 simulations in each case, and the mean and standard deviation from these simulations are reported. CSGVA produced an improvement in the estimate of the lower bound (3139.2) as compared to GVA (3138.3) and maximizing the IWLB led to further improvements. Using a larger K of 20 or 100 resulted in minimal improvements compared with \(K=5\). As this dataset is small, parallel computing is slower than serial for a small K. This is because, even though the importance weights and gradients for K samples are computed in parallel, the cost of sending and fetching data from the workers at each iteration dwarfs the cost of computation when K is small. For \(K=100\), parallel computing reduces the computation time by about half.

Table 1 Epilepsy data. Number of iterations I (in thousands), runtimes (in s) and estimates of lower bound (SD in brackets) of the variational methods

The estimated marginal posterior distributions of the global parameters are shown in Fig. 1. The plots show that CSGVA (red) produces improved estimates of the posterior distribution as compared to GVA (blue), especially in estimating the posterior variance of the precision parameters \(\omega _2\) and \(\omega _3\). The posteriors estimated using IW-CSGVA for the different values of K are very close. By using just \(K=5\), we are able to obtain estimates that are virtually indistinguishable from that of MCMC.

Fig. 1
figure 1

Epilepsy data. Marginal posterior distributions of global parameters. Black (MCMC), blue (GVA), red (CSGVA), purple (\(K=5\)), orange (\(K=20\)), green (\(K=100\)). (Color figure online)

8.2 Madras data

These data come from the Madras longitudinal schizophrenia study (Diggle et al. 2002) for detecting a psychiatric symptom called “thought disorder.” Monthly records showing whether the symptom is present in a patient are kept for \(n = 86\) patients over 12 months. The response \(y_{ij}\) is a binary indicator for presence of the symptom. Covariates include the time in months since initial hospitalization (t), gender of patient (Gender = 0 if male and 1 if female) and age of patient (Age = 0 if the patient is younger than 20 years and 1 otherwise). Consider \(y_{ij} \sim \text {Bern}(\mu _{ij})\) and

$$\begin{aligned} \text {logit}(\mu _{ij})= & {} \beta _0 +\beta _{\text {Age}}\text { Age}_i +\beta _{\text {Gender}}\text { Gender}_i + \beta _tt_{ij}\\&+ \beta _{\text {Age} \times t}\text { Age}_i \times t_{ij} + \beta _{\text {Gender} \times t}\text { Gender}_i \times t_{ij} +b_i, \end{aligned}$$

for \(i =1,\dots ,86, 1 \le j \le 12\).

The results in Table 2 are quite similar to that of the epilepsy data. CSGVA yields an improvement in the lower bound estimate as compared to GVA and IW-CSGVA provided further improvements, which are evident starting with a K as small as 5. Parallel computing halved the computation time for \(K=100\) but did not yield any benefits for \(K \in \{5,20\}\). From Fig. 2, CSGVA and IW-CSGVA are again able to capture the posterior variance of the precision parameter \(\omega _1\) better than GVA.

Table 2 Madras data. Number of iterations I (in thousands), runtimes (in s) and estimates of lower bound (SD in brackets) of the variational methods
Fig. 2
figure 2

Madras data. Marginal posterior distributions of global parameters. Black (MCMC), blue (GVA), red (CSGVA), purple (\(K=5\)), orange (\(K=20\)), green (\(K=100\)). (Color figure online)

8.3 Six cities data

In the six cities data (Fitzmaurice and Laird 1993), \(n = 537\) children from Steubenville, Ohio, are involved in a longitudinal study to investigate the health effects of air pollution. Each child is examined yearly from age 7 to 10, and the response \(y_{ij}\) is a binary indicator for wheezing. There are two covariates, \(\text {Smoke}_i\) (a binary indicator for smoking status of the mother of child i) and \(\text {Age}_{ij}\) (age of child i at time point j, centered at 9 years). Consider \(y_{ij} \sim \text {Bern}(\mu _{ij})\), where

$$\begin{aligned} \begin{aligned} \text {logit}(\mu _{ij})&= \beta _0 +\beta _{\text {Smoke}}\text {Smoke}_i +\beta _{\text {Age}} \text { Age}_{ij} \\&\quad + \beta _{\text {Smoke} \times \text {Age}} \text { Smoke}_i \times \text {Age}_{ij} +b_{i}, \end{aligned} \end{aligned}$$

for \(i = 1,\dots , 537, j = 1,\dots , 4\).

Table 3 Six cities data. Number of iterations I (in thousands), runtimes (in s) and estimates of lower bound (SD in brackets) of the variational methods

From Table 3, CSGVA managed to achieve a higher lower bound than GVA in about half the runtime. As K increases, IW-CSGVA produced tighter lower bounds for the log marginal likelihood. As in the previous two examples, parallel computing is beneficial only when \(K=100\), cutting the runtime by about half. From Fig. 3, there is slight overestimation of the posterior means of \(\beta _0\) and \(\omega _1\) by all the variational methods. However, CSGVA and IW-CSGVA are able to capture the posterior variance of these parameters much better than GVA especially for \(\omega _1\).

Fig. 3
figure 3

Six cities data. Marginal posterior distributions of global parameters. Black (MCMC), blue (GVA), red (CSGVA), purple (\(K=5\)), orange (\(K=20\)), green (\(K=100\)). (Color figure online)

9 Application to state space models

Here we consider the stochastic volatility model (SVM) widely used for modeling financial time series. Let each observation \(y_i\) for \(i=1, \dots , n\), be generated from a zero-mean Gaussian distribution where the error variance is stochastically evolving over time. The unobserved log volatility \(b_i\) is modeled using an autoregressive process of order one with Gaussian disturbances:

$$\begin{aligned} \begin{aligned}&y_i|b_i, \sigma , \kappa \sim N( 0, \mathrm{e}^{\sigma b_i + \kappa } ), \quad (i=1,\dots ,n)\\&b_i|\phi \sim N(\phi b_{i-1}, 1), \quad (i=2,\dots ,n) \\&b_1|\phi \sim N(0, 1/(1-\phi ^2) ), \end{aligned} \end{aligned}$$

where \(\kappa \in {\mathbb {R}}\), \(\sigma > 0\) and \(0< \phi < 1\). Here, \(y_i\) represents the mean-corrected return at time i and the states \(\{b_i\}\) come from a stationary process with \(b_1\) drawn from the stationary distribution. The parametrization of the SVM above is noncentered. The centered parametrization can be obtained by replacing \(b_i\) by \(({\tilde{b}}_i - \kappa )/\sigma \). The performance of GVA and CSGVA are sensitive to the parametrization both in terms of rate of convergence and attained local mode. For the data sets below, the noncentered parametrization was found to have better convergence properties. The sensitivity of the stochastic volatility model to parametrization when fitted using MCMC algorithms is well known in the literature. To “combine the best of different worlds,” Kastner and Frühwirth-Schnatter (2014) introduce a strategy that samples parameters from the centered and noncentered parametrizations alternately. Tan (2017) studies optimal partially noncentered parametrizations for Gaussian state space models fitted using EM, MCMC or variational algorithms.

We use the following transformations to map constrained parameters to \({\mathbb {R}}\):

$$\begin{aligned} \alpha = \log (\exp (\sigma ) - 1), \quad \psi = \mathrm{logit}(\phi ). \end{aligned}$$

This transformation for \(\alpha \) works better than \(\alpha = \log (\sigma )\), which leads to erratic fluctuations in the lower bound and convergence issues more often. The global variables are \(\theta _G = [\alpha , \kappa , \psi ]^T\) of dimension \(G = 3\) and the local variables are \(\theta _L = [b_1, \dots , b_n]^T\) of length n. We assume normal priors for the global parameters, where \(\alpha \sim N (0, \sigma _{\alpha }^2)\), \(\kappa \sim N (0, \sigma _{\kappa }^2)\) and \(\psi \sim N (0, \sigma _{\psi }^2)\), where \(\sigma _\alpha ^2 = \sigma _\kappa ^2 = \sigma _\psi ^2 = 10\). The joint density can be written as

$$\begin{aligned} \begin{aligned} p(y,\theta ) =&p(\alpha )p(\kappa )p(\psi )p(b_1|\psi ) \bigg \{ \prod _{i=2}^{n}p(b_i|b_{i-1}, \psi ) \bigg \} \\&\times \bigg \{ \prod ^n_{i=1}p(y_i|b_i, \alpha , \kappa ) \bigg \} . \end{aligned} \end{aligned}$$

The log joint density is

$$\begin{aligned} \begin{aligned} \log p(y, \theta )&= - \frac{n\kappa }{2} - \frac{\sigma }{2} \sum _{i=1}^n b_i - \frac{1}{2}\sum ^n_{i=1} y_i^2 \mathrm{e}^{-\sigma b_i - \kappa } \\&\quad - \frac{1}{2}\sum _{i=2}^{n} (b_i - \phi b_{i-1})^2 - \frac{1}{2} b_1^2 (1-\phi ^2) \\&\quad + \frac{1}{2}\log (1-\phi ^2) - \frac{\alpha ^2}{2\sigma _{\alpha }^2} - \frac{\kappa ^2}{2\sigma _{\kappa }^2} - \frac{\psi ^2}{2\sigma _{\psi }^2} + c, \end{aligned} \end{aligned}$$

where \(\phi = \exp (\psi )/\{1+\exp (\psi )\}\), \(\sigma = \log (\exp (\alpha )+1)\) and c is a constant independent of \(\theta \). The gradient \(\nabla _\theta \log p(y, \theta )\) is given in “Appendix C”. For this model, \(b_i\) is conditionally independent of \(b_j\) in the posterior if \( |i - j| >1\) given \(\theta _G\). Thus, the sparsity structure imposed on \(\Omega _2\) and \(C_2\) are

$$\begin{aligned} \begin{aligned} \Omega = \begin{bmatrix} \Omega _{2,11} &{} \Omega _{2,12} &{}0 &{}\dots &{}0 \\ \Omega _{2,21} &{} \Omega _{2,22} &{} \Omega _{2,23} &{} \dots &{} 0 \\ 0 &{} \Omega _{2,32} &{} \Omega _{2,33} &{} \dots &{} 0 \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ 0 &{} 0 &{} 0 &{} \dots &{} \Omega _{2,nn} \end{bmatrix}, \\ C_2 = \begin{bmatrix} C_{2,{11}} &{} 0 &{}0 &{}\dots &{}0 \\ C_{2,21} &{} C_{2,{22}} &{} 0 &{} \dots &{} 0 \\ 0 &{} C_{2,{32}} &{} C_{2,{33}} &{} \dots &{} 0 \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ 0 &{} 0 &{} 0 &{} \dots &{} C_{2,{nn}} \end{bmatrix}. \end{aligned} \end{aligned}$$

The number of nonzero elements in \(v(C_2^*)\) is \(2n-1\). Setting \(f_i = 0\) and \(F_{ij} =0\) for all \(i \in {\mathcal {I}}\) and all j, where \({\mathcal {I}}\) denotes the set of indices in \(v(C_2^*)\) which are fixed as zero, the number of variational parameters to be optimized are reduced from \(n(n+1)/2\) to \(2n-1\) for f, and from \(n(n+1)G/2\) to \((2n-1)G\) for F.

9.1 GBP/USD exchange rate data

These data contain 946 observations of the exchange rates of the US Dollar (USD) against the British Pound (GBP), recorded daily from 1 October 1981, to 28 June 1985. These data are available from the “Garch” dataset in the R package Ecdat. The mean-corrected responses \(\{y_t|t=1, \dots , n\}\) are computed from the exchange rates \(\{r_t|t=0, \dots , n\}\) as

$$\begin{aligned} y_t = 100 \left\{ \log \left( \frac{r_t}{r_{t-1}}\right) - \frac{1}{n}\sum ^n_{i=1}\log \left( \frac{r_i}{r_{i-1}}\right) \right\} , \end{aligned}$$

where \(n=945\).

For these data, CSGVA failed to achieve a higher lower bound when it was initialized using \(\lambda =0\). Hence we initialize CSGVA using the fit from GVA instead, by using the association discussed in Sect. 5. With this informative starting point, CSGVA converged in 16000 iterations and managed to improve upon the GVA fit, attaining a higher lower bound. IW-CSGVA led to further improvements with increasing K. As this dataset contains a large number of observations with correspondingly more variational parameters to be optimized, the computation is more intensive and parallel computing comes in very useful reducing the computation time by factors of 1.8, 2.9 and 4.2 for \(K =5, 20, 100\) respectively (Table 4).

Table 4 GBP data. Number of iterations I (in thousands), runtimes (in s) and estimates of lower bound (SD in brackets) of the variational methods

Figure 4 shows the estimated marginal posteriors of the global parameters. CSGVA provides significant improvements in estimating the posterior variance of \(\alpha \) and \(\psi \) as compared to GVA. With the application of IW-CSGVA, we are able to obtain posterior estimates that are quite close to that of MCMC even with a small K.

Fig. 4
figure 4

GBP data. Marginal posterior distributions of global parameters. Black (MCMC), blue (GVA), red (CSGVA), purple (\(K=5\)), orange (\(K=20\)), green (\(K=100\)). (Color figure online)

Figure 5 shows the estimated marginal posteriors of the latent states \(\{b_i\}\) summarized using the mean (solid line) and one standard deviation from the mean (dotted line) estimated by MCMC (black) and IW-CSGVA (\(K=5\), purple). For IW-CSGVA, the means and standard deviations are estimated based on 2000 samples, by generating \(\theta _G\) from \(q(\theta _G)\) followed by \(\theta _L\) from \(q(\theta _L|\theta _G)\). For MCMC, estimation was based on 5000 samples. IW-CSGVA estimated the means quite accurately (with a little overestimation), but the standard deviations are underestimated when compared to MCMC. The boxplot shows the difference between the means and standard deviations estimated by IW-CSGVA \((K=5)\) and GVA with MCMC. We can see that IW-CSGVA estimated both the means and standard deviations more accurately as compared to GVA.

Fig. 5
figure 5

GBP data. Top: Posterior means (solid line) of the latent states and one standard deviation from the mean (dotted line) estimated using MCMC (black) and IW-CSGVA (\(K=5\), purple). Bottom: Boxplots of \(\text {mean}_{\text {MCMC}} - \text {mean}_{\text {VA}}\) and \(\text {sd}_{\text {MCMC}} - \text {sd}_{\text {VA}}\). (Color figure online)

9.2 New York stock exchange data

Next we consider 2000 observations of the returns of the New York Stock Exchange (NYSE) from February 2, 1984 to December 31, 1991. These data are available as the dataset “nyse” from the R package astsa. We consider 100 times the mean-corrected returns as responses \(\{y_i\}\).

From Table 5, CSGVA was able to attain a higher lower bound than GVA when initialized in the standard manner using \(\lambda =0\). Applying IW-CSGVA led to further improvements as K increases. For this massive data set, parallel computing yields significant reductions in computation time, by factors of 2.9, 4.5 and 5.5 corresponding to \(K=5\), 20, 100, respectively.

Table 5 NYSE data. Number of iterations I (in thousands), runtimes (in s) and estimates of lower bound (SD in brackets) of the variational methods

Figure 6 shows that the marginal posteriors estimated using CSGVA are quite close to that of MCMC, while GVA underestimated the posterior variance of \(\alpha \) and \(\psi \) quite severely. Posteriors estimated by IW-CSGVA are virtually indistinguishable from MCMC.

Fig. 6
figure 6

NYSE data. Marginal posterior distributions of global parameters. Black (MCMC), blue (GVA), red (CSGVA), purple (\(K=5\)), orange (\(K=20\)), green (\(K=100\)). (Color figure online)

From Fig. 7, we can see that the marginal posteriors of the latent states are also estimated very well using IW-CSGVA \((K=5)\), and there is no systematic underestimation of the posterior variance unlike the previous example. GVA captures the posterior means very well but did not perform as well as IW-CSGVA in estimating the posterior variance.

Fig. 7
figure 7

NYSE data. Top: Posterior means (solid line) of the latent states and one standard deviation from the mean (dotted line) estimated using MCMC (black) and IW-CSGVA (\(K=5\), purple). Bottom: Boxplots of \(\text {mean}_{\text {MCMC}} - \text {mean}_{\text {VA}}\) and \(\text {sd}_{\text {MCMC}} - \text {sd}_{\text {VA}}\). (Color figure online)

10 Conclusion

In this article, we have proposed a Gaussian variational approximation for hierarchical models that adopts a conditional structure \(q(\theta ) = q(\theta _G) q(\theta _L|\theta _G)\). The dependence of the local variables \(\theta _L\) on global variables \(\theta _G\) are then captured using a linear approximation. This structure is very useful when there are global scale parameters in \(\theta _G\) which help to determine the scale of local variables in the conditional posterior of \(\theta _L|\theta _G\). We further demonstrate how CSGVA can be improved by maximizing the importance weighted lower bound. From our experiments, using a K as small as 5 can lead to significant improvements in the variational approximation, with just a short run. Moreover, for massive datasets, computation time can be further reduced through parallel computing. Our experiments indicate that CSGVA coupled with importance weighting is particularly useful in improving the estimation of the posterior variance of precision parameters \(\omega \) in GLMMs, and the persistence \(\phi \) and volatility \(\sigma \) of the log-variance in SVMs.