1 Introduction

Sparse Principal Component Analysis (SPCA), a contemporary variant of PCA, has gained popularity as a valuable tool for reducing the dimensions of high-dimensional data. Its applications span various fields, such as chemistry, where it aids in identifying crucial chemical components from spectra (Varmuza and Filzmoser 2009); genetics, where it helps discover significant genes and pathways (Li et al. 2017); and macroeconomics, where it plays a role in selecting dominant macro variables that earn substantial risk premiums (Rapach and Zhou 2019). The success of SPCA can be attributed to two main factors. Firstly, in typical high-dimensional datasets, the number of input variables p is greater than the number of observations n. This condition poses challenges when using traditional PCA, as the leading eigenvector becomes inconsistently estimated when p/n does not converge to 0 (Paul 2007; Johnstone and Lu 2009). However, SPCA addresses and mitigates this issue effectively. Secondly, the principal components derived from SPCA are linear combinations of only a few important variables, making them highly interpretable in practical applications. This interpretability makes SPCA a valuable asset when dealing with complex data sets, enabling researchers and analysts to glean meaningful insights with ease.

Several SPCA algorithms have been proposed, and interested readers can refer to Zou and Xue (2018) for a comprehensive literature review on these algorithms. However, it’s worth noting that the algorithms discussed in that review do not include Bayesian methods. Recently, two Bayesian SPCA approaches, introduced by Gao and Zhou (2015) and Xie et al. (2022), have emerged and demonstrated impressive advantages. Both approaches adopt the spiked covariance model, which conveniently represents a linear regression model where the loadings matrix serves as the coefficient and the design matrix follows a standard multivariate normal distribution as the random component. A significant challenge in Bayesian SPCA lies in placing a prior on the coefficients while enforcing the orthogonality constraint, which requires the columns of the loadings matrix to be mutually orthogonal. This constraint needs to be incorporated through the prior distribution. Gao and Zhou (2015) tackled this challenge by constructing a prior that projects the nonzero coordinates onto a subspace spanned by a collection of mutually orthogonal unit vectors. However, their posterior becomes intractable and challenging to compute when the rank is greater than one. Xie et al. (2022) adopted a different approach by reparametrizing the likelihood, multiplying the loadings matrix with an orthogonal matrix to remove the orthogonal constraint. Their prior involves only Laplacian spike and slab densities while our density (see Eq. (3) in Sect. 2.2) is considerably more general. Additionally, this prior can introduce dependence while theirs demand prior independence.

In this paper, we present a novel prior for the coefficient of the spiked covariance model. In our approach, we apply a regular spike and slab prior on the parameter, which is the product of the loadings matrix (the coefficient) and an orthogonal matrix. The orthogonal matrix is then a latent variable in the prior. By marginalizing this joint density, we derive the prior of the coefficient. The spike and slab prior is a mixture of a continuous density and a Dirac measure centered at 0. By introducing an appropriate prior on the mixture weight, one can effectively impose sparsity on the coefficient. The spike and slab prior is widely recognized as one of the most prominent priors for Bayesian high-dimensional analysis and has received extensive study. Excellent works in this area include those by Johnstone and Silverman (2004), Ročková and George (2018), Castillo and Szabó (2020), Castillo and van der Vaart (2012), Castillo et al. (2015), Martin et al. (2017), Qiu et al. (2018), Jammalamadaka et al. (2019), Qiu et al. (2020), Ning (2023b), Jeong and Ghosal (2020), Ohn et al. (2023), Ning (2023a), and Ning et al. (2020). For a comprehensive overview of this topic, readers can refer to the review paper by Banerjee et al. (2021). It is important to note that in our spike and slab formulation, the slab density, which incorporates the latent variable, differs from that in traditional linear regression models. This distinction contributes to the uniqueness and effectiveness of our proposed approach.

We employ a variational approach to compute the posterior, a method that minimizes a chosen distance or divergence (e.g., Kullback-Leibler divergence) between a preselected probability measure, belonging to a rich and analytically tractable class of distributions, and the posterior distribution. This approach offers faster computational speed compared to sampling methods like the Markov chain Monte Carlo algorithm. Among the variational approaches, the coordinate ascent variational inference (CAVI) method stands out as the most popular algorithm (Blei et al. 2017). Several CAVI methods have been developed for sparse linear regression models with the spike and slab prior (or the subset selection prior) such as Carbonetto and Stephens (2012), Huang et al. (2016), Ray and Szabo (2022), Yang et al. (2020) and sparse factor analysis such as Ghahramani and Beal (1999), Wang et al. (2020), Hansen et al. (2023). Researchers have also studied the theoretical properties of the variational posterior, such as the posterior contraction rate, as examined by Ray and Szabo (2022) and Yang et al. (2020). While variational Bayesian methods for SPCA have been developed by Guan and Dy (2009) and Bouveyron et al. (2018), they did not provide a theoretical justification for their posterior. Moreover, the priors used by Guan and Dy (2009) involving the Laplace distribution and Bouveyron et al. (2018)’s prior, similar to the spike and slab prior with a fixed mixture weight, are known not to yield the optimal (or near-optimal) posterior contraction rate.

In this paper, we show that the contraction rates of both the posterior and the variational posterior are nearly optimal. To the best of our knowledge, this is the first result for the variational Bayesian method applied to SPCA. Additionally, we develop an EM algorithm tailored for SPCA, in which the maximum of a posteriori estimator is obtained. The EM algorithm for Bayesian variable selection has been extensively studied for the sparse linear regression model by Ročková and George (2014), Ročková and George (2018). Similar algorithms have been developed for other high-dimensional models, such as the dynamic time series model (Ning et al. 2019) and the sparse factor model (Ročková and George 2016). For our EM algorithm to accommodate SPCA, we replace the Dirac measure in the spike and slab prior with a continuous density, resulting in the continuous spike and slab prior. Both the variational approach and the EM algorithm employ parameter expansion techniques on the likelihood function. Consequently, these algorithms are referred to as the PX-CAVI and the PX-EM algorithm respectively, where PX means parameter expanded. The parameter expansion approach was initially proposed by Liu et al. (1998) and has proven effective in accelerating the convergence speed of the EM algorithm. Additionally, we discovered that by selecting the expanded parameter as the orthogonal matrix, we can circumvent the need to handle the orthogonal constraint directly on the loading matrix. This approach allows us to first solve for the unconstrained matrix and subsequently apply singular value decomposition (SVD) to obtain an estimated value for the loadings matrix. This simplification streamlines the computation process and enhances the efficiency of our algorithms.

The remainder of this paper is structured as follows: Sect. 2 presents the model and the prior used in this study. Section 3 introduces the variational approach and outlines the development of the PX-CAVI algorithm. In Sect. 4, we delve into the theoretical properties of both the posterior and the variational posterior. Section 5 presents the PX-EM algorithm we developed. To evaluate the performance of our algorithms, we conduct simulation studies in Sect. 6. Furthermore, in Sect. 7, we analyze a lung cancer gene dataset to illustrate the application of our approach in real-world scenarios. The appendix contains proofs of the equations presented in Sect. 3. Proofs of the theorems discussed in Sect. 4 and the batch PX-CAVI algorithm without relying on the jointly row-sparsity assumption are provided in the supplementary material. For readers interested in implementing our algorithms, we have made the \(\textsf{VBsparsePCA}\) package available on the comprehensive R archive network (CRAN). This package includes both the PX-CAVI algorithm and the batch PX-CAVI algorithm.

2 Model and priors

In this section, we begin by introducing the spiked covariance model, followed by the spike and slab prior applied.

2.1 The spiked covariance model

Consider the spiked covariance model

(1)

where \(X_i\) is a p-dimensional vector, \(\theta \) is a \(p \times r\)-dimensional loadings matrix, \(w_i\) is a r-dimensional vector, \(\epsilon _i\) is a p-dimensional vector that is independent of \(w_i\), and r is the rank. We denote \(\theta _{\cdot k}\) as the k-th column of \(\theta \). The orthogonality constraint of \(\theta \) requires that \(\langle \theta _{\cdot k}, \theta _{\cdot k'} \rangle = 0\) for any \(k \ne k'\), \(k, k' \in \{1, \dots , r\}\). The model is equivalent to , where \(\Sigma = \theta \theta ' + \sigma ^2 I_p\). Let \(\theta = U\Lambda ^{1/2}\), where \(U = \left( {\theta _{\cdot 1}}/{\Vert \theta _{\cdot 1}\Vert _2}, \dots , {\theta _{\cdot r}}/{\Vert \theta _{\cdot r}\Vert _2}\right) \) is a \(p \times r\) matrix containing the first r eigenvectors and \(\Lambda = \text {diag}(\Vert \theta _{\cdot 1}\Vert _2^2, \dots , \Vert \theta _{\cdot r}\Vert _2^2)\) is an \(r \times r\) diagonal matrix. Then, \(\Sigma = U\Lambda U' + \sigma ^2 I_p\). One can easily check that the k-th eigenvalue of \(\Sigma \) is \(\Vert \theta _{\cdot k}\Vert _2^2 + \sigma ^2\) if \(k \le r\) and is \(\sigma ^2\) if \(k > r\). We assume \(p \gg n\) (i.e. \(n/p \rightarrow 0\)) and \(\theta \) is jointly row-sparse—that is, within the same row, the coordinates are either all zero or all non-zero. We define the rows containing non-zero entries as “non-zero rows” and the remaining rows as “zero rows.” With this assumption, the support of each column in \(\theta \) remains the same and is denoted as \(S = \left\{ j \in \{1, \dots , p\}: \ \theta _{j}' \ne 0_r \right\} \) where \(0_r\) represents r-dimensional zero vector. Adopting the row-sparsity assumption is convenient for practitioners, as the principal subspace is generated by the same sparse set of features. Additionally, we can simplify our main ideas and use more concise notations by adopting this assumption, as the support is consistent across all principal components. A more general assumption that allows the support to vary across principal components, is covered in the supplementary material. Our \(\textsf{R}\) package \(\textsf{VBsparsePCA}\) can effectively handle both assumptions.

2.2 The spike and slab prior

We introduce our spike and slab prior, which is

$$\begin{aligned} \pi (\theta , {{\,\mathrm {\varvec{\gamma }}\,}}|\lambda _1, r)&\propto \prod _{j=1}^p \left[ \gamma _j \int _{A \in V_{r,r}} g(\theta _j|\lambda _1, A, r) \pi (A) d A\right. \nonumber \\&\quad \left. + (1-\gamma _j) \delta _0(\theta _j) \right] , \end{aligned}$$
(2)

where \(V_{r,r} = \{ A \in \mathbb {R}^{r \times r}: A'A = I_r \}\) is the Stiefel manifold of r-frames in \(\mathbb {R}^r\) and \(\delta _0\) is the Dirac measure at zero. Our idea of constructing the prior (2) is that since \(\beta = \theta A\) does not have the orthogonality constraint, as A is an orthogonal matrix, we first apply the regular spike and slab prior on \(\beta \) which could be viewed as the joint distribution of \(\theta \) and A. We then obtain the prior of \(\theta \) by marginalizing the parameter A from the joint distribution of \(\theta \) and A. Because of the latent variable A, this prior is different from those in the sparse linear regression models. We consider a general expression for the density g, which is

$$\begin{aligned} g(\theta _j|\lambda _1, A, r) = [C(\lambda _1)]^r \exp (-\lambda _1 \Vert \beta _j\Vert _q^m), \end{aligned}$$
(3)

where \(1 \le q \le 2\), \(m \in \{1, 2\}\), and \([C(\lambda _1)]^r\) is the normalizing constant. This expression includes three common distributions as special cases. If \(q = 1\) and \(m = 1\), \(C(\lambda _1) = \lambda _1/2\), then g is a product of r-independent Laplace densities. If \(q = 2\) and \(m = 2\), \(C(\lambda _1) = \sqrt{\lambda _1/(2\pi )}\), then it is the multivariate normal density. If \(q = 2\) and \(m = 1\), \(C(\lambda _1) = \lambda _1/a_r\) where

$$\begin{aligned} a_r = \sqrt{\pi }\left( \Gamma (r+1)/\Gamma (r/2+1)\right) ^{1/r} > 2, \end{aligned}$$

then it is the density part of the prior introduced by Ning et al. (2020) for group sparsity. The priors for the remaining parameters are given as follows: \(\pi (A) \propto 1\) and for each j,

$$\begin{aligned}&\gamma _j| \kappa \sim \text {Bernoulli}(\kappa ), \qquad \kappa \sim \text {Beta}(\alpha _1, \alpha _2). \end{aligned}$$
(4)

If \(\sigma ^2\) and r are unknown, we let \(\sigma ^2 \sim \text {InverseGamma}(\sigma _a, \sigma _b)\) and \(r \sim \text {Poisson}(\varkappa )\). Assuming r is fixed, then the joint posterior distribution of \((\theta , {{\,\mathrm{\varvec{\gamma }}\,}}, \sigma ^2)\) is

$$\begin{aligned} \pi (\theta , {{\,\mathrm {\varvec{\gamma }}\,}}, \sigma ^2|X)&\propto \prod _{i=1}^n f(X_i|\theta , \sigma ^2, r) \prod _{j=1}^p \pi (\theta _j |\gamma _j, r)\nonumber \\&\quad \times \left( \int \pi (\gamma _j|\kappa )\pi (\kappa )d\kappa \right) \pi (\sigma ^2), \end{aligned}$$
(5)

where \(X=(X_1,\ldots ,X_n)\) with each \(X_i\) being a p-dimensional vector.

3 Variational inference

In this section, we propose a variational approach for SPCA using the posterior (5). We introduce a mean-field variational class to obtain the variational posterior, and then develop the PX-CAVI algorithm to efficiently compute it.

3.1 The variational posterior and the evidence lower bound

To obtain the variational posterior, we adopt the mean-field variational approximation, which decomposes the posterior into several independent components, with the parameter in each component being independent of the others. The variational class is defined as follows:

$$\begin{aligned} \begin{aligned}&\mathcal {P}^{\text{ MF }} = \Bigg \{ P(\theta ){:}{=} \prod _{j=1}^p \Big [ z_j \mathcal {N}(\mu _j, \sigma ^2 M_j) + (1 - z_j) \delta _0\Big ],\;\\ {}&\hspace{3cm} \mu _j \in \mathbb {R}^{r},\; \langle \mu _{\cdot k}, \mu _{\cdot k'} \rangle = 0, \;\forall k \ne k', \\ {}&\hspace{5cm} M_j \in \mathbb {M}^{r \times r},\; z_j \in [0, 1] \Bigg \}, \end{aligned} \end{aligned}$$
(6)

where \(\mathbb {M}^{r \times r}\) stands for the space of \(r \times r\) positive definite matrices. For any \(P(\theta ) \in \mathcal {P}^{{{\,\mathrm{\text {MF}}\,}}}\), it is a product of p independent densities, each of which is a mixture of two distributions—a multivariate normal (or a normal density when \(r = 1\)) and the Dirac measure at zero. The mixture weight \(z_j\) is the corresponding inclusion probability. The variational posterior is obtained by minimizing the Kullback-Leibler divergence between all \(P(\theta ) \in \mathcal {P}^{\text {MF}}\) and the posterior, i.e.,

$$\begin{aligned} {\widehat{P}}(\theta ) = {{\,\mathrm{arg\,min}\,}}_{P(\theta ) \in \mathcal {P}^{\text {MF}}} KL\left( P(\theta ), \pi (\theta |X)\right) , \end{aligned}$$
(7)

which can be also written as

$$\begin{aligned} {\widehat{P}}(\theta )&= {{\,\mathrm {arg\,min}\,}}_{P(\theta ) \in \mathcal {P}^{\text{ MF }}} \Big (\mathbb {E}_P \log P(\theta ) - \mathbb {E}_P \log \pi (\theta |X)\Big ) \nonumber \\ {}&= {{\,\mathrm {arg\,min}\,}}_{P(\theta ) \in \mathcal {P}^{\text{ MF }}} \Big ( \mathbb {E}_P \log P(\theta ) - \mathbb {E}_P \log \pi (\theta , X) \nonumber \\&\qquad + \log \pi (X) \Big ). \end{aligned}$$
(8)

As the expression of \(\log \pi (X)\) in (8) is intractable, we define the evidence lower bound (ELBO), which is the lower bound of \(\log \pi (X)\) as follows:

$$\begin{aligned} {{\,\mathrm{\text {ELBO}}\,}}(\theta )&= \mathbb {E}_P \log \pi (\theta , X) - \mathbb {E}_P \log P(\theta ). \end{aligned}$$
(9)

and solve \({\widehat{P}}(\theta _j) = {{\,\mathrm{arg\,max}\,}}_{P(\theta ) \in \mathcal {P}^{\text {MF}}}{{\,\mathrm{\text {ELBO}}\,}}(\theta )\). Since

$$\begin{aligned} P(\theta ) = \prod _{j=1}^p P(\theta _j)\quad \text {and} \quad \pi (\theta , X) = \prod _{j=1}^p \pi (\theta _j, X), \end{aligned}$$

the ELBO can be also written as follows:

$$\begin{aligned} {{\,\mathrm{\text {ELBO}}\,}}(\theta ) = \sum _{j=1}^p \Big ( \mathbb {E}_P \log \pi (\theta _j, X) - \mathbb {E}_P \log P(\theta _j) \Big ). \end{aligned}$$

From the last display, we can solve each \({\widehat{P}}(\theta _j)\) independently and then obtain the variational posterior from \({\widehat{P}}(\theta ) = \prod _{j=1}^p {\widehat{P}}(\theta _j)\).

3.2 The PX-CAVI algorithm

The PX-CAVI algorithm is an iterative method where, in each iteration, it optimizes each of the unknown variables by conditioning on the rest. Our algorithm incorporates two key differences from the conventional CAVI algorithm. Firstly, we include an expectation step, similar to that used in the EM algorithm, since \(w = (w_1, \dots , w_n)\) is a random variable. Secondly, we apply parameter expansion to the likelihood, which enables us to handle the orthogonality constraint and accelerate the convergence speed of our algorithm. Now, let’s provide a step-by-step derivation of the PX-CAVI algorithm, where \(M= (M_1, \dots , M_p)\) and \(z = (z_1, \dots , z_p)\).

1. E-step

In this step, the full model posterior is \(\pi (\theta , w, X)\). Let \(\Theta ^{(t)}\) be the estimated value of \(\Theta = (\mu , M, z)\) from the t-th iteration, we obtain

$$\begin{aligned}{} & {} {\widehat{P}}(w_i|\Theta ^{(t)}) =\mathcal {N}({{\widetilde{\omega }}}_i, {\widetilde{V}}_w), \nonumber \\{} & {} {{\widetilde{\omega }}}_i = \frac{1}{\sigma ^{2}} {\widetilde{V}}_w \sum \nolimits _{j=1}^p z_j^{(t)} \big [\mu _j^{(t)}\big ]' X_{ij}, \nonumber \\{} & {} {\widetilde{V}}_w = \left( \frac{1}{\sigma ^{2}} \sum \nolimits _{j=1}^p z_j^{(t)} \left( \big [\mu _j^{(t)}\big ]' {\mu _j^{(t)}} + \sigma ^2M_j^{(t)} \right) + I_r \right) ^{-1}.\nonumber \\ \end{aligned}$$
(10)

Then, the objective function is given by

$$\begin{aligned} Q(\theta |\Theta ^{(t)}) = \mathbb {E}_{w|\Theta ^{(t)}}\log \pi (\theta , w, X). \end{aligned}$$

We obtain

$$\begin{aligned} {\widehat{P}}(\theta )&{=} {{\,\mathrm{arg\,max}\,}}_{P(\theta ) \in \mathcal {P}^{\text {MF}}} \sum _{j=1}^p \left( \mathbb {E}_P Q(\theta _j |\Theta ^{(t)}) - \mathbb {E}_P \log P(\theta _j) \right) .\\&{=} {{\,\mathrm{arg\,max}\,}}_{P(\theta ) \in \mathcal {P}^{\text {MF}}} \sum _{j{=}1}^p \Big ( \mathbb {E}_P \big [\mathbb {E}_{w|\Theta ^{(t)}} \log \pi (\theta _j, w, X)\big ] {-} \mathbb {E}_P \log P(\theta _j) \Big ). \end{aligned}$$

2. Parameter expansion

To obtain \({\widehat{P}}(\theta )\), special attention must be given to the orthogonality constraint of \(\mu \) as defined in (6). This is where the parameter expansion technique is used. Let A be the expanded parameter and denote \(\beta = \theta A\), the likelihood after the parameter expansion becomes \(X_i = \beta w_i + \sigma ^2 \epsilon _i\), as follows the same distribution as \(w_i\). Then, our spike and slab prior is directly applied on \(\beta \). We do not require the prior to be invariant under the transformation of the parameter. After solving \(\beta \), one can obtain \(\theta \) using the singular value decomposition (SVD). To accelerate the convergence speed of the algorithm, we apply parameter expansion again. At this time, the expanded parameter is chosen to be a positive definite matrix, say D. We denote \({{\widetilde{\beta }}} = \beta D\). The likelihood after this parameter expansion becomes \(X_i = {{\widetilde{\beta }}} {\widetilde{w}}_i + \sigma \epsilon _i\), where and \({{\widetilde{\beta }}} = {{\widetilde{\theta }}} A D_L^{-1}\), \(D_L\) is the lower triangular matrix obtained using SVD. Our spike and slab prior is then directly putting on \(\widetilde{\beta }\). To summarize, parameter expansion is used twice in the PX-CAVI algorithm. The first time is primary used to deal with the orthogonality constraint, and the second time is to accelerate its convergence speed. We denote \({\widetilde{u}}\) and \({\widetilde{M}}\) as the mean and the covariance of \(P({{\widetilde{\beta }}})\). This leads us to instead maximize \(\mathbb {E}_P Q({{\widetilde{\Theta }}}|\widetilde{\Theta }^{(t)}) - \mathbb {E}_{q} \log P({{\widetilde{\beta }}})\), where \({{\widetilde{\Theta }}} = ({\widetilde{u}}, {\widetilde{M}}, z)\) and

$$\begin{aligned} P({{\widetilde{\beta }}}) = \prod _{j=1}^p \left[ z_j\mathcal {N}({\widetilde{u}}_j, \sigma ^2 {\widetilde{M}}_j) + (1-z_j) \delta _0 \right] . \end{aligned}$$
(11)

One can quickly check that \({\widetilde{u}} = \mu A D_L^{-1}\) and \({\widetilde{M}}_j = D_L^{-1} M_j {D_L^{-1}}'\). Note that since we assume \(\theta \) is jointly row-sparse, the support of \(\beta \) and it of \({{\widetilde{\beta }}}\) are the same. Thus \(z_j\) in (11) is the same as it in (6).

To solve for \({\widetilde{u}}\) and \({\widetilde{M}}\), we explore the following two choices of the density g in (3):

\(\bullet \) When \(q = 1\) and \(m = 1\), it yields a product of r-independent Laplace densities. Details are given in Appendix A. In summary, denoting \(H_i = {{\widetilde{\omega }}}_i {{\widetilde{\omega }}}_i' + {\widetilde{V}}_w\), we obtain

$$\begin{aligned} \widehat{{\widetilde{u}}}_j&= \min _{\widetilde{u}_j} \left[ \frac{1}{2\sigma ^2} \sum _{i=1}^n \left( {\widetilde{u}}_j H_i {\widetilde{u}}_j' - 2 X_{ij} {\widetilde{u}}_j {{\widetilde{\omega }}}_i \right) \right. \nonumber \\&\qquad \left. + \lambda _1 \sum _{k=1}^r f({\widetilde{u}}_{jk}, \sigma ^2 {\widetilde{M}}_{j,kk})\right] \end{aligned}$$
(12)
$$\begin{aligned} \widehat{{\widetilde{M}}}_j&= \min _{\widetilde{M}_j} \left[ \frac{1}{2} \sum _{i=1}^n {{\,\text {Tr}\,}}({\widetilde{M}}_j H_i) - \frac{\log \det ({\widetilde{M}}_j) }{2} \right. \nonumber \\&\qquad \left. + \lambda _1 \sum _{k=1}^r f({\widetilde{u}}_{jk},\sigma ^2 {\widetilde{M}}_{j, kk}) \right] , \end{aligned}$$
(13)

where \(f({\widetilde{u}}_{jk},\sigma ^2 {\widetilde{M}}_{j,kk})\) is the mean of the folded normal distribution,

$$\begin{aligned} f({\widetilde{u}}_{jk},\sigma ^2 {\widetilde{M}}_{j,kk})&= \sqrt{\frac{2\sigma ^2 {\widetilde{M}}_{j, kk}}{\pi }} \exp \left( -\frac{{\widetilde{u}}_{jk}^2}{2\sigma ^2 \widetilde{M}_{j,kk}}\right) \\&\quad + {\widetilde{u}}_{jk} \left( 1-2\Phi \left( - \frac{{\widetilde{u}}_{jk}}{\sqrt{\sigma ^2 {\widetilde{M}}_{j,kk}}} \right) \right) , \end{aligned}$$

with \(\Phi \) being the cumulative distribution function of a standard normal distribution. Here, \(\det (B)\) and \({{\,\textrm{Tr}\,}}(B)\) stands for the determinant and the trace of the matrix B.

\(\bullet \) When \(q = 2\) and \(m = 2\), it results in a multivariate normal density. If g is the multivariate normal density, we use \(\mathcal {N}(0, \sigma ^2I_r/\lambda _1 )\) instead, as the solution for \(\sigma ^2\) is simpler. One can consider we choose the tuning parameter to be \(\lambda _1/\sigma ^2\) instead of \(\lambda _1\). Then, we obtain

$$\begin{aligned}&\widehat{{\widetilde{u}}}_j = \widehat{{\widetilde{M}}}_j \sum _{i=1}^n X_{ij} {{\widetilde{\omega }}}_i'\quad \text {and}\quad \widehat{{\widetilde{M}}}_j = \left( \sum _{i=1}^n \left( \widetilde{\omega }_i {{\widetilde{\omega }}}_i' + {\widetilde{V}}_w \right) + \lambda _1 I_r \right) ^{-1}. \end{aligned}$$
(14)

3. Updating \(\textbf{z}\)

To solve z, we need to obtain \(\widehat{ h} = ({\widehat{h}}_1, \dots , {\widehat{h}}_p)\), where for each \(j\in \{1,\ldots ,p\}\), \({\widehat{h}}_j = \log ({\widehat{z}}_j / (1-{\widehat{z}}_j))\). In Appendix A, we derive the solution for \(\widehat{h}_j\). If g is the product of r independent Laplace density, then

$$\begin{aligned} {\widehat{h}}_j&= \log \left( \frac{\alpha _1}{\alpha _2}\right) + r \log \left( \frac{\sqrt{\pi } \sigma \lambda _1}{\sqrt{2}}\right) \nonumber \\&\quad -\lambda _1 \sum _{k=1}^r f({\widetilde{u}}_{jk}, \sigma ^2 \widetilde{M}_{j,kk}) + \frac{\log \det ({\widetilde{M}}_j)+1}{2}\nonumber \\&\quad - \frac{1}{2\sigma ^2} \sum _{i=1}^n \left[ - 2 X_{ij} {\widetilde{u}}_j {{\widetilde{\omega }}}_i +{\widetilde{u}}_j H_i \widetilde{u}_j' + {{\,\textrm{Tr}\,}}(\sigma ^2 {\widetilde{M}}_j H_i) \right] . \end{aligned}$$
(15)

If g is the multivariate normal density, then

$$\begin{aligned} {\widehat{h}}_j&= \log \left( \frac{\alpha _1}{\alpha _2} \right) + \frac{r\log \lambda _1}{2} - \frac{\lambda _1}{2} \left( {\widetilde{u}}_j {\widetilde{u}}_j' + \sigma ^2 {{\,\text {Tr}\,}}({\widetilde{M}}_j) \right) \nonumber \\ {}&\quad + \frac{\log \det ({\widetilde{M}}_j) + 1}{2} - \frac{1}{2\sigma ^2} \sum _{i=1}^n \left( - 2X_{ij} {\widetilde{u}}_j {{\widetilde{\omega }}}_i' + {\widetilde{u}}_j H_i {\widetilde{u}}_j' \right. \nonumber \\&\quad \left. + {{\,\text {Tr}\,}}(\sigma ^2 {\widetilde{M}}_jH_i) \right) . \end{aligned}$$
(16)

4. Updating \({\varvec{\widehat{\mu }}}\) and \({\varvec{\widehat{M}}}\)

As we obtained \(\widehat{{\widetilde{u}}}\) and \(\widehat{{\widetilde{M}}}\), then \({\widehat{\mu }}\) and \({\widehat{M}}\) can be solved accordingly. Note that \({\widetilde{w}}_i \sim \mathcal {N}(0, D)\). In the E-step, we also obtained \({{\widetilde{\omega }}}_i\) and \(V_\omega \). Thus, D can be solved using \({\widehat{D}} = \frac{1}{n} \sum _{i=1}^n {{\widetilde{\omega }}}_i {{\widetilde{\omega }}}_i' + \widetilde{V}_\omega \), and \({\widehat{\mu }}\) can be obtained by first solving \({\widehat{u}} = \widehat{{\widetilde{u}}} {\widehat{D}}_L\). Next, we apply the SVD to obtain \({\widehat{A}}\). Last, we obtain \(\mu \) using \({\widehat{\mu }} = {\widehat{u}} {\widehat{A}}'\). \({\widehat{M}}\) can be obtained similarly, i.e., \({\widehat{M}}_j = {\widehat{D}}_L \widetilde{M}_j {\widehat{D}}_L'\).

5. Updating \({\varvec{\sigma ^2}}\)

Recall that the prior \(\sigma ^2 \sim \text {InverseGamma}(\sigma _a, \sigma _b)\). If g is the product of r independent Laplace density, we obtain

$$\begin{aligned} {\widehat{\sigma }}^2&= {{\,\mathrm {arg\,min}\,}}_{\sigma ^2 \in (0, \infty )}\nonumber \\&\quad \Bigg [ \sum _{j=1}^p z_j \Big \{ \frac{1}{2\sigma ^2} \sum _{i=1}^n \Big ( {\widetilde{u}}_j H_i {\widetilde{u}}_j' - 2X_{ij} {\widetilde{u}}_j {{\widetilde{\omega }}}_i \Big ) - \frac{r\log \sigma ^2}{2} \nonumber \\&\qquad + \lambda _1 \sum _{k=1}^r f({\widetilde{u}}_{jk}, \sigma ^2 {\widetilde{M}}_{j,kk}) \Big \} \nonumber \\ {}&\quad +\frac{(np+2\sigma _a+2)\log \sigma ^2}{2} \nonumber \\&\quad + \frac{{{\,\text {Tr}\,}}(X'X) + 2\sigma _b}{2\sigma ^2} \Bigg ]. \end{aligned}$$
(17)

If g is the multivariate normal density, we obtain

$$\begin{aligned} {\widehat{\sigma }}^2 = \frac{ {{\,\textrm{Tr}\,}}(X'X) + \sum _{j=1}^p z_j \sum _{i=1}^n \left( {\widetilde{u}}_j H_i {\widetilde{u}}_j' - 2X_{ij} {\widetilde{u}}_j {{\widetilde{\omega }}}_i + \lambda _1 {\widetilde{u}}_j {\widetilde{u}}_j' \right) + 2\sigma _b }{ np + 2(\sigma _a + 1)}. \end{aligned}$$
(18)

Now, we summarize the PX-CAVI algorithm.

Algorithm 1
figure a

The PX-CAVI algorithm

4 Asymptotic properties

This section studies the asymptotical properties of the posterior in (5) and the variational posterior in (7). We work with the subset selection prior, which includes the spike and slab prior in (2) as a special case, which is constructed as follows: First, a number s is chosen from a prior \(\pi \) on the set \(\{0, \dots , p\}\). Next, a set S is chosen uniformly from the set \(\{1, \dots , p\}\) such that its cardinality \(|S| = s\). Last, conditional on S, if \(j \in S\), then the prior for \(\theta _j\) is chosen to be \(\int _{A \in V_{r,r}} g(\theta _j|\lambda _1, A) d\Pi (A)\); if \(j \not \in S\), then \(\theta _j\) is set to \(0_r'\). The prior is given as follows:

$$\begin{aligned} \pi (\theta , S|\lambda _1) \propto \pi (|S|) \frac{1}{{p \atopwithdelims ()|S|}} \prod _{j \in S} \int _{A \in V_{r,r}} g(\theta _{j}| \lambda _1, A)\pi (A) dA \prod _{j \not \in S} \delta _0(\theta _j). \end{aligned}$$
(19)

Note that (2) is a special case of (19) when \(\pi (|S|)\) is the beta-binomial distribution. That is, \(s|\kappa \sim \text {binomial}(p, \kappa )\) and \(\kappa \sim \text {Beta}(\alpha _1, \alpha _2)\).

In the next subsection, we will study the theoretical properties of the posterior with the subset selection prior. Before we proceed, some notations need to be introduced. Let \(\lesssim \) (resp. \( > rsim \)) stand for inequalities up (resp. down) to a constant, \(a \asymp b\) stand for \(C_1a \le b \le C_2a\) with positive constants \(C_1 < C_2\), and \(a \ll b\) stand for \(a/b \rightarrow 0\). We denote \(\Vert b\Vert _2\) as the \(\ell _2\)-norm of a vector b and \(\Vert B\Vert \) as the spectrum norm of a matrix B. The true value of an unknown parameter \(\vartheta \) is denoted by \(\vartheta ^\star \).

4.1 Contraction rate of the posterior

We study the dimensionality and the contraction rate of the posterior distribution. In this study, we assume r is unknown and \(\sigma ^2\) is fixed. Three assumptions are needed to obtain the rate.

Assumption 1

(Priors for s and r) For positive constants \(a_1\), \(a_2\), \(a_3\), and \(a_4\), assume

$$\begin{aligned} p^{-a_1}{} {}{} & {} \lesssim \pi (s)/\pi (s-1) \lesssim p^{-a_2} \\{} & {} \quad \exp (-a_3 r)\lesssim \pi (r) \lesssim \exp (- a_4 r). \end{aligned}$$

The above assumption impose conditions on the tails of the priors \(\pi (s)\) and \(\pi (r)\). The first condition also appears in the study of the sparse linear regression model (e.g. Castillo et al. 2015; Martin et al. 2017; Ning et al. 2020). It assumes that the logarithm of the ratio between \(\pi (s+h)\) and \(\pi (s)\) is in the same magnitude as \(-h\log p\). When h increases, the assigned probability on \(s+h\) decays exponentially fast. The beta-binomial prior mentioned above satisfies this condition if one chooses, for example, \(\alpha _1 = 1\) and \(\alpha _2 = p^{\nu } + 1\) for any \(\nu > \log \log p/ \log p\). The second condition is similar to that in Pati et al. (2014). It assumes the tail of \(\pi (r)\) should decay exponentially fast; the Poisson distribution satisfies this condition.

Assumption 2

(Bounds for \(\lambda _1\)) For positive constants \(b_1, b_2,\) and \(b_3\), assume

$$\begin{aligned} b_1 \sqrt{\frac{n}{p^{b_2/r^\star }}} \le \lambda _1 \le b_3 \sqrt{n\log p}. \end{aligned}$$

Assumption 2 provides the permissible region for \(\lambda _1\). If \(\lambda _1\) is too large, it introduces an extra shrinkage effect on large signals; if it is too small, the posterior will contract at a slower rate. Our upper bound is of the same order as that in Castillo et al. (2015), where they studied the sparse linear regression model. But the lower bounds are different. Ours is bigger; it can go to 0 very slowly if \(r^\star \) is close to \(\log p/\log n\).

Assumption 3

(Bounds for \(r^\star \) and \(\theta ^\star \)) For some positive constant \(b_2\), \(b_4\), and \(b_5\), \(r^\star \le b_2\log p/\log n\), \(\Vert \theta ^\star \Vert \ge b_4\) and \(\Vert \theta ^\star \Vert _{1,1} \le b_5 s^\star \log p/\lambda _1\) if \(m = 1\) and \(1\le q\le 2\) and \(\Vert \theta ^\star \Vert ^2 \le b_5 s^\star \log p/\lambda _1\) if \(m = 2\) and \(q = 2\).

Assumption 3 requires the true values of \(\theta \) and r being bounded. \(r^\star \) cannot be too large. If \(\log p/\log n \lesssim r^\star \lesssim \log p\), then the rate obtained in Theorem 4.1 will be slower, i.e., \(O(\sqrt{r^\star s^\star \log p/n}\)). The bounds for \(\Vert \theta ^\star \Vert \) essentially control the largest eigenvalue, as \(\Vert \theta ^\star \Vert ^2 + \sigma ^2\) is the largest eigenvalue of \(\Sigma ^\star \). It cannot be either too big or too small.

We now present the main theorem.

Theorem 4.1

For the model in (1) and the subset selection prior in (19), if Assumptions 13 hold, then for sufficiently large constants \(M_1\), \(M_2\), and \(M_3 \ge M_2/b_4\), as n goes to infinity,

$$\begin{aligned}&{} \mathbb {E}_{f^\star } \Pi \big (\theta : |S| > M_1 s^\star | X\big ) \rightarrow 0, \end{aligned}$$
(20)
$$\begin{aligned}&{} \mathbb {E}_{f^\star } \Pi \big (\Vert \Sigma - \Sigma ^\star \Vert \ge M_2\epsilon _n| X\big ) \rightarrow 0, \end{aligned}$$
(21)
$$\begin{aligned}&\mathbb {E}_{f^\star } \Pi \left( \left\| UU' - U^\star {U^\star }'\right\| \ge M_3\epsilon _n| X \right) \rightarrow 0, \end{aligned}$$
(22)

where \(\epsilon _n = \sqrt{s^\star \log p/n}\).

In Theorem 4.1, we derive the posterior contraction rate under the spectrum loss. The minimax rates for using the spectrum loss have been studied by Cai et al. (2015). Consider the parameter space

$$\begin{aligned} \Theta _0(s, p, r, {\overline{\rho }})&= \Big \{ \Sigma : 0 \le \Vert \theta _{\cdot r}\Vert _2^2 \le \Vert \theta _{\cdot 1}\Vert _2^2 \le {\overline{\rho }},\; U \in V_{r, r},\; \\&\qquad |S| \le s \Big \}, \end{aligned}$$

the minimax rate of estimating \(\Sigma \) for \(r \le s \le p\) is \( \sqrt{\frac{({\overline{\rho }} + 1)s}{n} \log \left( \frac{ep}{s} \right) + \frac{{\overline{\rho }}^2 r}{n}} \wedge {\overline{\rho }} \). Comparing it to the rate we obtained, assuming \({\overline{\rho }}\) is fixed and \(s \ge r\), our rate is suboptimal as the log factor in our rate is \(\log p\) but in the minimax rate, it is \(\log (p/s)\). Cai et al. (2015) also provided the minimax rate for the projection matrix. Assuming a more restrictive parameter space \(\Theta _1(s, p, r, \overline{\rho }, \tau )\),

$$\begin{aligned} \Theta _1(s, p, r, {\overline{\rho }}, \tau )&= \Big \{ \Sigma : {\overline{\rho }}/\tau \le \Vert \theta _{\cdot r}\Vert _2^2 \le \Vert \theta _{\cdot 1}\Vert _2^2 \le {\overline{\rho }},\; \\&\qquad U \in V_{r, r},\; |S| \le s \Big \}, \end{aligned}$$

the minimax rate is \(\sqrt{\frac{({\overline{\rho }} + 1)s}{n{\overline{\rho }}^2} \log \left( \frac{ep}{s}\right) } \wedge 1\). Again, if \({\overline{\rho }}\) is fixed, the rate we obtained is suboptimal.

One may ask if we could obtain the same rate as that in Theorem 4.1 if we use the Frobenius norm as the loss function (in short, Frobenius loss). This is in fact possible, and the proof can simply follow the argument in Gao and Zhou (2015). However, one needs to impose a lower bound for \(\Vert \theta {\cdot r}\Vert _2^2\). Although in practice, the lower bound can be introduced through the prior, e.g., using a truncated prior, the exact value is hard to determine. Thus, we did not choose this prior.

4.2 Contraction rate of the variational posterior

We study the contraction rate of the variational posterior in (7). Recent studies on this topic have provided exciting results of the variational method and developed useful tools for studying their theoretical properties (e.g. Ray and Szabo 2022; Wang and Blei 2019; Yang et al. 2020; Zhang and Gao 2020). Ray and Szabo (2022) and Yang et al. (2020) studied the spike and slab posterior with the linear regression model and obtained a (near-)optimal rate for their posterior. Zhang and Gao (2020) proposed a general framework for deriving the contraction rate of a variational posterior. We derive the rate by directly applying this general framework, as our variational posterior is intractable, and using a direct argument (e.g., those in the linear regression model) is impossible. Theorem 4.2 shows that the rate of the variational posterior is also \(\epsilon _n\) (but with a larger constant). Proofs of the theorem are provided in the supplemental material.

Theorem 4.2

With the model (1) and the subset selection prior (19), if \({\widehat{P}}(\theta ) \in \mathcal {P}^{{{\,\mathrm{\text {MF}}\,}}}\) and Assumptions 13 hold, then for large constants \(M_4\) and \(M_5\), as n goes to infinity,

$$\begin{aligned}&\widehat{Q}(\Vert \Sigma - \Sigma ^\star \Vert \ge M_4\epsilon _n|X) \rightarrow 0, \end{aligned}$$
(23)
$$\begin{aligned}&\widehat{Q}(\Vert UU' - U^\star {U^\star }'\Vert \ge M_5 \epsilon _n|X) \rightarrow 0. \end{aligned}$$
(24)

5 The PX-EM algorithm

The EM algorithm is another popular algorithm that is used in Bayesian high-dimensional analysis. In this section, to understand the strength of PX-CAVI, we also develop its EM analog, referred to as the PX-EM algorithm. The parameter expansion steps for the PX-EM algorithm mirror those used in the PX-CAVI algorithm. The PX-EM algorithm requires us to use the continuous spike and slab prior, which is

$$\begin{aligned} \pi (\theta , S|\lambda _1, \lambda _0)&\propto \int _A \prod _{j=1}^p \Big [ \gamma _j g(\theta _j| \lambda _1, A, r)\nonumber \\&\quad + (1-\gamma _j) g(\theta _j|\lambda _0, A, r) \Big ] \pi (A) dA, \end{aligned}$$
(25)

where \(\lambda _0 \gg \lambda _1\). By comparing to (2), the Dirac measure is replaced by the continuous density with a large variance. The priors for the rest parameters remain the same.

Our PX-EM algorithm contains two steps: E-step and M-step. In the E-step, expectations are taken with respect to both w and \({{\,\mathrm{\varvec{\gamma }}\,}}\). We then obtain

$$\begin{aligned} w_i|\theta ^{(t)}, X&\sim \mathcal {N}(\omega _i, V_w), \end{aligned}$$
(26)
$$\begin{aligned}&\gamma _j \sim {\text {Bernoulli}}({{\widetilde{\gamma }}}_j), \end{aligned}$$
(27)

where \(\theta ^{(t)}\) and \(\kappa ^{(t)}\) are the estimated values of \(\theta \) and \(\kappa \) from the t-th iteration and

$$\begin{aligned} V_w= & {} \sigma ^2 ({\theta ^{(t)}}'\theta ^{(t)} + \sigma ^2 I_r)^{-1}, \quad \omega _i = \sigma ^{-2} V_w {\theta ^{(t)}}' X_{i}, \end{aligned}$$
(28)
$$\begin{aligned} {{\widetilde{\gamma }}}_j^{(t)}= & {} P(\gamma _j=1|\theta ^{(t)}, \kappa ^{(t)}, X) = \frac{a_j^{(t)}}{a_j^{(t)}+b_j^{(t)}}, \end{aligned}$$
(29)

where \(a_j^{(t)} = \exp ( - \lambda _1 \Vert \theta ^{(t)}_j\Vert _q^m + \log \kappa ^{(t)})\) and \(b_j^{(t)} = \exp ( - \lambda _0 \Vert \theta ^{(t)}_j\Vert _q^m + \log (1-\kappa ^{(t)}))\).

To obtain the objective function, we first apply parameter expansion to the likelihood, same as that in the PX-CAVI algorithm. The expanded parameter becomes \({{\widetilde{\beta }}} = \beta D = \theta A D\). The spike and slab prior is then directly applied on \({{\widetilde{\beta }}}\). The objective function is given by \( Q({{\widetilde{\beta }}}, \kappa | \theta ^{(t)}, A^{(t)}, D^{(t)}, \kappa ^{(t)}) \), where

$$\begin{aligned} \begin{aligned} Q&= \mathbb {E}_{w, {{\,\mathrm {\varvec{\gamma }}\,}}|\theta ^{(t)}, \kappa ^{(t)}} \log \pi ({{\widetilde{\beta }}}, w|X)\\ {}&= C - \sum _{j=1}^p \left( \frac{1}{2\sigma ^2} \left\| M_L {{\widetilde{\beta }}}_j' - d_j\right\| _2^2 \right. \\&\quad \left. + ({{\widetilde{\gamma }}}_j \lambda _1 + (1-{{\widetilde{\gamma }}}_j) \lambda _0) \Vert {{\widetilde{\beta }}}_j\Vert ^m_q\right) \\ {}&\quad + \left( \Vert {{\widetilde{{{\,\mathrm {\varvec{\gamma }}\,}}}}}\Vert _1 {+} \alpha _1 {-} 1\right) \log \kappa {+} \left( p {-} \Vert {{\widetilde{{{\,\mathrm {\varvec{\gamma }}\,}}}}}\Vert _1 {+} \alpha _2 {-} 1\right) \log (1{-}\kappa ), \end{aligned} \end{aligned}$$
(30)

where C is a constant, \(M_L\) is the lower triangular part from the Cholesky decomposition, \(M = \sum _{i=1}^n \omega _i \omega _i' + nV_w\), and \(d_j = {M_L}^{-1} \sum _{i=1}^n \omega _i X_{ij}\).

In the M-step, we maximize the objective function and obtain

$$\begin{aligned} \widehat{{{\widetilde{\beta }}}}_j&= {{\,\mathrm{arg\,min}\,}}_{{{\widetilde{\beta }}}_j} \left\{ \frac{1}{2\sigma ^2} \left\| M_L {{\widetilde{\beta }}}_j' - d_j \right\| _2^2 + \text {pen}_j \Vert {{\widetilde{\beta }}}_j\Vert _q^m \right\} , \end{aligned}$$
(31)
$$\begin{aligned} {\widehat{\kappa }}&= \frac{\alpha _1 + \Vert {{\widetilde{{{\,\mathrm{\varvec{\gamma }}\,}}}}}\Vert _1 - 1}{p + \alpha _1 + \alpha _2 - 2}, \end{aligned}$$
(32)

where \(\text {pen}_j = {{\widetilde{\gamma }}}_j \lambda _1 + (1-{{\widetilde{\gamma }}}_j) \lambda _0\). Then \({\widehat{\theta }}\) is obtained using \({{\widetilde{\beta }}} = \theta A D_L\), where \({\widehat{D}} = \frac{1}{n} \sum _{i=1}^n{\omega _i \omega _i'} + V_w\) and \({\widehat{A}}\) is obtained by applying the SVD on the matrix \(\widehat{{{\widetilde{\beta }}}} {\widehat{D}}_L^{-1}\).

In (31), we choose \(m = 1\) and let \(q = 1\) and 2. When \(q = 1\), the expression is similar to that of the adaptive lasso (Zou et al. 2006). When \(q = 2\), the penalty term is then similar to it in the group lasso method (Yuan and Lin 2006). Despite those similarities, the tuning parameter in (31) can be updated during each EM iteration; however, in both of the two aforementioned literature, their tuning parameters are chosen to be fixed values. The benefit of allowing the tuning parameter to update is explored by Ročková (2018), which studied the sparse normal mean model.

Last, we obtain

$$\begin{aligned} {\widehat{\sigma }}^2 = \frac{ {{\,\textrm{Tr}\,}}(X'X) - 2 \sum _{j=1}^p d_j M_L \theta _j' + \sum _{j=1}^p \theta _j M \theta _j' + 2\sigma _b }{ np + 2(\sigma _a + 1). } \end{aligned}$$
(33)
Algorithm 2
figure b

The PX-EM algorithm

We conclude this section by offering theoretical justification for utilizing parameter expansion to accelerate the convergence speed of the EM algorithm. We observed that the convergence speed improves with both parameter expansions. Intuitively, by Dempster et al. (1977), the speed of convergence is determined by the largest eigenvalue of \(S(\Delta ) = I^{-1}_{com}(\Delta ) I_{obs}(\Delta )\), where

$$\begin{aligned}&I_{obs}(\Delta ) = -\frac{\partial ^2 \log (\Delta |X)}{\partial \Delta \partial \Delta '}\Bigg |_{\Delta = \Delta ^\star } \quad \text{ and }\quad \nonumber \\&I_{com}(\Delta ) = -\frac{\partial ^2 Q(\Delta |\Delta )}{\partial \Delta \partial \Delta '}\Bigg |_{\Delta = \Delta ^\star }. \end{aligned}$$
(34)

We denote \(\Delta \) as the collection of all the unknown parameters and \(\Delta ^\star \) as the true values. Let \(\Psi \) be the expanded parameter and \({{\widetilde{\Delta }}} = (\Delta , \Psi )\). We found that the largest eigenvalue of \(S({{\widetilde{\Delta }}})\) is bigger than that of \(S(\Delta )\). Thus, the convergence speed is increased. In Lemma 5.1, we provide a formal statement of this result. Proof of Lemma 5.1 is provided in the supplementary material.

Lemma 5.1

Given that the PX-EM algorithm converges to the posterior mode, both parameter expansions speed up the convergence of the original EM algorithm.

6 Simulation study

In this section, we conduct four simulation studies to evaluate the performance of our proposed PX-CAVI algorithm. Firstly, we compare the use of a product of Laplace density (i.e., \(q = 1\) and \(m=1\) in g (3)) with the multivariate normal density (i.e., \(q = 2\) and \(m = 2\) in g (3)) within the PX-CAVI algorithm. Next, we compare the PX-CAVI algorithm with the PX-EM algorithm. Additionally, we introduce the batch PX-CAVI algorithm, which does not require \(\theta \) to be jointly row-sparse, and compare it with two other penalty methods for SPCA and the conventional PCA. In the final study, we assume that r is unknown and demonstrate that the algorithm is less sensitive to the choice of r. Throughout all the studies, we set \(\sigma ^2\) to be fixed. However, in the \(\textsf{R}\) package we provided, it has the capability to estimate \(\sigma ^2\) automatically.

The dataset is generated as follows: First, given \(r^\star \), \(s^\star \), and p, we generate \(U^\star \) using the \(\textsf{randortho}\) function in \(\textsf{R}\). Next, we set \(\sigma ^2 = 0.1\) and choose the diagonal values of \(\Lambda ^\star \) to be an equally spaced sequence from 10 to 20 (i.e., the largest value is 20 and the smallest value is 10); however, in the first study, we will choose different values for \(\Lambda ^\star \); see Sect. 6.2 for details. Last, we obtain \(\Sigma ^\star = U^\star \Lambda ^\star {U^\star }' + \sigma ^2 I_p\) and generate \(n = 200\) independent samples from \(\mathcal {N}(0, \Sigma ^\star )\). Then, the dataset is an \(n \times p\) matrix. For each simulated dataset, we obtain the following quantities: the Frobenius loss of the projection matrix \(\Vert {\widehat{U}}{\widehat{U}}' - U^\star {U^\star }'\Vert _F\), the percentage of misclassification also known as the average Hamming distance \(\Vert {\widehat{z}} - {{\,\mathrm{\varvec{\gamma }}\,}}^\star \Vert _1/p\), the false discovery rate (FDR), and the false negative rate (FNR). The hyperparameters in the prior are chosen as follows: \(\lambda _1 = 1\), \(\alpha _1 = 1\), \(\alpha _2 = p + 1\), \(\sigma _a = 1\), and \(\sigma _b = 2\). Also, we set the total iterations \(T = 100\), \(\iota = 0.1\), and the threshold \(\delta = 10^{-4}\). To determine whether \(\gamma _j = 1\) or 0, we choose the threshold to be 0.5.

6.1 On choosing the initial values for PX-CAVI and PX-EM

Before presenting the simulation results, let us discuss how we obtained the initial values for the PX-CAVI algorithm, as well as the batch PX-CAVI algorithm, and the initial values for the PX-EM algorithm. We carefully explored different choices of initial values and found that the PX-CAVI algorithm exhibits robustness against variations in the initial values. Consequently, the algorithm is not overly sensitive to the specific choices of initializations. Therefore, we estimated \({\widehat{\mu }}^{(0)}\) using the conventional PCA and set \({\widehat{z}}^{(0)} = \mathbb {1}_p'\). For \({\widehat{M}}_j^{(0)}\), we let it be an identity matrix times a small value (i.e., \(10^{-3}\)). Finally, for \(({\widehat{\sigma }}^{(0)})^2\), we chose it to be the smallest eigenvalue of the Gramian matrix \(X'X/(n-1)\).

The PX-EM algorithm is more sensitive to poor initializations than the PX-CAVI algorithm. To address this concern, we employed two strategies aimed at alleviating this issue. The first one is through prior elicitation, which is proposed by Ročková and Lesaffre (2014). We replaced (27) with its tempered version given by

$$\begin{aligned} {{\widetilde{\gamma }}}_j^{(t)} = P(\gamma _j=1|\theta ^{(t)}, X) = \frac{\left( a_j^{(t)}\right) ^\iota }{\left( a_j^{(t)}\right) ^\iota + \left( b_j^{(t)}\right) ^\iota }. \end{aligned}$$
(35)

where \(\iota < 1\) is fixed. In the simulation study, we fix \(\iota = 0.1\).

Another strategy is to choose the “best-guess” of initial values using the path-following strategy proposed by Ročková and George (2016). First, we chose a vector containing a sequence of values of \(\lambda _0\), \(\{\lambda _0^{(1)}, \dots , \lambda _0^{(I)}\)}, where \(\lambda _0^{(1)} = \lambda _1 + 2\sqrt{\rho _{\min }}\) with \(\rho _{\min }\) being the smallest eigenvalue of \(X'X/(n-1)\), and \(\lambda _0^{(I)} = p^2\log p\). Next, we obtained an initial value of \(\theta \) using the conventional PCA and repeated the following process: At i-th step, set \(\lambda _0 = \lambda _0^{(i)}\) and chose the input values as their output values obtained from the \((i-1)\)-th step. We repeated this I times until all the values in that sequence of \(\lambda _0\) are used. Finally, the values output from the last step are used as the initial values for the PX-EM algorithm. As can be seen, comparing to the PX-CAVI algorithm, obtaining the initial values of the PX-EM algorithm takes a much longer time.

6.2 Laplace density vs normal density

Let \(r^\star =1\), then g is the Laplace distribution (\(m =1, q = 1\)) and the normal distribution (\(m = 2, q = 2\)). We conducted simulation studies of the PX-CAVI algorithm and compared the use of two distributions. We chose \(\Vert \theta ^\star \Vert ^2 \in \{1, 3, 5, 10, 20\}\) and \(p \in \{100, 1000\}\). For each setting,1000 datasets are generated. Simulation results are provided in Table 1.

Table 1 Simulation results of the PX-CAVI algorithm using the Laplace and Normal densitis
Table 2 Simulation results of the PX-CAVI and the PX-EM algorithms. We fixed \(n = 200\) and chose \(s^\star \in \{10, 20, 40, 70, 150\}\), \(r^\star \in \{1, 2, 3, 5\}\), and \(p \in \{500, 1000, 2000, 4000\}\)

From Table 1, we observed the following results: For \(p = 100\), there is no significant difference between using the normal and the Laplace densities, as their results are similar. However, when \(p = 1000\), using the normal density yields better results, as indicated by the smaller average value of the Frobenius loss of the projection matrix. In the case of \(p = 1000\), the normal density outperforms the Laplace density in estimating weaker signals (e.g., observed in the Frobenius loss when \(|\theta ^\star | = 1\)). The computational speed using the normal density is faster than the Laplace density. This is because when choosing the Laplace density, the algorithm needs to solve the two nonlinear functions (12) and (13) in each iteration. The computational speed notably increases, particularly when \(r > 2\) using the Laplace density, and solving the two Eqs. (12) and (13) becomes more challenging. Based on these findings, we recommend using the multivariate normal density, especially when the rank r is large, as it provides improved performance and computational efficiency in comparison to the Laplace density.

Table 3 Simulation results of the batch PX-CAVI (bPX-CAVI) algorithm, the two SPCA algorithms proposed by Zou et al. (2006) and Erichson et al. (2020) (namely, sPCA1 and sPCA2), and PCA
Table 4 Simulations for the PX-CAVI algorithm choosing different input values for r

6.3 Comparison between PX-CAVI and PX-EM

In this study, we compare the PX-CAVI algorithm with the PX-EM algorithm. Two options for q in (31) are considered for the PX-EM algorithm: \(q = 1\) representing the \(\ell _1\)-norm, and \(q = 2\) representing the \(\ell _2\)-norm. We observed that the algorithm using the \(\ell _1\)-norm outperforms the one using the \(\ell _2\)-norm in terms of parameter estimation (see the simulation result in the Supplementary Material). Henceforth, we utilized the \(\ell _1\)-norm. The true parameter values were chosen as follows: We fixed \(s^\star = 20\), and \(r^\star = 2\) and chose \(q = 1\), \(s^\star \in \{10, 40, 70, 150\}\), \(r^\star \in \{1,3,5\}\), and \(p \in \{500, 1000, 2000, 4000\}\). We ran both the PX-CAVI and the PX-EM algorithms. The results are given in Table 2. As we mentioned in Sect. 6.1, choosing the initial values for the PX-EM algorithm takes a longer time, and thus, we were only able to run 100 simulations. For the PX-CAVI, the result is based on 1000 simulations.

We remark two findings in Table 2. First, in general, the PX-CAVI algorithm is better than the PX-EM algorithm in both parameter estimation and variable selection. When \(s^\star \) and \(r^\star \) are large, the PX-CAVI algorithm is more accurate. Although it seems that when \(s^\star \) and \(r^\star \) are small (e.g., \(s^\star = 10\) and \(r^\star = 1\) and \(s^\star = 40\) and \(r^\star = 1\)), the Frobenius loss and the percentage of misclassification are bigger in the PX-CAVI algorithm than the PX-EM algorithm. However, the standard errors associate with the Frobenius loss when \(s^\star = 10\) and \(r^\star = 1\) is 0.011 and \(s^\star = 40\) and \(r^\star = 1\) is 0.015. For the percentage of misclassification, the standard errors are 0.1 when \(s^\star = 10\) and \(r^\star = 1\) and 0.2 when \(s^\star = 40\) and \(r^\star = 1\). Consequently, the observed differences between the two algorithms are insignificant. Our second notable finding is that both algorithms effectively control the FDR. However, the PX-CAVI algorithm exhibits better control over the FNR, resulting in more accurate and desirable variable selection outcomes.

Table 5 The results of the top 10 references IDs of genes and the total numbers of active genes of the first two principal components estimated by the PX-CAVI and the batch PX-CAVI (bPX-CAVI) algorithms and PCA
Fig. 1
figure 1

The three plots in the first row are the first three principal component scores estimated using the PX-CAVI algorithm and the three plots at the bottom are the same score functions estimated using the batch PX-CAVI algorithm

6.4 The batch PX-CAVI vs other SPCA algorithms

The PX-CAVI algorithm assumes \(\theta \) to be jointly row-sparse. In the Supplementary Material, we propose the batch PX-CAVI algorithm, which relaxes this assumption, allowing each principal component to have identical support. The batch PX-CAVI algorithm updates the coordinates belonging to the same row simultaneously. To evaluate the performance of the batch PX-CAVI algorithm, we compare it with two other popular algorithms for SPCA: the elastic net method proposed by Zou et al. (2006) and the robust SPCA method proposed by Erichson et al. (2020). Both of these methods are penalty-based approaches, and their tuning parameters are fixed (unlike the PX-EM algorithm). They are often used in practice, and their \(\textsf{R}\) packages \(\textsf{elasticnet}\) and \(\textsf{sparsepca}\) are available on CRAN.

To determine the optimal values of the tuning parameters for each algorithm, we consider a vector containing 100 values and estimate the Frobenius loss of the projection matrix for each value in ascending order. The tuning parameter that results in the smallest Frobenius loss value is selected as the optimal value. The results are presented in Table 3. Notably, we observed that the batch PX-CAVI algorithm outperforms the other three algorithms listed in the table with the smallest estimation and selection errors, regardless of the values of p, \(s^\star \), and \(r^\star \). Furthermore, the algorithm proposed by Zou et al. (2006) shows better performance than Erichson et al. (2020)’s method when \(r^\star \) is large. As expected, all three algorithms (batch PX-CAVI and two penalty methods) outperform the conventional PCA method. The program was executed on a MacBook Pro laptop with a 2.9 GHz Intel Core i7 processor and 16 GB memory. With specific parameters set at \(n = 200, p = 1000, s = 10\), and \(r = 2\), the average running time for a single simulation were found to be 14.96 s for the PX-CAVI algorithm, 0.281 s for the method proposed by Zou et al. (2006), and 0.498 s for the method introduced by Erichson et al. (2020). The running time of our PX-CAVI algorithm can be further accelerated by using parallel computing to update \(\widehat{\widetilde{u}}_j\) in (12) and (14) for each \(j = 1, \dots , p\) in Algorithm 1.

6.5 Unknown r

The previous three studies assumed that r is known. In this study, we investigate the scenario where r is unknown. Rather than modifying our algorithms to estimate r directly—which could increase computation time and introduce complexity in choosing initial values—we propose a practical approach of plugging in a value for r before conducting the analysis. This plugged-in value can be obtained through other algorithms or based on prior studies. Importantly, the accuracy of the plugged-in value is not critical. This study is designed as follows: We set \(r^\star = 4\), \(n = 200\), \(p = 1000\), and \(s^\star = 70\). The input value of r is chosen to be \(r = 1, 2, 3, 4, 5, 20\). For each value, we ran the PX-CAVI algorithm and obtained the average values of \(|\langle {\widehat{U}}_{\cdot k}, U_{\cdot k}^\star \rangle |\) and the percentage of misclassification from 1000 simulations. Note that \({\widehat{U}}_{\cdot k}\) is the k-th eigenvector from \({\widehat{\mu }}\); \({\widehat{U}}_{\cdot k}\) and \(U_{\cdot k}^\star \) are close if \(|\langle {\widehat{U}}_{\cdot k}, U_{\cdot k}^\star \rangle |\) is close 1. The results are provided in Table 4. From that table, we found that regardless of the input value r, even when \(r = 20\), the results are similar. Additionally, we noticed that as the rank increases, the accuracy of variable selection improves.

7 A real data study

This section applies the PX-CAVI and batch PX-CAVI algorithms to analyze a lung cancer dataset. This a gene expression dataset, accessible through the \(\textsf{R}\) package \(\textsf{sparseBC}\), which comprises expression levels of 5000 genes and 56 subjects. These subjects encompass 20 pulmonary carcinoid subjects (carcinoid), 13 colon cancer metastasis subjects (colon), 17 normal lung subjects (normal), and 6 small cell lung subjects (small cell). The primary objective is to identify biologically relevant genes correlated with lung cancer and distinguish the four different cancer types.

To prepare the data for analysis, we center and scale it before running each algorithm. In this study, we set the rank \(r = 8\), as it captures over \(70\%\) variability. Furthermore, we are particularly interested in the first three principal components (PCs). Therefore, selecting \(r = 8\) serves the purpose well. Table 5 presents the top 10 reference IDs of genes identified from the first and second PCs. Each reference ID corresponds to a specific gene, and this correspondence can be validated using the NCBI website. For instance, the reference ID ‘38,691_s_at’ represents the gene 6440 (see https://www.ncbi.nlm.nih.gov/geoprofiles/62830018).

From Table 5, we made the following observations. The top ten genes of the first principal component obtained from all three algorithms are the same. In the second PC, the order might vary slightly, but overall, the results are similar. We conducted a gene count analysis to determine the number of genes with nonzero loading values for each PC. For PCA, which does not impose sparsity on the loadings matrix, the total number of nonzeros is equal to the total number of genes. The PX-CAVI algorithm ensures that all PCs have the same number of nonzero loadings by the jointly row-sparsity assumption. This property leads to easier interpretation, as there is no concern about specific genes being selected in the first PC but not in second PC. The batch PX-CAVI algorithm employs fewer genes than the PX-CAVI algorithm to construct the second PC. By comparing their score functions in Fig. 1, we observed that using either 1183 genes or 795 genes to represent PCs does not result in significant differences. This demonstrates the advantage of the batch PX-CAVI algorithm in utilizing fewer genes to construct PCs while maintaining comparable performance. Additionally, we provided the first three PC scores estimated by both algorithms and highlighted the four different cancer types using different colors. As shown in the PC scores, the four cancer types are well-separated, indicating the effectiveness of our algorithms in distinguishing between the different types of lung cancer.

8 Conclusion and discussion

In this paper, we proposed the PX-CAVI algorithm (also the batch PX-CAVI algorithm) and its EM analogue the PX-EM algorithm for Bayesian SPCA. These algorithms utilized parameter expansion to effectively handle the orthogonality constraint imposed by the loading matrix and enhance their convergence speeds. We demonstrated that the PX-CAVI algorithm outperforms all other algorithms discussed in the paper, showcasing its superiority. Furthermore, we studied the posterior contraction rate of the variational posterior, providing a novel contribution to the existing literature. Additionally, our findings revealed that choosing the normal (or multivariate normal) density for g yielded better results compared to the heavier-tailed Laplace density.

Future studies include understanding why the Laplace density of the current prior fails to yield smaller estimation errors even in the rank one case and choosing other shrinkage priors such as the product moment prior in Johnson and Rossell (2010) and the non-local priors considered in Avalos-Pacheco et al. (2022). Additionally, the uncertainty quantification problem of SPCA remains unexplored, despite the rich literature on this topic for the sparse linear regression model (see van der Pas et al. 2017; Belitser and Ghosal 2020; Castillo and Roquain 2020; Martin and Ning 2020). Moreover, gaining a deeper understanding of the variational posterior, i.e. its conditions for achieving variable selection consistency would be valuable. Lastly, an interesting avenue for exploration involves extending our proposed method to the unsupervised learning setting, as explored in She (2017). In this context, the row-wise sparsity and row-rank restrictions can be imposed through priors that is similar to the those considered in the current paper. Our \(\textsf{R}\) package \(\textsf{VBsparsePCA}\) for the PX-CAVI and batch PX-CAVI algorithms is available on CRAN, offering a practical tool for researchers to apply these algorithms in their analyses.

Supplementary Material

Supplement to “Spike and slab Bayesian sparse principal component analysis”

(). In this supplementary material, we present the batch PX-CAVI algorithm, include the simulation results of the the PX-EM algorithm by choosing \(\ell _1\)-norm and \(\ell _2\)-norm in its penalty term, give the proofs of Theorems 4.1 and 4.2 and Lemma 5.1, and provide some auxiliary lemmas.