1 Introduction

For an n-observations \(\times \, p\)-variables column-centered data matrix X, principal component analysis (PCA) can be formulated as minimizing the least squares function

$$\begin{aligned} LS=\Vert \mathbf{X}-\mathbf{FA}^{\prime } \Vert ^{2}=\Vert \mathbf{X}-\mathbf{XWA}^{\prime } \Vert ^{2} \end{aligned}$$
(1.1)

over W \((p \times m)\) and A \((p \times m)\) (e.g., Izenman 2008; Zou et al. 2006). Here, \(\Vert \bullet \Vert ^{2}\) indicates the squared Frobenius norm and \(\mathbf{F} = \mathbf{XW}\) is an \(n \times m\) matrix of PC scores, where \(m \le \hbox { min}(n,\, p)\) is the number of components. That is, PCA is regarded as an approximation of X by a lower rank matrix \(\mathbf{FA}'\). It is well known that the problem is solved through the singular value decomposition (SVD) of X (Eckart and Young 1936; Takane 2014).

The term “loading matrix” is used by some authors for A, and by others for W. To avoid this confusion, we call W weight matrix as it weighs the variables in X, while we call A loading matrix as it describes how the variables load the components in F. The matrix W or A is interpreted to capture the relationships among variables and components. In either case, the interpreted matrix is desired to be sparse, i.e., to have a great number of zero elements, since a sparse matrix is easily interpreted by focusing only on the variables and components linked with nonzero elements. However, such sparse A or W cannot be obtained by the standard PCA. For this reason, a number of modified PCA procedures have been proposed in the last decade, which produce sparse solutions (Trendafilov 2014). Such procedures are called sparse PCA.

All existing sparse PCA procedures produce sparse weight matrix W (not A). Also, most of them are using penalty functions that penalizeW for having nonzero elements. Such examples are SCoTLASS (Jolliffe et al. 2003), SPCA (Zou et al. 2006), and sPCA-rSVD (Shen and Huang 2008). Further development of the penalty-based procedures has been proposed by Journée et al. (2010), d’Aspremont et al. (2008), and Witten et al. (2009). They are all formulated in a similar manner: the sum of \(\lambda \times \) P(W) and the loss function as in (1.1) is minimized or the function is minimized subject to P(W) \(\le 1/\lambda \). Here, P(W) stands for a penalty function, and the weight \(\lambda \) for the penalty is a tuning parameter(s) specifying the relative importance of P(W). Thus, the penalty weight \(\lambda \) controls the number of nonzero elements, which is called cardinality: a larger value of \(\lambda \) leads to W of lower cardinality, i.e., being sparser. Thus, the exact cardinality of the solution is not known in advance, and it is found after the sparsifying procedure is carried out with particular \(\lambda \).

In this paper, we propose a new sparse PCA procedure which differs from the existing ones in the following points:

  1. [1]

    The new procedure gives sparse loading matrix A rather than W.

  2. [2]

    It does not use a penalty function, and the cardinality of A is prespecified.

By these features, we refer to our proposed procedure as unpenalized sparse loading PCA (USLPCA). Recently Van Deun et al. (2011) proposed a similar approach to sparsify A in simultaneous PCA of several matrices. However, in contrast to USLPCA, they use penalty functions to achieve sparse A. Therefore, our proposed USLPCA with both features [1] and [2] is new, to the best of our knowledge. The implications and the benefits of [1] and [2] are discussed in detail in the next three paragraphs.

Whether W or A is to be sparsified depends on “in what the interest is” when interpreting a PCA solution. If the interest is in how the original variables are presented/summarized by the components F  =  XW, then the weight matrix W should be sparsified, because it expresses how the variables are weighted to form the components. On the other hand, when the interest is in how the original variables are explained by the components, A is to be sparsified, since A is the coefficient matrix in the regression of X onto F as given in (1.1). Recall, that this interpretation is assumed in the software package SPSS which produces A by default output, not W (SPSS Inc. 1997). Thus, USLPCA is to be used for interpreting how variables are explained by components.

There is another important implication from [1]. It helps to avoid the difficulty of the existing procedures sparsifying W, which destroys the components’ orthogonality, a fundamental feature of the standard PCA. That is, in most of the existing procedures, the columns of F  =  XW are correlated, which complicates the definition of the percentage of explained variance (PEV) indicating to what extent the variances of variables are explained by components (e.g., Zou et al. 2006; Shen and Huang 2008). On the other hand, when W is not sparsified, F can be constrained as \(\mathbf{F}'\mathbf{F}\) being a diagonal matrix D. In USLPCA, D is set to the n times of the \(m \times m\) identity matrix \(\mathbf{I}_{m}\), i.e. the constraint

$$\begin{aligned} \frac{1}{n}\mathbf{F}^{\prime } \mathbf{F}=\mathbf{I}_m . \end{aligned}$$
(1.2)

is used. Here, it should be noticed that (1.2) and \(\mathbf{F}'\mathbf{F} = \mathbf{D}\) are not essentially different, since of \(\mathbf{FD}^{-1/2}\mathbf{D}^{1/2}\mathbf{A}^{\prime } =\mathbf{FA}^{\prime }\), which implies that F with \(\mathbf{F}'\mathbf{F} = \mathbf{D}\) can be transformed to meet (1.2) without changing (1.1) and the zero elements in A. As shown in Sect. 4, this column-orthonormality of F in USLPCA allows PEV to be readily defined in the same manner as in the standard PCA. In Sect. 4, it would also be described that (1.2) equalizes the nonzero loadings in A to the correlation coefficients between components and the original variables when the latter are standardized. Then, obtaining a sparse A is very beneficial as it clearly describes the correlation structure of the dimension reduction.

Prespecifying cardinality in [2] is simply achieved by incorporating the constraint

$$\begin{aligned} \hbox {Card}(\mathbf{A}) = c \end{aligned}$$
(1.3)

in the formulation of PCA, where Card(A) stands for the cardinality of A, i.e., the number of its nonzero elements, and c is a specified integer. That is, (1.1) is minimized subject to (1.2) and (1.3) in USLPCA. Obviously, the cardinality of A can be predetermined to be a desired integer by substituting it into c in (1.3). Here, it should be noticed that c is also a tuning parameter to be chosen by users, as the penalty weight \(\lambda \) in the existing penalized approaches. However, c and \(\lambda \) differ in that c is an integer within a restricted range, while \(\lambda \) can take any positive real number. The former c is one of 1, 2, ..., pm, thus we can simply obtain the solutions for all possible c to choose the solution with the best cardinality. In contrast, we cannot consider all possible values of the penalty weight \(\lambda \), thus its limited values must be selected in the penalized approaches. However, this selection is not easy, since the resulting cardinality is unknown in advance as already stated.

An example of the existing approaches for concretely illustrating their differences from USLPCA is Zou et al’s (2006) SPCA, which is apparently similar to USLPCA in that the loss function is defined using (1.1). However, a linear combination of (1.1) and the penalty functions for sparsifying W is minimized over W and A in SPCA, where the loading matrix A (not F) is constrained as \(\mathbf{A}^{\prime }\mathbf{A} = \mathbf{I}_{m}\).

Here, we must mention the differences of sparse PCA from the rotation which can be used for facilitating the interpretation of loadings in the standard PCA (Jolliffe 2002; Trendafilov and Adachi 2015). Here, the rotation refers to obtaining orthonormal T so that a number of the elements in AT are close to zero, by exploiting the property that \(\mathbf{FA}'\) in (1) equals \(\mathbf{FTT}'\mathbf{A}'\) and FT can be substituted for F in (1.2): AT can also be regarded as the loading matrix. A crucial difference of the rotation from sparse PCA is that the loadings in rotated AT cannot be exactly zero. Thus, users must decide, in a subjective manner, which elements can be viewed as approximately zero. Another important difference is that the rotation criteria are functions of AT only, and do not involve the data. This implies that it cannot be known how AT influences the fit to the underlying data set.

The remaining parts of the paper are organized as follows: the incorporation of the cardinality-constrained minimization with (1.3) is described in Sect. 2, which follows from the decomposition of sums-of-squares for the PCA loss function (1.1). In Sect. 3, the algorithm for USLPCA is presented. The interpretation of sparse loadings are detailed and the PEV indices are introduced in Sect. 4. A procedure for selecting the cardinality c in (1.3) is described in Sect. 5. A simulation study is reported in Sect. 6, and real data examples are considered in Sect. 7.

2 Unpenalized sparse loading PCA

In USLPCA, the PCA loss function (1.1) is minimized under the orthonormality condition (1.2) for components and the cardinality constraint (1.3) for loadings. USLPCA is thus formulated as

$$\begin{aligned} \hbox {min}_{\mathbf{F},\mathbf{A}} LS\left( {\mathbf{F},\mathbf{A}} \right) =\Vert \mathbf{X}-\mathbf{FA}^{\prime } \Vert ^{2}\hbox { subject to Card}\left( \mathbf{A} \right) =c \hbox { and }\, \frac{1}{n} \mathbf{F}^{\prime } \mathbf{F}=\mathbf{I}_m.\qquad \end{aligned}$$
(2.1)

For simplicity, we have formulated USLPCA without using W explicitly. By substituting F = XW, (2.1) is rewritten as \(\hbox {min}_{\mathbf{W},\mathbf{A}} \Vert \mathbf{X} -\mathbf{XWA}^{\prime }\Vert ^{2}\) subject to \(\hbox {Card}(\mathbf{A})\,=\,c\) and \(n^{-1}\mathbf{W}^{\prime }\mathbf{X}^{\prime }\mathbf{XW} = \mathbf{I}_{m}\).

The key point of USLPCA is to use the fact that (1.2) leads to the decomposition of sum-of-squares for the loss function (1.1):

$$\begin{aligned} \Vert \mathbf{X}-\mathbf{FA}^{\prime } \Vert ^{2}=\Vert \mathbf{X}-\mathbf{FB}^{\prime } \Vert ^{2}+n\Vert \mathbf{B}-\mathbf{A}\Vert ^{2}, \end{aligned}$$
(2.2)

with B being the cross-product matrix of p-variables \(\times \,m\)-components:

$$\begin{aligned} \mathbf{B}=\frac{1}{n}\mathbf{X}^{\prime } \mathbf{F}. \end{aligned}$$
(2.3)

The equality (2.2) is proved as follows: (1.1) is rewritten as \(\Vert \mathbf{X} -\mathbf{FA}'\Vert ^{2 }= \Vert \mathbf{X} -\mathbf{FB}' +\mathbf{FB} -\mathbf{FA}'\Vert ^{2}\), which implies that the decomposition (2.2) holds true if \((\mathbf{X} -\mathbf{FB}^{\prime })^{\prime }(\mathbf{FB}^{\prime } -\mathbf{FA}^{\prime })\) equals the matrix of zeros O. This is fulfilled, using (1.2) and (2.3), as

$$\begin{aligned} (\mathbf{X}-\mathbf{FB}^{\prime })^{\prime } (\mathbf{FB}^{\prime } -\mathbf{FA}^{\prime })= & {} \mathbf{X}^{\prime } \mathbf{FB}^{\prime } -\mathbf{X}^{\prime } \mathbf{FA}^{\prime } -\mathbf{BF}^{\prime } \mathbf{FB}^{\prime } +\mathbf{BF}^{\prime } \mathbf{FA}^{\prime } \nonumber \\= & {} n\mathbf{BB}^{\prime } -n\mathbf{BA}^{\prime } -n\mathbf{BB}^{\prime } +n\mathbf{BA}^{\prime } =\mathbf{O}. \end{aligned}$$
(2.4)

The decomposition (2.2) shows that the term \(g(\mathbf{A}) = \Vert \mathbf{B} -\mathbf{A}\Vert ^{2}\) is only relevant to A, which implies that the minimization of (1.1) over A for a fixed F amounts to \(\hbox {min}_{\mathbf{A}}\, g(\mathbf{A})\), where \(g(\mathbf{A})\) is a function in which a sparse loading matrix A is simply matched with the matrix B in (2.3). Thus, \(\hbox {min}_{\mathbf{A}} \,g(\mathbf{A})\) with the cardinality constraint (1.3) is easily attained, as shown in Sect. 3.3.

The algorithm described in the next section gives the solution for a specific value c in the cardinality constraint (1.3). Thus, it remains to choose the best c value. For this task, we adopt the following procedure:

$$\begin{aligned} \hbox {Choose the value } c\hbox { with the best }I(c)\hbox { among }c=c_{\min }, {\ldots }, c_{\max }, \end{aligned}$$
(2.5)

where I(c) is an index of the goodness of the USLPCA solution obtained for a specific c, and \(c_{\mathrm{min}}/c_{\mathrm{max}}\) expresses the reasonable minimum/maximum of c. The choice of the index I(c) is described in Sect. 5.

As discussed in Sect. 1, the largest interval of \([c_{\mathrm{min}}, \,c_{\mathrm{max}}\)] is obviously [1, pm]. It can be reasonably reduced as

$$\begin{aligned} c_{\min }= & {} p, \end{aligned}$$
(2.6)
$$\begin{aligned} c_{\max }= & {} pm-\frac{m(m-1)}{2}. \end{aligned}$$
(2.7)

Here, (2.6) prevents A from having an empty row if c goes below the limit p, i.e., the number of variables. On the other hand, it is considered in (2.7) that \(M=m(m - 1)/2\) elements in A can be set to zeros without the change in the values of loss function (1.1): they are equivalent among \(c = pm - M, {\ldots }, pm\).

3 Algorithm

The USLPCA algorithm is outlined in Sect. 3.1. It comprises of two steps which are alternately iterated until convergence. The two steps are described in details respectively in Sects. 3.2 and 3.3, while the algorithm as a whole is considered/recounted in Sect. 3.4.

3.1 Outline

The solution for USLPCA (2.1) can be obtained by alternately iterating the update of F and A. Indeed, making use of (1.2) and (2.3), we can expand the loss function (1.1) in USLPCA as

$$\begin{aligned} LS\left( {\mathbf{F},\mathbf{A}} \right) = \hbox {tr}\mathbf{X}^{\prime } \mathbf{X}+\hbox {tr}\mathbf{AF}^{\prime } \mathbf{FA}^{\prime } -\hbox {2tr}\mathbf{X}^{\prime } \mathbf{FA}=n\hbox {tr}\mathbf{S}+n\hbox {tr}\mathbf{AA}^{\prime } -2n\hbox {tr}\mathbf{BA}^{\prime }, \end{aligned}$$
(3.1)

with

$$\begin{aligned} \mathbf{S}=\frac{1}{n}\mathbf{X}^{\prime } \mathbf{X} \end{aligned}$$
(3.2)

being the sample covariance matrix. Here, the loss function is found to be a function of B (and A), though it must be kept in mind that B is a function of the PC score matrix F satisfying (1.2) as seen in (2.3). The problem (2.1) can thus be attained by alternately iterating

  • [B-step] minimizing (1.1) or (3.1) over B subject to (1.2) and (2.3) with A kept fixed,

  • [A-step] minimizing (1.1) over A subject to (1.3) with B being kept fixed,

until convergence is reached.

3.2 B-step

As B is a function of F, we first consider minimizing (1.1) over F subject to (1.2) for a given A. Since (1.1) is expanded as (3.1), the minimization is equivalent to maximizing \(n\hbox {tr}\mathbf{BA}^{\prime } = \hbox {tr}\mathbf{X}^{\prime }\mathbf{FA}^{\prime } = \hbox {tr}(\mathbf{XA)}^{\prime }\mathbf{F}^{\prime }\) subject to (1.2). It is attained for

$$\begin{aligned} \mathbf{F}=\sqrt{n}\mathbf{KL}^{\prime } =\mathbf{XAL}{\varvec{\Lambda }}^{-1}\mathbf{L}^{\prime }, \end{aligned}$$
(3.3)

where K and L are given by the SVD of \(n^{-1/2}\) XA defined as

$$\begin{aligned} \frac{1}{\sqrt{n}}\mathbf{XA}=\mathbf{K}{\varvec{\Lambda }}\mathbf{L}^{\prime } \end{aligned}$$
(3.4)

with \(\mathbf{K}^{\prime }\mathbf{K} = \mathbf{L}^{\prime }\mathbf{L} = \mathbf{I}_{p}\) and \({\varvec{\Lambda }}\) a diagonal matrix (e.g., Seber 2008). This SVD can be rewritten as \(\mathbf{K} = n^{-1/2}\mathbf{XAL}{\varvec{\Lambda }}^{-1}\) which leads to the last identity in (3.3). It implies that F is column-centered:

$$\begin{aligned} \mathbf{1}_n^{\prime } \mathbf{F}=\mathbf{0}_m ^{\prime }, \end{aligned}$$
(3.5)

as X is, with \(\mathbf{1}_{n}\) the \(n \times 1\) vector of ones and \(\mathbf{0}_{m}\) the \(m \times 1\) zero vector.

Using (3.3) in the definition (2.3) for B, we find that

$$\begin{aligned} \mathbf{B}=\frac{1}{n}\mathbf{X}^{\prime } \mathbf{XAL}{\varvec{\Lambda }}^{-1}\mathbf{L}^{\prime } =\mathbf{SAL}{\varvec{\Lambda }}^{-1}\mathbf{L}^{\prime } \end{aligned}$$
(3.6)

is to be obtained in this step. Here, it should be noted in (3.6) that K is not included and the matrix product \(\mathbf{L}{\varvec{\Lambda }}^{-1}\mathbf{L}^{\prime }\) can be obtained through the eigenvalue decomposition (EVD)

$$\begin{aligned} \mathbf{A}^{\prime } \mathbf{SA}=\mathbf{L}{\varvec{\Lambda }}^{2}\mathbf{L}^{\prime } \end{aligned}$$
(3.7)

following from (3.2) and (3.4). This EVD and (3.6) show that, in the B-step, [1] F may not be obtained and [2] the original data matrix X may not be available only if the covariance matrix S in (3.7) is given.

3.3 A-step

For fixed B, the minimization of (1.1) over constrained A is equivalent to the minimization of \(g(\mathbf{A}) = \Vert \mathbf{B} -\mathbf{A}\Vert ^{2}\), since of the decomposition (2.2). Using \(\mathbf{A} = (a_{ij})\) and \(\mathbf{B} = (b_{ij})\), we can rewrite \(g(\mathbf{A})\) as

$$\begin{aligned} g\left( \mathbf{A} \right) =\Vert \mathbf{B}-\mathbf{A}\Vert ^{2}=\sum _{(i,j)\in Z} {b_{ij}^2 } +\sum _{(i,j)\in Z^{\bot }} {(b_{ij} -a_{ij} )^{2}} \ge \sum _{(i,j)\in Z} {b_{ij}^2} . \end{aligned}$$
(3.8)

Here, Z denotes the set of the \(q = pm - c\) indexes \((i,\, j)\)’s indicating the locations of the loadings \(a_{ij}\) to be zero. The complement set \(Z^{\bot }\) is the set containing the c indexes \((i,\, j)\)’s of the nonzero \(a_{ij}\). The inequality in (3.8) shows that \(g(\mathbf{A})\) attains its lower limit \(\Sigma _{(i, j)\in Z}\, b_{ij}^{2}\) when the non-zero loadings \(a_{ij}\) with \((i, j){\in } Z^{\bot }\) are taken equal to the corresponding \(b_{ij}\). Moreover, the limit \(\Sigma _{(i, j)\in \mathbf{Z}}\, b_{ij}^{2}\) is minimal, when Z contains the indexes for the q smallest \(b_{ij}^{2}\) among all squared elements of B. Thus, \(g(\mathbf{A})\) is minimized for \(\mathbf{A} = (a_{ij})\) being

$$\begin{aligned} a_{ij} =\left\{ {\begin{array}{l@{\quad }l} 0 &{} \textit{iff} \ b_{ij}^2 \le b_{[q]}^2\\ b_{ij} &{} \hbox {otherwise} \\ \end{array}}\right. \end{aligned}$$
(3.9)

with \(b_{[q]}^{2}\) the \(q\hbox {th}\) smallest value among all \(b_{ij}^{2}\).

The loading matrix A updated by (3.9) satisfies

$$\begin{aligned} a_{ij} ^{2}=a_{ij} b_{ij}, \quad \hbox { or equivalently, } \quad \mathbf{A}{\bullet }\mathbf{A}=\mathbf{A}{\bullet }\mathbf{B}, \end{aligned}$$
(3.10)

where \(\bullet \) denotes the Hadamard element-wise matrix product. This property is important to prove the expression of the explained variance discussed later.

3.4 Whole steps

The value of loss function (1.1) or (3.1) attained after the A-step is given by

$$\begin{aligned} LS\left( \mathbf{A} \right) =n\hbox {tr}\mathbf{S}+n\hbox {tr}\mathbf{AA}^{\prime } -2n\hbox {tr}\mathbf{BA}^{\prime } =n\hbox {tr}\mathbf{S}-n\hbox {tr}\mathbf{AA}^{\prime }, \end{aligned}$$
(3.11)

which is derived using (3.10) in (3.1). Further, the loss function value (3.11) is divided by \(n\hbox {tr}\mathbf{S}\) to yield

$$\begin{aligned} LS_{\mathrm{N}} \left( \mathbf{A} \right) = 1-\frac{\hbox {tr}\mathbf{AA}^{\prime }}{\hbox {tr }\mathbf{S}}, \end{aligned}$$
(3.12)

which is convenient for checking convergence, as it is normalized so as to take values within the range [0, 1].

We should note that (3.12) does not include the original data matrix X, which is also unnecessary in the B- and the A-steps. In other words, USLPCA can be used if the covariance matrix S is available only. This is convenient when \(n > p\), as the \(p \times p\) matrix S is smaller than X \((n \times p)\). Further, the updating of F described in Sect. 3.2 can be avoided. Thus, the USLPCA algorithm for obtaining the optimal A can be shorten as follows:

  1. [1]

    Initialize A

  2. [2]

    Perform EVD (3.7) to obtain B with (3.6)

  3. [3]

    Obtain A with (3.9)

  4. [4]

    Finish if \(\Delta LS_{\mathrm{N}}(\mathbf{A}) \le \varepsilon \); otherwise go back to [2]

Here, \(\Delta LS_{\mathrm{N}}(\mathbf{A})\) denotes the change in (3.12) from the previous round, and \(\varepsilon \) is set to \(0.1^{7}\) in this paper. This algorithm is run multiple times by starting with different initial A in the procedure described in Appendix “Multiple-runs procedure”. Among the resulting multiple solutions, we select the A with the lowest \(LS_{\mathrm{N}}(\mathbf{A})\) value as the optimal one, in order to avoid local minimizers. After those processes, the PC score matrix F can be obtained using the optimal A, X, and (3.7) in (3.3), if F is of interest.

4 Interpreting solutions

As described in Sect. 1, the weight matrix W is sparsified in the existing sparse PCA, which implies that W is supposed to be interpreted. On the other hand, the loading matrix A is to be interpreted in USLPCA with A sparsified. In Sect. 4.1, we start with contrasting the interpretations of W and A. In Sect. 4.2, we show that the nonzero loadings in USLPCA are also the covariances of components to variables and further equal the correlation coefficients when X is standardized. Finally, the indices of the percentage of explained variances are introduced in Sect. 4.3.

4.1 Interpreting weights versus interpreting loadings

In this section, we compare the proposed USLPCA with other approaches in which the weights in W are sparsified.

The role of W is to weight the original variables to form \(\mathbf{F} = \mathbf{XW}\), which is rewritten in the vector form as

$$\begin{aligned} \mathbf{f}_j =\mathbf{Xw}_j =w_{1j} \mathbf{x}_1 + \cdots +w_{pj} \mathbf{x}_p =\sum _{i\in {M}_j } {w_{ij} \mathbf{x}_i }\quad \left( {j= 1, \ldots ,m} \right) . \end{aligned}$$
(4.1)

Here, \(\mathbf{f}_{j}\) and \(\mathbf{w}_{j}\) are the \(j\hbox {th}\) columns of F and \(\mathbf{W} = (w_{ij})\), respectively, \(\mathbf{x}_{i}\) denotes the \(i\hbox {th}\) column (variable) of X, and \({ M}_{j}\) denotes the set of indexes \(\{i\}\) corresponding to the nonzero weights in \(\mathbf{w}_{j}\). That is, component j is interpreted as summarizing the variables \(\mathbf{x}_{j}\) weighted by nonzero \(w_{ij}\). This explains why the existing sparse PCA procedures produce sparse W.

On the other hand, USLPCA solutions can be interpreted with the equation

$$\begin{aligned} \mathbf{X}=\mathbf{FA}^{\prime } +\mathbf{E}, \end{aligned}$$
(4.2)

which may be called a PCA model, since the squared Frobenius norm of the error matrix E in (4.2) leads to the PCA loss function (1.1). Using \({\tilde{\mathbf{a}}}_i^{\prime } (1 \times m)\) for the \(i\hbox {th}\) row of \(\mathbf{A}= (a_{ij})\) and \(\mathbf{e}_{i}\) for the \(i\hbox {th}\) column of E, (4.2) is rewritten in the vector from

$$\begin{aligned} \mathbf{x}_i =\mathbf{F}{\tilde{\mathbf{a}}}_i =a_{i1} \mathbf{f}_1 + \cdots +a_{im} \mathbf{f}_m +\mathbf{e}_i =\sum _{j\in {N}_i } {a_{ij} \mathbf{f}_j } +\mathbf{e}_i \end{aligned}$$
(4.3)

with \({ N}_{i}\) is the set of index \(\{j\}\) corresponding to nonzero loadings in \({\tilde{\mathbf{a}}}_i\). They can be regarded as the coefficients in the multiple regression of \(\mathbf{x}_{i}\) onto \(\mathbf{f}_{j}\)’s. That is, variable i is interpreted as a dependent variable explained by the components with nonzero \(a_{ij}\). To what extent the variable are explained by the components, or equivalently, the smallness of the sizes of errors in \(\mathbf{e}_{i}\), are indicated by the percentage of explained variances introduced in Sect. 4.3.

Beside the above interpretation for each variable, USLPCA solutions can also be interpreted component-wise. For describing it, we use \(\mathbf{a}_{j}\) for the \(j\hbox {th}\) column of A to rewrite (4.2) as \(\mathbf{X} = \mathbf{f}_{1}\mathbf{a}_{1}^{\prime } + {\cdots } +\mathbf{f}_{m}\mathbf{a}_{m}^{\prime } + \mathbf{E} = \mathbf{f}_{j}\mathbf{a}_{j}^{\prime } + (\Sigma _{k \ne j}\mathbf{f}_{k}\mathbf{a}_{k}^{\prime } + \mathbf{E})\), i.e.,

$$\begin{aligned} \left[ {\mathbf{x}_1,\ldots ,\mathbf{x}_p} \right] =\mathbf{f}_j \left[ {a_{1j}, \ldots , a_{pj} } \right] +\mathbf{H}_{[j]}. \end{aligned}$$
(4.4)

Here, \(\mathbf{a}_{j}^{\prime }= [a_{1j}, {\ldots }, a_{pj}]\), and \(\mathbf{H}_{[j]}=\Sigma _{k \ne j}\mathbf{f}_{k}\mathbf{a}_{k}^{\prime } + \mathbf{E}\) whose columns are uncorrelated with those of \(\mathbf{f}_{j}\mathbf{a}_{j}^{\prime } = \mathbf{f}_{j}[a_{1j}, {\ldots }, a_{pj}]\) as proved in Appendix “No correlation for components and errors”. Equation (4.4) thus allows component j to be interpreted as a common factor exclusively explaining the variables associated with nonzero \(a_{ij}\). In Sect. 4.3, the percentage is introduced that indicates how well the component explains the variables.

4.2 Nonzero loadings as covariances

Since F is column-centered with (3.5) as X is, the cross-product matrix \(\mathbf{B} = n^{-1}\mathbf{X}^{\prime }\mathbf{F}\) in (2.3) contains the covariances between p variables and m components. By taking this into account in (3.9), we can find that the nonzero loadings in the resulting A equal the corresponding covariances in B: nonzero \(a_{ij}\) equals the covariance between the \(i\hbox {th}\) original variable and the \(j\hbox {th}\) component. Further, this implies that the nonzero loadings equal the correlation coefficients of variables to components, when a data set to be analyzed is a standardized data matrix X with unit variances or a correlation matrix S, since the PC scores, constrained as (1.2), have unit variances. It allows us to easily capture the magnitudes of loadings, as their ranges are restricted within [\(-1\), 1].

The above equivalence of nonzero loadings to covariances motivates us to relate the optimization in USLPCA to covariances as follows: using (1.2) and (2.3), the decomposed loss function (2.2) can be rewritten as \(n\hbox {tr}\mathbf{S} - n\Vert \mathbf{B}\Vert ^{2 }+n\Vert \mathbf{B} -\mathbf{A}\Vert ^{2}\), whose minimization amounts to maximizing \(\Vert \mathbf{B}\Vert ^{2} - \Vert \mathbf{B} -\mathbf{A}\Vert ^{2}\). USLPCA can thus be viewed as maximizing the sum of squared covariances \(\Vert \mathbf{B}\Vert ^{2}\) subject to B approximating sparse A.

4.3 Percentages of explained variances

From the standardized loss function value (3.12), we can derive a goodness-of-fit index

$$\begin{aligned} PEV= 100\times \frac{\hbox {tr }\mathbf{AA}^{\prime }}{\hbox {tr }\mathbf{S}}=100\times \frac{\left\| {\mathbf{FA}^{\prime }} \right\| ^{2}}{\left\| \mathbf{X} \right\| ^{2}}= 100\times \left( {1-\frac{\left\| {\mathbf{X}-\mathbf{FA}^{\prime }} \right\| ^{2}}{\left\| \mathbf{X} \right\| ^{2}}} \right) . \end{aligned}$$
(4.5)

Here, the last identity is added to stress that this index directly follows from the loss function (1.1). Note, that it attains the value (3.11) divided by \(n\hbox {tr}\mathbf{S} = \Vert \mathbf{X}\Vert ^{2}\) to give (3.12), one minus which leads to (4.5). This index can be called total percentage of explained variance (PEV), as the denominator trS in (4.5) is the total variance of variables, while the numerator \(\hbox {tr}\mathbf{AA}^{\prime }\), which is found to equal \(n^{-1}\Vert \mathbf{FA}^{\prime }\Vert ^{2}\) using (1.2), is the total variance for \(\mathbf{FA}^{\prime }\), since (3.5) implies \(\mathbf{FA}^{\prime }\) being column-centered.

The total PEV (4.5) can be decomposed as a sum of

$$\begin{aligned} PEV\left( j \right) = 100\times \frac{\left\| {\mathbf{a}_j } \right\| ^{2}}{\hbox {tr }\mathbf{S}} \end{aligned}$$
(4.6)

over \(j = 1, {\ldots }, m\). Since (1.2) leads to \(\Vert \mathbf{a}_{j}\Vert ^{2}=n^{-1}\Vert \mathbf{f}_{j}\mathbf{a}_{j}^{\prime }\Vert ^{2}\), the percentage (4.6) is regarded as the amount of total variance of variables explained by component j and used for the interpretation of the component with (4.4).

The PEV for each variable is derived from (4.5), which can be rewritten as \(n\sum _{i=1}^p {(s_{ii} -\left\| {\tilde{\mathbf{a}}} \right\| ^{2})} =n \sum _{i=1}^p {s_{ii} (1-\left\| {\tilde{\mathbf{a}}_i} \right\| ^{2}/s_{ii} )} \, \ge 0\) with \(s_{ii}\) the \(i\hbox {th}\) diagonal element of S, i.e., the variance of variable i. This gives the statistic

$$\begin{aligned} \textit{PEV}\left[ i \right] = 100\times \frac{\left\| {\tilde{\mathbf{a}}_i} \right\| ^{2}}{s_{ii}}. \end{aligned}$$
(4.7)

Since (1.2) implies \(\left\| {\tilde{\mathbf{a}}_i } \right\| ^{2}=n^{-1}\left\| {\mathbf{F}\tilde{\mathbf{a}}}_i \right\| ^{2}\), the percentage (4.7) expresses the amount of the variance of variable i explained by the components associated with the nonzero loadings in \({\tilde{\mathbf{a}}}_i\). Therefore, (4.7) is used for the interpretation with the model (4.3) which expresses the regression of a variable onto components.

In the same forms as (4.5), (4.6), and (4.7), the PEV indices are defined for the standard PCA, which can be formulated as minimizing (1.1) subject to (1.2) and \(\mathbf{A}^{\prime }\mathbf{A}\) being a diagonal matrix. The solution of A is expressed as \(\mathbf{A} = \mathbf{R}{\varvec{\Delta }}\), with \({\varvec{\Delta }}\) is the \(m \times m\) diagonal matrix including the m largest singular values of \(n^{-1/2}\mathbf{X}\) and the corresponding right singular vectors being the columns of R. The substitution of \(\mathbf{A} = \mathbf{R}{\varvec{\Delta }}\) into (4.5), (4.6), and (4.7) gives the PEV indices for PCA, with its total PEV expressed as \(PEV_{\mathrm{PCA}} = \hbox {tr}{\varvec{\Delta }}^{2}/\hbox {tr}\mathbf{S}\). The same form of the PEV definition between USLPCA and PCA implies that the increase in the cardinality value c in (1.3) allows the total PEV of USLPCA (4.5) to approach and finally equal \(PEV_{\mathrm{PCA}}\). It suggests that we can find the acceptable cardinality for USLPCA with the corresponding value of (4.5) being not substantially less than \(PEV_{\mathrm{PCA}}\). However, whether the difference in PEV values is substantial must be decided subjectively. A procedure without such a decision is discussed in the next section.

5 Cardinality selection by information criteria

The cardinality selection problem (2.5) can be viewed as a model selection problem for the optimal combination of the goodness-of-fit for a data set, and the number of parameters to be estimated. Here, the latter is directly related to the cardinality which is the number of the loadings whose values are to be estimated. For such a model selection, indices usually called information criteria can be used, which include AIC and BIC as popular ones (Akaike 1974; Schwarz 1978). Their useful feature is that a model with the least index value is selected as the optimal one: model selection can be attained numerically without subjective decision.

The information criteria are based on the maximum likelihood (ML) method, while USLPCA is formulated as a least squares (LS) method. However, USLPCA can be reformulated as an ML procedure, by assuming that X is generated with the PCA model (4.2) with each error in \(\mathbf{E} = (e_{ti})\) distributed independently and identically according to the normal distribution with its mean 0 and variance \(\sigma ^{2}\):

$$\begin{aligned} e_{ti} \sim N(0,\sigma ^{2}). \end{aligned}$$
(5.1)

The model (4.2) with the normality assumption (5.1) leads to the log likelihood whose part relevant to F and A is expressed as

$$\begin{aligned} l\left( {\mathbf{F},\mathbf{A}} \right) = -\frac{np}{2}\log \Vert \mathbf{X}-\mathbf{FA}\Vert ^{2}. \end{aligned}$$
(5.2)

This is derived using the fact that the ML estimate of the variance must satisfy \(\sigma ^{2} = (np)^{-1}\Vert \mathbf{X} -\mathbf{FA}\Vert ^{2}\), as described in Appendix “Likelihood for PCA model”. The ML version of USLPCA is formulated as maximizing (5.2) subject to (1.2) and (1.3), which is equivalent to the LS-based one in Sects. 2 and 3.

By substituting the USLPCA solution into (5.2), it can be rewritten as

$$\begin{aligned} l\left( \mathbf{A} \right) = -\frac{np}{2}\log \{LS_{\mathrm{N}} \left( \mathbf{A} \right) \times n\hbox {tr}\mathbf{S}\}= -\frac{np}{2}\log \left( {1-\frac{PEV}{100}} \right) -Const \end{aligned}$$
(5.3)

with \(Const = (np/2)\log (n\hbox {tr}\mathbf{S})\) irrelevant to A. Here, we have used that \(\Vert \mathbf{X} -\mathbf{FA}\Vert ^{2}\) attains the value expressed as (3.11), which is the product of \(n\hbox {tr}\mathbf{S}\) and (3.12) with the latter leading to the PEV in (4.5). Using (5.3), AIC and BIC are given by

$$\begin{aligned} \hbox {AIC}\left( c \right)= & {} -2l\left( \mathbf{A} \right) + 2\times \eta \left( c \right) \end{aligned}$$
(5.4)
$$\begin{aligned} \hbox {BIC}\left( c \right)= & {} -2l\left( \mathbf{A} \right) + \log \left( {np} \right) \times \eta \left( c \right) \end{aligned}$$
(5.5)

as functions of cardinality c, where the number of parameters \(\eta (c)\) is just the cardinality of the loading matrix A with \(\eta (c)=c\), since (5.3) is a function of only the loading matrix A with the values of its c elements to be estimated.

We use either (5.4) or (5.5) as the index I(c) in the cardinality selection problem (2.5). Thus, it is rewritten as

$$\begin{aligned} \hbox {Choose } {\hat{c}}=\mathop {\hbox {arg min}}\limits _{c_{\min } \le c\le c_{\max }} BIC\left( c \right) \hbox { as the optimal cardinality} \end{aligned}$$
(5.6)

if (5.5) is used; otherwise BIC(c) is replaced by AIC(c). Here, \(c_{\min }\) and \(c_{\max }\) are given by (2.6) and (2.7). That is, the best cardinality can be selected through the runs of the USLPCA algorithm with c set to \(c_{\min }, {\ldots }, c_{\max }\). As a result, we have \(\Delta _{c}\) solutions with

$$\begin{aligned} \Delta _c =c_{\max } -c_{\min } +1. \end{aligned}$$
(5.7)

Those solutions give the \(\Delta _{c}\) BIC/AIC values, among which the least one gives the best cardinality with (5.6) or its AIC version.

6 Simulation study

We performed a simulation study to assess [1] how often local minima arise and to what degree local minimizers differ from the optimal solution in USLPCA, [2] how well the true loadings are recovered by USLPCA when the cardinality c in (1.3) is set to the true value, and [3] how exactly the true cardinality is identified by AIC and BIC. The procedure for synthesizing data is described in Sect. 6.1, which is followed by the assessment of [1] in Sect. 6.2, that of [2] in 6.3, and the results for [3] in 6.4.

6.1 Data synthesis procedure

We generate data matrices X having more observations than variables with \(n = 150 > p = 15\) and horizontal ones having more variables with \(n = 15 < p = 150\), on the basis of the PCA model (4.2) subject to (1.2), (1.3), and (5.1). Here, the approximate PEV for \(\mathbf{X}= \mathbf{FA}^{\prime }+\mathbf{E}\),

$$\begin{aligned} \textit{APEV}=100\times \frac{\left\| {\mathbf{FA}^{\prime }} \right\| ^{2}}{\left\| {\mathbf{F{A}'}} \right\| ^{2}+\left\| \mathbf{E} \right\| ^{2}}\cong 100\times \frac{\left\| {\mathbf{FA}^{\prime }} \right\| ^{2}}{\left\| {\mathbf{FA}^{\prime }+\mathbf{E}} \right\| ^{2}}=100\times \frac{\left\| {\mathbf{FA}^{\prime }} \right\| ^{2}}{\left\| \mathbf{X} \right\| ^{2}}, \end{aligned}$$
(6.1)

is controlled to be 80, 60, or 40. As described below, F and E are mutually independently generated, so that \(\mathbf{F}^{\prime }\mathbf{E}\) is close to the zero matrix. Thus, \(\Vert \mathbf{FA}^{\prime }+\mathbf{E}\Vert ^{2 }\cong \Vert \mathbf{FA}^{\prime }\Vert ^{2}+ \Vert \mathbf{E}\Vert ^{2}\), and (4.5) is approximated by (6.1). Its values 80, 60, and 40 correspond to the error sizes being small, medium, and large, respectively. A set of F, A, and E is synthesized by setting \(m = 3\) with the following steps.

  1. [1]

    An integer within the interval [ppm / 2] is randomly chosen as the cardinality of the true A.

  2. [2]

    The locations of the nonzero elements in A are randomly chosen subject to that each row of A includes at least one nonzero loading and each column includes at least two nonzero ones.

  3. [3]

    Each of the nonzero elements in A is drawn randomly from U(0.5, 1) or \(U(-1, -0.5)\), with \(U(\alpha , \beta )\) the uniform distribution over the range \([\alpha , \beta ]\).

  4. [4]

    F is filled with the standard normal variables and then orthonormalized so as to satisfy (1.2).

  5. [5]

    Each element of E is drawn with (5.1) independently of F, where \(\sigma ^{2}\) is set so that APEV is 0.8, 0.6, or 0.4.

For each of the 2 (\(n > p\) and \(n < p\)) \(\times \, 3\) (APEV values) combinations, we generated 100 data matrices. For the resulting data sets, we set m at 3 to carry out USLPCA with the cardinality selection procedure from Sect. 5.

6.2 Local minima

For Card(A) set to a specific c, the USLPCA algorithm is run multiple \((K_{c})\) times in the procedure described in Appendix “Multiple-runs procedure”. As defined there, the solution A resulting from the \(k\hbox {th}\) run \((k = 1, {\ldots }, K_{c})\) with this c is denoted by \(\mathbf{A}_{ck}\), and \(\mathbf{A}_{c}= \hbox {argmin}_{1\le k\le K_{c}} LS_{\mathrm{N}}(\mathbf{A}_{ck})\). Let \(\mathbf{A}_{c}\) be the optimal solution. We define \(\mathbf{A}_{ck}\) as a local minimizer (LM), if the averaged absolute difference of \(\mathbf{A}_{ck} = (a_{ij}^{(ck)})\) to \(\mathbf{A}_{c} = (a_{ij}^{(c)})\), which is defined as \(\hbox {AAD}(\mathbf{A}_{ck},\mathbf{A}_{c}) = (pm)^{-1} \Sigma _{i} \Sigma _{j}{\vert }a_{ij}^{(ck)} -a_{ij}^{(c)}{\vert }\), is greater than 0.001. We count the frequency \(L_{c}\) with which LMs are observed during the \(K_{c}\) runs of the algorithm, and obtain the average of the LMs’ proportion of \(L_{c}/K_{c}\) over c, i.e., \(P_{\mathrm{LM}}=\Delta _{c}^{-1}\sum _{c=c_{\min }}^{c_{\max }} {L_c /K_c}\), where \(\Delta _{c}\) is given in (5.7). Further, we define the difference of LMs to the optimal solution as \(D_{\mathrm{LM}}(c)=L_{c}^{-1}\Sigma _{\varGamma }\hbox { AAD}(\mathbf{A}_{ck},\mathbf{A}_{c})\) and obtained its average over c, i.e., \(D_{LM}=\Delta _{c}^{-1}\sum _{c=c_{\min }}^{c_{\max }} {D_{\mathrm{LM}} (c)}\), where \(\varGamma \) denotes the set of \(\mathbf{A}_{ck}\) being LMs.

Table 1 shows the percentiles and averages of \(P_{\mathrm{LM}}\) and \(D_{\mathrm{LM}}\) over 100 data sets. There, \(P_{\mathrm{LM}}\) are found to be very high, in particular, when APEV is lower, i.e., errors are larger. In particular, the 10 percentile of \(P_{\mathrm{LM}}\) in the bottom row is 0.94, which shows that the 94 percents of the solutions were LMs for the 90 % (=100 \(-\) 10 percentile) of data sets of \(n < p\) with APEV  =  40 %. The \(D_{\mathrm{LM}}\) values are also found to be large: those averages exceed 0.1 for all conditions, which implies that an LM corresponding to the optimal loading 0.5 is less than 0.4 or larger than 0.6. We can thus conclude that USLPCA is sensitive to local minima and LMs are not similar to the optimal solutions. Despite these rather disappointing results, the optimal USLPCA solutions are close to the true values as shown next.

Table 1 Statistics of the proportions of local minimizers (LMs) and their differences from the optimal solution

6.3 Recovery of true loadings

Let us use \({\hat{\mathbf{A}}}\) for the solution \(\mathbf{A}_{c}\) obtained for c being the true cardinality. We define two recovery indices for each pair of the true \(\mathbf{A} = (a_{ij})\) and its estimate \({\hat{\mathbf{A}}}= (\hat{{a}}_{ij})\). One is the misidentification rate of zero loadings \(\hbox {MR}_{0} = 1 - N_{00}/N_{0}\), where \(N_{0}\) is the number of the true zero loadings and \(N_{00}\) is the number of \((i,\, j)\)’s with \(a_{ij}={\hat{a}}_{ij} = 0\). The other index is the averaged absolute differences of \({\hat{\mathbf{A}}}\) to its true counterpart, i.e., \(\hbox {AAD}({\hat{\mathbf{A}}},\mathbf{A}) = (pm)^{-1}\Sigma _{i}\Sigma _{j}{\vert }{\hat{a}}_{ij} -a_{ij}{\vert }\). Table 2 presents the percentiles and averages of the indices over 100 solutions.

Table 2 Statistics of the indices for the recovery of sparse loadings

First, let us note the results for the data sets of \(n > p\) in Table 2. The Panel for \(\hbox {MR}_{0}\) in Table 2 shows that no misidentification occurred for any data set of \(n > p\), except the rare cases in the condition of APEV  =  40 % with the average 0.005. The statistics of \(\hbox {AAD}({\hat{\mathbf{A}}},\mathbf{A})\) are also found to be satisfactorily small. Indeed, even the worst value 0.069 (the 90 percentile for APEV  =  40 %) is not considered to indicate large differences between A and \({\hat{\mathbf{A}}}\) in that the examples of \([a_{ij},\, \hat{{a}}_{ij}]\) giving the value 0.069 such as [0.500, 0.569] and [\(-0.800, -0.869\)] never impress us as showing bad recovery. We can thus conclude that sparse loadings were recovered fairly well for \(n > p\), when the cardinality is set to the true value.

Next, we consider the case \(p > n\). Compared with \(n > p\), the recovery for \(p> n\) is worse and clearly depends on APEV. When it is 80 %, the statistics of \(\hbox {MR}_{0}\) and \(\hbox {AAD}({\hat{\mathbf{A}}},\mathbf{A})\) are small enough and show fairly good recovery, but they are not small for APEV  =  40 %. Also for the 10 % of the data sets for APEV  =  60 %, the recovery is found to be unsatisfactory, as the 90 percentile of \(\hbox {AAD}({\hat{\mathbf{A}}}, \mathbf{A})\) exceeds 0.1. The results imply that good recovery of the loadings is guaranteed for data sets with PEV of 80 %, but is not true for those with PEV less than 60 %.

6.4 Identification of true cardinality

Let us use c for the true cardinality and \({\hat{c}}\) for the cardinality selected by AIC or BIC. For each data set, we obtain the relative bias \(({\hat{c}}-c)/\Delta _{c}\), and its percentiles over the 100 data sets are shown in Table 3. For the data sets of \(n > p\), we find that AIC overestimates the cardinality, but it is fairly well identified by BIC, which shows that BIC should be used for the cases of \(n > p\). On the other hand, the behavior of AIC and BIC depend on APEV for the data sets with \(p > n\). That is, although BIC identified the cardinality well for the cases with APEV  =  80 %, it is found that the overestimation of the cardinality by AIC is reduced and the underestimation by BIC is reinforced with the decrease in APEV. Eventually, AIC works better for data sets with \(p > n\) and less APEV, so its performance is almost equivalent to that of BIC in the absolute values of averages for APEV  =  40 %. As BIC is still superior for APEV  =  60 %, the results suggest that BIC should be used for \(APEV \ge 60\,\%\), but it is inconclusive whether AIC or BIC is to be used for \(APEV \le 40\,\%\).

Table 3 Statistics of the relative biases of the cardinality selected by AIC and BIC

7 Examples

In this section, we illustrate USLPCA with two data matrices of \(n > p\) and \(n < p\), respectively. Each data set is standardized or given as a correlation matrix, so that the nonzero loadings to be obtained stand for the correlations of variables to components. The zero loadings are left blank in the following tables.

7.1 Pitprop data

The first example is Jeffers’s (1967) Pitprop data matrix of \(n = 180\) by \(p = 13\), which has been used as a benchmark for testing sparse PCA procedures. We carried out USLPCA with \(m = 6\) following the previous studies. Then, local minimizers were often obtained with the average proportion \(P_{\mathrm{LM}} = 0.852\) and they are not similar to the optimal one with the average of \(\hbox {AAD}(\mathbf{A}_{ck},\mathbf{A}_{c})\) being \(D_{\mathrm{LM}}\) = 0.151. The resulting PEV and BIC values are plotted against Card(A) in Fig. 1. For choosing Card(A), we use BIC which showed the good performances for the cases of \(n > p\) in the simulation study.

Fig. 1
figure 1

Total PEV and BIC against the cardinality of A for Pitprop data

BIC showed the least value for Card(A)  =  40 among all possible cardinalities. The corresponding loadings are shown in Table 4. The resulting total PEV 86.8 is found to be almost equivalent to the PEV 87.0 for the standard PCA: the USLPCA solution approximates the data very well, nearly as PCA does, and is more interpretable with \(13 \times 6 - 40 = 38\) vanishing loadings. Bold font is used in Table 4 for the PEV for variables which exceeds the corresponding ones for PCA: it is noted that five variables are better explained by the sparse USLPCA components, with PEV for the other variables not very different between USLPCA and PCA.

Table 4 USLPCA loadings with the least BIC for Pitprop data together with PCA’s PEV in the final row and column

Although the solution in Table 4 is optimal according to the BIC-based cardinality selection, the 40 nonzero loadings still seem too many to capture quickly the underlying variable-component relationships. This motivates us to choose a sparser solution using PEV rather than BIC. A procedure for the choice is trying a so-called scree test for the PEV plot in Fig. 1, i.e., to find the Card(A) value at which the increment in PEV begins to be less pronounced. However, such values are found at several points, among which we cannot choose the best one. In place of this approach, it can be considered to use a benchmark PEV value. As the value we choose the PEV 80 which is the integer \(\times \) 10 % closest to the PEV 87.0 for the standard PCA, to select the solution of Card(A)  =  17 with its PEV nearly greater than 80. The solution is depicted in Table 5, where the variables are found to be clearly clustered with every variable loading only one or two components.

In contrast to the USLPCA solutions with sparse loadings A, the weights in W are sparsified in the existing procedures. As an example of the latter, Table 6 shows the sparse W obtained by Zou et al’s (2013, Table 3) SPCA, which is related to USLPCA as mentioned in Sect. 1. All six components in Table 6 can be considered to correspond to those in Table 5, except the differences of a few nonzero loadings. However, the interpretation of W and A is quite different as discussed in Sect. 4.1. This difference is illustrated in the next paragraph.

Table 5 USLPCA loadings with Card(A) = 17 for Pitprop data together with PCA’s EV in the final row and column

For example, the sparse W in Table 6 shows that the PC scores for Component 3 (C3) are defined by four variables as

$$\begin{aligned} \hbox {C}3=.64\times \hbox {ovensg} +.59\times \hbox {ringtop}+.49\times \hbox {ringbut}-.02\times \hbox {diaknot}. \end{aligned}$$

On the other hand, the USLPCA loadings A in Table 5 show that those same variables without diaknot are explained by Component 3 as

$$\begin{aligned} \left[ \hbox {ovensg, ringtop, ringbut} \right] =\left[ {.81, .79, .62} \right] \times \hbox {C}3 +\mathbf{e}^{\prime }, \end{aligned}$$
(7.1)

which is a row vector version of (4.4) and \(\mathbf{e}' \,(1 \times 7)\) expresses the part \(\Sigma _{k\ne j}\mathbf{f}_{k}\mathbf{a}_{k}^{\prime } + \mathbf{E}\) in (4.4). Equation (7.1) shows that Component 1 is interpreted as a common factor explaining those variables. The explanation power of Component 3 can be assessed by the corresponding PEV 12.8. Beside this column-wise interpretation of A, we can also interpret A in a row-wise manner. For example, we find in Table 4 that the variable “ringbut” is explained by Components 1 and 3 as

$$\begin{aligned} \hbox {ringbut }= 0.67\times \hbox {C}1 + 0.62\times \hbox {C}3 +\hbox { error}, \end{aligned}$$
(7.2)

where the corresponding PEV shows that the 83.4 % of the variance of “ringbut” is explained by the two components.

Table 6 Zou et al.’s (2006) solution for Pitprop data, where the signs of weights in C1 and C6 have been changed from the original table

The USLPCA loadings in Tables 4 and 5 are also the correlation coefficients as described in Sect. 4.2, since the data set is given as a correlation matrix. For example, it is found in Table 5 that the coefficient between “clear” and C4 is \(-\)0.98, which is close to the lower limit \(-\)1 and shows their very high negative correlation. On the other hand, the coefficient 0.37 between “ringtop” and C1 stands for that their relation is not high as 0.37 is rather close to the zero implying no correlation in the range of the zero to the upper limit one.

An anonymous reviewer was interested in the transition of nonzero loadings into zero ones and vice versa with respect to the increase of Card(A), i.e., the cases of

$$\begin{aligned} a_{ij}^{(c)} \ne 0 \quad \hbox { and } \quad a_{ij}^{(c+1)} =0 \end{aligned}$$
(7.3)

and

$$\begin{aligned} a_{ij}^{(c)} = 0 \quad \hbox { and }\quad a_{ij}^{(c+1)} \ne 0, \end{aligned}$$
(7.4)

where \(a_{ij}^{(c)}\) denotes the (ij) elements of the resulting A for \(\hbox {Card}(\mathbf{A}) = c\). We thus obtained the switch-rate \(SR_{c}=q_{c}/c\), with \(q_{c}\) the number of the indices (ij) with (7.3). It implies that \(q_{c} + 1\) is the number of (ij) with (7.4). The rate \(SR_{c}\) is plotted against \(\hbox {Card}(\mathbf{A}) = c\) left in Fig. 2. We find that \(SR_{c}\) is low (the same loadings steadily remain zero/nonzero) when Card(A) is smaller, in contrast to the cases of greater Card(A) where the switchover is remarkable.

The contrast may be related to the results shown right in Fig. 2, where four curves express the different averages of absolute nonzero loadings. In the figure, except for Card(A) being \(c_{\min }\) and 36, the simple average of absolute nonzero loadings \(\hbox {AV}(c)=c^{-1}\Sigma _{i,j}\left| {a_{ij}^{(c)} }\right| \) is found to exceed the average \(\hbox {AV}_{\mathrm{NZ}}(c)=q_{c}^{-1}\Sigma _{(i,j)\in NZ}\left| {a_{ij}^{(c)} } \right| \) of the nonzero \(a_{ij}^{(c)} \) switched into zero, where NZ denotes the set of (ij) satisfying (7.3). It shows that the nonzero \(a_{ij}^{(c)}\) with smaller \(\left| {a_{ij}^{(c)}}\right| \) tends to become zero. Further, we can find that \(\hbox {AV}(c)\) decreases with an increase in c and approaches \(\hbox {AV}_{\mathrm{NZ}}(c)\) when c is greater. This suggests that, for greater c, a number of \(\left| {a_{ij}^{(c)} } \right| \) are small enough to switch \(a_{ij}^{(c)} \) into zero, which leads to the uncertainty for what nonzero loadings are turned into zero, i.e., higher switch-rates. The other two curves in the right figure express the average \(\hbox {AV}_{\mathrm{ZN}}(c)=T_{ZN}^{-1}\Sigma _{(i,j)\in ZN}\left| {a_{ij}^{(c+1)} } \right| \) for the nonzero \(a_{ij}^{(c+1)} \) switched from \(a_{ij}^{(c)} \) = 0, and \(\hbox {AV}_{\mathrm{NN}}(c)=T_{NN}^{-1}\Sigma _{(i,j)\in NN}\left| {a_{ij}^{(c)} } \right| \) for the nonzero \(a_{ij}^{(c)}\) being not switched. Here, ZN is the set of (ij) with (7.4), and NN is the set of the (ij) satisfying \(a_{ij}^{(c)} \ne 0\) and \(a_{ij}^{(c+1)} \ne 0\), with \(T_{ZN}\) and \(T_{NN}\) the numbers of (ij) in ZN and NN, respectively. We can find that \(\hbox {AV}_{\mathrm{NZ}}(c) <\hbox { AV}_{\mathrm{NN}}(c)\) and \(\hbox {A}_{\mathrm{ZN}}(c) <\hbox { A}_{\mathrm{NN}}(c)\) except for \(\hbox {Card}(\mathbf{A}) = c_{\min }\): the absolute values of the switched loadings tend to be smaller than those of the ones which did not switch. It suggests that the interpretation for \(\hbox {Card}(\mathbf{A}) = c+1\) cannot be changed drastically from that for \(\hbox {Card}(\mathbf{A}) = c\), if one note the loadings with great absolutes.

Fig. 2
figure 2

Switch-rates and the averages of absolute loadings against the cardinality of A

7.2 Gene expression data

The second example is the yeast cell cycle data matrix of \(n = 17\) time points (observations) by \(p = 384\) genes (variables) presented by Yeung and Ruzzo (2001). The data matrix, publicly available at http://faculty.washington.edu/kayee/pca, have been first log-transformed and then standardized so that the column averages and variances are zero and one, respectively. The 384 genes are categorized into five phases of cell cycles, which suggests that \(m = 5\) components corresponding to the five phases might be obtained. However, the preliminary trial detailed in Appendix “Component-wise constrained USLPCA” showed that the solution with \(m = 5\) included a component whose PEV was very low. Such a trivial component was also extracted by Enki and Trendafilov (2012, p. 620), where the genes were classify into five groups, but one of them cannot be well related to the phases. We thus reduced m to 4 for performing USLPCA for the data set. Then, local minimizers were often obtained with \(P_{\mathrm{LM}} = 0.771\) and they are not similar to the optimal one with \(D_{\mathrm{LM}} = 0.119\). The resulting PEV and BIC values are plotted against Card(A) in Fig. 3, where PEV values are found to range from 60 to 81 %. For such PEV values, BIC is working better than AIC as suggested by the results for \(p > n\) and the approximate PEV = 60 and 80 % in Sect. 6.4. In Fig. 3, BIC is found to give the least value for Card(A) = 616 among all possible cardinality, with the corresponding PEV 73.4.

Fig. 3
figure 3

Total PEV and BIC against the cardinality of A for gene expression data

The loadings for Card(A)  =  616 are presented block-wise in Fig. 4. There, the blocks (genes \(\times \) components) correspond to the five phases, with white, red, and blue standing for zero, positive values, and negative ones. The PEV for components 1, 2, 3, and 4 were 16.0, 37.0, 13.4, and 7.0, respectively. The solution is considered to be reasonable, as each phase has a unique feature of loadings: [a] The genes in Phases 1, 2, and 4 are positively loaded by Components 1, 2, and 3, respectively; [b] Phases 5 are characterized by positive loadings for Component 4 and negative ones for 2; [c] Phases 3 consists of the genes positively loaded by Component 2 or 3 and by both.

8 Discussion

In this paper, we proposed an unpenalized sparse PCA procedure USLPCA, in which component loadings rather than weights are sparsified without using penalty functions. In USLPCA, the loss function of PCA is minimized subject to the direct cardinality constraint on the loadings and the orthonormality constraint on the component score matrix. The latter condition makes it possible to decompose the loss function as a sum of two terms, one of which is irrelevant to the loadings, and the other one is a function easily minimized subject to the cardinality constraint. Using this decomposition, we formed the alternate least squares algorithm for USLPCA. This algorithm has two important features. First, the updating of the PC score matrix F can be avoided, and second, the algorithm can operate with either a data or covariance matrix as an input.

We also presented the cardinality selection procedure using AIC and BIC. This selection can be easily attained by considering all possible values of cardinality, owing to the cardinality being prespecified to be an integer in USLPCA. The simulation study shows that the true cardinality can be identified fairly well by BIC, though AIC overestimates the cardinality considerably.

In USLPCA, the total percentage of explained variances (PEV), the PEV index for components, and the one for variables are defined in the same form as the standard PCA. It facilitates comparing USLPCA solutions with the PCA one in the goodness-of-fit for a data set. As the total PEV of PCA is the upper limit of those of USLPCA, we can ascertain that a USLPCA solution is satisfactory if its total PEV is not very lower than that of PCA, as illustrated in the examples. There, it was also demonstrated that the variables can exist that are well explained by sparse USLPCA components than the PCA ones being not sparse.

Fig. 4
figure 4

Heat map of the loadings for gene expression data, which was produced by polarmap.m from MATLAB file exchange

One of the most critical differences of USLPCA to the existing sparse PCA is that the loadings in A are to be interpreted in the former, while the weights in W is of interest in the latter. As the weights must be inspected for finding how the PC scores in \(\mathbf{F} = \mathbf{XW}\) are defined, the existing procedures are suitable in particular when PC scores are of interest. On the other hand, the loadings express how the original variables are regressed onto components. Thus, USLPCA is to be used when components are treated as the factors explaining the variables. To what extent they are explained by components are assessed by the PEV indices discussed above.

One may be interested in modifying USLPCA for obtaining sparse W, that is, minimizing the PCA loss function (1.1) over W and A subject to \(\hbox {Card}(\mathbf{W}) = c\). Unfortunately, this modification is not easy, as the key point of USLPCA is using the decomposition of (1.1) into (2.2), but this strategy is difficult to use for W. That is, (2.2) reduces the cardinality constrained minimization of (1.1) over A into that of a simple function \(g(\mathbf{A}) = \Vert \mathbf{B} -\mathbf{A}\Vert ^{2}\). Unfortunately, such a simple function for W does not seem to exist.

In this paper, we did not present a systematic procedure for selecting dimensionality, i.e. the number of components (m). For this selection, we can perform the cardinality selection over different dimensionality to find the best combination of cardinality and dimensionality. For example, the combination with the least BIC can be chosen as the best one. Assessing such a procedure and considering other approaches remains for future studies.