1 Introduction

Principal component analysis (PCA) is a well-known data-analytic technique that linearly transforms a given set of data to some equivalent representation. This transformation is defined in such a manner that any variable in the new representation, called a principal component (PC), expresses most of the variance in the data, which is not expressed by the PCs that precede it. The linear combination defining each of the PCs is given by a coefficients (also termed loadings) vector. In terms of the covariance (or correlation) matrix of the data, the coefficients vector of the kth PC is the eigenvector that corresponds to the kth largest eigenvalue [1]. One major drawback of PCA is that commonly the coefficients vectors are dense, i.e., each PC is a linear combination of much, if not most, of the original variables, which causes a difficulty in interpreting the obtained PCs. This disadvantage encouraged a wide interest in the sparsity constrained version of PCA, which imposes an additional constraint, enforcing the coefficients vector not to exceed some predetermined sparsity level s.

Enforcing sparsity on the coefficients vector is commonly acceptable in some applications. For example, in the exploration of micro-array gene expression patterns, PCA is employed in order to classify different tissues according to their gene expression. It is also desirable that such discrimination can be executed by utilizing only a small subset of the genes, thus encouraging sparse solutions [2]. The desire to obtain interpretable coefficients vectors is not the only reason to favor the sparse PCA model. For example, some financial applications will prefer sparse solutions in order to reduce transaction costs [3]. Clearly, incorporating an additional sparsity constraint will provide a PC that, generally, does not explain all of the variance which is explained by the regular PC; nevertheless, in such applications, this sacrifice is acceptable with respect to the obtained benefits. We refer to this formulation as the sparsity constrained formulation, and it is merely one of several alternative formulations considered in the literature. The common alternatives are the result of treating the sparsity term, or its relaxation, by a penalty approach. The sparse PCA problem is a difficult non-convex problem and can be optimally solved only for small scale problems by performing exhaustive or a branch and bound search over all possible support sets [4]. Thus, in order to handle large-scale problems, the algorithms proposed in the literature are seeking to find an approximate solution. One of the first methods, suggested by Cadima and Jolliffe [5], is to threshold the smallest, in absolute value, elements of the dominant eigenvector. Unfortunately, this remarkably simple approach is known to frequently provide poor results. In [4] Moghaddam et al. proposed several greedy methods. An advantage of these methods is that they produce a full path of solutions (i.e., a solution for each of the values of sparsity level up to s), but the necessity to perform a large amount of eigenvalue computations at each step render them quite computationally expensive. In [6], d’Aspremont et al. proposed an approximate greedy approach that obviates the necessity to perform most of the eigenvalue computations by evaluating a lower bound on the eigenvalues, which results in a substantial reduction of computation time. Another approach presented by d’Aspremont et al. [6], and earlier in [7], is to consider a semidefinite programming formulation with a rank constraint for some of the relaxed and/or penalized models of PCA. These equivalent formulations are still hard non-convex problems, and thus a relaxed model is solved and an approximate solution is derived for the original problem. The algorithms used to solve the SDP relaxations are not applicable for large-scale problems, rendering this approach as non-scalable. In [8], encouraged by the LASSO approach suggested for regression [9], Jolliffe et al. proposed the absolute value norm constrained formulation under the name SCoTLass (simplified component technique LASSO), which is a relaxation of the sparsity constrained problem. In practice, the numerical study was conduced on the penalized version by implementing the projected gradient algorithm. The relaxed model was further considered in the literature. An alternating minimization scheme to solve the constrained formulation was proposed in [10]. Another work that addressed the constrained formulation was motivated by the expectation maximization algorithm for probabilistic PCA [11]. Even though the work addressed the constrained formulation, the sequence generated by the method in [11] is guaranteed to be s-sparse. Penalized versions were also considered extensively. In [12] Zou et al. formulated the sparse PCA as a regression-type model, where the ith principal component was approximated by the linear combination of the original variables. A LASSO and ridge penalties are imposed on the coefficients vector forming the elastic net model that generalizes the LASSO [13] and an alternating minimization algorithm, called SPCA, was proposed. In [14] Shen and Huang proposed several iterative schemes to solve the penalized versions via regularized SVD. These methods were considered further in [15], where a gradient scheme was proposed and a convergence analysis, that was missing in [14], was also provided.

Recently, Luss and Teboulle showed in [16] that the seemingly different methods proposed in [1012, 14, 15, 17] are some particular realizations of the conditional gradient algorithm with unit step size. The work [16] proposed a unified algorithmic framework which they refer to as ConGradU (the well-known conditional gradient with unit step size) and established convergence results, showing that the algorithm produces a point satisfying some necessary first-order optimality criteria. Some novel schemes are provided. One of them addresses directly the sparsity constrained formulation of sparse PCA.

As already mentioned, none of the methods listed above can guarantee to produce an optimal solution. In addition, the sparse PCA problem does not seem to posses a verifiable necessary and sufficient global optimality condition, and hence, in general, there is no efficient way to check if a given vector is the global optimal solution.Footnote 1 Therefore, the comparison of the methods in the literature is based solely on numerical experiments without providing any theoretical justification for the advantage of a certain method over the others. However, most of the algorithms just listed will produce a solution that satisfies some necessary optimality condition. In a recent work, Beck and Eldar [18] employed some of the aforesaid conditions in order to provide an insight regarding the success of the corresponding algorithms. Under the framework of minimizing a continuously differentiable function subject to a sparsity constraint, several necessary optimality conditions were presented. The relations between the different optimality conditions were established, showing that some of the conditions are stronger (that is, more restrictive) than others. An extension to problems over sparse symmetric sets was considered in [19]. In this paper, we adopt this methodology in order to establish a hierarchy between two necessary optimality conditions for the sparsity constrained sparse PCA problem. The first condition that we consider is a well-known first-order condition, which was originally presented in the context of the sparse PCA problem in [16]. We will refer to it as the complete (co) stationarity condition. Much of the existing algorithms in the literature are actually guaranteed to converge to a co-stationary point. The second condition, which we call coordinate-wise (CW) maximality, is a generalization of one of the conditions considered in [18], and it essentially states that the function value cannot be improved by making changes of at most two coordinates.

In the following section, we will explicitly define the conditions under consideration. In Sect. 3, we will establish the relation between the conditions, showing that the CW-maximality condition is stronger (that is, more restrictive) than co-stationarity. In Sect. 4, we will introduce algorithms that produce points satisfying the aforementioned conditions and finally, in Sect. 5, we will provide a numerical study on simulated and real-life data that supports our assertion that algorithms that correspond to stronger conditions are more likely to provide better results.

2 Necessary Optimality Conditions

Throughout the paper, we consider the following sparsity constrained problem:

$$\begin{aligned} \max \{f(\mathbf {x}):\mathbf {x}\in S \}, \end{aligned}$$
(P)

where f is a continuously differentiable convex function over \(\mathbb {R}^n\) and

with \(\Vert \cdot \Vert _0\) being the so-called \(l_0\)-norm defined by .Footnote 2 As a special case, when the objective function is chosen as \(f(\mathbf {x}) = \mathbf {x}^T \mathbf {A} \mathbf {x}\), where \(\mathbf {A}\) is a given positive semidefinite matrix, problem (P) amounts to the \(l_0\)-constrained sparse PCA model:

$$\begin{aligned} \max \{\mathbf {x}^T\mathbf {Ax}:\mathbf {x}\in S \}. \end{aligned}$$
(SPCA)

In PCA applications, \(\mathbf {A}\) usually stands for the covariance matrix of the data.

In this section, we will present two necessary optimality conditions for the general model (P). Although our main motivation is to study the sparse PCA problem, we will nonetheless consider the general model (P), since our results are also applicable in this general setting.

Prior to presenting the optimality conditions, we will introduce in the following subsection some notation and definitions that will be used in our analysis.

2.1 Notation and Definitions

A subvector of a given vector corresponding to a set of indices is denoted by \(\mathbf {x}_T\). Similarly, we will denote the subvector of the gradient \(\nabla f(\mathbf {x})\) corresponding to the indices in T by \(\nabla _Tf(\mathbf {x})\). The sign of a given \(\alpha \in {\mathbb R}\) is denoted by \(\text {sgn}(\alpha )\) and is equal to 1 for \(\alpha \ge 0\) and \(-1\) for \(\alpha <0\). The support set of some arbitrary vector \(\mathbf {x}\) will be denoted by \(I_1(\mathbf {x})= \{i:x_i\ne 0\}\) and its complement by \(I_0(\mathbf {x})= \{i:x_i=0\}\). For a given vector and an integer , we will define \(M_s(\mathbf {x})\) to be the sth largest absolute value component in \(\mathbf {x}\). For such \(\mathbf {x}\) and s, we will define the sets \(I_{>}(\mathbf {x},s), I_{=}(\mathbf {x},s)\) and \(I_{<}(\mathbf {x},s)\) as follows:

$$\begin{aligned}&I_{>}(\mathbf {x},s) := \{i:|x_i|>M_s(\mathbf {x})\}, \\&I_{=}(\mathbf {x},s) := {\left\{ \begin{array}{ll} \{i:|x_i|=M_s(\mathbf {x})\}, &{} \Vert \mathbf {x}\Vert _0 \ge s, \\ ~~~~~~\qquad \emptyset , &{} \Vert \mathbf {x}\Vert _0<s, \end{array}\right. } \\&I_{<}(\mathbf {x},s) := {\left\{ \begin{array}{ll} \{i:|x_i|<M_s(\mathbf {x})\}, &{} \Vert \mathbf {x}\Vert _0\ge s,\\ \{i:x_i=0\}, &{} \Vert \mathbf {x}\Vert _0<s. \end{array}\right. } \end{aligned}$$

We will also define the set \(I_{\ge }(\mathbf {x},s) := I_{>}(\mathbf {x},s)\cup I_{=}(\mathbf {x},s)\) and the set \(I_{\le }(\mathbf {x},s) := I_{<}(\mathbf {x},s)\cup I_{=}(\mathbf {x},s)\). Obviously, the sets \(I_{>}(\mathbf {x},s), I_{=}(\mathbf {x},s)\) and \(I_{<}(\mathbf {x},s)\) form a partition of \(\{1,2,\dots ,n\}\). Furthermore, when \(\Vert \mathbf {x}\Vert _0<s\), we have that \(I_{>}(\mathbf {x},s) = I_1(\mathbf {x}), I_{=}(\mathbf {x}) = \emptyset \) and \(I_{<}(\mathbf {x},s)=I_0(\mathbf {x})\).

The sets defined above posses some convenient and elementary properties which are given in Lemma 2.1 below. Since all the properties stated in the lemma are rather simple consequences of the definition of the sets \(I_{>}(\mathbf {x},s),\) \( I_{=}(\mathbf {x},s)\), \(I_{<}(\mathbf {x},s)\), the proof is omitted.

Lemma 2.1

  1. 1.

    If \(\mathbf {x}\ne \mathbf {0}\), then \(I_\ge (\mathbf {x},s)\ne \emptyset \).

  2. 2.

    If \(|I_\ge (\mathbf {x},s)|<s\) then \(x_j=0\) for all \(j\in I_<(\mathbf {x},s)\).

  3. 3.

    For any \(i\in I_{>}(\mathbf {x},s),j \in I_{=}(\mathbf {x},s)\) and \(k\in I_{<}(\mathbf {x},s)\), it holds that \(|x_i|>|x_j|>|x_k|\).

We will frequently use the notation

for the set containing all the subsets of indices corresponding to the nonzero s largest in absolute value components of a given vector \(\mathbf {x}\). When \(\Vert \mathbf {x}\Vert _0\le s\), there are no more than s nonzero elements in \(\mathbf {x}\), and the above definition actually amounts to \(R_s(\mathbf {x})= \{ I_1(\mathbf {x})\}\). However, when \(\Vert \mathbf {x}\Vert _0> s\), there might be more than one set of indices corresponding to the s largest absolute value components of \(\mathbf {x}\). For example, consider the vector \(\mathbf {x} = (3,2,1,1,1,0,0)^T\) and the sparsity level \(s=3\). Then,

On the other hand, in the following examples, the set contains a single subset:

$$\begin{aligned} R_3((0,-5,4,-3,2,0)^T) = \{\{2,3,4\}\}, R_3 ((0,0,4,-3,0,0)^T) = \{\{3,4\}\}. \end{aligned}$$

The hard thresholding operator maps a vector \(\mathbf {x} \in \mathbb {R}^n\) to the set of vectors that are generated by keeping the s largest absolute value components of \(\mathbf {x}\) and setting all the others to zeros. This operator, which we denote by \(H_s\), is formally defined by

Thus, for example,

$$\begin{aligned}&H_3((3,2,1,1,1,0,0)^T) \\&\quad = \{(3,2,1,0,0,0,0)^T,(3,2,0,1,0,0,0)^T, (3,2,0,0,1,0,0)^T \}. \end{aligned}$$

2.2 Complete (co)-Stationarity

The first condition that we consider was presented for the sparse PCA problem in [16]. We refer to it as the complete (co) stationarity condition.

Definition 2.1

(co-stationarity) Let \(\mathbf {x}\) be a feasible solution of (P). Then, \(\mathbf {x}\) is called a co-stationary point of (P) over S if and only if it satisfies:

$$\begin{aligned} \langle \nabla f(\mathbf {x}),\mathbf {v}-\mathbf {x}\rangle \le 0 \qquad \forall \mathbf {v}\in S. \end{aligned}$$

This is probably the most elementary first-order condition for constrained differentiable optimization problems. The work [16] provided a unified framework for several algorithms designed to solve different formulations of sparse PCA. Actually, [16] considered the co-stationarity condition over a general nonempty and compact set instead of S, and for this general case, the following proposition, which was originally established in [20], was recalled. This result follows from the convexity of the objective function.

Proposition 2.1

Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) be a continuous differentiable and convex function over \(\mathbb {R}^n\), and let C be a nonempty and compact set. If \(\mathbf {x}\) is a global maximum of f over C, then \(\mathbf {x}\) is a co-stationary point over C, meaning that \(\langle \nabla f(\mathbf {x}), \mathbf {v}-\mathbf {x}\rangle \le 0\) for any \(\mathbf {v}\in C\).

2.3 CW-Maximality

The second necessary optimality condition that we will consider is coordinate-wise maximality. This optimality condition is in fact a type of a local optimality condition, stating that a given point \(\mathbf {x}\) is a minimizer over a neighborhood consisting of all feasible points, that are different by at most two coordinates. We will denote the corresponding neighborhood by

$$\begin{aligned} S_2(\mathbf {x}) := \{\mathbf {z}:\Vert \mathbf {z}-\mathbf {x}\Vert _0\le 2,\mathbf {z}\in S\}. \end{aligned}$$

The formal definition of a CW-maximum point follows.

Definition 2.2

(CW-maximum point) Let \(\mathbf {x}\) be a feasible solution of (P). Then, \(\mathbf {x}\) is called a coordinate-wise (CW) maximum point of (P) if and only if \(f(\mathbf {x})\ge f(\mathbf {z})\) for every \(\mathbf {z}\in S_2(\mathbf {x})\).

Obviously, CW-maximality, by its definition, is a necessary optimality condition.

Proposition 2.2

Let \(\mathbf {x}\) be an optimal solution to (P). Then, \(\mathbf {x}\) is an CW-maximum point.

Instead of considering the neighborhood \(S_2(\mathbf {x})\) in the definition of CW-maximality (Definition 2.2), we could have alternatively considered larger neighborhoods consisting of vectors that differ from \(\mathbf {x}\) by at most k coordinates for some \(2 \le k \le s\):

$$\begin{aligned} S_k(\mathbf {x}) := \{\mathbf {z}:\Vert \mathbf {z}-\mathbf {x}\Vert _0\le k,\mathbf {z}\in S\}. \end{aligned}$$

A similar optimality condition over such a neighborhood can be defined, and clearly since \(S_t(\mathbf {x})\subseteq S_k(\mathbf {x})\) for any \(t\le k\), considering neighborhoods that differ by a larger amount of coordinates will result in stronger optimality conditions. Note that the amount of comparisons required in order to verify that a vector \(\mathbf {x}\in \mathbb {R}^n\) with a full support (\(I_1(\mathbf {x})=s\)) is CW-maximal (\(k=2\)) is \(O(s\cdot n)\), while changing the neighborhood to \(S_3\) will increase the amount of comparisons to \(O(s\cdot n^2)\). Hence, considering such a stronger optimality condition has a substantial computational price. Keeping in mind that we seek scalable conditions and algorithms, we restrict the discussion to the case \(k=2\).

3 Optimality Conditions Hierarchy

Our main result in this section is that CW-maximality is a stronger (that is, more restrictive) optimality condition than co-stationarity. This result also has an impact on the performance of the corresponding algorithms in the sense that, loosely speaking, algorithms that are only guaranteed to converge to a co-stationary point are less likely to produce the optimal solution of the problem than algorithms that are guaranteed to converge to a CW-maximal point. In Sect. 5, we will show that the numerical results support this assertion.

3.1 Technical Preliminaries

We will begin by providing some auxiliary technical results that will be used in order to establish the main result. Lemma 3.1 is a trivial result that follows directly from the Cauchy–Schwarz inequality (see also Lemma 4.1 in [16]).

Lemma 3.1

Suppose that \(\mathbf {0} \ne \mathbf {q} \in \mathbb {R}^d\) and \(\rho >0\). Then, the optimal solution of the optimization problem

figure a

is given by \(\mathbf {x}^* = \rho \frac{\mathbf {q}}{\Vert \mathbf {q}\Vert _2}\) with the optimal value of \(\rho \Vert \mathbf {q}\Vert _2\).

The following simple lemma is an extension of Proposition 4.3 from [16].

Lemma 3.2

Assume that \(\mathbf {0} \ne \mathbf {p}\in \mathbb {R}^n\). Then, the set of optimal solutions of the optimization problem

figure b

is given by

with the optimal value of \(\Vert \mathbf {p}_T\Vert _2\), where \(T\in R_s(\mathbf {p})\).

Proof

We can write (S-QCLP) as

(1)

According to Lemma 3.1, for each \(T\subseteq \{1,\dots ,n\}\) satisfying \(|T|\le s\), the optimal value of the inner optimization problem is \(\Vert \mathbf {p}_T\Vert _2\), and if \(\mathbf {p}_T \ne \mathbf {0}\), then a solution \(\mathbf {x}^*\) to the inner optimization problem is given by

$$\begin{aligned} \mathbf {x}^*_T=\frac{\mathbf {p}_T}{\Vert \mathbf {p}_T\Vert _2},~\mathbf {x}^*_{\bar{T}} = \mathbf {0}. \end{aligned}$$
(2)

The problem (1) thus reduces to

(3)

Obviously, when \(\Vert \mathbf {p}\Vert _0 \ge s\), the optimal solutions of the latter problem are all the sets containing the indices of components corresponding to the s largest absolute values in \(\mathbf {p}\), and when \(\Vert \mathbf {p}\Vert _0 <s\), the unique optimal solution is \(I_1(\mathbf {p})\). Thus, the set of all optimal solutions of (3) is \(R_s(\mathbf {p})\). Noting that \(\mathbf {p}_T \ne \mathbf {0}\) for any \(T \in R_s(\mathbf {p})\), we conclude that the optimal solutions of (S-QCLP) are given by (2) with T being any set in \(R_s(\mathbf {p})\), which are exactly the members of \(X^*(\mathbf {p},s)\). \(\square \)

Our final technical lemma states that, if a given vector \(\tilde{\mathbf {x}}\) is not an optimal solution of the problem of maximizing a linear function over the unit norm, then there must be two indices \(i \ne j\) for which the subvector \(\tilde{\mathbf {x}}_{\{i,j\}}\) is also not an optimal solution for the problem restricted to the variables \(x_i,x_j\) (while fixing all the other variables). This lemma is rather simple, but will play a key role in the proof of the main result.

Lemma 3.3

Let \(\mathbf {q} \in \mathbb {R}^d\) and \(\rho >0\). Suppose that \(\tilde{\mathbf {x}}\) satisfies \(\Vert \tilde{\mathbf {x}}\Vert _2\le \rho \), and that it is not an optimal solution of (QCLP). Then, there exist indices \(i,j (i \ne j)\) such that \(\tilde{\mathbf {x}}_{\{i,j\}}\) is not the optimal solution of

figure c

Proof

Since \(\tilde{\mathbf {x}}\) is not the optimal solution of (QCLP), we obtain that \(\mathbf {q}\ne \mathbf {0}\) (since otherwise, if \(\mathbf {q}=\mathbf {0}\), all feasible points are also optimal). Thus, the set \(I_1(\mathbf {q})\) is nonempty. We will split the analysis into two cases.

  • If \(\Vert \tilde{\mathbf {x}}\Vert _2 < \rho \), then take any \(i \in I_1(\mathbf {q})\) and \(j\ne i\), and we can write

    which together with \(\mathbf {q}_{\{i,j\}} \ne \mathbf {0}\) (since \(i\in I_1(\mathbf {q})\)) implies that \(\tilde{\mathbf {x}}_{\{i,j\}}\) is not the optimal solution of \((\text {2-QCLP}_{\{i,j\}})\), since we have, by Lemma 3.1, that the constraint at the optimal solution must be active.

  • If, on the other hand, , then assume in contradiction that for each \(i \ne j\) the vector \(\tilde{\mathbf {x}}_{\{i,j\}}\) is the optimal solution of \((\text {2-QCLP}_{\{i,j\}})\). Take some \(i \in I_1(\mathbf {q})\). For any \(j\in I_0(\mathbf {q})\), we know that \(\tilde{\mathbf {x}}_{\{i,j\}}\) is the optimal solution of \((\text {2-QCLP}_{\{i,j\}})\) and thus, according to Lemma 3.1 (employed on the problem \((\text {2-QCLP}_{\{i,j\}})\)), it must in particular satisfy \(\tilde{x}_j =0\), that is, \(j \in I_0(\tilde{\mathbf {x}})\). To summarize,

    $$\begin{aligned} \tilde{x}_j=0 \text{ for } \text{ any } j \in I_0(\mathbf {q}). \end{aligned}$$
    (4)

    Now, for any \(j\in I_1(\mathbf {q})\), according to Lemma 3.1, \(\tilde{\mathbf {x}}_{\{i,j\}}\) must satisfy

    $$\begin{aligned} \tilde{x}_i=\frac{q_i}{\Vert (q_i,q_j)^T\Vert _2}(\tilde{x}_i^2+\tilde{x}_j^2)^{1/2}, \end{aligned}$$
    (5)

    where here we used the fact that \(\rho ^2-\sum _{l \ne i,j} \tilde{x}_l^2 = \tilde{x}_i^2+\tilde{x}_j^2\). Squaring both sides of (5), we obtain that it is equivalent to \(q_j^2 \tilde{x}_i^2 = q_i^2 \tilde{x}_j^2\), and hence

    $$\begin{aligned} \tilde{x}_j^2 = \frac{q_j^2}{q_i^2}\tilde{x}_i^2 \qquad \text{ for } \text{ any } j\in I_1( \mathbf {q}). \end{aligned}$$

    By (4), \(\tilde{x}_j=0\) whenever \(j \in I_0( \mathbf {q})\), and we can therefore write

    $$\begin{aligned} \tilde{x}_j^2 = \frac{q_j^2}{q_i^2}\tilde{x}_i^2, \qquad j=1,2,\ldots ,n. \end{aligned}$$

    Summing over \(j=1,2,\ldots ,n\), and using the fact that \(\Vert \tilde{\mathbf {x}}\Vert _2^2 = \rho ^2\), it follows that

    $$\begin{aligned}\sum _{j=1}^{n} \tilde{x}_i^2\frac{q_j^2}{q_i^2} = \rho ^2,\end{aligned}$$

    implying that

    $$\begin{aligned} \tilde{x}_i^2 = \rho ^2\frac{q_i^2}{\Vert \mathbf {q}\Vert _2^2}, \end{aligned}$$

    which combined with the fact that that \(\text {sgn}(\tilde{x}_i)=\text {sgn}(q_i)\) (see (5) ), yields

    $$\begin{aligned} \tilde{x}_i = \rho \frac{q_i}{\Vert \mathbf {q}\Vert _2}. \end{aligned}$$

    Since we actually proved the latter for an arbitrary \(i \in I_1(\mathbf {q})\), and since \(\tilde{x}_i = 0\) for any \(i \in I_0(\mathbf {q})\) (see (4)), it follows that

    $$\begin{aligned} \mathbf {x} = \rho \frac{\mathbf {q}}{\Vert \mathbf {q}\Vert _2}, \end{aligned}$$

    in contradiction to the assumption that \(\tilde{\mathbf {x}}\) is not an optimal solution of (QCLP).   \(\square \)

The following corollary is a direct consequence of Lemmas 3.2 and 3.3.

Corollary 3.1

Let \(\tilde{\mathbf {x}}\in S\). If \(\tilde{\mathbf {x}}\) is not an optimal solution to (S-QCLP) and \(I_1(\tilde{\mathbf {x}}) \subseteq T\) for some \(T \in R_s(\mathbf {p})\), then there exist indices \(i,j \in T (i \ne j)\) such that \(\tilde{\mathbf {x}}_{\{i,j\}}\) is not an optimal solution of \((\text {2-QCLP}_{\{i,j\}})\).

Proof

Assume that \(|T|=k\). Since \(\tilde{\mathbf {x}}\) is not an optimal solution of (S-QCLP), it follows by Lemma 3.2 that \(\tilde{\mathbf {x}}_T \ne \frac{\mathbf {p}_T}{\Vert \mathbf {p}_T\Vert }\), which implies that \(\tilde{\mathbf {x}}_T\) is not the optimal solution of the restricted problem

$$\begin{aligned} \min _{\mathbf {y} \in \mathbb {R}^k} \left\{ \mathbf {p}_T^T \mathbf {y} : \Vert \mathbf {y}\Vert _2 \le \rho \right\} . \end{aligned}$$

Therefore, invoking Lemma 3.3 with \(d=k,\mathbf {q} =\mathbf {p}_T\), it follows that there exist indices \(i,j \in T (i \ne j)\) such that \(\tilde{\mathbf {x}}_{i,j}\) is not an optimal solution of \((\text {2-QCLP}_{\{i,j\}})\). \(\square \)

3.2 Co-Stationarity Versus CW-Maximality

The main result of this paper is given in the following theorem, which establishes the superiority of the CW-maximality condition over the co-stationarity condition.

Theorem 3.1

Let \(\mathbf {x}\) be a CW-maximum point of problem (P). Then, \(\mathbf {x}\) is a co-stationary point of (P).

Proof

Let \(\mathbf {x}\) be a CW-maximum point of (P). Assume by contradiction that \(\mathbf {x}\) is not a co-stationary point. This means that there exists a vector \(\mathbf {v}\in S\) such that

$$\begin{aligned} \nabla f(\mathbf {x})^T (\mathbf {v}-\mathbf {x}) > 0. \end{aligned}$$
(6)

We will show that we can find a vector \(\mathbf {z}\in S_2(\mathbf {x})\) such that

$$\begin{aligned} \nabla f(\mathbf {x})^T (\mathbf {z}-\mathbf {x}) > 0. \end{aligned}$$
(7)

This will imply a contradiction to the CW-maximality of \(\mathbf {x}\) by the following simple argument: since f is a convex function, we have

$$\begin{aligned} f(\mathbf {z}) \ge f(\mathbf {x}) + \nabla f(\mathbf {x})^T(\mathbf {z}-\mathbf {x}), \end{aligned}$$

which combined with (7) implies that

$$\begin{aligned} f(\mathbf {z}) > f(\mathbf {x}), \end{aligned}$$

which is an obvious contradiction to the CW-maximality of \(\mathbf {x}\).

Since \(\mathbf {x}\) satisfies (6), we obviously have \(\nabla f(\mathbf {x})\ne \mathbf {0}\). Let \(X^*(\nabla f(\mathbf {x}),s)\) be the set of optimal solutions of (S-QCLP) with \(\mathbf {p}=\nabla f(\mathbf {x})\) and let \(\mathbf {x}^*\in X^*(\nabla f(\mathbf {x}),s)\) be some particular solution. Then,

$$\begin{aligned} \nabla f(\mathbf {x})^T\mathbf {x}^* \ge \nabla f(\mathbf {x})^T \mathbf {v} > \nabla f(\mathbf {x})^T \mathbf {x}, \end{aligned}$$

and thus \(\mathbf {x}\notin X^*(\nabla f(\mathbf {x}),s)\).

Suppose that there exists some l for which \(\nabla _l f(\mathbf {x})\cdot x_l < 0\) (and in particular \(l\in I_1(\mathbf {x})\)). Define \(\mathbf {z}\) as:

$$\begin{aligned} j = 1,\dots ,n \qquad z_j := \left\{ \begin{array}{l@{\quad }l} -x_l, &{} j=l, \\ x_j, &{} otherwise. \end{array} \right. \end{aligned}$$

\(\mathbf {z}\in S_2(\mathbf {x})\) and \(\nabla f(\mathbf {x})^T (\mathbf {z}-\mathbf {x}) > 0\) since

$$\begin{aligned} \nabla f(\mathbf {x})^T(\mathbf {z}-\mathbf {x}) = -2\cdot \nabla _l f(\mathbf {x})\cdot x_l > 0. \end{aligned}$$

We have thus shown in this case the desired contradiction. From now on , we will therefore consider the case where \(\nabla _i f(\mathbf {x})\cdot x_i \ge 0\) for all \(i=1,\dots ,n\).

Consider the following cases:

  1. 1.

    \( I_1(\mathbf {x}) \not \subseteq I_{\ge }(\nabla f(\mathbf {x}),s)\). Obviously, there is some \(h\in I_1(\mathbf {x})\cap I_<(\nabla f(\mathbf {x}),s)\). We will consider the following subcases:

    1. 1.1

      If \(|I_\ge (\nabla f(\mathbf {x}),s)|< s\), then \(\nabla _h f(\mathbf {x})=0\) (by Lemma 2.1, part 2), and since \(\nabla f(\mathbf {x})\ne \mathbf {0}\), we conclude, using Lemma 2.1 (part 1), that there is some \(l\in I_\ge (\nabla f(\mathbf {x}),s)\). Define \(\mathbf {z}\) as:

      $$\begin{aligned} j = 1,\dots ,n \qquad z_j := \left\{ \begin{array}{c@{\quad }l} \text {sgn}\left( \nabla _l f(\mathbf {x})\right) \cdot (x_h^2+x_l^2)^{1/2} &{} j=l, \\ 0 &{} j = h, \\ x_j &{} otherwise. \end{array} \right. \end{aligned}$$

      Obviously \(\mathbf {z}\in S_2(\mathbf {x})\), and in addition \(\nabla f(\mathbf {x})^T (\mathbf {z}-\mathbf {x}) > 0\) since

    2. 1.2

      If \(|I_\ge (\nabla f(\mathbf {x}),s)|\ge s\), then there is some \(l\in I_\ge (\nabla f(\mathbf {x}),s)\) such that \(l \notin I_1(\mathbf {x})\). Otherwise \(I_\ge (\nabla f(\mathbf {x}),s)\subseteq I_1(\mathbf {x})\), and since \(|I_\ge (\nabla f(\mathbf {x}),s)|\ge s\) and \(|I_1 (\mathbf {x})|\le s\), we have that \(I_\ge (\nabla f(\mathbf {x}),s)=I_1 (\mathbf {x})\), contradicting our assumption that \(I_1(\mathbf {x}) \not \subseteq I_\ge (\nabla f(\mathbf {x}),s)\). We will define \(\mathbf {z}\) as:

      $$\begin{aligned} j = 1,\dots ,n \qquad z_j := \left\{ \begin{array}{c@{\quad }l} \text {sgn}\left( \nabla _l f(\mathbf {x})\right) \cdot |x_h| &{} j=l, \\ 0 &{} j = h, \\ x_j &{} otherwise. \end{array} \right. \end{aligned}$$

      Clearly, \(\mathbf {z}\in S_2(\mathbf {x})\). In addition, \(\nabla f(\mathbf {x})^T (\mathbf {z}-\mathbf {x}) > 0\) since:

      $$\begin{aligned} \begin{array}{lllc} \nabla f(\mathbf {x})^T(\mathbf {z} - \mathbf {x}) &{} = &{} \nabla _l f(\mathbf {x}) \cdot \text {sgn}\left( \nabla _l f(\mathbf {x})\right) \cdot |x_h| \\ &{}&{}- \nabla _h f(\mathbf {x})\cdot x_h &{} \\ &{} =&{} \big |\nabla _l f(\mathbf {x})\big |\cdot |x_h| - \big |\nabla _h f(\mathbf {x})\big |\cdot |x_h| &{} \left( \nabla _h f(\mathbf {x})\cdot x_h\ge 0\right) \\ &{} = &{} \left( \big |\nabla _l f(\mathbf {x})\big | - \big |\nabla _h f(\mathbf {x})\big | \right) \cdot |x_h| > 0, &{} \end{array} \end{aligned}$$

      where the last inequality holds since \(x_h\ne 0\) and the indices l and h are such that \(l\in I_\ge (\nabla f(\mathbf {x}),s)\) and \(h\in I_<(\nabla f(\mathbf {x}),s)\), thus according to Lemma 2.1 (part 3) \(\big |\nabla _l f(\mathbf {x})\big |>\big |\nabla _h f(\mathbf {x})\big |\).

  2. 2.

    \(I_1(\mathbf {x}) \subseteq I_\ge (\nabla f(\mathbf {x}),s)\) Now we will consider the following subcases:

    1. 2.1

      If \(I_1(\mathbf {x}) \subseteq T\) for some \(T \in R_s(\nabla f(\mathbf {x}))\), then since \(\mathbf {x} \notin X^*(\nabla f(\mathbf {x}),s)\), it follows that according to Corollary 3.1, there exist indices \(h,l\in T\) such that

      satisfies

      $$\begin{aligned} \nabla _{\{h,l\}} f(\mathbf {x})^T\hat{\mathbf {x}} > \nabla _{\{h,l\}} f(\mathbf {x})^T\mathbf {x}_{\{h,l\}}. \end{aligned}$$
      (8)

      Since \(|T|\le s\) and \(\Vert \hat{\mathbf {x}}\Vert _2^2 \le 1 - \sum _{i\ne h,l} x_i^2\), the vector

      $$\begin{aligned} j = 1,\dots ,n \qquad z_j := \left\{ \begin{array}{c@{\quad }l} \hat{x}_1, &{} j=h, \\ \hat{x}_2, &{} j=l, \\ x_j, &{} otherwise, \end{array} \right. \end{aligned}$$

      is in \(S_2(\mathbf {x})\), and satisfies by (8) that \(\nabla f(\mathbf {x})^T(\mathbf {z}-\mathbf {x})>0\).

    2. 2.2

      If \(I_1(\mathbf {x}) \not \subseteq T\) for all \(T \in R_s(\nabla f(\mathbf {x}))\), then:

      • Take \(h\in I_1(\mathbf {x})\) such that \(h\notin T\) for some \(T \in R_s(\nabla f(\mathbf {x}))\). Since \(I_1(\mathbf {x})\subseteq I_{\ge }(\nabla f(\mathbf {x}),s)\), it follows that \(h\in I_{\ge }(\nabla f(\mathbf {x}),s)\). Moreover, since \(I_{>}(\nabla f(\mathbf {x}),s)\subseteq T\) and \(h\notin T\), we have that \(h\notin I_{>}(\nabla f(\mathbf {x}),s)\), implying that \(h\in I_{=}(\nabla f(\mathbf {x}),s)\). Thus, \(h\in I_{=}(\nabla f(\mathbf {x}),s)\cap I_1(\mathbf {x})\).

      • \(I_>(\nabla f(\mathbf {x}),s)\not \subseteq I_1(\mathbf {x})\). To show this, note that otherwise, \(I_>(\nabla f(\mathbf {x}),s)\subseteq I_1(\mathbf {x})\), and since \(I_1(\mathbf {x}) \subseteq I_\ge (\nabla f(\mathbf {x}),s)\) and \(|I_1(\mathbf {x})|\le s\), we obtain that , implying that \(I_1(\mathbf {x})\subseteq T\) for some \(T\in R_s(\nabla f(\mathbf {x}))\), in contradiction to our assumption. Thus, there exists some \(l\in I_>(\nabla f(\mathbf {x}),s)\) such that \(l\notin I_1(\mathbf {x})\).

      Define \(\mathbf {z}\) as:

      $$\begin{aligned} j = 1,\dots ,n \qquad z_j := \left\{ \begin{array}{c@{\quad }l} \text {sgn}\left( \nabla _l f(\mathbf {x})\right) \cdot |x_h|, &{} j=l, \\ 0, &{} j = h, \\ x_j, &{} otherwise. \end{array} \right. \end{aligned}$$

      Clearly, \(\mathbf {z}\in S_2(\mathbf {x})\). Furthermore, \(\nabla f(\mathbf {x})^T (\mathbf {z}-\mathbf {x}) > 0\) since

      $$\begin{aligned} \begin{array}{lllc} \nabla f(\mathbf {x})^T(\mathbf {z} - \mathbf {x}) &{} = &{} \nabla _l f(\mathbf {x})\cdot \text {sgn}\left( \nabla _l f(\mathbf {x})\right) \cdot |x_h| &{} \\ &{}&{} - \nabla _h f(\mathbf {x})\cdot x_h &{} \\ &{} =&{} \big |\nabla _l f(\mathbf {x})\big |\cdot |x_h| - \big |\nabla _h f(\mathbf {x})\big |\cdot |x_h| &{} \left( \nabla _h f(\mathbf {x})\cdot x_h\ge 0\right) \\ &{} = &{} \left( \big |\nabla _l f(\mathbf {x})\big | - \big |\nabla _h f(\mathbf {x})\big | \right) \cdot |x_h| > 0, &{} \end{array} \end{aligned}$$

      where the last inequality holds since \(x_h\ne 0\) and the indices l and h are such that \(l\in I_>(\nabla f(\mathbf {x}),s)\) and \(h\in I_=(\nabla f(\mathbf {x}),s)\), and thus according to Lemma 2.1 (part 3) \(\big |\nabla _l f(\mathbf {x})\big |>\big |\nabla _h f(\mathbf {x})\big |\).

We have thus arrived at a contradiction, and the desired implication is established. \(\square \)

In order to show that the reverse implication is not valid, that is, that co-stationary points are not necessarily CW-maximal points, we present an example of a problem instance and a co-stationary point, that is not a CW-maximal point.

Example 3.1

For any \(n>s>0\), we consider problem (SPCA) with a diagonal matrix \(\mathbf {A}\), whose entries on the main diagonal are given by the vector \(\mathbf {a}\) defined by

$$\begin{aligned} \mathbf {a}:=\left( \begin{matrix} 2\cdot \mathbf {1}_{n-s} \\ 0.5 \cdot \mathbf {1}_{s} \end{matrix}\right) , \end{aligned}$$

where for a given positive integer m, \(\mathbf {1}_m\) and \(\mathbf {0}_m\) are the vectors of size m with all entries equal to ones or zeros, respectively. We also define

$$\begin{aligned} \mathbf {x}:=\left( \begin{array}{l} \mathbf {0}_{n-s} \\ s^{-0.5}\cdot \mathbf {1}_{s} \end{array}\right) \qquad \text {and} \qquad \tilde{\mathbf {x}}:=\left( \begin{array}{c} \mathbf {0}_{n-s-1} \\ s^{-0.5} \\ 0\\ s^{-0.5}\cdot \mathbf {1}_{s-1} \end{array}\right) . \end{aligned}$$

It easy to see that \(\mathbf {x},\tilde{\mathbf {x}}\in S\) and that \(\mathbf {A}\succ {\mathbf {0}}\), since it is a diagonal matrix with positive diagonal elements. The gradient of f is given by:

$$\begin{aligned} \nabla f(\mathbf {x}) = 2\mathbf {A}\mathbf {x} = \left( \begin{array}{c} \mathbf {0}_{n-s}\\ s^{-0.5}\cdot \mathbf {1}_s \end{array}\right) . \end{aligned}$$

For any \(\mathbf {v}\in S\):

$$\begin{aligned} \langle \nabla f(\mathbf {x}),\mathbf {v} - \mathbf {x} \rangle= & {} \sum _{i=n-s+1}^{n} s^{-0.5}(v_i-s^{-0.5}) \\= & {} s^{-0.5} \left( \sum _{i=n-s+1}^{n} v_i -s^{0.5}\right) \le s^{-0.5} \left( \Vert \mathbf {v}\Vert _1 -s^{0.5}\right) \le 0, \end{aligned}$$

where the last inequality holds since \(\Vert \mathbf {v}\Vert _1\le \sqrt{\Vert \mathbf {v}\Vert _0}\Vert \mathbf {v}\Vert _2 \le \sqrt{s}\). Hence, \(\mathbf {x}\) is co-stationary. The vector \(\tilde{\mathbf {x}}\) satisfies \(\tilde{\mathbf {x}} \in S_2(\mathbf {x})\) and since:

$$\begin{aligned} \begin{array}{l@{\quad }l} f(\tilde{\mathbf {x}}) = \tilde{\mathbf {x}}^T\mathbf {A}\tilde{\mathbf {x}} = (s-1)\cdot (2s)^{-1}+2s^{-1} &{}= (s+3)\cdot (2s)^{-1}\\ &{}> s\cdot (2s)^{-1} = \mathbf {x}^T\mathbf {A}\mathbf {x} = f(\mathbf {x}), \end{array} \end{aligned}$$

it follows that \(\mathbf {x}\) is not a CW-maximum point.

3.3 Support Optimality

Theorem 3.1 establishes the relationship between the two stationarity conditions considered up to this point: co-stationarity and CW-maximality. A third condition, proposed in [4], that we will refer to as support optimality (SO), is given in the following definition.

Definition 3.1

(Support Optimality) A vector \(\mathbf {x}^*\in S\) is called a support optimal (SO) point of (P) with respect to an index set \(T \subseteq \{1,2,\ldots ,n\}\) if and only if it is an optimal solution of the optimization problem

figure d

It is clear that, if \(\mathbf {x} \in S\) is an optimal solution of problem (P), then it must be an SO point of (P) with respect to any index set T satisfying \(|T| \le s\) and \(I_1(\mathbf {x}) \subseteq T\). In that respect, support optimality is a necessary optimality condition for problem (P). It is a remarkably weak condition and cannot be used exclusively to derive a reasonable algorithm. Nevertheless, it is not totally futile. In order to enhance the performance, the CW-based algorithms that will be presented in Sect. 4 will produce a sequence of SO points, and in Sect. 5, we will adopt the variational re-normalization strategy suggested in [4], stating that for each sparse solution obtained by any technique, it is reasonable to replace this solution with the SO point that correspond to the same support.

We will conclude this section with an example that demonstrates the potential benefit of employing algorithms that produce a point that satisfies stronger necessary optimality conditions. Consider the pit-prop data, which consists of 13 variables measuring various physical properties of 180 pitprops. This dataset was suggested originally in [21], and since then was extensively used as a benchmark example for sparse PCA; see, for example, [4, 8, 15]. The problem has 13 variables and we consider a sparsity level of \(s=4\). Note that we can list all the \({13 \atopwithdelims ()4}=715\) SO points that correspond to index sets with exactly 4 indices, and the optimal solution must be one of these 715 points. Out of this set of points, 28 satisfy the co-stationarity condition and only 2 satisfy the CW-maximality condition. Table 1 presents the support sets of each of the co-stationarity points along with their function values.

Table 1 The supports of the co-stationary points for the pit-prop data

Since the number of CW-maximum points is significantly smaller than the number of co-stationary points, it is much more probable that the optimal solution will be found by an algorithm that produces CW-maximum points than an algorithm that produces co-stationary points.

4 Algorithms

In this section, we will present two CW-based algorithms—GCW and PCW – that are guaranteed to converge after a finite amount of iterations to a CW-maxima. Later on, in Sect. 5, we will demonstrate the superiority of these algorithm over methods which are based on the co-stationarity optimality condition such as the conditional gradient algorithm with unit step size (ConGradU), that was suggested in [16], where it was also proven that limit points of the sequence generated by ConGradU are co-stationary point.

In [18] several algorithms that produce a CW-minimum point were considered. These block coordinate descent type algorithms perform at each iteration an optimization step with respect to one or two variables, while keeping the rest fixed. The coordinates that need to be altered are chosen to be the ones that produce the maximal decrease among all possible alternatives, or by applying an index selection strategy based on a local first-order information. We adopt this approach and present similar algorithms for the sparse PCA problem.

At each iteration of a CW-based algorithm applied to (P), at most two variables will be updated. We can categorize each of the iterations according to whether the support is altered or not. Block coordinate algorithms suffer from a major drawback – a slow convergence rate. In order to reduce the effect of this displeasing characteristic, we will replace the point obtained at each step with an SO point that corresponds to the same support. This modification allows us to bypass the large amount of iterations that should have been devoted for optimizing the variables with respect to a fixed support.

Below we present the Greedy CW (GCW) algorithm. We denote by \(\mathcal {O}(T)\) an oracle that produce an SO point with respect to a given support T by solving problem (SO). We will refer to this oracle as an SO oracle. In the specific case of the PCA problem, the SO oracle amounts to finding a normalized principal eigenvector of a submatrix of the covariance matrix. However, finding the maximum of a general convex function f over a unit ball is in principle a difficult task. We will assume that the solution produced by the oracle is uniquely defined by T. In addition, note that the oracle outputs an optimal solution of a problem consisting of maximizing a convex function over a compact and convex feasible set, and hence by [20, Corollary 32.3.2], there exists an optimal solution of the problem which is an extreme point. In particular, this means that we can assume without any loss of generality that the oracle outputs a vector with norm 1. This assumption will made from now on.

figure e

Step 1 of the GCW algorithm is in fact the greedy forward selection method proposed in [4]. Hence, in some sense, the GCW method is a generalization of this method that does not terminate at the moment that a solution with a full support is obtained. However, from a more practical point of view, this resemblance is irrelevant due to the fact that, if the initial support satisfies \(|T|=s\), then the condition \(\Vert \mathbf {x}^k\Vert _0<s\) will probably be false for all k in any reasonable practical scenario.

The following theorem summarizes the key properties of the GCW algorithm.

Theorem 4.1

Let \(\{\mathbf {x}^k \}\) be the sequence generated by the GCW algorithm. Then, the following statements hold.

  1. (i)

    The sequence of function values \(\{f(\mathbf {x}^k)\}\) is monotonically increasing.

  2. (ii)

    The algorithm terminates after a finite amount of iterations.

  3. (iii)

    At termination, the algorithm produces a CW-maximum point.

Proof

Part (i) follows immediately from the description of the GCW algorithm. Part (ii) is a consequence of the monotonicity of the algorithm (part (i)) and the fact that it only passes through SO points, from which there is only a finite number under the standing assumption that the solution produced by the oracle \(\mathcal {O}(T)\) is uniquely defined by T.

To prove (iii), consider the following partition of \(S_2(\mathbf {x})\):

$$\begin{aligned} S_2(\mathbf {x})= & {} \{\mathbf {z}:\Vert \mathbf {z}-\mathbf {x}\Vert _0\le 2,\mathbf {z}\in S\}\\= & {} S_2^0(\mathbf {x}) \cup S_2^1(\mathbf {x}) \cup S_2^2(\mathbf {x}),\\ \end{aligned}$$

where

$$\begin{aligned} \begin{array}{r@{\quad }l} S_2^0(\mathbf {x}) =&{} \{\mathbf {z} \in S:\Vert \mathbf {z}-\mathbf {x}\Vert _0\le 2,I_1(\mathbf {z})\subseteq I_1(\mathbf {x}) \}\\ S_2^1(\mathbf {x}) =&{} \{\mathbf {z} \in S:\Vert \mathbf {z}-\mathbf {x}\Vert _0\le 2,I_1(\mathbf {z})=I_1(\mathbf {x})\cup \{j\}, j\in I_0(\mathbf {x})\} \\ S_2^2(\mathbf {x}) = &{} ~ \end{array}\\ \qquad \{\mathbf {z} \in S:\Vert \mathbf {z}-\mathbf {x}\Vert _0\le 2,I_1(\mathbf {z})=(I_1(\mathbf {x}) \setminus \{i\}) \cup \{j\},i\in I_1(\mathbf {x}),j\in I_0(\mathbf {x}) \}, \end{aligned}$$

and assume that the algorithm produced the point \(\bar{\mathbf {x}}\). Since \(\bar{\mathbf {x}}\) is an SO point and \(S_2^0(\bar{\mathbf {x}})\subseteq \{\mathbf {x}:\Vert \mathbf {x}\Vert _2 \le 1,I_1(\mathbf {x})\subseteq I_1(\bar{\mathbf {x}})\}\), it follows that \(f(\bar{\mathbf {x}})\ge f(\mathbf {x})\) for any \(\mathbf {x}\in S_2^0(\bar{\mathbf {x}})\). Now, note that the algorithm terminates only if after performing Step 2 we obtain that for any \(i\in I_1(\bar{\mathbf {x}})\) and \(j\in I_0(\bar{\mathbf {x}})\)

where the first equality is due to the fact that the maximum of a convex function over a compact and convex set is attained at an extreme point, see [20, Corolalry32.3.2]. Thus, \(f(\bar{\mathbf {x}})\ge f(\mathbf {x})\) for any \(\mathbf {x}\in S_2^2(\bar{\mathbf {x}})\). This is enough for proving that \(\bar{\mathbf {x}}\) is CW-maximal in the case when \(\Vert \bar{\mathbf {x}}\Vert _0=s\) since in this case \(S_2^1(\bar{\mathbf {x}})=\emptyset \). If \(\Vert \bar{\mathbf {x}}\Vert _0<s\), then prior to entering Step 2, Step 1 must be performed. This step is terminated only if \(f(\bar{\mathbf {x}})\ge f(\mathbf {x})\) for any

$$\begin{aligned} \mathbf {x}\in \{\mathbf {z} \in S:I_1(\mathbf {z})=I_1(\bar{\mathbf {x}})\cup \{j\}, j\in I_0(\bar{\mathbf {x}})\}, \end{aligned}$$

and since \(S_2^1(\bar{\mathbf {x}})\subseteq \{\mathbf {z} \in S:I_1(\mathbf {z})=I_1(\bar{\mathbf {x}})\cup \{j\}, j\in I_0(\bar{\mathbf {x}})\}\), it implies that \(f(\bar{\mathbf {x}})\ge f(\mathbf {x})\) for any \(\mathbf {x}\in S_2^1(\bar{\mathbf {x}})\), concluding that \(f(\bar{\mathbf {x}})\ge f(\mathbf {x})\) for any \(\mathbf {x}\in S_2(\bar{\mathbf {x}})\). \(\square \)

Practically, if the initial support T satisfies \(|T|=s\), then most of the computation time in the GCW method is consumed in computing \(f_{i,j}\) for each possible swap. This observation encourages us to consider the following variation of GCW, which we name the Partial CW (PCW) algorithm.

figure f

Before termination, PCW will perform the computation of all possible \(f_{i,j}\), thus assuring the convergence to a CW-maximum point, given that the output is of a full support. For the general step, the amount of computation will significantly decrease on the expense of finding the indices that provide the maximal increase in the function value. Nevertheless, the empirical study suggests that PCW provides similar results as GCW with respect to function values in a fraction of the time, as demonstrated in Sect. 5.

5 Numerical Results

We will illustrate the effectiveness of the algorithms proposed in the previous section on simulated and a gene expression datasets. We compared the results with the following alternative algorithms: the novel \(l_0\)-constrained version of ConGradU [16], the expectation maximization [11], approximate greedy [6] and thresholding [5]. The MATLAB implementation of ConGradU was kindly provided by the authors, for all the other alternative algorithms we used a MATLAB implementation available on the authors’ web-pages. For the thresholding algorithm and the algorithms proposed in this paper, we used a MATLAB implementation, which is available in the following URL: https://web.iem.technion.ac.il/images/user-files/becka/papers/CW_PCA.zip.

Whenever an initialization is required, we set the initial point to be the solution of the thresholding method. Regarding the output, we adopt the variational renormalization strategy suggested in [4]. Hence, for each of the algorithms, we extracted the sparsity pattern (the set of indices of the nonzero elements). The actual output vector is determined to be equal to \(\mathcal {O}(T)\), where T is the generated sparsity pattern. The experiments were conduced on a PC with a 3.40GHz processor with 16GB RAM.

5.1 Random Data

The covariance matrix \(\mathbf {A}\) is given by \(\mathbf {A}=\mathbf {D}^T \mathbf {D}\), where \(\mathbf {D}\) is the so-called “data matrix”. Each entry in the data matrix was randomly generated according to the Gaussian distribution with zero mean and variance 1 / m (\(D_{i,j}\sim \mathcal {N}(0,1/m)\)). We considered data matrices with \(n=2000,5000,10,000\) and 50, 000 variables. The number of observations is set to \(m=150\) for all matrices. The sparsity levels considered are \(s=5,10,\dots ,250\), and for each sparsity level we generated 100 realizations. We will measure the effectiveness of the algorithms according to the average proportion of variability explained by the algorithm with respect to the largest eigenvalue of the data covariance matrix (i.e., \(\mathbf {x}^T\mathbf {A}\mathbf {x}/\lambda _1(\mathbf {A})\), where \(\mathbf {x}\) is the solution and \(\lambda _1(\mathbf {A})\) is the largest eigenvalue of \(\mathbf {A}\)).

5.1.1 GCW Versus PCW

First, we would like to compare the effectiveness and performance of the CW-based algorithms proposed in the previous section: GCW and PCW. We conducted the comparison based on data matrices with 2, 000 variables and the results are given in Fig. 1.

Fig. 1
figure 1

GCW versus PCW—the proportion of explained variability is given in the left figure and the computation time is given in the right one. The plot in both figures are given as a function of the sparsity level

Fig. 2
figure 2

PCW versus others—the proportion of explained variability as a function of the sparsity level for \(n=5000,10,000\) and 50, 000 are given in the upper left, upper right and bottom figures, respectively

We can clearly see that both methods achieve similar results with respect to the function values, while PCW achieves these results in a fraction of the time. Thus, in the remaining numerical study, we will omit GCW. Although the partial version remarkably reduces the computation time, it is still not competitive for very large-scale problems when a full path of solutions is required. Thus, for such cases, we will also examine the effect of initializing PCW with the solution of the previous run (with the smaller sparsity level), and we will refer to such a continuation scheme as \(\text {PCW}_{cont}\).

5.1.2 PCW Versus Alternative Methods

We will now compare the effectiveness and performance of PCW with respect to the alternative algorithms mentioned earlier. The setting for this set of experiments is the same as the one described in the previous example, but with problems with \(n=5,000,~10,000\) and 50, 000 variables. Figure 2 provides the proportion of explained variability as a function of the sparsity level.

For small sparsity levels (\(<50\)), most of the algorithms provide similar results, but as the sparsity level is increased, the CW algorithms becomes superior to all the other methods. This advantage is not achieved without a price. In Fig. 3 we provide the cumulative computation time of the algorithms (the cumulative time is considered since the approximate greedy algorithm provides a full set of solutions).

Fig. 3
figure 3

PCW versus alternative methods—the cumulative computation times as a function of the sparsity level for \(n=5,000,10,000\) and 50, 000 are given in the upper left, upper right and bottom, respectively. The SVD time is the time required for computing the principal eigenvector of the covariance matrix that corresponds to the generated data, which is used in order to find the thresholding solution, and in order to initialize the CW and ConGradU algorithms

Even though PCW has greatly decreased the computation time with respect to GCW, it still requires a notably higher amount of computation time with respect to the alternative algorithms. The scheme we referred as \(\text {PCW}_{cont}\) achieves similar results to PCW with respect to the function value. Regarding the running time, this scheme is competitive to the EM algorithm and requires somewhat more computational effort than the ConGradU and approximate greedy algorithms, thus providing a reasonable approach when a full set of solutions is required.

5.2 Gene Expression Dataset

Sparse PCA is extensively utilized in the identification of the genes that reflect the changes in the gene expression patterns during different biological states, thus contributing to the diagnosis and research of certain diseases such as cancer. Figure 4 illustrates the proportion of explained variability and the cumulative running time for a Leukemia dataset [22]. This dataset is composed from gene expression profiles of 72 patients with 12582 genes. The dataset is normalized such that it has zero mean and unit variance.

Fig. 4
figure 4

Leukemia gene expression data—the proportion of explained variability is given in the left figure and the cumulative computation time is given in the right one. The plot in both figures are given as a function of the sparsity level

Fig. 5
figure 5

Gene expression data—the left figure illustrates for each sparsity level the proportion of the number of datasets for which each algorithm obtained the best solution. The right figure illustrates for each sparsity level the mean error with respect to the best solution (the approximate greedy and thresholding algorithms were disregarded since both of them provide relative poor results)

Most of the algorithms under consideration provide similar results with respect to the explained variability, which might indicate that this problem is, in a sense, rather easy to solve. We conducted similar experiments for additional 20 gene expression datasets from the GeneChip oncology database [23] that is publicly available in: http://compbio.dfci.harvard.edu/compbio/tools/gcod while commonly, all the algorithms provided similar results, we can still see in Fig. 5 that PCW yields the best solution (with respect to the function value) more times than the alternative algorithms, and consequently it obtains the smallest mean error with respect to the best solution.

6 Conclusions

In this paper, we considered the problem of maximizing a continuously differentiable convex function over the intersection of an \(l_2\) unit ball and a sparsity constraint. We have shown that coordinate-wise maximality is a more restrictive condition than co-stationarity, which is the basis of many well-known methods for solving the sparse PCA problem. We introduced two algorithms (GCW and PCW) that are guaranteed to produce a CW-maximal solution and demonstrated empirically the potential benefit of using this algorithms over some common algorithms proposed for this problem.