1 Introduction

In the big data era, people often face intrinsically multi-dimensional, multi-modal and multi-view data-sets that are too complex to be processed and analyzed by traditional data mining tools based on vectors or matrices. Higher-order tensors (hypermatrices) naturally can represent such complex data sets. Tensor decomposition tools, developed for understanding tensor-based data sets, have shown the power in various fields such as signal processing, image processing, statistics, and machine learning; see the surveys [4,5,6, 8, 16, 29].

In high dimensional data-sets, another structure that cannot be ignored is the sparsity. Sparsity tensor examples come from clustering problems, online advertising, web link analysis, ranking of authors based on citations [17, 24, 26, 31, 32], and so on. Therefore, recent advances incorporate sparsity into tensor decomposition models and tensor-based statistics and machine learning problems [1, 3, 21, 26, 30, 32, 35, 38]; just to name a few. As pointed out by Allen [1], introducing sparsity into tensor problems is desirable for feature selection, for compression, for statistical consistency, and for better visualization and interpretation of the data analysis results.

In (dense) tensor decomposition and related tensor models and problems, the best rank-1 approximation (BR1Approx) is one of the most important and fundamental problems [10, 28]. Just as the dense counterpart, the sparse tensor BR1Approx serves as a keystone in the computation of sparse tensor decomposition and related sparse tensor models [1, 26, 30, 32, 35]. Roughly speaking, the sparse tensor BR1Approx is to find a projection of a given data tensor onto the set of sparse rank-1 tensors in the sense of Euclidean distance. This is equivalent to maximizing a multilinear function over both unit sphere and \(\ell _0\) constraints; mathematical models will be detailed in Section 2. Such a problem also closely connects to the sparse tensor spectral norm defined in [30, 32].

For (dense) tensor BR1Approx, several methods have been proposed; e.g., power methods [10, 18], approximation algorithms [12, 13, 39], and convex relaxations [15, 25, 37]. For sparse matrix BR1Approx, solution methods have been studied extensively in the context of sparse PCA and sparse SVD [36, 40]; see, e.g., iterative methods [20, 36], approximation algorithms [2, 11], and semidefinite relaxation [9]; just to name a few. For sparse tensor BR1Approx, in the context of sparse tensor decomposition, Allen [1] first studied models and iterative algorithms based on \(\ell _1\) regularization in the literature, whereas Sun et al. [32] developed \(\ell _0\)-based models and algorithms, and analyzed their statistical performance. Wang et al. [35] considered another \(\ell _1\) regularized model and algorithm, which is different from [1]. In the study of co-clustering, Papalexakis et al. [26] proposed alternating minimization methods for nonnegative sparse tensor BR1Approx.

For nonconvex and NP-hard problems, approximation algorithms are nevertheless encouraged. However, in the context of sparse tensor BR1Approx problems, little attention was paid to this type of algorithms. To fill this gap, by fully exploiting the multilinearity and sparsity of the model, we develop four polynomial-time approximation algorithms, some extending their matrix or dense tensor counterparts; in particular, the last algorithm, which is the most efficient one, is even new when reducing to the matrix or dense tensor cases. The proposed algorithms are easily implemented, and the computational complexity is not high: the most expensive execution is, if necessary, to only compute the largest singular vector pairs of certain matrices. Therefore, the algorithms are able to serve as initialization procedures for iterative algorithms. Moreover, for each algorithm, we derive theoretically guaranteed approximation lower bounds. Experiments on synthetic as well as real data show the usefulness of the introduced algorithms.

The rest of this work is organized as follows. Sect. 2 introduces the sparse tensor BR1Approx model. Approximation algorithms are presented in Sect. 3. Numerical results are illustrated in Sect. 4. Section 5 draws some conclusions.

2 Sparse tensor best rank-1 approximation

Throughout this work, vectors are written as \(({\mathbf {x}},{\mathbf {y}},\ldots )\), matrices correspond to \((A,B,\ldots )\), and tensors are written as \((\mathcal {A}, \mathcal {B}, \cdots )\). \({\mathbb {R}}^{n_1\times \cdots \times n_d}\) denotes the space of \(n_1\times \cdots \times n_d\) real tensors. For two tensors \({\mathcal {A}},{\mathcal {B}}\) of the same size, their inner product \(\langle {\mathcal {A}},{\mathcal {B}}\rangle \) is given by the sum of entry-wise product. The Frobenius norm of \({\mathcal {A}}\) is defined by \(\Vert {\mathcal {A}}\Vert _F = \langle {\mathcal {A}},{\mathcal {A}}\rangle ^{1/2}\). \(\circ \) denotes the outer product; in particular, for \({\mathbf {x}}_j\in {\mathbb {R}}^{n_j}\), \(j=1,\ldots ,d\), \({\mathbf {x}}_1\circ \cdots \circ {\mathbf {x}}_d\) denotes a rank-1 tensor in \( \mathbb R^{n_1\times \cdots \times n_d} \). \( \left\| {\mathbf {x}} \right\| _0 \) represents the number of nonzero entries of a vector \({\mathbf {x}}\).

Given \({\mathcal {A}} \in \mathbb R^{n_1\times \cdots \times n_d} \) with \(d\ge 3\), the tensor BR1Approx consists of finding a set of vectors \({\mathbf {x}}_1,\ldots ,{\mathbf {x}}_d\), such that

$$\begin{aligned} \min _{\lambda \in {\mathbb {R}},{\mathbf {x}}_j\in {\mathbb {R}}^{n_j},1\le j\le d}~ \left\| \lambda \cdot {\mathbf {x}}_1\circ \cdots \circ {\mathbf {x}}_d- {\mathcal {A}} \right\| _F ^2~\mathrm{s.t.} \left\| {\mathbf {x}}_j \right\| = 1,~1\le j\le d, \end{aligned}$$
(2.1)

When \({\mathcal {A}}\) is sparse, it may be necessary to also investigate the sparsity of the latent factors \({\mathbf {x}}_j\), \(1\le j\le d\). Assume that the true sparsity level of each latent factors is known a prior, or can be estimated; then the sparse tensor BR1Approx can be modeled as follows [32]:

$$\begin{aligned} \min _{\lambda \in {\mathbb {R}}, {\mathbf {x}}_j\in {\mathbb {R}}^{n_j},1\le j\le d}~ \left\| \lambda \cdot {\mathbf {x}}_1\circ \cdots \circ {\mathbf {x}}_d- {\mathcal {A}} \right\| _F ^2~\mathrm{s.t.} \left\| {\mathbf {x}}_j \right\| =1,~ \left\| {\mathbf {x}}_j \right\| _0 \le r_j,1\le j\le d, \end{aligned}$$
(2.2)

where \(1\le r_j\le n_j\) are positive integers standing for the sparsity level. Allen [1] and Wang et al. [35] proposed \(\ell _1\) regularized models for sparse tensor BR1Approx problems.

Since \({\mathbf {x}}_j\)’s are normalized in (2.2), we have

minimizing which with respect to \(\lambda \) gives \(\lambda = \left\langle {\mathcal {A}} , {\mathbf {x}}_1\circ \cdots \circ {\mathbf {x}}_d\right\rangle \), and so

$$\begin{aligned} \left\| \lambda {\mathbf {x}}_1\circ \cdots {\mathbf {x}}_d - {\mathcal {A}} \right\| _F ^2 = \left\| {\mathcal {A}} \right\| _F ^2 - \left\langle {\mathcal {A}} , {\mathbf {x}}_1\circ \cdots \circ {\mathbf {x}}_d\right\rangle ^2. \end{aligned}$$

Due to the multilinearity of \( \left\langle {\mathcal {A}} , {\mathbf {x}}_1\circ \cdots \circ {\mathbf {x}}_d\right\rangle \), \( \left\langle {\mathcal {A}} , {\mathbf {x}}_1\circ \cdots \circ {\mathbf {x}}_d\right\rangle ^2\) is maximized if and only if \( \left\langle {\mathcal {A}} , {\mathbf {x}}_1\circ \cdots \circ {\mathbf {x}}_d\right\rangle \) is maximized. Thus (2.2) can be equivalently recast as

(2.3)

This is the main model to be focused on. When \(r_j=n_j\), (2.3) boils down exactly to the tensor singular value problem [19]. Thus (2.2) can be regarded as a sparse tensor singular value problem.

When \(r_j=n_j\), (2.3) is already NP-hard in general [14]; on the other hand, when \(d=2\) and \({\mathbf {x}}_1={\mathbf {x}}_2\), it is also NP-hard [22]. Therefore, we may deduce that (2.3) is also NP-hard, whose NP-hardness comes from two folds: the multilinearity of the objective function, and the sparsity constraints. In view of this, approximation algorithms for solving (2.3) are necessary.

In the rest of this work, to simplify notations, we denote

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}_1\cdots {\mathbf {x}}_d:= \left\langle {\mathcal {A}} , {\mathbf {x}}_1\circ \cdots \circ {\mathbf {x}}_d\right\rangle . \end{aligned}$$

In addition, we also use the following notation to denote the partial gradients of \({\mathcal {A}}{\mathbf {x}}_1 \cdots {\mathbf {x}}_d\) with respect to \({\mathbf {x}}_j\):

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}_1\cdots {\mathbf {x}}_{j-1}{\mathbf {x}}_{j+1}\cdots {\mathbf {x}}_d ~=~ \nabla _{{\mathbf {x}}_j}{\mathcal {A}}{\mathbf {x}}_1\cdots {\mathbf {x}}_d \in {\mathbb {R}}^{n_j},~1\le j\le d. \end{aligned}$$

For example, \(({\mathcal {A}}{\mathbf {x}}_2\cdots {\mathbf {x}}_d)_{i_1} = \sum ^{n_2,\ldots ,n_d}_{i_2=1,\ldots ,i_d=1}{\mathcal {A}}_{i_1i_2\cdots i_d}{\mathbf {x}}_{2,i_2}\cdots {\mathbf {x}}_{d,i_d}\), where we write \({\mathbf {x}}_j := [x_{j,1},\ldots ,x_{j,n_j}]^\top \). The partial Hessian of \({\mathcal {A}}{\mathbf {x}}_1\cdots {\mathbf {x}}_d\) with respect to \({\mathbf {x}}_{d-1}\) and \({\mathbf {x}}_d\) is denoted as:

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}_1\cdots {\mathbf {x}}_{d-2} = \nabla _{{\mathbf {x}}_{d-1},{\mathbf {x}}_d}{\mathcal {A}}{\mathbf {x}}_1\cdots {\mathbf {x}}_d \in {\mathbb {R}}^{n_{d-1}\times n_d}, \end{aligned}$$

with \(({\mathcal {A}}{\mathbf {x}}_1\cdots {\mathbf {x}}_{d-2})_{i_{d-1},i_d} = \sum ^{n_1,\ldots ,n_{d-2}}_{i_1=1,\ldots ,i_{d-2}=1}{\mathcal {A}}_{i_1\cdots i_{d-1}i_d} {\mathbf {x}}_{1,i_1}\cdots {\mathbf {x}}_{d-2,i_{d-2}}\).

3 Approximation algorithms and approximation bounds

Four approximation algorithms are proposed in this section, all of which admit theoretical lower bounds. We first present some preparations used in this section. For any nonzero \(\mathbf {a}\in {\mathbb {R}}^n\), \(1\le r\le n\), denote \( \left[ \mathbf {a} \right] ^{\downarrow ,r} \in {\mathbb {R}}^n\) as a truncation of \({\mathbf {a}}\) as

$$\begin{aligned} \left[ {\mathbf {a}} \right] ^{\downarrow ,r} _i= \left\{ \begin{array}{lr} a_i, &{} \mathrm{if}~| a_i|\mathrm{~ is ~one~of~the~} r~\mathrm{largest~(in~magnitude)~entries~of~} {\mathbf {a}}, \\ 0, &{} \mathrm{otherwise}. \end{array} \right. \end{aligned}$$

In particular, if \( a_{i_1}\), \( a_{i_2}\), \( a_{i_3},\ldots \) are respectively the r-, \((r+1)\)-, \((r+2)\)-, \(\ldots \) largest entries (in magnitude) with \(i_1<i_2<i_3<\cdots \), and \(| a_{i_1}|=| a_{i_2}| = |a_{i_3}| = \cdots \), then we set \( \left[ {\mathbf {a}} \right] ^{\downarrow ,r} _{i_1}= a_{i_1}\) and \( \left[ {\mathbf {a}} \right] ^{\downarrow ,r} _{i_2} = \left[ {\mathbf {a}} \right] ^{\downarrow ,r} _{i_3}=\cdots =0\). Thus \( \left[ {\mathbf {a}} \right] ^{\downarrow ,r} \) is uniquely defined. We can see that \( \left[ {\mathbf {a}} \right] ^{\downarrow ,r} \) is a best r-approximation to \({\mathbf {a}}\) [20, Proposition 4.3], i.e.,

$$\begin{aligned} \left[ {\mathbf {a}} \right] ^{\downarrow ,r} \in \arg \min _{ \left\| {\mathbf {x}} \right\| _0 \le r} \left\| {\mathbf {x}}-{\mathbf {a}} \right\| ~~\Leftrightarrow ~~\frac{ \left[ {\mathbf {a}} \right] ^{\downarrow ,r} }{ \Vert \left[ {\mathbf {a}} \right] ^{\downarrow ,r} \Vert } \in \arg \max _{ \left\| {\mathbf {x}} \right\| =1, \left\| {\mathbf {x}} \right\| _0 \le r } {\mathbf {a}}^\top {\mathbf {x}}. \end{aligned}$$
(3.4)

It is not hard to see that the following proposition holds.

Proposition 3.1

Let \({\mathbf {a}}\in {\mathbb {R}}^n\), \({\mathbf {a}}\ne 0\) and let \({\mathbf {a}}^0 = \left[ {\mathbf {a}} \right] ^{\downarrow ,r} / \Vert \left[ {\mathbf {a}} \right] ^{\downarrow ,r} \Vert \) with \(1\le r\le n\). Then

$$\begin{aligned} \left\langle {\mathbf {a}} , {\mathbf {a}}^0\right\rangle \ge \sqrt{\frac{r}{n}} \left\| {\mathbf {a}} \right\| . \end{aligned}$$

Let \(\lambda _{\max }(\cdot )\) denote the largest singular value of a given matrix. The following lemma is important.

Lemma 3.1

Given a nonzero \(A\in {\mathbb {R}}^{m\times n}\), with \(({\mathbf {y}},{\mathbf {z}})\) being the normalized singular vector pair corresponding to \(\lambda _{\max }(A)\). Let \({\mathbf {z}}^0 = \left[ {\mathbf {z}} \right] ^{\downarrow ,r} / \Vert \left[ {\mathbf {z}} \right] ^{\downarrow ,r} \Vert \) with \(1\le r\le n\). Then there holds

Proof

From the definition of \({\mathbf {z}}\), we see that \( \left\| A {\mathbf {z}} \right\| = \lambda _{\max }(A)\) and \(A^\top A{\mathbf {z}} = \lambda _{\max }^2(A){\mathbf {z}}\). Therefore,

where the last inequality follows from Proposition 3.1, the definition of \({\mathbf {z}}^0 \), and that \( \left\| {{\mathbf {z}}} \right\| =1\). This completes the proof. \(\square \)

3.1 The first algorithm

To illustrate the first algorithm, we denote \({\mathbf {e}}_j^{i_j}\in {\mathbb {R}}^{n_j}\), \(1\le i_j\le n_j\), \(j=1,\ldots ,d\), as standard basis vectors in \({\mathbb {R}}^{n_j}\). For example, \({\mathbf {e}}_2^1\) is a vector in \({\mathbb {R}}^{n_2}\) with the first entry being one and the remaining ones being zero. Denote \(\varvec{r}:=(r_1,\ldots ,r_d)\); without loss of generality, we assume that \(r_1\le \cdots \le r_d\).

figure a

It is clear that \({\mathcal {A}}{\mathbf {e}}_1^{i_1}\cdots {\mathbf {e}}_{d-1}^{i_{d-1}}\)’s are mode-d fibers of \({\mathcal {A}}\). For the definition of fibers, one can refer to [16]; following the Matlab notation we have \({\mathcal {A}}(i_1,\ldots ,i_{d-1},:) = {\mathcal {A}}{\mathbf {e}}_1^{i_1}\cdots {\mathbf {e}}_{d-1}^{i_{d-1}}\). Algorithm A is a straightforward extension of [2, Algorithm 1] for sparse symmetric matrix PCA to higher-order tensor cases. Intuitively, the first two steps of Algorithm A enumerate all the mode-d fibers of \({\mathcal {A}}\), such that the select one admits the largest length with respect to its largest \(r_d\) entries (in magnitude). \({\mathbf {x}}^0_d\) is then given by the normalization of this fiber. Then, according to (3.4), the remaining \({\mathbf {x}}^0_j\) are in fact obtained by sequentially updated as

$$\begin{aligned} {\mathbf {x}}^0_j \in \arg \max _{ \left\| {\mathbf {y}} \right\| =1, \left\| {\mathbf {y}} \right\| _0 \le r_j } {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{j-1}^{\bar{i}_{j-1}}{\mathbf {y}} {\mathbf {x}}^{0}_{j+1}\cdots {\mathbf {x}}^0_d,~j=d-1,\ldots ,1. \end{aligned}$$
(3.5)

We consider the computational complexity of Algorithm A. Computing \({\mathcal {A}}{\mathbf {e}}_1^{i_1}\cdots {\mathbf {e}}_{d-1}^{i_{d-1}}\) takes \(O(n_d)\) flops, while it takes \(O(n_d\log (n_d))\) flops to compute \( \left[ {\mathcal {A}}{\mathbf {e}}_1^{i_1}\cdots {\mathbf {e}}_{d-1}^{i_{d-1}} \right] ^{\downarrow ,r} \) by using quick-sort. Thus the first two steps take \(O(n_1\cdots n_d\log (n_d) + n_d\log (n_d))\) flops. Computing \({\mathbf {x}}^0_j,j=1,\ldots ,d-1\) respectively takes \(O(\prod ^j_{k=1}n_k \cdot n_d+n_j\log (n_j) )\) flops. Thus the total complexity can be roughly estimated as \(O(\prod ^d_{j=1}n_j\cdot \log (n_d) + \sum ^{d}_{j=1}n_j\log (n_j))\).

We first show that Algorithm A is well-defined.

Proposition 3.2

If \({\mathcal {A}}\ne 0\) and \(({\mathbf {x}}^0_1,\ldots ,{\mathbf {x}}^0_d)\) is generated by Algorithm A, then \({\mathbf {x}}^0_j\ne 0\), \(1\le j\le d\).

Proof

Since \({\mathcal {A}}\) is a nonzeros tensor, there exists at least one fiber that is not identically zero. Therefore, the definition of \(\bar{{\mathbf {x}}}^0_d\) shows that \(\bar{{\mathbf {x}}}^0_d\) is not identically zero, and hence \({\mathbf {x}}^0_d\). We also observe that \( \left\langle {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}}{\mathbf {x}}^0_d , {\mathbf {e}}^{\bar{i}_{d-1}}_{d-1} \right\rangle ={\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-1}^{\bar{i}_{d-1}}{\mathbf {x}}^0_d = \Vert {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-1}^{\bar{i}_{d-1}} \Vert >0\), which implies that \({\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}}{\mathbf {x}}^0_d \ne 0\), and so \({\mathbf {x}}^0_{d-1}\ne 0\). (3.5) shows that \({\mathcal {A}}{\mathbf {e}}^{\bar{i}_1}_1\cdots {\mathbf {e}}^{\bar{i}_{d-2}}_{d-2}{\mathbf {x}}^0_{d-1}{\mathbf {x}}^0_{d} \ge {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-1}^{\bar{i}_{d-1}}{\mathbf {x}}^0_d >0\). Similarly, we can show that \({\mathbf {x}}^0_{d-2}\ne 0,\ldots ,{\mathbf {x}}^0_1\ne 0\). \(\square \)

We next first present the approximation bound when \(d=3\). The bound extends that of [2] to higher-order cases.

Theorem 3.1

Denote \(v^{\mathrm{opt}}\) as the optimal value of (2.3) when \(d=3\).

Let \(({\mathbf {x}}^0_1,{\mathbf {x}}^0_2,{\mathbf {x}}^0_3)\) be generated by Algorithm A. Then it holds that

Proof

Denote \(({\mathbf {x}}^*_1,{\mathbf {x}}^*_2,{\mathbf {x}}^*_3)\) as a maximizer of (2.3). By noticing that \( \left\| {\mathbf {x}}^*_1 \right\| _0 \le r_1\) and \({\mathcal {A}}{\mathbf {x}}^*_1{\mathbf {x}}^*_2{\mathbf {x}}^*_3=\sum _{\{i_1: x^*_{1,i_1}\ne 0\}}^{n_1} x^*_{1,i_1} \cdot \left( {\mathcal {A}}{\mathbf {e}}_1^{i_1} {\mathbf {x}}^*_2{\mathbf {x}}^*_3 \right) \); recalling that we write \({\mathbf {x}}_j := [x_{j,1},\ldots ,x_{j,n_j}]^\top \), we have

(3.6)

where the first inequality uses the Cauchy-Schwartz inequality, and the last equality follows from \( \left\| {\mathbf {x}}^*_1 \right\| =1\). In the same vein, we have

Assume that \( \left| {\mathcal {A}}{\mathbf {e}}_1^{\hat{i}_1}{\mathbf {e}}_2^{\hat{i}_2}{\mathbf {x}}^*_3 \right| = \max _{i_1,i_2} \left| {\mathcal {A}}{\mathbf {e}}_1^{i_1}{\mathbf {e}}_2^{i_2}{\mathbf {x}}^*_3 \right| \); denote

$$\begin{aligned} \hat{{\mathbf {x}}}_3 := \left[ {\mathcal {A}}{\mathbf {e}}_1^{\hat{i}_1}{\mathbf {e}}_2^{\hat{i}_2} \right] ^{\downarrow ,r_3} \Big / \left\| \left[ {\mathcal {A}}{\mathbf {e}}_1^{\hat{i}_1}{\mathbf {e}}_2^{\hat{i}_2} \right] ^{\downarrow ,r_3} \right\| . \end{aligned}$$

(3.4) shows that \( \left| {\mathcal {A}}{\mathbf {e}}_1^{\hat{i}_1}{\mathbf {e}}_2^{\hat{i}_2}{\mathbf {x}}^*_3 \right| \le \left| {\mathcal {A}}{\mathbf {e}}_1^{\hat{i}_1}{\mathbf {e}}_2^{\hat{i}_2}\hat{{\mathbf {x}}}_3 \right| \). By noticing the definition of \({\mathbf {x}}^0_3\), we then have

$$\begin{aligned} \left| {\mathcal {A}}{\mathbf {e}}_1^{\hat{i}_1}{\mathbf {e}}_2^{\hat{i}_2}{\mathbf {x}}^*_3 \right|\le & {} \left| {\mathcal {A}}{\mathbf {e}}_1^{\hat{i}_1}{\mathbf {e}}_2^{\hat{i}_2}\hat{{\mathbf {x}}}_3 \right| = \left\| \left[ {\mathcal {A}}{\mathbf {e}}_1^{\hat{i}_1}{\mathbf {e}}_2^{\hat{i}_2} \right] ^{\downarrow ,r_3} \right\| \le \max _{i_1,i_2} \left\| \left[ {\mathcal {A}}{\mathbf {e}}_1^{i_1}{\mathbf {e}}_2^{i_2} \right] ^{\downarrow ,r_3} \right\| \\= & {} {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}{\mathbf {e}}_2^{\bar{i}_2}{\mathbf {x}}^0_3. \end{aligned}$$

Finally, recalling the definitions of \({\mathbf {x}}^0_1\) and \({\mathbf {x}}^0_2\) and combining the above pieces, we arrive at

$$\begin{aligned} v^{\mathrm{opt}} \le \sqrt{r_1r_2}{\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}{\mathbf {e}}_2^{\bar{i}_2}{\mathbf {x}}^0_3\le \sqrt{r_1r_2}{\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}{\mathbf {x}}^0_{2}{\mathbf {x}}^0_3\le \sqrt{r_1r_2}{\mathcal {A}}{\mathbf {x}}^0_1{\mathbf {x}}^0_2{\mathbf {x}}^0_3, \end{aligned}$$

as desired. \(\square \)

In the same spirit of the proof of Theorem 3.1, for general \(d\ge 3\), one can show that

Theorem 3.2

Denote \(v^{\mathrm{opt}}\) as the optimal value of the problem (2.3).

Let \(({\mathbf {x}}^0_1,\ldots ,{\mathbf {x}}^0_d)\) be generated by Algorithm A for a general d-th order tensor \({\mathcal {A}}\). Then it holds that

$$\begin{aligned}\small {\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_d\ge \frac{v^\mathrm{opt}}{\sqrt{\prod ^{d-1}_{j=1}r_j}}. \end{aligned}$$

3.2 The second algorithm

The second algorithm is presented as follows.

figure b

The main difference from Algorithm A mainly lies in the first step, where Algorithm A requires to find the fiber with the largest length with respect to the largest \(r_3\) entries (in magnitude), while the first step of Algorithm B looks for the matrix \({\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}} \in {\mathbb {R}}^{n_{d-1}\times n_d}\) with the largest spectral radius among all \({\mathcal {A}}{\mathbf {e}}_1^{i_1}\cdots {\mathbf {e}}_{d-2}^{i_{d-2}}\). Algorithm B combines the ideas of both Algorithms 1 and 2 of [2] and extends them to higher-order tensors. When reducing to the matrix case, our algorithm here is still different from [2], as we find sparse singular vector pairs, while [2] pursues sparse eigenvectors of a symmetric matrix.

The computational complexity of Algorithm B is as follows. Computing the largest singular value of \({\mathcal {A}}{\mathbf {e}}_1^{i_1}\cdots {\mathbf {e}}_{d-2}^{i_{d-2}}\) takes \(O(\min \{ n_{d-1}^2 n_d, n_{d-1} n_d^2 \} )\) flops in theory. Computing \({\mathbf {x}}^0_d\) takes \(O(n_d\log (n_d))\) flops. Thus the first two steps take \(O(n_1\cdots n_{d-2}\min \{ n_{d-1}^2 n_d, n_{d-1} n_d^2 \} + n_d\log (n_d))\) flops. The flops of the third step are the same as those of Algorithm A. As a result, the total complexity is dominated by \(O(n_1\cdots n_{d-2}\min \{ n_{d-1}^2 n_d, n_{d-1} n_d^2 \}+ \sum ^d_{j=1}n_j\log (n_j))\).

Algorithm B is also well-defined as follow.

Proposition 3.3

If \({\mathcal {A}}\ne 0\) and \(({\mathbf {x}}^0_1,\ldots ,{\mathbf {x}}^0_d)\) is generated by Algorithm B, then \({\mathbf {x}}^0_j\ne 0\), \(1\le j\le d\).

Proof

The definition of \((\bar{i}_1,\ldots , \bar{i}_{d-2})\) shows that the matrix \( {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}} \ne 0\), and hence \(\bar{{\mathbf {x}}}_d\ne 0\) and \({\mathbf {x}}^0_d\ne 0\). We also observe from step 2 that \(\bar{{\mathbf {x}}}_d = {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}} \bar{{\mathbf {x}}}_{d-1}/ \Vert {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}} \bar{{\mathbf {x}}}_{d-1} \Vert \), and so

$$\begin{aligned} \left\langle {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}}{\mathbf {x}}^0_d , \bar{{\mathbf {x}}}_{d-1}\right\rangle= & {} \left\langle {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}} \bar{{\mathbf {x}}}_{d-1} , {\mathbf {x}}^0_d\right\rangle \\= & {} \left\langle \bar{{\mathbf {x}}}_d , {\mathbf {x}}^0_d\right\rangle \Vert {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}} \bar{{\mathbf {x}}}_{d-1} \Vert >0, \end{aligned}$$

implying that \( {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}}{\mathbf {x}}^0_d\ne 0\), and so \({\mathbf {x}}^0_{d-1} = \left[ {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}}{\mathbf {x}}^0_d \right] ^{\downarrow ,r_{d-1}} \ne 0\). Similar arguments apply to show that \({\mathbf {x}}^0_{d-2} \ne 0,\ldots ,{\mathbf {x}}^0_1\ne 0\) then. \(\square \)

To analyze the approximation bound, we need the following proposition.

Proposition 3.4

Let \(\bar{i}_1,\ldots ,\bar{i}_{d-2}\) and \((\bar{{\mathbf {x}}}_{d-1},\bar{{\mathbf {x}}}_d)\) be defined in Algorithm B. Then it holds that

$$\begin{aligned} {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\cdots {\mathbf {e}}_{d-2}^{\bar{i}_{d-2}}\bar{{\mathbf {x}}}_{d-1}\bar{{\mathbf {x}}}_d \ge \max _{i_1,\ldots ,i_{d-2}, \Vert {\mathbf {y}} \Vert = \Vert {\mathbf {z}} \Vert =1, \left\| {\mathbf {y}} \right\| _0 \le r_{d-1}, \left\| {\mathbf {z}} \right\| _0 \le r_d} \left| {\mathcal {A}}{\mathbf {e}}_1^{i_1}\cdots {\mathbf {e}}_{d-2}^{i_{d-2}}{\mathbf {y}}{\mathbf {z}} \right| . \end{aligned}$$

Proof

The result holds by noticing the definition of \(({\mathbf {e}}^{\bar{i}_1}_1,\ldots ,{\mathbf {e}}^{\bar{i}_{d-2}}_{d-2},\bar{{\mathbf {x}}}_{d-1},\bar{{\mathbf {x}}}_d)\), and the additional sparsity constraints in the right-hand side of the inequality. \(\square \)

We first derive the approximation bound with \(d=3\) as an illustration.

Theorem 3.3

Let \(v^{\mathrm{opt}}\) be defined as that in Theorem 3.1 when \(d=3\), and let \(({\mathbf {x}}^0_1,{\mathbf {x}}^0_2,{\mathbf {x}}^0_3)\) be generated by Algorithm B. Then it holds that

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}^0_1{\mathbf {x}}^0_2{\mathbf {x}}^0_3\ge \sqrt{\frac{r_2r_3}{n_2n_3r_1}}v^{\mathrm{opt}}. \end{aligned}$$

Proof

Denote \(({\mathbf {x}}^*_1,{\mathbf {x}}^*_2,{\mathbf {x}}^*_3)\) as a maximizer to (2.3). We have

$$\begin{aligned} v^{\mathrm{opt}}\le & {} \sqrt{r_1}\max _{i_1} \left| {\mathcal {A}}{\mathbf {e}}_1^{i_1}{\mathbf {x}}_2^*{\mathbf {x}}^*_3 \right| \nonumber \\\le & {} \sqrt{r_1}\max _{i_1, \left\| {\mathbf {y}} \right\| = \left\| {\mathbf {z}} \right\| =1, \left\| {\mathbf {y}} \right\| _0 \le r_2, \left\| {\mathbf {z}} \right\| _0 \le r_3} \left| {\mathcal {A}}{\mathbf {e}}_1^{i_1}{\mathbf {y}}{\mathbf {z}} \right| \nonumber \\\le & {} \sqrt{r_1}{\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\bar{{\mathbf {x}}}_2\bar{{\mathbf {x}}}_3, \end{aligned}$$
(3.7)

where the first inequality is the same as (3.6), while Proposition 3.4 gives the last one. Now denote \(A:= {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\in {\mathbb {R}}^{n_2\times n_3}\). Our remaining task is to show that

$$\begin{aligned} ({\mathbf {x}}^0_2)^\top A{\mathbf {x}}^0_3 \ge \sqrt{\frac{r_2r_3}{n_2n_3}}\bar{{\mathbf {x}}}_2^\top A\bar{{\mathbf {x}}}_3. \end{aligned}$$
(3.8)

Recalling the definition of \((\bar{{\mathbf {x}}}_2,\bar{{\mathbf {x}}}_3)\), we have \(\lambda _{\max }(A) = \bar{{\mathbf {x}}}_2^\top A\bar{{\mathbf {x}}}_3\). Lemma 3.1 tells us that

$$\begin{aligned} \left\| A{\mathbf {x}}^0_3 \right\| \ge \sqrt{\frac{r_3}{n_3}}\lambda _{\max }(A) = \sqrt{\frac{r_3}{n_3}}\bar{{\mathbf {x}}}_2^\top A\bar{{\mathbf {x}}}_3; \end{aligned}$$
(3.9)

on the other hand, since

$$\begin{aligned} \bar{{\mathbf {x}}}^0_2 = \left[ {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}{\mathbf {x}}^0_3 \right] ^{\downarrow ,r_2} = \left[ A{\mathbf {x}}^0_3 \right] ^{\downarrow ,r_2} , ~{\mathbf {x}}^0_2 = \bar{{\mathbf {x}}}_2^0/ \left\| \bar{{\mathbf {x}}}_2^0 \right\| , \end{aligned}$$

it follows from Proposition 3.1 that

$$\begin{aligned} \left\langle A{\mathbf {x}}^0_3 , {\mathbf {x}}^0_2\right\rangle \ge \sqrt{\frac{r_2}{n_2}} \left\| A{\mathbf {x}}^0_3 \right\| , \end{aligned}$$

combining which with (3.9) gives (3.8). Finally, (3.8) together with (3.7) and the definition of \({\mathbf {x}}^0_1\) yields

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}^0_1{\mathbf {x}}^0_2{\mathbf {x}}^0_3\ge & {} {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}{\mathbf {x}}^0_2{\mathbf {x}}^0_3 = \left\langle A{\mathbf {x}}^0_3 , {\mathbf {x}}^0_2\right\rangle \ge \sqrt{\frac{r_2r_3}{n_2n_3}}\bar{{\mathbf {x}}}_2^\top A\bar{{\mathbf {x}}}_3 = \sqrt{\frac{r_2r_3}{n_2n_3}} {\mathcal {A}}{\mathbf {e}}_1^{\bar{i}_1}\bar{{\mathbf {x}}}_2\bar{{\mathbf {x}}}_3 \\\ge & {} \sqrt{\frac{r_2r_3}{n_2n_3r_1}}v^{\mathrm{opt}}, \end{aligned}$$

as desired. \(\square \)

The following approximation bound is presented for general order \(d\ge 3\).

Theorem 3.4

Let \(({\mathbf {x}}^0_1,\ldots ,{\mathbf {x}}^0_d)\) be generated by Algorithm B. Then it holds that

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_d\ge \sqrt{\frac{r_{d-1}r_d}{n_{d-1}n_d \prod ^{d-2}_{j=1}r_j }}v^\mathrm{opt}. \end{aligned}$$

Proof

Similar to (3.7), we have

Similar to the proof of (3.8), we can obtain

\(\square \)

3.3 The third algorithm

We begin with the illustration from third-order tensors. We will employ the Matlab function reshape to denote tensor folding/unfolding operations. For instance, given \({\mathcal {A}}\in \mathbb R^{n_1\times \cdots \times n_d} \), \({\mathbf {a}} = \texttt {reshape}({\mathcal {A}}, \prod ^d_{j=1}n_j,1)\) means the unfolding of \({\mathcal {A}}\) to a vector \({\mathbf {a}}\) in \({\mathbb {R}}^{\prod ^d_{j=1}n_j}\), while \({\mathcal {A}} = \texttt {reshape}({\mathbf {a}},n_1,\ldots , n_d)\) means the folding of \({\mathbf {a}}\) back to \({\mathcal {A}}\).

figure c

Different from Algorithm B, Algorithm C0 is mainly based on a series of computing leading singular vector pairs of certain matrices. In fact, Algorithm C0 generalizes the approximation algorithm for dense tensor BR1Approx [13, Algorithm 1 with DR 2] to our sparse setting; the main difference lies in the truncation of \(\bar{{\mathbf {x}}}_j\) to obtain the sparse solution \({\mathbf {x}}^0_j\). In particular, if no sparsity is required, i.e., \(r_j=n_j,j=1,2,3\), then Algorithm C0 boils down essentially to [13, Algorithm 1 with DR 2]. The next proposition shows that Algorithm C0 is well-defined.

Proposition 3.5

If \({\mathcal {A}}\ne 0\) and \(({\mathbf {x}}^0_1,\ldots ,{\mathbf {x}}^0_d)\) is generated by Algorithm C0, then \({\mathbf {x}}^0_j\ne 0\), \(1\le j\le 3\).

Proof

It is clear that \(\bar{{\mathbf {x}}}_1\ne 0\), \({\mathbf {x}}^0_1\ne 0\), and \(\bar{\mathbf{w}}_1\ne 0\) due to that \(A_1\ne 0\). We have \( \left\langle {\mathbf {x}}^0_1 , A_1\bar{\mathbf{w}}_1\right\rangle = \left\langle {\mathbf {x}}^0_1 , \bar{{\mathbf {x}}}_1\right\rangle \Vert A_1\bar{\mathbf{w}}_1 \Vert >0\), and so \(A_1^\top {\mathbf {x}}^0_1 \ne 0\), and \(A_2\ne 0\). Similar argument shows that \({\mathbf {x}}^0_2\ne 0\) and \({\mathbf {x}}^0_3\ne 0\). \(\square \)

Due to the presence of the truncation, deriving the approximation bound is different from that of [13]. In particular, we need Lemma 3.1 to build bridges in the analysis. The following relation

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}^0_1{\mathbf {x}}^0_2{\mathbf {x}}^0_3 = \left\langle A^\top _1{\mathbf {x}}^0_1 , {\mathbf {x}}^0_3\otimes {\mathbf {x}}^0_2\right\rangle = \left\langle A_2 , {\mathbf {x}}^0_2({\mathbf {x}}^0_3)^\top \right\rangle = \left\langle A^\top _2{\mathbf {x}}^0_2 , {\mathbf {x}}^0_3\right\rangle \end{aligned}$$
(3.10)

is also helpful in the analysis, where \(\otimes \) denotes the Kronecker product [16].

Theorem 3.5

Let \(v^{\mathrm{opt}}\) be defined as that in Theorem 3.1 when \(d=3\), and let \(({\mathbf {x}}^0_1,{\mathbf {x}}^0_2,{\mathbf {x}}^0_3)\) be generated by Algorithm C0. Then it holds that

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}^0_1{\mathbf {x}}^0_2{\mathbf {x}}^0_3 \ge \sqrt{\frac{r_1r_2r_3}{n_1n_2n_3 n_2}} \lambda _{\max }(A_1) \ge \sqrt{\frac{r_1r_2r_3}{n_1n_2n_3 n_2}} v^{\mathrm{opt}}. \end{aligned}$$

Proof

From the definition of \(A_1\), \(\bar{{\mathbf {x}}}_1\) and \(\bar{\mathbf{w}}_1\), we see that

$$\begin{aligned} \lambda _{\max }(A_1) = \bar{{\mathbf {x}}}_1^\top A_1\bar{\mathbf{w}}_1 \ge \max _{ \left\| {\mathbf {x}} \right\| = \left\| {\mathbf {y}} \right\| = \left\| {\mathbf {z}} \right\| =1}{\mathcal {A}}{\mathbf {x}}{\mathbf {y}}{\mathbf {z}} \ge v^{\mathrm{opt}}. \end{aligned}$$

Therefore, to prove the approximation bound, it suffices to show that

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}^0_1{\mathbf {x}}^0_2{\mathbf {x}}^0_3 \ge \sqrt{\frac{r_1r_2r_3}{n_1n_2n_3 n_2}} \lambda _{\max }(A_1). \end{aligned}$$
(3.11)

To this end, recalling Lemma 3.1 and the definition of \({\mathbf {x}}^0_1\) (which is a truncation of the leading left singular vector of \(A_1\)), we obtain

$$\begin{aligned} \left\| A_1^\top {\mathbf {x}}^0_1 \right\| \ge \sqrt{\frac{r_1}{n_1}}\lambda _{\max }(A_1). \end{aligned}$$
(3.12)

Since \(A_2 = \texttt {reshape}(A^\top _1{\mathbf {x}}^0_1,n_2,n_3)\), it holds that \( \left\| A_2 \right\| _F = \left\| A^\top _1{\mathbf {x}}^0_1 \right\| \). Using again Lemma 3.1 and recalling the definition of \({\mathbf {x}}^0_2\) (which is a truncation of the leading left singular vector of \(A_2\)), we get

(3.13)

where the second inequality follows from the relation between the spectral norm and the Frobenius norm of a matrix. Finally, Proposition 3.1 and the definition of \({\mathbf {x}}^0_3\) gives that

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}^0_1{\mathbf {x}}^0_2{\mathbf {x}}^0_3= \left\langle A_2^\top {\mathbf {x}}^0_2 , {\mathbf {x}}^0_3\right\rangle \ge \sqrt{\frac{r_3}{n_3}} \left\| A_2^\top {\mathbf {x}}^0_2 \right\| , \end{aligned}$$
(3.14)

where the equality follows from (3.10). Combining the above relation with (3.12) and (3.13) gives (3.11). This completes the proof. \(\square \)

When extending to d-th order tensors, the algorithm is presented as follows.

figure d

Remark 3.1

In step 2, by noticing the recursive definition of \(A_j\), one can check that \(A_j\) is in fact the same as \({\texttt {\texttt {reshape}}}({\mathcal {A}}{\mathbf {x}}_1^0\cdots {\mathbf {x}}_{j-1}^0,n_j,\prod ^d_{k=j+1}n_k)\), where \({\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_{j-1}\) is regarded as a tensor of size \(n_{j}\times \cdots \times n_d\) with \( \left( {\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_{j-1} \right) _{i_j\cdots i_d} = \sum ^{n_1,\ldots , n_{j-1}}_{i_1=1,\ldots ,i_{j-1}=1} {\mathcal {A}}_{i_1\cdots i_{j-1} i_j\cdots i_d} ({\mathbf {x}}^0_1)_{i_1}\cdots ({\mathbf {x}}^0_{j-1})_{i_{j-1}} \).

The computational complexity of the first step is \(O(n_1^2 n_2\cdots n_d+n_1\log (n_1))\), where \(O(n_1^2 n_2\cdots n_d)\) comes from solving the singular value problem of an \(n_1\times \prod ^d_{j=2}n_j\) matrix. In the second step, for each j, computing \(A^\top _{j-1}{\mathbf {x}}^0_{j-1} \) requires \(O(n_{j-1}\cdots n_d)\) flops; computing the singular value problem requires \(O(n^2_j n_{j+1}\cdots n_d )\) flops; thus the complexity is \(O(n_{j-1}\cdots n_d + n^2_j n_{j+1}\cdots n_d + n_j\log (n_j))\). The third step is \(O(n_{d-1}n_d + n_d\log (n_d))\). Thus the total complexity is dominated by \(O(n_1\prod ^d_{j=1}n_j + \sum ^d_{j=1}n_j\log (n_j))\), which is the same as Algorithm B in theory.

Similar to Proposition 3.5, we can show that Algorithm C is well-defined, i.e., if \({\mathcal {A}}\ne 0\), then \({\mathbf {x}}^0_j\ne 0\), \(1\le j\le d\). The proof is omitted.

Concerning the approximation bound, similar to (3.14), one has

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_d = \left\langle {\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_{d-1} , {\mathbf {x}}^0_d\right\rangle = \langle A^\top _{d-1}{\mathbf {x}}^0_{d-1},{\mathbf {x}}^0_d\rangle \ge \sqrt{\frac{r_d}{n_d}}\Vert A^\top _{d-1}{\mathbf {x}}^0_{d-1}\Vert , \end{aligned}$$

where the second equality comes from Remark 3.1 that \({\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_{d-2} = A_{d-1}\) (up to a reshaping) and the inequality is due to Proposition 3.1. Analogous to (3.13), one can prove the following relation:

$$\begin{aligned} \left\| A_j^\top {\mathbf {x}}^0_j \right\| \ge \sqrt{\frac{r_j}{n_j^2}} \left\| A^\top _{j-1}{\mathbf {x}}^0_{j-1} \right\| ,~2\le j\le d-1. \end{aligned}$$

Based on the above relations and (3.12), for order \(d\ge 3\), we present the approximation bound without proof.

Theorem 3.6

Let \(({\mathbf {x}}^0_1,\ldots ,{\mathbf {x}}^0_d)\) be generated by Algorithm C. Then it holds that

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_d\ge \sqrt{\frac{\prod ^d_{j=1}r_j}{\prod ^d_{j=1}n_j}}\frac{\lambda _{\max }(A_1)}{\sqrt{\prod ^{d-1}_{j=2}n_j}} \ge \sqrt{\frac{\prod ^d_{j=1}r_j}{\prod ^d_{j=1}n_j}}\frac{v^\mathrm{opt}}{\sqrt{\prod ^{d-1}_{j=2}n_j}} . \end{aligned}$$

3.4 The fourth algorithm

Algorithm C computes \(\bar{{\mathbf {x}}}_j\) from \(A_j\) via solving singular value problems. When the size of the tensor is huge, this might be time-consuming. To further accelerate the algorithm, we propose the following algorithm, which is similar to Algorithm C, while it obtains \(\bar{{\mathbf {x}}}_j\) without solving singular value problems. Denote \(A^k\) as the k-th row of a matrix A. The algorithm is presented as follows:

figure e

It is clear from the above algorithm that computing \(\bar{{\mathbf {x}}}_j\) only requires some matrix-vector productions, which has lower computational complexity than computing singular vectors. Numerical results presented in the next section will show the efficiency and effectiveness of this simple modification.

In Algorithm D, the first step needs \(O(n_1\cdots n_d + n_1\log (n_1))\) flops; in the second step, for each j, the complexity is \(O(n_{j-1}\cdots n_d + n_j\log (n_j))\); the third step is \(O(n_{d-1}n_d + n_d\log (n_d))\). Thus the total complexity is dominated by \(O(n_1\cdots n_d + \sum ^d_{j=1}n_j\log (n_j))\), which is lower than that of Algorithm C, due to the SVD-free computation of \(\bar{{\mathbf {x}}}_j\)’s.

Reducing to the dense tensor setting, i.e., \(r_j=n_j\) for each j, Algorithm D is even new for dense tensor BR1Approx problems; when \(d=2\), similar ideas have not been applied to approximation algorithms for sparse matrix PCA/SVD yet.

The next proposition shows that Algorithm D is well-defined, whose proof is quite similar to that of Proposition 3.5 and is omitted.

Proposition 3.6

If \({\mathcal {A}}\ne 0\) and \(({\mathbf {x}}^0_1,\ldots ,{\mathbf {x}}^0_d)\) is generated by Algorithm D, then \({\mathbf {x}}^0_j\ne 0\), \(1\le j\le d\).

The approximation bound analysis essentially relies on the following lemma.

Lemma 3.2

Given \(A\in {\mathbb {R}}^{m\times n}\), with \(A^{\tilde{k}}\) being the row of A having the largest magnitude. Let \(\mathbf{w}=(A^{\tilde{k}})^\top / \Vert (A^{\tilde{k}})^\top \Vert \), \({\mathbf {x}} = A\mathbf{w}\), and \({\mathbf {x}}^0 = \left[ {\mathbf {x}} \right] ^{\downarrow ,r} / \Vert \left[ {\mathbf {x}} \right] ^{\downarrow ,r} \Vert \) with \(1\le r\le m\). Then there holds

$$\begin{aligned} \left\| A^\top {\mathbf {x}}^0 \right\| \ge \sqrt{\frac{r}{m^2} } \left\| A \right\| _F . \end{aligned}$$

Proof

We have

where the second inequality follows from that \(\mathbf{w}\) is normalized, and the last one comes from Proposition 3.1. We also have from the definition of \({\mathbf {x}}\) that

$$\begin{aligned} \Vert {\mathbf {x}} \Vert ^2= & {} \sum \limits ^m_{k=1} ({A^k\mathbf{w}})^2 \ge ({A^{\tilde{k}}\mathbf{w}})^2 \\= & {} \left\langle A^{\tilde{k}} , \frac{A^{\tilde{k}}}{ \Vert A^{\tilde{k}} \Vert }\right\rangle ^2 = \left\| A^{\tilde{k}} \right\| ^2 \\\ge & {} \frac{1}{m} \left\| A \right\| _F ^2, \end{aligned}$$

where the last inequality comes from the definition of \(A^{\tilde{k}}\). Combining the above analysis gives the desired result. \(\square \)

Theorem 3.7

Let \(({\mathbf {x}}^0_1,\ldots ,{\mathbf {x}}^0_d)\) be generated by Algorithm D. Then it holds that

$$\begin{aligned} {\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_d\ge \sqrt{\frac{\prod ^d_{j=1}r_j}{\prod ^d_{j=1}n_j}}\frac{ \left\| {\mathcal {A}} \right\| _F }{\sqrt{\prod ^{d-1}_{j=1}n_j}} \ge \sqrt{\frac{\prod ^d_{j=1}r_j}{\prod ^d_{j=1}n_j}}\frac{v^\mathrm{opt}}{\sqrt{\prod ^{d-1}_{j=1}n_j}} . \end{aligned}$$

Proof

As the discussions above Theorem 3.6 we get \({\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_d \ge \sqrt{ \frac{r_d}{n_d} } \left\| A^\top _{d-1} {\mathbf {x}}^0_{d-1} \right\| \). Using Lemma 3.2, for \(2\le j\le d-1\), we have

$$\begin{aligned} \left\| A^\top _{j}{\mathbf {x}}^0_{j} \right\| \ge \sqrt{\frac{r_{j}}{n_{j}^2}} \left\| A_{j} \right\| _F = \sqrt{\frac{r_{j}}{n_{j}^2}} \left\| A^\top _{j-1}{\mathbf {x}}^0_{j-1} \right\| . \end{aligned}$$

In particular, from step 1 of the algorithm, \( \left\| A^\top _1{\mathbf {x}}^0_1 \right\| \ge \sqrt{\frac{r_1}{n_1^2}} \left\| A_1 \right\| _F = \sqrt{\frac{r_1}{n_1^2}} \left\| {\mathcal {A}} \right\| _F \). It is clear that \( \left\| {\mathcal {A}} \right\| _F \ge \lambda _{\max }(A_1) \ge v^\mathrm{opt}\). Combining the analysis gives the desired results. \(\square \)

Before ending this section, we summarize the approximation ratio and computational complexity of the proposed algorithms in Table 1. For convenience we set \(r_1=\cdots =r_d\) and \(n_1=\cdots = n_d\).

Table 1 Comparisons of the proposed approximation algorithms on the approximation bound and computational complexity

Concerning the approximation ratio \({\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_d/v^{\mathrm{opt}}\), we see that if r is a constant, then the ratio of Algorithm A is also a constant, while those of other algorithms rely on n. This is the advantage of Algorithm A, compared with other algorithms. When \(d=2\), the ratios of Algorithms A and B are respectively \(1/\sqrt{r}\) and r/n, which coincide with their matrix counterparts; see, [2, Algorithms 1 and 2]. We also observe that Algorithm B generalizes the approximation algorithm and bound for dense tensor BR1Approx in [39, Theorem 5.1]: If sparsity is not required, i.e., \(r=n\), the bound of Algorithm B boils down to those of [39, Theorem 5.1]. For Algorithm C, when \(r=n\), or r is proportional to n up to a constant, the ratio reduces to \( O(1/\sqrt{ n^{d-2}})\), which recovers that of [13, Algorithm 1]. Although the ratio of Algorithm D is worse than that of Algorithm C with an additional factor \(1/\sqrt{n}\), we shall also observe that the numerator of the bound of Algorithm D is \( \left\| {\mathcal {A}} \right\| _F \), while that of Algorithm C is \(\lambda _{\max }(A_1)\) (see Algorithm C for the definition of \(A_1\)), where the latter is usually much smaller than the former. Note that two randomized approximation algorithms as initializations were proposed in [32, Algorithms 3 and 4] (see its arXiv version), where the performance analysis was considered on structured tensors. It would be also interesting to study their approximation bound for general tensors. Concerning the computational complexity, we see that Algorithm D admits the lowest one in theory. This is also confirmed by our numerical observations that will be presented in the next section. Note that if the involved singular value problems in Algorithm C are solved by the Lanczos algorithm at k steps [7] where k is a user-defined parameter, then the computational complexity is \(O(k n^d + dn\log (n))\).

Finally, We discuss that whether the approximation bounds are achieved. We only consider the cases that \(r_j<n_j\). We first consider Algorithm C and take \(d=3\) as an example. In (3.13), achieving \(\lambda _{\max }(A_2)=\frac{\Vert A_2\Vert _F}{\sqrt{n_2}}\) requires that \(\mathrm{rank}(A_2)=n_2\) (assuming that \(n_2\le n_3\)) and all the singular values are the same. If these are true, then \(\Vert A_2^{\top }{\mathbf {x}}_2^0\Vert > \sqrt{ \frac{r_2}{n_2} }\lambda _{\max }(A_2)\).Footnote 1 Thus the bound of Algorithm C may not be tight. The approximation ratio of Algorithm D relies on Lemma 3.2. However, there do not exist matrices achieving the approximation ratio \(\sqrt{\frac{r}{m^2}}\) in Lemma 3.2, implying that the bound of Algorithm D may not be tight. For Algorithm A, if step 3 is not executed (use \({\mathbf {e}}_1^{\bar{i}_1}, \ldots , {\mathbf {e}}_{d-1}^{\bar{i}_{d-1}}, {\mathbf {x}}_d^0\) obtained in step 2 as the output), then the bound is tight: Consider \({\mathcal {A}}\in {\mathbb {R}}^{n_1\times n_2\times n_3}\) with \(n_1=n_2=n_3=n\) as an all-one tensor and let \(1< r_1=r_2=r_3 = r<n\); then \(v^{\mathrm{opt}} = r^{\frac{3}{2}}\). The generated feasible point is \({\mathbf {x}}^0_j = {\mathbf {e}}^1_j\), \(j=1,2\), and \(x^0_3\) is the vector whose first r entries are \(\frac{1}{\sqrt{r}}\) and the remaining ones are zero. Then \({\mathcal {A}}{\mathbf {x}}^0_1{\mathbf {x}}^0_2{\mathbf {x}}^0_3/v^\mathrm{opt} = \frac{1}{r}\), achieving the bound. For Algorithm B, in case that \(d=3\) and if \({\mathbf {x}}_1^0\) is not updated by step 3 (use \({\mathbf {e}}_1^{\bar{i}_1}\) obtained in step 2 as the output), then the bound can be tight: Consider \({\mathcal {A}}\in {\mathbb {R}}^{4\times 4\times 4}\), with \({\mathcal {A}}(i,:,:) = A:= \left[ {\begin{matrix} 0&{}1&{}0&{}1\\ 0&{}1&{}0&{}1\\ 1&{} 0&{}1&{}0\\ 1&{} 0&{}1&{}0\\ \end{matrix}}\right] \), \(i=1,\ldots ,4\), and let \(r_1=r_2=r_3=2\). Then \(v^{\mathrm{opt}}= 2\sqrt{2}\).Footnote 2 Applying Algorithm B to \({\mathcal {A}}\) yields \(\bar{i}_1=1\), with \(\bar{{\mathbf {x}}}_3=[1~1~1~1]^{\top }/2\); then \({\mathbf {x}}_3^0= [1~1~0~0]^{\top }/\sqrt{2}\), and \({\mathbf {x}}_2^0={\mathbf {x}}_3^0\). Finally, \( {{\mathcal {A}}{\mathbf {e}}_1^1{\mathbf {x}}_2^0{\mathbf {x}}_3^0 }/{v^{\mathrm{opt}}}= \frac{1}{2\sqrt{2}} = \sqrt{\frac{r^2}{n^2 r}}\). In summary, the reason that why the bounds cannot be achieved is that although we derive the worst-case inequalities in every step of the analysis, after putting them together, the final approximation bounds might not be tight. How to improve them still need further research.

4 Numerical experiments

We evaluate the proposed algorithms in this section on synthetic and real data. All the computations are conducted on an Intel i7 CPU desktop computer with 16 GB of RAM. The supporting software is Matlab R2019a. Tensorlab [33] is employed for basic tensor operations.

Performance of approximation algorithms We first compare Algorithms A, B, C, and D on solving (2.3). The tensor is given by

(4.15)

where \({\mathbf {x}}_{j,i}\in {\mathbb {R}}^{n_j}\), \(i=1,\ldots ,R\), \(j=1,\ldots ,d\), and we let \(R=10\). Here the vectors are first randomly drawn from the normal distribution, and then \(sr = 70\%\) of the entries are randomly set to be zero. We set \(d=3,4\), and let \(n_1=\cdots =n_d=n\) with n varying from 5 to 100. For each case, we randomly generated 50 instances, and the averaged results are presented. \(r_j = \lfloor (1-sr)n_j\rfloor \) for each j in (2.3). As a baseline, we also randomly generate feasible points \(({\mathbf {x}}^\mathrm{random}_1,\ldots ,{\mathbf {x}}^\mathrm{random}_d)\) and evaluate its performance. On the other hand, we easily see that

$$\begin{aligned} v^\mathrm{ub} := \min \{\lambda _{\max }(A_{(1)}),\ldots ,\lambda _{\max }(A_{(d)}) \} \end{aligned}$$
(4.16)

is an upper bound for problem (2.3), where \(A_{(j)} = \texttt {reshape}({\mathcal {A}},n_j,\prod ^d_{k\ne j}n_k)\) denotes the j-mode unfolding of \({\mathcal {A}}\). Thus we also evaluate \(v^\mathrm{ub}\). The results are depicted in Fig. 1, where the left panels show the curves of the objective values \({\mathcal {A}}{\mathbf {x}}^{0}_1\cdots {\mathbf {x}}^0_d\) versus n of different algorithms, whose colors are respectively cyan (Algorithm A), magenta (Algorithm B), blue (Algorithm C), and red (Algorithm D); the curves of the random value \({\mathcal {A}}{\mathbf {x}}^\mathrm{random}_1\cdots {\mathbf {x}}^\mathrm{random}_d\) is in black with hexagram markers, while the curve of the upper bounds \(v^\mathrm{ub}\) is in black with diamond markers. The right ones plot the curve of CPU time versus n.

Fig. 1
figure 1

Comparisons of Algorithms A, B, C, and D for solving (2.3) where \({\mathcal {A}}\) is given by (4.15). n varies from 5 to 100. Left panels: objective value \({\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_d\) versus n; right panels: CPU time

From the left panels, we observe that the objective values generated by Algorithms A, B, C, and D are similar, where Algorithm C performs better; Algorithm D performs the second when \(d=4\), and it is comparable with Algorithm B when \(d=3\); Algorithm A gives the worst results, which may be that Algorithm A does not explore the structure of the problem as much as possible. We also observe that the objective values of all the algorithms are quite close to the upper bound (4.16), which demonstrates the effectiveness of the proposed algorithms. In fact, the ratio of \(\frac{{\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_d}{v^\mathrm{ub}}\) is in (0.7, 1), which is far better than the approximation ratios presented in Sect. 3. This implies that at least for this kind of tensors, the approximation ratios might be independent of the size of the tensor. The value \({\mathcal {A}}{\mathbf {x}}^\mathrm{random}_1\cdots {\mathbf {x}}^\mathrm{random}_d\) is close to zero (the curve almost coincides with the x-axis). Concerning the computational time, Algorithms D is the most efficient one, confirming the theoretical results in Table 1. Algorithm C is the second efficient one. Algorithms A and B do not perform well compared with Algorithms C and D, although their computational complexity is similar in theory. The reason may be because the first two algorithms require for-loop operations, which is time-consuming in Matlab.

Fig. 2
figure 2

Comparisons of Algorithms A, B, C, and D for solving (2.3) with fixed n and sparsity ratio varying from 10 to \(90\%\). Figure 2a: \({\mathcal {A}}\in {\mathbb {R}}^{100\times 100\times 100}\); Fig. 2b: \({\mathcal {A}}\in {\mathbb {R}}^{100\times 100\times 100\times 100}\). Left panels: the objective values \({\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_d\) versus sparsity ratio; right panels: CPU time

We then consider fix \(n=100\) and vary the sparsity ratio sr from 10 to \(90\%\) of \({\mathcal {A}}\) in (4.15), and compare the performance of the four proposed algorithms. \(r_j\) in (2.3) is set to \(\lfloor (1-sr) n_j\rfloor \) correspondingly. The results of the objective values \({\mathcal {A}}{\mathbf {x}}^0_1\cdots {\mathbf {x}}^0_d\) together with the upper bound are depicted in the left panels of Fig. 2, from which we still observe that all the algorithms are close to the upper bound (4.16); among them, Algorithm C is still slightly better than the other three, followed by Algorithms B and D. The CPU time is plotted in the right panels of Fig. 2, which still shows that Algorithm D is the most efficient one. In fact, in average, Algorithm D is 5 times faster than Algorithm C, which ranks the second, and is about 80 times faster than Algorithm B, which is the slowest one.

Overall, comparing with Algorithms A and B that find the solutions fiber by fiber, or slide by slide, Algorithm C admits hierarchical structures that take the whole tensor into account, and so it can explore the structure of the data tensor better. This may be the reason why Algorithm C is better than Algorithms A and B in terms of the effectiveness. When compared with Algorithm D, Algorithm C computes each \({\mathbf {x}}^0_j\) “optimally” via SVD, while Algorithm D computes \({\mathbf {x}}^0_j\) “sub-optimally” but more efficiently; this explains why Algorithms C performs better than Algorithm D. Concerning the efficiency, Algorithms A and B require for-loop operations, which is known to be slow in Matlab. This leads to that although the algorithms have similar computational complexity in theory (Algorithms B and C), after implementation, their performances are quite different.

Fig. 3
figure 3

Performances of Algorithm C + AM, Algorithm D + AM and Random + AM for solving (2.3) where \({\mathcal {A}}\) is given by (4.15). n varies from 10 to 100. The CPU time counts both that of approximation algorithms and AM

Performance of approximation plus iterative algorithms In this part, we first use approximation algorithms to generate \(({\mathbf {x}}^0_1,\ldots ,{\mathbf {x}}^0_d)\), and then use it as an initializer for iterative algorithms. The goal is to see if approximation algorithms can help in improving the solution quality of iterative algorithms. The iterative algorithm used for solving problem (2.3) is simply an alternating maximization method (termed AM in the sequel) with the scheme

$$\begin{aligned} \boxed { \mathrm{(AM)}~~~~ {\mathbf {x}}^{k+1}_j \in \arg \max \nolimits ~ {{\mathcal {A}}}{{\mathbf {x}}_1^{k+1}\cdots {\mathbf {x}}^{k+1}_{j-1}{\mathbf {x}}_j {\mathbf {x}}^k_{j+1} \cdots {\mathbf {x}}_d^k}~\mathrm{s.t.}~ \left\| {\mathbf {x}}_j \right\| =1, \left\| {\mathbf {x}}_j \right\| _0 \le r_j} \end{aligned}$$

for \(j=1,\ldots ,d\) and \(k=1,2,\ldots \). The stopping criterion used for AM is \(\max _j\{ \left\| {\mathbf {x}}_j^{k+1} -{\mathbf {x}}_j^k \right\| \}\le 10^{-5}\), or \(k\ge 2000\). We employ Algorithms C and D in this part, and denote the approximation plus iterative algorithms as Algorithm C + AM and D + AM in the sequel. As a baseline, we also evaluate AM initialized by randomly generated feasible point \(({\mathbf {x}}^\mathrm{random}_1,\ldots ,{\mathbf {x}}^\mathrm{random}_d)\), which is generated the same as the previous part. This algorithm is denoted as Random + AM. The data tensors are also (4.15), where 50 instances are randomly generated for each n. \(r_j=\lfloor 0.3n_j\rfloor \). The objective values \({\mathcal {A}}{\mathbf {x}}^\mathrm{out}_1\cdots {\mathbf {x}}^\mathrm{out}_d\) and CPU time for third- and fourth-order tensors with n varying from 10 to 100 are plotted in Fig. 3a, b, where \(({\mathbf {x}}^\mathrm{out}_1,\ldots ,{\mathbf {x}}^\mathrm{out}_d)\) is the output of AM. Here the CPU time counts both that of approximation algorithms and AM. Figure 3c depicts the number of iterations of AM initialized by different strategies. Algorithm C + AM is in blue, D + AM is in red, and Random + AM is in black.

Fig. 4
figure 4

Performances of Algorithm C + AM (\(\ell _1\)), Algorithm D + AM (\(\ell _1\)) and Random + AM (\(\ell _1\)) for solving (4.17) where \({\mathcal {A}}\) is given by (4.15). n varies from 10 to 100

In terms of objective value, we see from Fig. 3a, b that Algorithm C + AM performs the best, while Algorithm D + AM is slightly worse. Both of them are better than Random + AM, demonstrating that approximation algorithms can indeed help in improving the solution quality of iterative algorithms. Considering the efficiency, Algorithm D + AM is the best among the three, followed by Random + AM. This further demonstrates the advantage of approximation algorithms. In fact, from Fig. 3c, we can see that both Algorithm C and D can help in reducing the number of iterations of AM. However, as we have observed in the previous part, Algorithm C is more time-consuming than Algorithm D, leading to that Algorithm C + AM is the slowest one.

We also try to use Algorithms C and D to initialize AM for \(\ell _1\) regularized model [1]:

(4.17)

where \(\rho _j=0.2\) is set in the experiment. The algorithms are denoted as Algorithm C + AM (\(\ell _1\)), Algorithm D + AM (\(\ell _1\)) and Random + AM (\(\ell _1\)). The results are plotted in Fig. 4, from which we observe similar results as those of Fig. 3; in particular, the efficiency of approximation algorithms plus AM (\(\ell _1\)) is significantly better than Random + AM (\(\ell _1\)), and their number of iterations is much more stable.

Overall, based on the observations of this part, compared with random initializations, iterative algorithms initialized by approximation solutions generated by approximation algorithms is superior both in terms of the solution quality and the running time.

Sparse tensor clustering Tensor clustering aims to cluster matrix or tensor samples into their underlying groups: for N samples \({\mathcal {A}}_i \in \mathbb R^{n_1\times \cdots \times n_d} ,i = 1,\dots ,N\) with \(d\ge 2\) and given clusters \(K\ge 2\), a clustering \(\psi (\cdot )\) is defined as a mapping \(\psi :{\mathbb {R}}^{n_1\times \dots \times n_d}\rightarrow \{1,\dots ,K\}\). Several tensor methods have been proposed for tensor clustering; see, e.g., [26, 31, 32]. Usually, one can first perform a dimension reduction to the samples by means of (sparse) tensor decomposition, and then use classic methods such as K-means to the reduced samples for clustering. Here, we use a deflation method for tensor decomposition, including the proposed approximation algorithms as initialization procedures for AM as subroutines. The whole method for tensor clustering is presented in Algorithm STC, which has some similarities to [31, Algorithm 2]. We respectively use STC (A), STC (B), STC (C), STC (D) to distinguish AM involved in STC initialized by different approximation algorithms. We also denote STC (Random) as AM involved in STC with random initializations.

figure f

Denote \(\psi _0\) as the true clustering mapping and let |S| represent the cardinality of a set S. The following metric is commonly used to measure clustering performance [34]:

$$\begin{aligned} \mathrm{cluster~ err.} := \genfrac(){0.0pt}1{N}{2}^{-1}|\{i,j\}:(\psi ({\mathcal {A}}_i)=\psi ({\mathcal {A}}_j))\not =(\psi _0({\mathcal {A}}_i)=\psi _0 ({\mathcal {A}}_j)),i<j|. \end{aligned}$$

Synthetic data This experiment is similar to [31, Sect. 5.2]. Consider N sparse matrix samples \(\{A_i\in \mathbb {R}^{n_1\times n_2},i = 1,\dots ,N\}\) with \(n_1=n_2=20\) as follows:

$$\begin{aligned}&A_1 = \dots = A_{\frac{N}{4}} ~= \mu ^3\cdot \begin{bmatrix} \Sigma &{}~&{}~ \\ ~&{} -\Sigma &{} ~\\ ~&{} ~ &{} {\mathbf {0}} \end{bmatrix}, A_{\frac{N}{4}+1} = \dots = A_{\frac{N}{2}} ~= \mu ^3\cdot \begin{bmatrix} \Sigma &{} ~&{}~ \\ ~&{} \Sigma &{} ~\\ ~&{}~ &{} {\mathbf {0}} \end{bmatrix}\\&A_{\frac{N}{2}+1} = \dots = A_{\frac{3N}{4}} = \mu ^3\cdot \begin{bmatrix} -\Sigma &{} &{} \\ &{} \Sigma &{} \\ &{} &{} {\mathbf {0}} \end{bmatrix}, A_{\frac{3N}{4}+1} = \dots = A_{N} = \mu ^3\cdot \begin{bmatrix} -\Sigma &{} &{} \\ &{} -\Sigma &{} \\ &{} &{} {\mathbf {0}} \end{bmatrix}, \end{aligned}$$

where \(\Sigma ={\mathbf {v}}{\mathbf {v}}^{\top } \in {\mathbb {R}}^{4\times 4}\) with \({\mathbf {v}} = [1,-1,0.5,-0.5]^{\top }\), and \({\mathbf {0}}\) is a zero matrix of size \(12\times 12\). We stack these samples into a third order tensor \(\mathcal {T}\in \mathbb {R}^{n_1\times n_2 \times N}\) with \(\mathcal T(:,:,i)=A_i\). We apply STC (A), STC (B), STC (C), STC (D), and the vanilla K-means to the tensor \(\mathcal T^* = \frac{\mathcal T}{||\mathcal T||_F} + \sigma \frac{\mathcal E}{||\mathcal E||_F}\) where \(\mathcal E\) is a noisy tensor and \(\sigma \) is the noise level. Here vanilla K-means stands for directly applying K-means to the vectorizations of \(A_i\)’s without tensor methods. We vary the sample number \(N = \{20,40\}\) and noise level \(\sigma = \{0.1,0.5,0.9\}\), and set \(\mu = 0.5\). To select parameters, we apply a similar Bayesian Information Criterion as [1]: For a set of parameter combinations defined as the cartesian product of the set \( \mathfrak {R}\) of given ranks and the sets \( \mathfrak {r}\) of tuning parameters, namely, \( \mathfrak {R}\times \mathfrak {r} = \{(R,\varvec{r})|R\in \mathfrak {R},\varvec{r}\in \mathfrak {r}\}\), we select \(( R^*, {\varvec{r}^*})=\arg \min _{\mathfrak {R}\times \mathfrak {r}} BIC(R,\varvec{r})\) with

(4.18)

where \( {\alpha }_l\) and \({\mathbf {x}}^l_j\) are computed by Algorithm STC.

For tuning procedure, we set \( \mathfrak {R} = \{4,6\}\) and \(\mathfrak {r}= \{(7,7,N),(8,8,N)\}\). For the number K of clusters, we employ the widely used evalclusters function in Matlab which evaluate each proposed number of clusters in a set \(\{K_1,\cdots ,K_m\}\) and select the smallest number of clusters satisfying \(Gap(K)\ge Gap(K+1) - SE(K+1),\) where Gap(K) is the gap value for the clustering solution with K clusters, and \(SE(K + 1)\) is the standard error of the clustering solution with \(K + 1\) clusters. The results are reported in Table 2 averaged over 50 instances in each case, where bold means that the cluster error is the lowest among all the methods.

Table 2 Sparse tensor clustering via STC (A), STC (B), STC (C), STC (D) on synthetic data with different size N and noise level \(\sigma \)

Table 2 shows that the cluster error of STC with any approximation algorithm is smaller than that of the vanilla K-means in all cases and is even zero in some case when the noise level is not high, where the best one is STC (C), followed by STC (D). This shows the accuracy and robustness of our method. Considering the CPU time, among the four tensor methods, STC (D) is the most efficient one, followed by STC (C), while STC (B) needs more time than the other three. However, the computational time of tensor methods is not as good as that of the vectorized K-means, which is because of the increasing cost of the tuning procedure via (4.18) with a pre-specified set of parameter combinations.

Real data We test the clustering performance of STC (D), STC (Random), and the vanilla K-means on Columbia Object Image Library COIL-20 [23] for image clustering. The data contains 20 objects viewed from varying angles and each image is of size \(128\times 128\). The images are in grayscale, and the background of the objects is black, resulting in that the images can be seen as sparse matrices. In this experiment, we consider \(K = \{5, 6, 7, 8, 9, 10\}\) objects and pick up 36 images from each object for clustering, giving \(\mathcal T\in {\mathbb {R}}^{128\times 128\times 36K}\). We still tune parameters of Algorithm STC via (4.18) and the evalclusters function in Matlab, and set \( \mathfrak {R} = \{40,45\}\) and \(\mathfrak {r}= \{(75,75,36K),(80,80,36K)\}\). The experiment has been repeated 20 times for each K, and the averaged results are shown in Table 3.

Table 3 Sparse tensor clustering via STC (D), STC (Random) and vanilla K-means on COIL-20 with different K

Table 3 shows that the cluster error of tensor methods is still better than that of the vanilla K-means, while STC (D) is slightly better than STC (Random). On the other hand, the computational time of STC (D) is less than that of STC (Random). However, it still needs more time than that of the vanilla K-means. There are two possible reasons for this phenomenon: first, it takes more time to tune parameters via (4.18); and second, our approach deals directly with data tensors instead of reshaping them into vectors, which preserves their intrinsic structures but naturally increases the computational complexity at the same time.

Overall, tensor methods armed with an approximation algorithm to generate good initializers show its ability to better exploring the data structure for clustering, and it would be interesting to further improve the efficiency.

5 Conclusions

Sparse tensor BR1Approx problem can be seen as a sparse generalization of the dense tensor BR1Approx problem and a higher-order extension of the matrix BR1Approx problem. In the literature, little attention was paid to approximation algorithms for sparse tensor BR1Approx. To fill this gap, four approximation algorithms were developed in this work, which are of low computational complexity, easily implemented, and all admit theoretical guaranteed approximation bounds. Some of the proposed algorithms and the associated aproximation bounds generalize their matrix or dense tensor counterparts, while Algorithm D, which is the most efficient one, is even new when reducing to the matrix or dense tensor cases. Numerical experiments on synthetic as well as real data showed the effectiveness and efficiency of the developed algorithms; in particular, we observed that compared with random initializations, our algorithms can improve the performance of iterative algorithms for solving the problem in question. Possible future work is to design algorithms with better approximation ratio; following [12, Theorem 4.3], we conjecture that the best ratio might be \(O(\sqrt{\prod ^{d}_{j=1}\frac{r_j}{n_j}} \cdot \sqrt{\prod ^{d-2}_{j=1} \frac{\ln n_j }{ n_j} } )\). On the other hand, Qi defined and studied the best rank-1 approximation ratio of a tensor space [27]; it would be interesting to extend this notion to the sparse best rank-1 approximation setting and study its bounds.