1 Introduction

In practical applications, high-dimensional data, such as color images, hyperspectral images and videos, often exhibit a low-rank structure. Low-rank approximation of tensors has become a general tool for compressing and approximating high-dimensional data and has been widely used in scientific computing, machine learning, signal/image processing, data mining, and many other fields [1]. The classical low-rank tensor factorization models include, e.g., Canonical Polyadic decomposition (CP) [2, 3], Tucker decomposition [4,5,6], Hierarchical Tucker (HT) [7, 8], and Tensor Train decomposition (TT) [9]. This paper focuses on low-rank Tucker decomposition, also known as the low multilinear rank approximation of tensors. When the target rank of Tucker decomposition is much smaller than the original dimensions, it will have good compression performance. For a given Nth-order tensor \({\varvec{\mathcal {X}}}\in \mathbb {R}^{I_1\times I_2\times \ldots \times I_N}\), the low-rank Tucker decomposition can be formulated as the following optimization problem, i.e.,

$$\begin{aligned} \min _{{\varvec{\mathcal {Y}}}}\Vert {\varvec{\mathcal {X}}}-{\varvec{\mathcal {Y}}}\Vert _F^2 \end{aligned}$$
(1)

where \({\varvec{\mathcal {Y}}}\in \mathbb {R}^{I_1\times I_2\times \ldots \times I_N}\), with \(\texttt{rank}(\textbf{Y}_{(n)})\le r_n\) for \(n=1,2,\ldots ,N\), \(\textbf{Y}_{(n)}\) is the mode-n unfolding matrix of \({\varvec{\mathcal {Y}}}\), and \(r_n\) is the rank of the mode-n unfolding matrix of \({\varvec{\mathcal {X}}}\).

For the Tucker approximation of higher-order tensors, the most frequently used non-iterative algorithms are the improved algorithms for the higher-order singular value decomposition (HOSVD) [5], the truncated higher-order SVD (THOSVD) [10] and the sequentially truncated higher-order SVD (STHOSVD) [11]. Although the results of THOSVD and STHOSVD are usually sub-optimal, they can use as reasonable initial solutions for iterative methods such as higher-order orthogonal iteration (HOOI) [10]. However, both algorithms rely directly on SVD when computing the singular vectors of intermediate matrices, requiring large memory and high computational complexity when the size of tensors is large.

Strikingly, randomized algorithms can reduce the communication among different levels of memories and are parallelizable. In recent years, many scholars have become increasingly interested in randomized algorithms for finding approximation Tucker decomposition of large-scale data tensors [12,13,14,15,16,17, 19, 20]. For example, Zhou et al. [12] proposed a randomized version of the HOOI algorithm for Tucker decomposition. Che and Wei [13] proposed an adaptive randomized algorithm to solve the multilinear rank of tensors. Minster et al. [14] designed randomized versions of the THOSVD and STHOSVD algorithms, i.e., R-THOSVD and R-STHOSVD. Sun et al. [17] presented a single-pass randomized algorithm to compute the low-rank Tucker approximation of tensors based on a practical matrix sketching algorithm for streaming data, see also [18] for more details. Regarding more randomized algorithms proposed for Tucker decomposition, please refer to [15, 16, 19, 20] for a detailed review of randomized algorithms for solving Tucker decomposition of tensors in recent years involving, e.g., random projection, sampling, count-sketch, random least-squares, single-pass, and multi-pass algorithms.

This paper presents two efficient randomized algorithms for finding the low-rank Tucker approximation of tensors, i.e., Sketch-STHOSVD and sub-Sketch-STHOSVD summarized in Algorithms 6 and 8, respectively. The main contributions of this paper are threefold. Firstly, we propose a new one-pass sketching algorithm (i.e., Algorithm 6) for low-rank Tucker approximation, which can significantly improve the computational efficiency of STHOSVD. Secondly, we present a new matrix sketching algorithm (i.e., Algorithm 7) by combining the two-sided sketching algorithm proposed by Tropp et al. [18] with subspace power iteration. Algorithm 7 can accurately and efficiently compute the low-rank approximation of large-scale matrices. Thirdly, the proposed Algorithm 8 can deliver a more accurate Tucker approximation than simpler randomized algorithms by combining the subspace power iteration. More importantly, sub-Sketch-STHOSVD can converge quickly for any data tensors and independently of singular value gaps.

The rest of this paper is organized as follows. Section 2 briefly introduces some basic notations, definitions, and tensor-matrix operations used in this paper and recalls some classical algorithms, including THOSVD, STHOSVD, and R-STHOSVD, for low-rank Tucker approximation. Our proposed two-sided sketching algorithm for STHOSVD is given in Sect. 3. In Sect. 4, we present an improved algorithm with subspace power iteration. The effectiveness of the proposed algorithms is validated thoroughly in Sect. 5 by numerical experiments on synthetic and real-world data tensors. We conclude in Sect. 6.

2 Preliminary

2.1 Notations and Basic Operations

Some common symbols used in this paper are shown in the following Table 1.

Table 1 Common symbols used in this paper

We denote an Nth-order tensor \({\varvec{\mathcal {X}}}\in \mathbb {R}^{I_1\times I_2\times \ldots \times I_N}\) with entries given by \(\textbf{x}_{i_1,i_2,\ldots ,i_N},1\le i_n\le I_n, n=1,2,\ldots ,N.\) The Frobenius norm of \({\varvec{\mathcal {X}}}\) is defined as

$$\begin{aligned} \Vert {\varvec{\mathcal {X}}}\Vert _F=\sqrt{\sum _{i_1,i_2,\ldots ,i_N}^{I_1,I_2,\ldots ,I_N}{} \textbf{x}_{i_1,i_2,\ldots ,i_N}^2} \ \end{aligned}$$
(2)

The mode-n tensor-matrix multiplication is a frequently encountered operation in tensor computation. The mode-n product of a tensor \({\varvec{\mathcal {X}}}\in \mathbb {R}^{I_1\times I_2\times \ldots \times I_N}\) by a matrix \(\textbf{A}\in \mathbb {R}^{K\times I_n}\) (with entries \(\textbf{a}_{k,i_n}\)) is denoted as \({\varvec{\mathcal {Y}}}={\varvec{\mathcal {X}}}\times _{n}\textbf{A}\in \mathbb {R}^{I_1\times ...\times I_{n-1}\times K \times I_{n+1}\times ...\times I_N}\), with entries:

$$\begin{aligned} \textbf{y}_{i_1,\ldots ,i_{n-1},k,i_{n+1},\ldots ,i_N}=\sum _{i_n=1}^{I_n}{} \textbf{x}_{i_1,\ldots ,i_{n-1},i_n,i_{n+1},\ldots ,i_N}{} \textbf{a}_{k,i_n} \end{aligned}$$
(3)

The mode-n matricization of higher-order tensors is the reordering of tensor elements into a matrix. The columns of mode-n unfolding matrix \(\textbf{X}_{(n)}\in \mathbb {R}^{I_n\times (\prod _{N\ne n}I_N)}\) are the mode-n fibers of tensor \({\varvec{\mathcal {X}}}\). More specifically, an element \((i_1,i_2,\ldots ,i_N)\) of \({\varvec{\mathcal {X}}}\) is maps on an element \((i_n,j)\) of \(\textbf{X}_{(n)}\), where

$$\begin{aligned} j=1+\sum _{k=1,k\ne n}^{N}\left[ (i_k-1)\prod _{m=1,m\ne n}^{k-1}I_m\right] \end{aligned}$$
(4)

Let the rank of mode-n unfolding matrix \(\textbf{X}_{(n)}\) is \(r_n\), the integer array \((r_1,r_2,\ldots ,r_N)\) is Tucker-rank of Nth-order tensor \({\varvec{\mathcal {X}}}\), also known as the multilinear rank. The Tucker decomposition of \({\varvec{\mathcal {X}}}\) with rank \((r_1,r_2,\ldots ,r_N)\) is expressed as

$$\begin{aligned} {\varvec{\mathcal {X}}}={\varvec{\mathcal {G}}}\times _1\textbf{U}^{(1)}\times _2\textbf{U}^{(2)}\ldots \times _N\textbf{U}^{(N)} \end{aligned}$$
(5)

where \({\varvec{\mathcal {G}}}\in \mathbb {R}^{r_1\times r_2\times \ldots \times r_N}\) is the core tensor, and \(\{\textbf{U}^{(n)}\}_{n=1}^{N}\) with \(\textbf{U}^{(n)}\in \mathbb {R}^{I_n\times r_n}\) is the mode-n factor matrices. We denote an optimal rank-\((r_1,r_2,\ldots ,r_N)\) approximation of a tensor \({\varvec{\mathcal {X}}}\) as \(\hat{{\varvec{\mathcal {X}}}}_\textrm{opt}\), which is the optimal Tucker approximation by solving the minimization problem in (1).

Below we present the definitions of some concepts used in this paper.

Definition 1

(Kronecker products) The Kronecker product of matrices \(\textbf{A}\in \mathbb {R}^{m\times n}\) and \(\textbf{B}\in \mathbb {R}^{k\times l}\) is defined as

$$\begin{aligned} \textbf{A}\otimes \textbf{B}=\begin{bmatrix} \textbf{a}_{11}{} \textbf{B} &{} \textbf{a}_{12}{} \textbf{B} &{}... &{} \textbf{a}_{1n}{} \textbf{B} \\ \textbf{a}_{21}{} \textbf{B} &{} \textbf{a}_{22}{} \textbf{B} &{}... &{} \textbf{a}_{2n}{} \textbf{B} \\ :&{} :&{}\ddots &{}:\\ \textbf{a}_{m1}{} \textbf{B} &{} \textbf{a}_{m2}{} \textbf{B} &{}... &{} \textbf{a}_{mn}\textbf{B} \end{bmatrix}\in \mathbb {R}^{mk\times nl} \end{aligned}$$
(6)

The Kronecker product helps express Tucker decomposition. The Tucker decomposition in (5) implies:

$$\begin{aligned} \textbf{X}_{(n)}=\textbf{U}^{(n)}{} \textbf{G}_{(n)}(\textbf{U}^{(N)}\otimes ...\otimes \textbf{U}^{(n+1)}\otimes \textbf{U}^{(n-1)}\otimes ...\otimes \textbf{U}^{(1)} )^\top \end{aligned}$$
(7)

Definition 2

(Standard normal matrix) The elements of a standard normal matrix follow the real standard normal distribution (i.e., Gaussian with mean zero and variance one) form an independent family of standard normal random variables.

Definition 3

(Standard Gaussian tensor) The elements of a standard Gaussian tensor follow the standard Gaussian distribution.

Definition 4

(Tail energy) The jth tail energy of a matrix \(\textbf{X}\) is defined:

$$\begin{aligned} \tau _{j}^{2}({\textbf{X}}):=\min _{\textrm{rank}(\textbf{Y})<j}\Vert \textbf{X}-\textbf{Y}\Vert _{F}^{2}=\sum _{i\ge j}\sigma _{i}^{2}(\textbf{X}) \end{aligned}$$
(8)

2.2 Truncated Higher-Order SVD

Since the actual Tucker rank of large-scale higher-order tensor is hard to compute, the truncated Tucker decomposition with a pre-determined truncation \((r_1,r_2,\ldots ,r_N)\) is widely used in practice. THOSVD is a popular approach to computing the truncated Tucker approximation, also known as the best low multilinear rank-\((r_1,r_2,\ldots ,r_N)\) approximation, which reads:

$$\begin{aligned} \begin{aligned} \min _{{\varvec{\mathcal {G}}};\textbf{U}^{(1)},\textbf{U}^{(2)},\cdots ,\textbf{U}^{(N)}}\Vert {\varvec{\mathcal {X}}}-{\varvec{\mathcal {G}}}\times _1\textbf{U}^{(1)}\times _2\textbf{U}^{(2)}\cdots \times _N\textbf{U}^{(N)}\Vert _F^2\\ \mathrm{s.t.}\quad \textbf{U}^{(n)\top }{} \textbf{U}^{(n)}=\textbf{I}_{r_n}, n\in \{1,2,\ldots ,N\} \end{aligned} \end{aligned}$$
(9)
figure a

Algorithm 1 summarizes the THOSVD approach. Each mode is processed individually in Algorithm 1. The low-rank factor matrices of mode-n unfolding matrix \(\textbf{X}_{(n)}\) are computed through the truncated SVD, i.e.,

$$\begin{aligned} \textbf{X}_{(n)}=\begin{bmatrix} \textbf{U}^{(n)}&\tilde{\textbf{U}^{(n)}} \end{bmatrix}\begin{bmatrix} \textbf{S}^{(n)}&{} \\ &{} \tilde{\textbf{S}^{(n)}} \end{bmatrix}\begin{bmatrix} \textbf{V}^{(n)\top }\\ \tilde{\textbf{V}^{(n)\top }} \end{bmatrix}\cong \textbf{U}^{(n)} \textbf{S}^{(n)} \textbf{V}^{(n)\top } \end{aligned}$$
(10)

where \(\textbf{U}^{(n)} \textbf{S}^{(n)} \textbf{V}^{(n)\top }\) is a rank-\(r_n\) approximation of \(\textbf{X}_{(n)}\), the orthogonal matrix \(\textbf{U}^{(n)}\in \mathbb {R}^{I_n\times r_n}\) is the mode-n factor matrix of \({\varvec{\mathcal {X}}}\) in Tucker decomposition, \( \textbf{S}^{(n)}\in \mathbb {R}^{r_n\times r_n}\) and \(\textbf{V}^{(n)}\in \mathbb {R}^{ I_1...I_{n-1}I_{n+1}...I_N\times r_n}\). Once all factor matrices have been computed, the core tensor \({\varvec{\mathcal {G}}}\) can be computed as

$$\begin{aligned} {\varvec{\mathcal {G}}}={{\varvec{\mathcal {X}}}\times }_1\textbf{U}^{\left( 1\right) \top }\times _2\textbf{U}^{\left( 2\right) \top }\cdots \times _N\textbf{U}^{\left( N\right) \top }\in \mathbb {R}^{r_1\times r_2\times ...\times r_N} \end{aligned}$$
(11)

Then, the Tucker approximation \(\hat{{\varvec{\mathcal {X}}}}\) of \({\varvec{\mathcal {X}}}\) can be computed:

$$\begin{aligned} \begin{aligned} \hat{{\varvec{\mathcal {X}}}} =&\ {\varvec{\mathcal {G}}}\times _1\textbf{U}^{(1)}\times _2\textbf{U}^{(2)}\cdots \times _N\textbf{U}^{(N)} \\ =&\ {\varvec{\mathcal {X}}}\times _1(\textbf{U}^{(1)}\textbf{U}^{(1)\top })\times _2(\textbf{U}^{(2)}\textbf{U}^{(2)\top })\cdots \times _N(\textbf{U}^{(N)}{} \textbf{U}^{(N)\top }) \end{aligned} \end{aligned}$$
(12)

With the notation \(\Delta _{n}^{2}({\varvec{\mathcal {X}}}) \triangleq \sum _{i=r_{n}+1}^{I_{n}}\sigma _{i}^{2}(\textbf{X}_{(n)})\) and \(\Delta _{n}^2{({\varvec{\mathcal {X}}})\le \Vert {\varvec{\mathcal {X}}}-\hat{{\varvec{\mathcal {X}}}}_\textrm{opt}\Vert _F^2}\) [14], the error-bound for Algorithm 1 can be stated in the following Theorem 1.

Theorem 1

([11], Theorem 5.1) Let \(\hat{{\varvec{\mathcal {X}}}}={\varvec{\mathcal {G}}}\times _1\textbf{U}^{(1)}\times _2\textbf{U}^{(2)}\cdots \times _N\textbf{U}^{(N)}\) be the low multilinear rank-\((r_1,r_2,\ldots ,r_N)\) approximation of a tensor \({\varvec{\mathcal {X}}}\in \mathbb {R}^{I_1\times I_2\times \ldots \times I_N}\) by THOSVD. Then

$$\begin{aligned} \begin{aligned} \Vert {\varvec{\mathcal {X}}}-\hat{{\varvec{\mathcal {X}}}}\Vert _{F}^{2}&\le \sum _{n=1}^{N}\Vert {\varvec{\mathcal {X}}}\times _{n}(\textbf{I}_{I_n}-\textbf{U}^{(n)}{} \textbf{U}^{(n)\top })\Vert _{F}^{2}=\sum _{n=1}^{N}\sum _{i=r_{n}+1}^{I_{n}}\sigma _{i}^{2}(\textbf{X}_{(n)})\\&=\sum _{n=1}^{N}\Delta _{n}^{2}({\varvec{\mathcal {X}}})\le N\Vert {\varvec{\mathcal {X}}}-\hat{{\varvec{\mathcal {X}}}}_\textrm{opt}\Vert _F^2 \end{aligned} \end{aligned}$$
(13)

2.3 Sequentially Truncated Higher-Order SVD

Vannieuwenhoven et al. [11] proposed one more efficient and less computationally complex approach for computing approximate Tucker decomposition of tensors, called STHOSVD. Unlike THOSVD algorithm, STHOSVD updates the core tensor simultaneously whenever a factor matrix has computed.

Given the target rank \((r_1,r_2,\ldots ,r_N)\) and a processing order \(s_p: \{1,2,\ldots ,N\}\), the minimization problem (1) can be formulated as the following optimization problem:

$$\begin{aligned} \begin{aligned}&\min _{\textbf{U}^{(1)},\cdots ,\textbf{U}^{(N)}}\Vert {\varvec{\mathcal {X}}}-{\varvec{\mathcal {X}}}\times _1(\textbf{U}^{(1)}{} \textbf{U}^{(1)\top })\times _2(\textbf{U}^{(2)}{} \textbf{U}^{(2)\top })\cdots \times _N(\textbf{U}^{(N)}{} \textbf{U}^{(N)\top })\Vert _F^2\\&\quad = \min _{\textbf{U}^{(1)},\cdots ,\textbf{U}^{(N)}}(\Vert {\varvec{\mathcal {X}}}\times _1(\textbf{I}_{I_1}-\textbf{U}^{(1)}{} \textbf{U}^{(1)\top })\Vert _F^2+\Vert \hat{{\varvec{\mathcal {X}}}}^{(1)}\times _2(\textbf{I}_{I_2}-\textbf{U}^{(2)}{} \textbf{U}^{(2)\top })\Vert _F^2 + \\&\qquad \quad \cdots +\Vert \hat{{\varvec{\mathcal {X}}}}^{(N-1)}\times _N(\textbf{I}_{I_N}-\textbf{U}^{(N)}{} \textbf{U}^{(N)\top })\Vert _F^2)\\&\quad = \min _{\textbf{U}^{(1)}}(\Vert {\varvec{\mathcal {X}}}\times _1(\textbf{I}_{I_1}-\textbf{U}^{(1)}{} \textbf{U}^{(1)\top })\Vert _F^2 +\min _{\textbf{U}^{(2)}}(\Vert \hat{{\varvec{\mathcal {X}}}}^{(1)}\times _2(\textbf{I}_{I_2}-\textbf{U}^{(2)}{} \textbf{U}^{(2)\top })\Vert _F^2 + \\&\qquad \quad \min _{\textbf{U}^{(3)}}(\cdots +\min _{\textbf{U}^{(N)}}\Vert \hat{{\varvec{\mathcal {X}}}}^{(N-1)}\times _N(\textbf{I}_{I_N}-\textbf{U}^{(N)}{} \textbf{U}^{(N)\top })\Vert _F^2))) \end{aligned} \end{aligned}$$
(14)

where \(\hat{{\varvec{\mathcal {X}}}}^{(n)}={\varvec{\mathcal {X}}}\times _1(\textbf{U}^{(1)}{} \textbf{U}^{(1)\top })\times _2(\textbf{U}^{(2)}\textbf{U}^{(2)\top })\cdots \times _n(\textbf{U}^{(n)}{} \textbf{U}^{(n)\top }), n=1,2,\ldots ,N-1\), denote the intermediate approximation tensors.

figure b

In Algorithm 2, the solution \(\textbf{U}{(n)}\) of problem (14) can be obtained via \( \texttt{truncatedSVD}(\textbf{G}_{(n)},r_n)\), where \(\textbf{G}_{(n)}\) is mode-n unfolding matrix of the \((n-1)\)-th intermediate core tensor \({\varvec{\mathcal {G}}}={\varvec{\mathcal {X}}}\times _{i=1}^{n-1}\textbf{U}^{(i)\top }\in \mathbb {R}^{r_1\times r_2\times ...\times r_{n-1}\times I_{n}\times ...\times I_{N} }\), i.e.,

$$\begin{aligned} \textbf{G}_{(n)}=\begin{bmatrix} \textbf{U}^{(n)}&\tilde{\textbf{U}^{(n)}} \end{bmatrix}\begin{bmatrix} \textbf{S}^{(n)}&{} \\ &{} \tilde{\textbf{S}^{(n)}} \end{bmatrix}\begin{bmatrix} \textbf{V}^{(n)\top }\\ \tilde{\textbf{V}^{(n)\top }} \end{bmatrix}\cong \textbf{U}^{(n)} \textbf{S}^{(n)} \textbf{V}^{(n)\top } \end{aligned}$$
(15)

where the orthogonal matrix \(\textbf{U}^{(n)}\) is the mode-n factor matrix, and \(\textbf{S}^{(n)}{} \textbf{V}^{(n)\top }\) \(\in \mathbb {R}^{r_n\times r_1...r_{n-1}I_{n+1}...I_N}\) is used to update the n-th intermediate core tensor \({\varvec{\mathcal {G}}}\). Function \(\mathtt{fold_{n}}(\textbf{S}^{(n)}{} \textbf{V}^{(n)\top })\) tensorizes matrix \(\textbf{S}^{(n)}{} \textbf{V}^{(n)\top }\) into tensor \({\varvec{\mathcal {G}}}\in \mathbb {R}^{r_1\times r_2\times ...\times r_{n}\times I_{n+1}\times ...\times I_{N} }\). When the target rank \(r_n\) is much smaller than \(I_n\), the size of the updated intermediate core tensor \({\varvec{\mathcal {G}}}\) is much smaller than original tensor. This method can significantly improve computational performance. STHOSVD algorithm possesses the following error-bound.

Theorem 2

([11], Theorem 6.5) Let \(\hat{{\varvec{\mathcal {X}}}}={\varvec{\mathcal {G}}}\times _1\textbf{U}^{(1)}\times _2\textbf{U}^{(2)}\ldots \times _N\textbf{U}^{(N)}\) be the low multilinear rank-\((r_1,r_2,\ldots ,r_N)\) approximation of a tensor \({\varvec{\mathcal {X}}}\in \mathbb {R}^{I_1\times I_2\times \ldots \times I_N}\) by STHOSVD with processsing order \(s_p:\{1,2,\ldots ,N\}\). Then

$$\begin{aligned} \begin{aligned} \Vert {\varvec{\mathcal {X}}}-\hat{{\varvec{\mathcal {X}}}}\Vert _{F}^{2}&=\sum _{n=1}^{N}\Vert \hat{{\varvec{\mathcal {X}}}}^{(n-1)}-\hat{{\varvec{\mathcal {X}}}}^{(n)}\Vert _{F}^{2}\le \sum _{n=1}^{N}\Vert {\varvec{\mathcal {X}}}\times _{n}(\textbf{I}_{I_n}-\textbf{U}^{(n)}{} \textbf{U}^{(n)\top })\Vert _{F}^{2}\\&=\sum _{n=1}^{N}\Delta _{n}^{2}({\varvec{\mathcal {X}}})\le N\Vert {\varvec{\mathcal {X}}}-\hat{{\varvec{\mathcal {X}}}}_\textrm{opt}\Vert _F^2 \end{aligned} \end{aligned}$$
(16)

Although STHOSVD has the same error-bound as THOSVD, it is less computationally complex and requires less storage. As shown in Sect. 5 for the numerical experiment, the running (CPU) time of the STHOSVD algorithm is significantly reduced, and the approximation error has slightly better than that of THOSVD in some cases.

2.4 Randomized STHOSVD

When the dimensions of data tensors are enormous, the computational cost of the classical deterministic algorithm TSVD for finding a low-rank approximation of mode-n unfolding matrix can be expensive. Randomized low-rank matrix algorithms replace original large-scale matrix with a new one through a preprocessing step. The new matrix contains as much information as possible about the rows or columns of original data matrix. Its size is smaller than original matrix, allowing the data matrix to be processed efficiently and thus reducing the memory requirements for solving low-rank approximation of large matrix.

figure c

Halko et al. [21] proposed a randomized SVD (R-SVD) for matrices. The preprocessing stage of the algorithm is performed by right multiplying original data matrix \(\textbf{A}\in \mathbb {R}^{m\times n}\) with a random Gaussian matrix \({\varvec{\Omega }}\in \mathbb {R}^{n\times r}\). Each column of the resulting new matrix \(\textbf{Y}=\textbf{A}{\varvec{\Omega }}\in \mathbb {R}^{m\times r}\) is a linear combination of the columns of original data matrix. When \(r<n\), the size of matrix \(\textbf{Y}\) is smaller than \(\textbf{A}\). The oversampling technique can improve the accuracy of solutions. Subsequent computations are summarised in Algorithm 3, where randn generates a Gaussian random matrix, thinQR produces an economy-size of the QR decomposition, and thinSVD is the thin SVD decomposition. When \(\textbf{A}\) is dense, the arithmetic cost of Algorithm 3 is \(\mathcal {O}(2(r+p)mn+r^2(m+n))\) flops, where \(p>0\) is the oversampling parameter satisfying \(r+p\le \min \{m,n\}\).

Algorithm 3 is an efficient randomized algorithm for computing rank-r approximations to matrices. Minster et al. [14] applied Algorithm 3 directly to the STHOSVD algorithm and then presented a randomized version of STHOSVD (i.e., R-STHOSVD), see Algorithm 4.

figure d

3 Sketching Algorithm for STHOSVD

A drawback of R-SVD algorithm is that when both dimensions of the intermediate matrices are enormous, the computational cost can still be high. To resolve this problem, we could resort to the two-sided sketching algorithm for low-rank matrix approximation proposed by Joel Tropp et al. [22]. The preprocessing of sketching algorithm needs two sketch matrices to contain information regarding the rows and columns of input matrix \(\textbf{A}\in \mathbb {R}^{m\times n}\). Thus we should choose two sketch size parameters k and l, s.t., \(r\le k\le \min \{l, n\}\), \(0< l\le m\). The random matrices \({\varvec{\Omega }}\in \mathbb {R}^{n\times k}\) and \({\varvec{\Psi }} \in \mathbb {R}^{l\times m}\) are fixed independent standard normal matrices. Then we can multiply matrix \(\textbf{A}\) left and right respectively to obtain random sketch matrices \(\textbf{Y}\in \ \mathbb {R}^{m\times k}\) and \(\textbf{W}\in \mathbb {R}^{l\times n}\), which collect sufficient data about the input matrix to compute the low-rank approximation. The dimensionality and distribution of the random sketch matrices determine the approximation’s potential accuracy, with larger values of k and l resulting in better approximations but also requiring more storage and computational cost.

figure e

The sketching algorithm for low-rank approximation is given in Algorithm 5. Function \(\texttt{orth}(\textbf{A})\) in Step 2 produces an orthonormal basis of \(\textbf{A}\). Using orthogonalization matrices will achieve smaller errors and better numerical stability than directly using the randomly generated Gaussian matrices. In particular, when \(\textbf{A}\) is dense, the arithmetic cost of Algorithm 5 is \(\mathcal {O}((k+l)mn+kl(m+n))\) flops. Algorithm 5 is simple, practical, and possesses the sub-optimal error-bound as stated in the following Theorem 3.

Theorem 3

([22], Theorem 4.3) Assume that the sketch size parameters satisfy \(l>k+1\), and draw random test matrices \({\varvec{\Omega }}\in \mathbb {R}^{n\times k}\) and \({\varvec{\Psi }}{\in \mathbb {R}}^{l\times m}\) independently forming the standard normal distribution. Then the rank-k approximation \(\hat{\textbf{A}}\) obtained from Algorithm 5 satisfies:

$$\begin{aligned} \begin{aligned} \mathbb {E}_{\varvec{\Omega }}\parallel \textbf{A}-\hat{\textbf{A}}\parallel _F^2&\le (1+f(k,l))\cdot \min _{\varrho<k-1}(1+f(\varrho ,k))\cdot \tau _{\varrho +1}^2(\textbf{A})\\&=\frac{k}{l-k-1}\cdot \min _{\varrho <k-1}\frac{k}{k-\varrho -1}\cdot \tau _{\varrho +1}^2(\textbf{A}) \end{aligned} \end{aligned}$$
(17)

In Theorem 3, function \(f(s,t):=s/(t-s-1)(t>s+1>1)\). The minimum in Theorem 3 reveals that the low rank approximation of given matrix \(\textbf{A}\) automatically exploits the decay of tail energy.

Using the two-sided sketching algorithm to leverage STHOSVD algorithm, we propose a practical sketching algorithm for STHOSVD named Sketch-STHOSVD. We summarize the procedures of Sketch-STHOSVD algorithm in Algorithm 6, with its error analysis stated in Theorem 4.

Theorem 4

Let \(\hat{{\varvec{\mathcal {X}}}}={\varvec{\mathcal {G}}}\times _1\textbf{U}^{(1)}\times _2\textbf{U}^{(2)}\ldots \times _N\textbf{U}^{(N)}\) be the low multilinear rank-\((r_1,r_2,\ldots ,r_N)\) approximation of a tensor \({\varvec{\mathcal {X}}}\in \mathbb {R}^{I_1\times I_2\times \ldots \times I_N}\) by the Sketch-STHOSVD algorithm (i.e., Algorithm 6) with processing order \(s_p:\{1,2,\ldots ,N\}\) and sketch size parameters \(\{l_1,l_2,\ldots ,l_N\}\). Then

$$\begin{aligned} \begin{aligned} \mathbb {E}_{\{{\varvec{\Omega }}_j\}_{j = 1}^N}\Vert {\varvec{\mathcal {X}}} - \hat{{\varvec{\mathcal {X}}}}\Vert _F^2&\le {\sum \limits _{n = 1}^N \frac{r_n}{l_n-r_n-1}\min _{\varrho _n<r_n-1}\frac{r_n}{r_n-\varrho _n-1}\Delta _n^2({\varvec{\mathcal {X}}})}\\&\le \sum \limits _{n = 1}^N \frac{r_n}{l_n-r_n-1}\min _{\varrho _n<r_n-1}\frac{r_n}{r_n-\varrho _n-1}\Vert {\varvec{\mathcal {X}}}-\hat{{\varvec{\mathcal {X}}}}_\textrm{opt}\Vert _F^2 \end{aligned} \end{aligned}$$
(18)

Proof

Combining Theorems 2 and 3, we have

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{\{{\varvec{\Omega }}\}_{j = 1}^N}\Vert {\varvec{\mathcal {X}}} - \hat{{\varvec{\mathcal {X}}}}\Vert _F^2 \\&\quad = \sum \limits _{n = 1}^N\mathbb {E}_{\{{\varvec{\Omega }} _j\}_{j = 1}^N} \Vert \hat{{\varvec{\mathcal {X}}}}^{(n - 1)} - \hat{{\varvec{\mathcal {X}}}}^{(n )}\Vert _F^2\\&\quad = \sum \limits _{n = 1}^N \mathbb {E}_{\{{\varvec{\Omega }} _j\}_{j = 1}^{n-1}}\left\{ \mathbb {E}_{\varvec{\Omega }_n}\Vert \hat{{\varvec{\mathcal {X}}}}^{(n - 1)} - \hat{{\varvec{\mathcal {X}}}}^{(n)}\Vert _F^2 \right\} \\&\quad = \sum \limits _{n = 1}^N \mathbb {E}_{\{{\varvec{\Omega }} _j\}_{j = 1}^{n-1}}\left\{ \mathbb {E}_{\varvec{\Omega }_n}\Vert {\varvec{\mathcal {G}}}^{(n - 1)} \times _{i = 1}^{n - 1} {\textbf{U}^{(i)}}{ \times _n}(\textbf{I}_{I_n} - {\textbf{U}^{(n)}}{} \textbf{U}^{(n)\top }) \Vert _F^2 \right\} \\&\quad \le \sum \limits _{n = 1}^N \mathbb {E}_{\{{\varvec{\Omega }} _j\}_{j = 1}^{n-1}}\left\{ \mathbb {E}_{\varvec{\Omega }_n}\Vert (\textbf{I}_{I_n} - {\textbf{U}^{(n)}}{} \textbf{U}^{(n)\top })\textbf{G}_{(n)}^{(n-1)}) \Vert _F^2 \right\} \\&\quad \le \sum \limits _{n = 1}^N \mathbb {E}_{\{{\varvec{\Omega }} _j\}_{j = 1}^{n-1}}\frac{r_n}{l_n-r_n-1}\min _{\varrho _n<r_n-1}\frac{r_n}{r_n-\varrho _n-1}\sum \limits _{i = r_n+1}^{I_n}\sigma _{i}^{2}(\textbf{G}_{(n)}^{(n-1)})\\&\quad \le \sum \limits _{n = 1}^N \mathbb {E}_{\{{\varvec{\Omega }} _j\}_{j = 1}^{n-1}} \frac{r_n}{l_n-r_n-1}\min _{\varrho _n<r_n-1}\frac{r_n}{r_n-\varrho _n-1} \Delta _n^2({\varvec{\mathcal {X}}})\\&\quad = \sum \limits _{n = 1}^N \frac{r_n}{l_n-r_n-1}\min _{\varrho _n<r_n-1}\frac{r_n}{r_n-\varrho _n-1}\Delta _n^2({\varvec{\mathcal {X}}})\\&\quad \le {\sum \limits _{n = 1}^N \frac{r_n}{l_n-r_n-1}\min _{\varrho _n<r_n-1}\frac{r_n}{r_n-\varrho _n-1}} \Vert {\varvec{\mathcal {X}}}-\hat{{\varvec{\mathcal {X}}}}_\textrm{opt}\Vert _F^2. \end{aligned} \end{aligned}$$

\(\square \)

figure f

We assume the processing order for STHOSVD, R-STHOSVD, and Sketch-STHOSVD algorithms is \(s_p:\{1,2,\ldots ,N\}\). Table 2 summarises the arithmetic cost of different algorithms for the cases related to the general higher-order tensor \({\varvec{\mathcal {X}}}\in \mathbb {R}^{I_1\times I_{2}\times ...\times I_N}\) with target rank \((r_1,r_2,\ldots ,r_N)\) and the special cubic tensor \({\varvec{\mathcal {X}}}\in \mathbb {R}^{I\times I\times ...\times I}\) with target rank \((r,r,\ldots ,r)\). Here the tensors are dense and the target ranks \(r_j\ll I_j, j = 1, 2, \ldots , N\).

Table 2 Arithmetic cost for the algorithms THOSVD, STHOSVD, R-STHOSVD, and the proposed Sketch-STHOSVD

4 Sketching Algorithm with Subspace Power Iteration

When the size of original matrix is very large or the singular spectrum of original matrix decays slowly, Algorithm 5 may produce a poor basis in many applications. Inspired by [23], we suggest using the power iteration technique to enhance the sketching algorithm by replacing \(\textbf{A}\) with \((\textbf{A}{} \textbf{A}^\top )^{q}{} \textbf{A}\), where q is a positive integer. According to the SVD decomposition of matrix \(\textbf{A}\), i.e., \(\textbf{A}=\textbf{U}{} \textbf{S}{} \textbf{V}^\top \), we know that \((\textbf{A}{} \textbf{A}^\top )^{q}{} \textbf{A}=\textbf{U}{} \textbf{S}^{2q+1}{} \textbf{V}^\top \). It can see that \(\textbf{A}\) and \((\textbf{A}{} \textbf{A}^\top )^{q}{} \textbf{A}\) have the same left and right singular vectors, but the latter has a faster decay rate of singular values, making its tail energy much smaller.

Although power iteration can improve the accuracy of Algorithm 5 to some extent, it still suffers from a problem, i.e., during the execution with power iteration, the rounding errors will eliminate all information about the singular modes associated with the singular values. To address this issue, we propose an improved sketching algorithm by orthonormalizing the columns of the sample matrix between each application of \(\textbf{A}\) and \(\textbf{A}^\top \), see Algorithm 7. When \(\textbf{A}\) is dense, the arithmetic cost of Algorithm 7 is \(\mathcal {O}((q+1)(k+l)mn+kl(m+n))\) flops. Numerical experiments show that a good approximation can achieve with a choice of 1 or 2 for subspace power iteration parameter [21].

Using Algorithm 7 to compute the low-rank approximations of intermediate matrices, we can obtain an improved sketching algorithm for STHOSVD, called sub-Sketch-STHOSVD, see Algorithm 8.

figure g
figure h

The error-bound for Algorithm 8 states in the following Theorem 5. Its proof is deferred in Appendix.

Theorem 5

Let \(\hat{{\varvec{\mathcal {X}}}}={\varvec{\mathcal {G}}}\times _1\textbf{U}^{(1)}\times _2\textbf{U}^{(2)}\ldots \times _N\textbf{U}^{(N)}\) be the low multilinear rank-\((r_1,r_2,\ldots ,r_N)\) approximation of a tensor \({\varvec{\mathcal {X}}}\in \mathbb {R}^{I_1\times I_2\times \ldots \times I_N}\) by the sub-Sketch-STHOSVD algorithm (i.e., Algorithm 8) with processing order \(s_p:\{1,2,\ldots ,N\}\) and sketch size parameters \(\{l_1,l_2,\ldots ,l_N\}\). Then

$$\begin{aligned} \begin{aligned}&\mathbb {E}_{\{\varvec{\Omega }_j\}_{j = 1}^N}\Vert {\varvec{\mathcal {X}}} - \hat{{\varvec{\mathcal {X}}}}\Vert _F^2\\ {}&\le {\sum \limits _{n = 1}^N (1+f(r_n,l_n))\cdot \min _{\varrho _n<r_n-1}(1+f(\varrho _n,r_n){\varpi _r}^{4q})\cdot \tau _{\varrho +1}^2({\textbf{X}_{(n)}})}\\&\le \sum \limits _{n = 1}^N (1+f(r_n,l_n))\cdot \min _{\varrho _n<r_n-1}(1+f(\varrho _n,r_n){\varpi _r}^{4q})\Vert {\varvec{\mathcal {X}}}-\hat{{\varvec{\mathcal {X}}}}_\textrm{opt}\Vert _F^2 \end{aligned} \end{aligned}$$
(19)

Proof

See Appendix. \(\square \)

5 Numerical Experiments

This section conducts numerical experiments with synthetic data and real-world data, including comparisons between the traditional THOSVD algorithm, STHOSVD algorithm, the R-STHOSVD algorithm proposed in [14], and our proposed algorithms Sketch-STHOSVD and sub-Sketch-STHOSVD. Regarding the numerical settings, the oversampling parameter \(p=5\) is used in Algorithm 3, the sketch parameters \(l_n = r_n+2, n = 1, 2, \ldots , N\), are used in Algorithms 6 and 8, and the power iteration parameter \(q=1\) is used in Algorithm 8.

5.1 Hilbert Tensor

Hilbert tensor is a synthetic and supersymmetric tensor, with each entry defined as

$$\begin{aligned} {\varvec{\mathcal {X}}}_{{i_1}{i_2}...{i_n}}= \frac{1}{{{i_1} + {i_2} +... + {i_n}}}, 1 \le {i_n} \le {I_n}, n = 1,2,\ldots ,N \end{aligned}$$
(20)

In the first experiment, we set \(N=5\) and \(I_n = 25, n=1,2,\ldots ,N\). The target rank is chosen as (rrrrr), where \(r\in [1,25]\). Due to the supersymmetry of the Hilbert tensor, the processing order in the algorithms does not affect the final experimental results, and thus the processing order can be directly chosen as \(s_p:\{1,2,3,4,5\}\). The results of different algorithms are given in Fig. 1. It shows that our proposed algorithms (i.e., Sketch-STHOSVD and sub-Sketch-STHOSVD) and algorithm R-STHOSVD outperform the algorithms THOSVD and STHOSVD. In particular, the error of the proposed algorithms Sketch-STHOSVD and sub-Sketch-STHOSVD is comparable to R-STHOSVD (see the left plot in Fig. 1), while they both use less CPU time than R-STHOSVD (see the right plot in Fig. 1).

This result demonstrates the excellent performance of the proposed algorithms and indicates that the two-sided sketching method and the subspace power iteration used in our algorithms can indeed improve the performance of STHOSVD algorithm.

Fig. 1
figure 1

Results comparison on the Hilbert tensor with a size of \(25\times 25\times 25\times 25\times 25\) in terms of numerical error (left) and CPU time (right)

Table 3 Results comparison in terms of the CPU time (in second) on the Hilbert tensor with a size of \(500\times 500\times 500\) as the target rank increases
Table 4 Results comparison in terms of the relative error (in second) on the Hilbert tensor with a size of \(500\times 500\times 500\) as the target rank increases

For a large-scale test, we use a Hilbert tensor with a size of \(500\times 500\times 500\) and conduct experiments using ten different approximate multilinear ranks. We perform the tests ten times and report the algorithms’ average running time and relative error in Tables 3 and 4, respectively. The results show that the randomized algorithms can achieve higher accuracy than the deterministic algorithms. The proposed Sketch-STHOSVD algorithm is the fastest, and the sub-Sketch-STHOSVD algorithm achieves the highest accuracy efficiently.

5.2 Sparse Tensor

In this experiment, we test the performance of different algorithms on a sparse tensor \( {\varvec{\mathcal {X}}} \in \mathbb {R}^{200 \times 200 \times 200}\), i.e.,

$$\begin{aligned} {\varvec{\mathcal {X}}} = \sum \limits _{i = 1}^{10} {\frac{\gamma }{{{i^2}}}} {\mathbf{{x}}_i} \circ {\mathbf{\textbf{y}}_i} \circ {\mathbf{{z}}_i} + \sum \limits _{i = 11}^{200} {\frac{1}{{{i^2}}}} {\mathbf{{x}}_i} \circ {\mathbf{\textbf{y}}_i} \circ {\mathbf{{z}}_i} \end{aligned}$$
(21)

Where \(\mathbf{{x}}_i,{\mathbf{\textbf{y}}_i},{\mathbf{{z}}_i} \in \mathbb {R}^n\) are sparse vectors all generated using the sprand command in MATLAB with \(5\%\) nonzeros each, and \(\gamma \) is a user-defined parameter which determines the strength of the gap between the first ten terms and the rest terms. The target rank is chosen as (rrr), where \(r\in [20,100]\).

Fig. 2
figure 2

Results comparison on a sparse tensor with a size of \(200\times 200\times 200\) in terms of numerical error (first row) and CPU time (second row)

The experimental results show in Fig. 2, in which three different values \(\gamma = 2,10,200\) are tested. The increase of gap means that the tail energy will be reduced, and the accuracy of the algorithms will be improved. Our numerical experiments also verified this result. Figure 2 demonstrates the superiority of the proposed sketching algorithms. In particular, we see that the proposed Sketch-STHOSVD is the fastest algorithm, with a comparable error against R-STHOSVD; the proposed sub-Sketch-STHOSVD can reach the same accuracy as the STHOSVD algorithm but in much less CPU time; and the proposed sub-Sketch-STHOSVD achieves much better low-rank approximation than R-STHOSVD with similar CPU time.

Fig. 3
figure 3

Results comparison on a \(200\times 200\times 200\) sparse tensor with noise in terms of numerical error (first row) and CPU time (second row)

Now we consider the influence of noise on algorithms’ performance. Specifically, the sparse tensor \({\varvec{\mathcal {X}}}\) with noise is designed in the same manner as in [24], i.e.,

$$\begin{aligned} \hat{{\varvec{\mathcal {X}}}}={\varvec{\mathcal {X}}}+\delta {\varvec{\mathcal {K}}} \end{aligned}$$
(22)

where \({\varvec{\mathcal {K}}}\) is a standard Gaussian tensor and \(\delta \) is used to control the noise level. Let \(\delta =10^{-3}\) and keep the rest parameters the same as the settings in the previous experiment. The relative error and running time of different algorithms are shown in Fig. 3. In Fig. 3, we see that noise indeed affects the accuracy of the low-rank approximation, especially when the gap is small. However, the influence of noise does not change the conclusion obtained on the case without noise. The accuracy of our sub-Sketch-STHOSVD algorithm is the highest among the randomized algorithms. As \(\gamma \) increases, sub-Sketch-STHOSVD can achieve almost the same accuracy as that of THOSVD and STHOSVD in a comparable CPU time against R-STHOSVD.

5.3 Real-World Data Tensor

In this experiment, we test the performance of different algorithms on a colour image, called HDU picture,Footnote 1 with a size of \(1200\times 1800\times 3\). We also evaluate the proposed sketching algorithms on the widely used YUV Video Sequences.Footnote 2 Taking the ‘hall monitor’ video as an example and using the first 30 frames, a three order tensor with a size of \(144\times 176\times 30\) is then formed for this test.

Firstly, we conduct an experiment on the HDU picture with target rank (500, 500, 3), and compare the PSNR and CPU time of different algorithms.

Fig. 4
figure 4

Results comparison on a HDU picture with a size of \(1200\times 1800\times 3\) in terms of PSNR (i.e., peak signal-to-noise ratio) and CPU time. The target rank is (500,500,3). The two values in e.g. (2.62; 40.61) represent the CPU time and the PSNR, respectively

The experimental result is shown in Fig. 4, which shows that the PSNR of sub-Sketch-STHOSVD, THOSVD and STHOSVD is very similar (i.e., \(\sim \) 40) and that sub-Sketch-STHOSVD is more efficient in terms of CPU time. R-STHOSVD and Sketch-STHOSVD are also very efficient compared to sub-Sketch-STHOSVD; however, the PSNR they achieve is 5 dB less than sub-Sketch-STHOSVD.

Fig. 5
figure 5

Results comparison on a HDU picture with size of \(1200\times 1800\times 3\) in terms of numerical error (left), CPU time (middle) and PSNR (right). The HDU picture is with target rank \((r,r,3), r\in [50,1000]\)

Fig. 6
figure 6

Results comparison on the ‘hall monitor’ grey video with size of \(144\times 176\times 30\) in terms of numerical error (left), CPU time (middle) and PSNR (right). The ‘hall monitor’ grey video is with target rank (rr, 10), \( r\in [5,100]\)

Then we conduct separate numerical experiments on the HDU picture and the ‘hall monitor’ video clip as the target rank increases, and compare these algorithms in terms of the relative error, CPU time and PSNR, see Figs. 5 and 6. These experimental results again demonstrate the superiority (i.e., low error and good approximation with high efficiency) of the proposed sub-Sketch-STHOSVD algorithm in computing the Tucker decomposition approximation.

Fig. 7
figure 7

Results comparison on LONDON picture with a size of \(4775\times 7155\times 3\) in terms of CPU time and PSNR. The target rank is (50,50,3)

Fig. 8
figure 8

Results comparison on LONDON picture with a size of \(4775\times 7155\times 3\) and white Gaussian noise in terms of CPU time and PSNR. The target rank is (50,50,3)

Table 5 Results comparison in terms of the relative error on the LONDON picture with a size of \(4775\times 7155\times 3\) as the target rank increases
Table 6 Results comparison in terms of the CPU time on the LONDON picture with a size of \(4775\times 7155\times 3\) as the target rank increases
Table 7 Results comparison in terms of the PSNR on the LONDON picture with a size of \(4775\times 7155\times 3\) as the target rank increases

In the last experiment, a larger-scale real-world tensor data is used. We choose a color image (called the LONDON picture) with a size of \(4775\times 7155\times 3\) as the test image and consider the influence of noise. The LONDON picture with white Gaussian noise is generated using the awgn(X,SNR) built-in function in MATLAB. We set the target rank as (50,50,3) and SNR to 20. The results comparisons without and with white Gaussian noise are respectively shown in Figs. 7 and 8 in terms of the CPU time and PSNR.

Moreover, we also test the algorithms on the LONDON picture as the target rank increases. The results regarding the relative error, the CPU time and the PSNR are reported in Tables 5, 6 and 7, respectively. On the whole, the results again show the consistent performance of the proposed methods.

In summary, the numerical results show the superiority of the sub-sketch STHOSVD algorithm for large-scale tensors with or without noise. We can see that sub-Sketch-STHOSVD could achieve close approximations to that of the deterministic algorithms in a time similar to other randomized algorithms.

6 Conclusion

In this paper we proposed efficient sketching algorithms, i.e., Sketch-STHOSVD and sub-Sketch-STHOSVD, to calculate the low-rank Tucker approximation of tensors by combining the two-sided sketching technique with the STHOSVD algorithm and using the subspace power iteration. Detailed error analysis is also conducted. Numerical results on both synthetic and real-world data tensors demonstrate the competitive performance of the proposed algorithms in comparison to the state-of-the-art algorithms.