1 Introduction

Singular value decomposition is arguably the most powerful tool of numerical linear algebra. It is not surprising that, when compared to the matrix SVD, the tensor generalization is significantly more complicated, see e.g. [5, 7, 8, 16]. We study the SVD-like tensor decomposition in the Tucker format,

$$\begin{aligned} {{\mathcal {A}}}={\mathcal {S}}\times _1U_1\times _2U_2\cdots \times _dU_d, \end{aligned}$$
(1.1)

where \({\mathcal {A}}\) and \(\mathcal {S}\) are tensors of order d and \(U_1,U_2,\ldots ,U_d\) are orthogonal matrices. Here, the tensor \(\mathcal {S}\) mimics the diagonal matrix of singular values from the matrix SVD. It is well known that, in the tensor case, one cannot expect to obtain a diagonal core tensor \(\mathcal {S}\). Hence, our goal will be to get a decomposition (1.1) where \(\mathcal {S}\) is “as diagonal as possible”. This SVD-like tensor decomposition problem is closely related to the tensor diagonalization problem. It has many applications in signal processing, blind source separation, and independent component analysis [3, 4, 6].

Problem (1.1) for tensors of order \(d=3\) has been studied by Moravitz Martin and Van Loan [14]. In their paper the authors use a Jacobi-type method to solve the maximization problem stated in (1.2) below. Their numerical results suggest convergence, although a convergence proof is not provided. If the tensor \(\mathcal {S}\) from (1.1) is a diagonal tensor, then \({\mathcal {A}}\) can be diagonalized using orthogonal transformations. Since a general tensor cannot be diagonalized, we aim to achieve an approximate diagonalization. A similar problem for symmetric tensors has been studied in a series of papers by Comon, Li and Usevich [12, 13, 15] where a Jacobi-type method is also a method of choice.

In this paper we develop a Jacobi-type algorithm with the same idea as in [14], to maximize the sum of squares of the diagonal, but the algorithm itself is different from the one in [14]. Moreover, we prove the convergence of our algorithm. Our convergence results are alongside those for the symmetric case from [12, 13, 15]. We are concerned with general tensors, that is, we do not assume any tensor structure, except in Section 5, where we discuss several special cases.

One can observe the problem (1.1) either as a minimization problem where the goal is to minimize the off-diagonal norm of \(\mathcal {S}\),

$$\begin{aligned} {\text {off}}^2(\mathcal {S})=\Vert \mathcal {S}\Vert _F^2-\Vert diag (\mathcal {S})\Vert _F^2\rightarrow \min , \end{aligned}$$

or as a maximization problem,

$$\begin{aligned} \Vert diag (\mathcal {S})\Vert _F^2\rightarrow \max , \end{aligned}$$
(1.2)

where the square of the Frobenius norm of diagonal entries of \(\mathcal {S}\) is maximized. We are going to work with the formulation (1.2). We mainly focus on tensors of order \(d=3\) and develop a block coordinate descent Jacobi-type algorithm for finding the decomposition

$$\begin{aligned} {\mathcal {A}}=\mathcal {S}\times _1U\times _2V\times _3W, \end{aligned}$$

such that

$$\begin{aligned} \Vert diag (\mathcal {S})\Vert _F^2=\sum _{i=1}^n\mathcal {S}_{iii}^2 \end{aligned}$$

is maximized. We prove that the algorithm converges to a stationary point of the objective function. As it will be explained later in the paper, the algorithm can easily be generalized to tensors of order \(d>3\).

Our algorithm for an approximate tensor diagonalization can also be used for a low-rank tensor approximation. We can approximate \({\mathcal {A}}\) by a rank-r tensor \({\tilde{{\mathcal {A}}}}\) in the following way. Starting from the decomposition (1.1) we form a diagonal \(r\times r\times \cdots \times r\) order d tensor \({\mathcal {D}}\) such that the diagonal elements of \({\mathcal {D}}\) are r diagonal elements of \(\mathcal {S}\) with the highest absolute values. Moreover, for \(i=1,\ldots ,r\), we take \(U_{i,r}\) as columns of \(U_i\) corresponding to the selected diagonal elements. Then, the low-rank approximation is obtained as

$$\begin{aligned} {\tilde{{\mathcal {A}}}}={\mathcal {D}}\times _1U_{1,r}\times _2U_{2,r}\cdots \times _dU_{d,r}. \end{aligned}$$

In Section 2 we describe the problem and construct the algorithm for solving the maximization problem (1.2). We prove the previously mentioned convergence results in Section 3, while in Section 4 we provide several numerical examples. Moreover, in Section 5 we study the special cases of symmetric and antisymmetric tensors.

2 Orthogonal tensor decomposition

2.1 Preliminaries and notation

We use the tensor notation from [10], which is commonly used in the papers dealing with numerical algorithms for tensors. Notation from [11] is also commonly used in multilinear algebra, but somewhat less frequently in its numerical aspects.

Tensors of order three or higher are denoted by calligraphic letters, e.g. \({\mathcal {X}}\). Tensor fibers are vectors obtained from a tensor by fixing all indices but one. For a third-order tensor, its fibers are columns, rows, and tubes. The mode-m matricization of a tensor \({\mathcal {X}}\in \mathbb {R}^{n_1\times n_2\times \cdots \times n_d}\) is an \(n_m\times (n_1\cdots n_{m-1}n_{m+1}\cdots n_d)\) matrix \(X_{(m)}\) obtained by arranging mode-m fibers of \({\mathcal {X}}\) into columns of \(X_{(m)}\). In this paper we mainly work with 3rd order tensors. Thus, we will have \(m=1,2,3\).

The mode-m product of a tensor \({\mathcal {X}}\in \mathbb {R}^{n_1\times n_2\times \cdots \times n_d}\) with a matrix \(A\in \mathbb {R}^{p\times n_m}\) is a tensor \(\mathcal {Y}\in \mathbb {R}^{n_1\times \cdots \times n_{m-1}\times p\times n_{m+1}\times \cdots \times n_d}\),

$$\begin{aligned} \mathcal {Y}={\mathcal {X}}\times _m A, \quad {\text {such that}} \quad Y_{(m)}=A X_{(m)}. \end{aligned}$$

Two important properties of the mode-m product are

$$\begin{aligned} {\mathcal {X}}\times _mA\times _nB&={\mathcal {X}}\times _nB\times _mA, \quad m\ne n, \end{aligned}$$
(2.1)
$$\begin{aligned} {\mathcal {X}}\times _nA\times _nB&={\mathcal {X}}\times _n(BA). \end{aligned}$$
(2.2)

The norm of \({\mathcal {X}}\) is a generalization of the matrix Frobenius norm. It is given by

$$\begin{aligned} \Vert {\mathcal {X}}\Vert _F=\sqrt{\sum _{i_1=1}^{n_1}\sum _{i_2=1}^{n_2}\cdots \sum _{i_d=1}^{n_d} x_{i_1i_2\ldots i_d}^2}. \end{aligned}$$

To lighten the notation throughout the paper we are going to write this norm simply as \(\Vert {\mathcal {X}}\Vert\). The inner product of two tensors \({\mathcal {X}},\mathcal {Y}\in \mathbb {R}^{n_1\times n_2\times \cdots \times n_d}\) is given by

$$\begin{aligned} \langle {\mathcal {X}},\mathcal {Y}\rangle =\sum _{i_1=1}^{n_1}\sum _{i_2=1}^{n_2}\cdots \sum _{i_d=1}^{n_d} x_{i_1i_2\ldots i_d}y_{i_1i_2\ldots i_d}. \end{aligned}$$

It is straightforward to check that \(\langle {\mathcal {X}},{\mathcal {X}}\rangle =\Vert {\mathcal {X}}\Vert ^2\).

The Tucker decomposition is a decomposition of a tensor \({\mathcal {X}}\) into a core tensor \(\mathcal {S}\) multiplied by a matrix in each mode,

$$\begin{aligned} {\mathcal {X}}={\mathcal {S}}\times _1M_1\times _2M_2\times _3\cdots \times _dM_d. \end{aligned}$$
(2.3)

Tensor \({\mathcal {X}}\in {\mathbb {R}}^{n\times n\times n}\) is diagonal when \({\mathcal {X}}_{ijk}\ne 0\) only if \(i=j=k\), that is, if \({\text {off}}({\mathcal {X}})=0\).

2.2 Problem description

Let \({\mathcal {A}}\in {\mathbb {R}}^{n\times n\times n}\). We are looking for an orthogonal Tucker decomposition

$$\begin{aligned} {\mathcal {A}}=\mathcal {S}\times _1U\times _2V\times _3W, \end{aligned}$$
(2.4)

where \(U,V,W\in {\mathbb {R}}^{n\times n}\) are orthogonal matrices and \(\mathcal {S}\in {\mathbb {R}}^{n\times n\times n}\) is a core tensor such that

$$\begin{aligned} \Vert diag (\mathcal {S})\Vert ^2=\sum _{i=1}^n\mathcal {S}_{iii}^2\rightarrow \max . \end{aligned}$$
(2.5)

From relation (2.4) tensor \(\mathcal {S}\) can be expressed as

$$\begin{aligned} \mathcal {S}={\mathcal {A}}\times _1U^T\times _2V^T\times _3W^T. \end{aligned}$$

Hence, in order to solve the problem defined by (2.4) and (2.5) for a given tensor \({\mathcal {A}}\), we need to find orthogonal matrices UVW that maximize the objective function

$$\begin{aligned} f(U,V,W)=\Vert diag ({\mathcal {A}}\times _1U^T\times _2V^T\times _3W^T)\Vert ^2\rightarrow \max . \end{aligned}$$
(2.6)

We do this using a Jacobi-type method with a block coordinate descent approach.

For the sake of simplicity, our analysis is restricted to equal-sized modes. However, with a few technical adjustments, the same algorithm can be constructed for \({\mathcal {A}}\in {\mathbb {R}}^{n_1\times n_2\times n_3}\). Then, in (2.4) we have \(U\in {\mathbb {R}}^{n_1\times n_1}\), \(V\in {\mathbb {R}}^{n_2\times n_2}\), \(W\in {\mathbb {R}}^{n_3\times n_3}\), and \(\mathcal {S}\in {\mathbb {R}}^{n_1\times n_2\times n_3}\).

2.3 Jacobi-type algorithm

We now describe the Jacobi-type algorithm for solving the maximization problem defined by (2.6). This is an iterative algorithm. Its kth iteration has the form

$$\begin{aligned} {\mathcal {A}}^{(k+1)}={\mathcal {A}}^{(k)}\times _1R_{U,k}^T\times _2R_{V,k}^T\times _3R_{W,k}^T, \quad k\ge 0, \quad {\mathcal {A}}^{(0)}={\mathcal {A}}, \end{aligned}$$
(2.7)

where \(R_{U,k},R_{V,k},R_{W,k}\) are plane rotations with the following structure,

$$\begin{aligned} R(i,j,\phi )=\left[ \begin{array}{ccccc} I &{} &{} &{} &{} \\ &{} \cos \phi &{} &{} -\sin \phi &{} \\ &{} &{} I &{} &{} \\ &{} \sin \phi &{} &{} \cos \phi &{} \\ &{} &{} &{} &{} I \\ \end{array} \right] \begin{array}{l} \\ i \\ \\ j \\ \\ \end{array}. \end{aligned}$$
(2.8)

Index pair (ij) in the rotation matrix (2.8) is called a pivot position. The set of all possible pivot positions is \(\{(i,j) \ : \ 1\le i<j\le n\}\). In the k-th step, matrices \(R_{U,k},R_{V,k},R_{W,k}\) have the same pivot position \((i_k,j_k)\), while the rotation angle \(\phi _k\) is, in general, different for each matrix.

Our algorithm uses a block coordinate descent approach. This means that each iteration consists of three microiterations where we hold two variables constant and vary the third one. We have

$$\begin{aligned} \mathcal {B}^{(k)}&={\mathcal {A}}^{(k)}\times _1R_{U,k}^T, \end{aligned}$$
(2.9)
$$\begin{aligned} \mathcal {C}^{(k)}&=\mathcal {B}^{(k)}\times _2R_{V,k}^T, \end{aligned}$$
(2.10)
$$\begin{aligned} {\mathcal {A}}^{(k+1)}&=\mathcal {C}^{(k)}\times _3R_{W,k}^T. \end{aligned}$$
(2.11)

Here, by \(\mathcal {B}^{(k)}\) and \(\mathcal {C}^{(k)}\) we denote the intermediate steps. Of course, if we combine all three microiterations together, using the properties of mode-m product, namely (2.1) and (2.2), we get back to the iteration step (2.7).

Let us see how the rotation angles in matrices \(R_{U,k},R_{V,k},R_{W,k}\) are computed. For a fixed iteration step k we observe a \(2\times 2\times 2\) subproblem. Assume that \((i_k,j_k)=(p,q)\), \(1\le p<q\le n\). A subtensor of \({\mathcal {A}}\) corresponding to an index pair (pq) is denoted by \({\hat{{\mathcal {A}}}}\) and we can write it as

$$\begin{aligned} {\hat{{\mathcal {A}}}}(:,:,1)=\left[ \begin{array}{cc} a_{ppp} &{} a_{pqp} \\ a_{qpp} &{} a_{qqp} \\ \end{array} \right] , \quad {\hat{{\mathcal {A}}}}(:,:,2)=\left[ \begin{array}{cc} a_{ppq} &{} a_{pqq} \\ a_{qpq} &{} a_{qqq} \\ \end{array} \right] . \end{aligned}$$

Then, the corresponding \(2\times 2\times 2\) subproblem is to find \(2\times 2\) rotations \({\hat{R}}_U,{\hat{R}}_V,{\hat{R}}_W\) such that

$$\begin{aligned} \Vert diag ({\hat{S}})\Vert ^2=\sigma _{ppp}^2+\sigma _{qqq}^2\rightarrow \max , \end{aligned}$$

where

$$\begin{aligned} {\hat{\mathcal {S}}}={\hat{{\mathcal {A}}}}\times _1{\hat{R}}_U^T\times _2{\hat{R}}_V^T\times _3{\hat{R}}_W^T, \end{aligned}$$

and

$$\begin{aligned} {\hat{\mathcal {S}}}(:,:,1)=\left[ \begin{array}{cc} \sigma _{ppp} &{} \sigma _{pqp} \\ \sigma _{qpp} &{} \sigma _{qqp} \\ \end{array} \right] , \quad {\hat{\mathcal {S}}}(:,:,2)=\left[ \begin{array}{cc} \sigma _{ppq} &{} \sigma _{pqq} \\ \sigma _{qpq} &{} \sigma _{qqq} \\ \end{array} \right] . \end{aligned}$$

Taking only microiteration (2.9) we calculate rotation angles for matrix \({\hat{R}}_U\). We have

$$\begin{aligned} \left[ \begin{array}{cccc} b_{ppp} &{} b_{pqp} &{} b_{ppq} &{} b_{pqq} \\ b_{qpp} &{} b_{qqp} &{} b_{qpq} &{} b_{qqq} \\ \end{array} \right] =\left[ \begin{array}{cc} \cos \phi &{} \sin \phi \\ -\sin \phi &{} \cos \phi \\ \end{array} \right] \left[ \begin{array}{cccc} a_{ppp} &{} a_{pqp} &{} a_{ppq} &{} a_{pqq} \\ a_{qpp} &{} a_{qqp} &{} a_{qpq} &{} a_{qqq} \\ \end{array} \right] . \end{aligned}$$

The rotation angle \(\phi\) is chosen to maximize the function

$$\begin{aligned} g_1(\phi )&= b_{ppp}^2+b_{qqq}^2 \nonumber \\&=(a_{ppp}\cos \phi +a_{qpp}\sin \phi )^2+(-a_{pqq}\sin \phi +a_{qqq}\cos \phi )^2. \end{aligned}$$
(2.12)

Such \(\phi\) must satisfy relation \(g_1'(\phi )=0\). Taking the derivative of \(g_1\) we get

$$\begin{aligned} g_1'(\phi )&=2(\cos \phi ^2-\sin \phi ^2)(a_{ppp}a_{qpp}-a_{pqq}a_{qqq}) \\&\quad +2\cos \phi \sin \phi (a_{pqq}^2+a_{qpp}^2-a_{ppp}^2-a_{qqq}^2) \\&=2\cos (2\phi )(a_{ppp}a_{qpp}-a_{pqq}a_{qqq}) \\&\quad +\sin (2\phi )({a_{pqq}}^2+a_{qpp}^2-a_{ppp}^2-a_{qqq}^2) \\&=0. \end{aligned}$$

Dividing this relation by \(\cos (2\phi )\) we obtain

$$\begin{aligned} \tan (2\phi )=\frac{2(a_{ppp}a_{qpp}-a_{pqq}a_{qqq})}{a_{ppp}^2+a_{qqq}^2-a_{pqq}^2-a_{qpp}^2}. \end{aligned}$$
(2.13)

Similarly, we find the rotation angles for matrices \({\hat{R}}_V\) and \({\hat{R}}_W\) as

$$\begin{aligned} \tan (2\phi )=\frac{2(b_{ppp}b_{pqp}-b_{qpq}b_{qqq})}{b_{ppp}^2+b_{qqq}^2-b_{qpq}^2-b_{pqp}^2} \end{aligned}$$
(2.14)

and

$$\begin{aligned} \tan (2\phi )=\frac{2(c_{ppp}c_{ppq}-c_{qqp}c_{qqq})}{c_{ppp}^2+c_{qqq}^2-c_{ppq}^2-c_{qqp}^2}, \end{aligned}$$
(2.15)

respectively.

In the relations (2.13)–(2.15) it is possible that both the numerator and the denominator are equal to zero. If that happens for one of those relations, we can skip the rotation in the corresponding direction and move on to the next one. If this is the case for all pairs, the algorithm will be terminated and it should be restarted with preconditioning. This will be explained in Section 5 for the case of antisymmetric tensors.

Rotation angles in \({\hat{R}}_U,{\hat{R}}_V,{\hat{R}}_W\) do not need to be calculated explicitly. We only need the sine and the cosine of the corresponding angles. However, once we have formulas for computing \(\tan (2\phi )\), there is still a problem of calculating efficiently \(\sin \phi\) and \(\cos \phi\). We will show how it is done for the rotation in the first mode. The procedure is the same in other modes. We go back to the relation (2.13). Denote

$$\begin{aligned} \lambda&= 2(a_{ppp}a_{qpp}-a_{pqq}a_{qqq}){\text {sign}}(a_{ppp}^2+a_{qqq}^2-a_{pqq}^2-a_{qpp}^2), \\ \mu & = \vert a_{ppp}^2+a_{qqq}^2-a_{pqq}^2-a_{qpp}^2\vert . \end{aligned}$$

Moreover, we denote \(t=\tan \phi\). Using the double-angle formula for tangent,

$$\begin{aligned} \tan (2\phi )=\frac{2t}{1-t^2}, \end{aligned}$$

relation (2.13) reads

$$\begin{aligned} \frac{2t}{1-t^2}=\frac{\lambda }{\mu }. \end{aligned}$$

This is a quadratic equation in t, \(\lambda t^2+2\mu t-\lambda =0,\) with solutions \(t_1=\frac{-\mu +\sqrt{\mu ^2+\lambda ^2}}{\lambda }, t_2=\frac{-\mu -\sqrt{\mu ^2+\lambda ^2}}{\lambda }.\) Note that the equation for \(t_1\) is numerically unstable because catastrophic cancellation may occur. Therefore, we multiply both numerator and denominator by \(\mu +\sqrt{\mu ^2+\lambda ^2}\). That way we attain a numerically stable expression \(t_1=\frac{\lambda }{\mu +\sqrt{\mu ^2+\lambda ^2}}.\) Finally,

$$\begin{aligned} \cos \phi _i=\frac{1}{\sqrt{1+t_i^2}}, \quad \sin \phi _i=\frac{t_i}{\sqrt{1+t_i^2}}=t_i\cos \phi _i, \quad i=1,2. \end{aligned}$$

We calculate both solutions and use the one that gives the bigger value of the function (2.12).

The order in which we choose pivot pairs is called pivot strategy. In our algorithm the pivot strategy is assumed to be cyclic. We choose an ordering of pairs (ij), \(1\le i<j\le n\), which makes one cycle. Then we repeat the same cycle of pivot pairs until the convergence criterium is satisfied. Common examples of cyclic pivot strategies are row-wise and column-wise strategies with corresponding ordering of pivot pairs defined by

$$\begin{aligned} {\mathcal {O}}_r=(1,2),(1,3),\ldots ,(1,n),(2,3),\ldots ,(2,n),\ldots ,(n-1,n) \end{aligned}$$
(2.16)

and

$$\begin{aligned} {\mathcal {O}}_c=(1,2),(1,3),(2,3),\ldots ,(1,n),(2,n),\ldots ,(n-1,n), \end{aligned}$$
(2.17)

respectively. The convergence results from Section 3 hold for any cyclic strategy. Nevertheless, to ensure convergence of the algorithm, pivot pairs should satisfy an additional condition. We only take a pivot pair (ij) such that (at least) one of the following inequalities is true,

$$\begin{aligned} \vert \langle \nabla _Q f,Q{\dot{R}}(i,j,0)\rangle \vert \ge \eta \Vert \nabla _Q f\Vert _2, \quad {\text {for }} Q=U,V,W, \end{aligned}$$
(2.18)

where \(0<\eta \le \frac{2}{n}\), \({\dot{R}}(i,j,0)\) denotes \(\frac{\partial }{\partial \phi }R(i,j,\phi ){\vert }_{\phi =0}\), f is defined in (2.6), and the projected gradient \(\nabla _Q f\) will be defined in Subsection 3.1. If (ij) does not satisfy any of the conditions (2.18), then it will be skipped and we move to the next pair in the cycle. It will be shown in Lemma 3.2 that for each inequality (2.18) it is always possible to find an appropriate pivot pair.

In the kth step of the algorithm, when we have \({\mathcal {A}}^{(k)}\), \(U_k,V_k,W_k\), we first compute the sine and the cosine of the rotation angle in the rotation matrix \(R_{U,k}\). We compute the auxiliary tensor \(\mathcal {B}^{(k)}\),

$$\begin{aligned} \mathcal {B}^{(k)} = {\mathcal {A}}^{(k)}\times _1R_{U,k}^T = {\mathcal {A}}\times _1(R_{U,k}^TU_k^T)\times _2V_k^T\times _3W_k^T, \end{aligned}$$

and

$$\begin{aligned} U_{k+1}=U_kR_{U,k}. \end{aligned}$$

Then we repeat this procedure in the other modes. This is summarized in Algorithm 2.1.

figure a

We have several remarks regarding the Algorithm 2.1.

  • Algorithm 2.1 employs the identity initialization \(U_0=V_0=W_0=I\). This is not necessarily done this way and it will be further discussed within the numerical examples in Section 4, as well as in relation with the antisymmetric tensors in Section 5.

  • It is not needed to explicitly form rotation matrices and tensor matricizations in order to perform mode-n multiplications in the algorithm.

  • Conditions on pivot pairs (2.18) can be simplified to lower the computational effort. This will be shown after Lemma 3.2. Moreover, the coefficient \(0<\eta \le \frac{2}{n}\) can vary, which will be examined in Section 4.

This algorithm can be generalized for the order-d tensors where \(d>3\). In that case we need to obtain orthogonal matrices \(U_1,U_2,\ldots U_d\) such that maximization condition (1.2) holds. One iteration of the algorithm consists of d microiterations that are analogues of those in (2.9), (2.10), and (2.11).

3 Convergence results

3.1 Gradient of the objective function

Before we move on to the convergence of the Algorithm 2.1, let us say something about the gradient of the objective function \(f:O_n\times O_n\times O_n\rightarrow {\mathbb {R}}\),

$$\begin{aligned} f(U,V,W)=\Vert diag ({\mathcal {A}}\times _1U^T\times _2V^T\times _3W^T)\Vert ^2, \end{aligned}$$
(3.1)

where \(O_n\) stands for the group of orthogonal matrices of order n. To calculate \(\nabla f\) we need an auxiliary function \({\tilde{f}}:{\mathbb {R}}^{n\times n}\times {\mathbb {R}}^{n\times n}\times {\mathbb {R}}^{n\times n}\rightarrow {\mathbb {R}}\) defined by the same formula (3.1) as f. In other words, function \({\tilde{f}}\) is such that f is the restriction of \({\tilde{f}}\) to the set of triples of orthogonal matrices. Then \(\nabla f\) is the projection of \(\nabla {\tilde{f}}\) onto the tangent space at (UVW) to the manifold \(O_n\times O_n\times O_n\). We have

$$\begin{aligned} \nabla f(U,V,W)&=\left[ \begin{array}{ccc} \nabla _U f(U,V,W) &{} \nabla _V f(U,V,W) &{} \nabla _W f(U,V,W) \\ \end{array} \right] \nonumber \\&=Proj \left[ \begin{array}{ccc} \nabla _U {\tilde{f}}(U,V,W) &{} \nabla _V {\tilde{f}}(U,V,W) &{} \nabla _W {\tilde{f}}(U,V,W) \\ \end{array} \right] \nonumber \\&=\left[ \begin{array}{ccc} U\Lambda (U) &{} V\Lambda (V) &{} W\Lambda (W) \\ \end{array}\right] , \end{aligned}$$
(3.2)

where

$$\begin{aligned} \Lambda (Q):=\frac{Q^T\nabla _Q {\tilde{f}}-(\nabla _Q {\tilde{f}})^TQ}{2}. \end{aligned}$$
(3.3)

To calculate \(\nabla f(U,V,W)\) we write \({\tilde{f}}\) as

$$\begin{aligned} {\tilde{f}}(U,V,W)=\Vert diag ({\mathcal {A}}\times _1U^T\times _2V^T\times _3W^T)\Vert ^2=\sum _{l=1}^n\left( \sum _{i,j,k=1}^n a_{ijk}u_{il}v_{jl}w_{kl}\right) ^2. \end{aligned}$$

Element-wise, we get

$$\begin{aligned} \frac{\partial {\tilde{f}}}{\partial u_{ml}}&=2\left( \sum _{i,j,k=1}^n a_{ijk}u_{il}v_{jl}w_{kl}\right) \left( \sum _{j,k=1}^n a_{mjk}v_{jl}w_{kl}\right) \\&=2\left( {\mathcal {A}}\times _1U^T\times _2V^T\times _3W^T\right) _{lll}\left( {\mathcal {A}}\times _2V^T\times _3W^T\right) _{mll}, \\ \frac{\partial {\tilde{f}}}{\partial v_{ml}}&=2\left( {\mathcal {A}}\times _1U^T\times _2V^T\times _3W^T\right) _{lll}\left( {\mathcal {A}}\times _1U^T\times _3W^T\right) _{lml}, \\ \frac{\partial {\tilde{f}}}{\partial w_{ml}}&=2\left( {\mathcal {A}}\times _1U^T\times _2V^T\times _3W^T\right) _{lll}\left( {\mathcal {A}}\times _1U^T\times _2V^T\right) _{llm}. \end{aligned}$$

Then, we can use the above relations together with (3.3) to get an explicit expression for \(\Lambda (U)\),

$$\begin{aligned} (\Lambda (U))_{lp}&= \frac{1}{2}\left( \sum _{m=1}^n u_{ml}\frac{\partial {\tilde{f}}}{\partial u_{mp}}-\sum _{m=1}^n\frac{\partial {\tilde{f}}}{\partial u_{ml}}u_{mp}\right) \\&= \left( {\mathcal {A}}\times _1U^T\times _2V^T\times _3W^T\right) _{ppp}\left( {\mathcal {A}}\times _1U^T\times _2V^T\times _3W^T\right) _{lpp} \\&\qquad -\left( {\mathcal {A}}\times _1U^T\times _2V^T\times _3W^T\right) _{lll}\left( {\mathcal {A}}\times _1U^T\times _2V^T\times _3W^T\right) _{pll}. \end{aligned}$$

Similarly we get the expressions for \(\Lambda (V)\) and \(\Lambda (W)\).

The gradient of the objective function will be needed in order to prove the following convergence result and also to check the pivot conditions (2.18).

3.2 Convergence theorem

The convergence of Algorithm 2.1 is given in Theorem 3.1.

Theorem 3.1

Every accumulation point (UVW) obtained by Algorithm 2.1 is a stationary point of the function f defined by (3.1).

The proof follows the idea from [9], which was also used in [13], as well as in [1, 2]. The major obstacle is that here we have a function of three variables, while earlier this procedure was used with single-variable functions. We prove Lemma 3.2, which is an adaptation of Lemma 3.1 from [13]. Then, using Lemma 3.2 we prove Lemma 3.4, which is an essential step in the proof of Theorem 3.1.

Lemma 3.2

For any differentiable function \(f:O_n\times O_n\times O_n\rightarrow {\mathbb {R}}\), \(U,V,W\in O_n\), and \(0<\eta \le \frac{2}{n}\) it is always possible to find index pairs \((i_U,j_U)\), \((i_V,j_V)\), \((i_W,j_W)\) satisfying pivot condition (2.18).

Proof

Observe that

$$\begin{aligned} {\dot{R}}(i,j,0)=e_je_i^T-e_ie_j^T. \end{aligned}$$

From the definition of the operator \(\Lambda\) we see that matrix \(\Lambda (U)\) is skew-symmetric. Then, from the fact that the Euclidean norm is invariant under unitary transformations and from relation (3.2) we have

$$\begin{aligned} \vert \langle \nabla _U f(U,V,W),U{\dot{R}}(i,j,0)\rangle \vert&=\vert \langle U\Lambda (U),U{\dot{R}}(i,j,0)\rangle \vert \nonumber \\&=\vert \langle \Lambda (U),{\dot{R}}(i,j,0)\rangle \vert =2\vert \Lambda (U)_{ij}\vert . \end{aligned}$$
(3.4)

We can always find an index pair \((i_U,j_U)\) such that

$$\begin{aligned} \vert \Lambda (U)_{i_Uj_U}\vert \ge \frac{1}{n}\Vert \Lambda (U)\Vert _2. \end{aligned}$$

Inserting this into equation (3.4) with \((i,j)=(i_U,j_U)\), we get

$$\begin{aligned} \vert \langle \nabla _U f(U,V,W),U{\dot{R}}(i_U,j_U,0)\rangle \vert \ge \frac{2}{n}\Vert \Lambda (U)\Vert _2\ge \eta \Vert \Lambda (U)\Vert _2=\eta \Vert \nabla _U f(U,V,W)\Vert _2, \end{aligned}$$

which proves assertion (i). Since the matrices \(\Lambda (V)\) and \(\Lambda (W)\) are also skew-symmetric, in the same way we obtain

$$\begin{aligned} \vert \langle \nabla _V f(U,V,W),V{\dot{R}}(i_V,j_V,0)\rangle \vert&\ge \frac{2}{n}\Vert \Lambda (V)\Vert _2\ge \eta \Vert \Lambda (V)\Vert _2=\eta \Vert \nabla _V f(U,V,W)\Vert _2, \\ \vert \langle \nabla _W f(U,V,W),W{\dot{R}}(i_W,j_W,0)\rangle \vert&\ge \frac{2}{n}\Vert \Lambda (W)\Vert _2\ge \eta \Vert \Lambda (W)\Vert _2=\eta \Vert \nabla _W f(U,V,W)\Vert _2. \end{aligned}$$

This proves assertions (ii) and (iii), respectively. \(\square\)

Remark 3.3

Conditions (2.18) are equivalent to

$$\begin{aligned} 2\vert \Lambda (Q)_{ij}\vert \ge \eta \Vert \Lambda (Q)\Vert _2, \quad {\text {for }} Q=U,V,W, \end{aligned}$$

where \(\Lambda (\cdot )\) is as in relation (3.3).

Lemma 3.4

Let \(U_k\), \(V_k\), \(W_k\), \(k\ge 0\) be the sequences generated by Algorithm 2.1. Let \({\overline{U}},{\overline{V}},{\overline{W}}\) be a triple of orthogonal matrices satisfying \(\nabla f({\overline{U}},{\overline{V}},{\overline{W}})\ne 0\). Then there exist \(\epsilon >0\) and \(\delta >0\) such that

$$\begin{aligned} \Vert U_k-{\overline{U}}\Vert _2<\epsilon , \quad \Vert V_k-{\overline{V}}\Vert _2<\epsilon , \quad \Vert W_k-{\overline{W}}\Vert _2<\epsilon \end{aligned}$$

implies

$$\begin{aligned} f(U_{k+1},V_{k+1},W_{k+1})-f(U_k,V_k,W_k)\ge \delta . \end{aligned}$$
(3.5)

Proof

Let us fix the iteration step k. To shorten the notation set \(\phi _U=\phi _{U,k}\), \(\phi _V=\phi _{V,k}\), \(\phi _W=\phi _{W,k}\), and \(R_{U,k}=R(i_k,j_k,\phi _U)\), \(R_{V,k}=R(i_k,j_k,\phi _V)\), \(R_{W,k}=R(i_k,j_k,\phi _W)\). We define three functions \(h_k^{(1)},h_k^{(2)},h_k^{(3)}:{\mathbb {R}}\rightarrow {\mathbb {R}}\),

$$\begin{aligned} h_k^{(1)}(\phi _1)&=f(U_kR(i_k,j_k,\phi _1),V_k,W_k), \\ h_k^{(2)}(\phi _2)&=f(U_kR_{U,k},V_kR(i_k,j_k,\phi _2),W_k), \\ h_k^{(3)}(\phi _3)&=f(U_kR_{U,k},V_kR_{V,k},W_kR(i_k,j_k,\phi _3)). \end{aligned}$$

Further on, we define yet another function \(h_k:{\mathbb {R}}\times {\mathbb {R}}\times {\mathbb {R}}\rightarrow {\mathbb {R}}\),

$$\begin{aligned} h_k(\phi _1,\phi _2,\phi _3)=f(U_kR(i_k,j_k,\phi _1),V_kR(i_k,j_k,\phi _2),W_kR(i_k,j_k,\phi _3)). \end{aligned}$$

Since \(R(i_k,j_k,0)=I\),

$$\begin{aligned} h_k(0,0,0)=f(U_k,V_k,W_k). \end{aligned}$$

From the construction of Algorithm 2.1 we know that

$$\begin{aligned} \max _{\phi _1}h_k^{(1)}(\phi _1)&=h_k^{(1)}(\phi _U)=f(U_kR_{U,k},V_k,W_k), \\ \max _{\phi _2}h_k^{(2)}(\phi _2)&=h_k^{(2)}(\phi _V)=f(U_kR_{U,k},V_kR_{V,k},W_k), \\ \max _{\phi _3}h_k^{(3)}(\phi _3)&=h_k^{(3)}(\phi _W)=f(U_kR_{U,k},V_kR_{V,k},W_kR_{W,k}), \end{aligned}$$

and the kth step of the algorithm is represented by

$$\begin{aligned} h_k(\phi _U,\phi _V,\phi _W)=f(U_kR_{U,k},V_kR_{V,k},W_kR_{W,k})=f(U_{k+1},V_{k+1},W_{k+1}). \end{aligned}$$

Moreover, it is easy to see from the algorithm that

$$\begin{aligned} f(U_{k+1},V_{k+1},W_{k+1})&\ge f(U_{k+1},V_{k+1},W_k) \nonumber \\&\ge f(U_{k+1},V_k,W_k) \nonumber \\&\ge f(U_k,V_k,W_k). \end{aligned}$$
(3.6)

In order to attain inequality (3.5) we need at least one sharp inequality in (3.6).

If \(\nabla f({\overline{U}},{\overline{V}},{\overline{W}})\ne 0\), then at least one partial gradient of f is not zero, that is

$$\begin{aligned} \nabla _U f({\overline{U}},{\overline{V}},{\overline{W}})\ne 0, \ \nabla _V f({\overline{U}},{\overline{V}},{\overline{W}})\ne 0, {\text { or }} \nabla _W f({\overline{U}},{\overline{V}},{\overline{W}})\ne 0. \end{aligned}$$

Let us assume that \(\nabla _U f({\overline{U}},{\overline{V}},{\overline{W}})\ne 0\). Then there exists \(\epsilon >0\) such that

$$\begin{aligned} \mu _1:=\min \{\Vert \nabla _U f(U,V,W)\Vert _2 \ : \ \Vert U-{\overline{U}}\Vert _2<\epsilon \}>0. \end{aligned}$$
(3.7)

We use the Taylor expansion of the function \(h_k^{(1)}\) around 0,

$$\begin{aligned} h_k^{(1)}(\phi _1)=h_k^{(1)}(0)+(h_k^{(1)})'(0)\phi _1+\frac{1}{2}(h_k^{(1)})''(\xi )\phi _1^2, \quad 0<\xi <\phi _1. \end{aligned}$$

Set \(M_1=\max \vert (h_k^{(1)})''(\xi )\vert <\infty\). Then we have

$$\begin{aligned} h_k^{(1)}(\phi _1)-h_k^{(1)}(0) \ge (h_k^{(1)})'(0)\phi _1-\frac{1}{2}M_1\phi _1^2. \end{aligned}$$
(3.8)

The derivative of \(h_k^{(1)}\) is given by

$$\begin{aligned} (h_k^{(1)})'(\phi _1)= \langle \nabla _U f(U_kR(i_k,j_k,\phi _1),V_k,W_k), U_k{\dot{R}}(i_k,j_k,\phi _1)\rangle . \end{aligned}$$

In particular,

$$\begin{aligned} (h_k^{(1)})'(0)=\langle \nabla _U f(U_k,V_k,W_k), U_k{\dot{R}}(i_k,j_k,0)\rangle . \end{aligned}$$
(3.9)

It follows from Lemma 3.2(i) and relation (3.9) that

$$\begin{aligned} \vert (h_k^{(1)})'(0)\vert \ge \eta \Vert \nabla _U f(U_k,V_k,W_k)\Vert _2. \end{aligned}$$
(3.10)

Therefore, from (3.10) and (3.7) we get

$$\begin{aligned} \vert (h_k^{(1)})'(0)\vert \ge \eta \mu _1>0. \end{aligned}$$
(3.11)

We go back to the inequality (3.8). For \(\phi _1=\frac{1}{M_1}(h_k^{(1)})'(0)\), using the definition of the function \(h_k^{(1)}\) and relations (3.6) and (3.11), we obtain

$$\begin{aligned}&f(U_{k+1},V_{k+1},W_{k+1})-f(U_k,V_k,W_k) \\&\ge f(U_{k+1},V_k,W_k)-f(U_k,V_k,W_k) =h_k^{(1)}(\phi _1)-h_k^{(1)}(0) \\&\ge (h_k^{(1)})'(0)\phi _1-\frac{1}{2}M_1\phi _1^2 =\frac{1}{M_1}((h_k^{(1)})'(0))^2-\frac{1}{2M_1}((h_k^{(1)})'(0))^2 \\&\ge \frac{\eta ^2\mu _1^2}{2M_1}=\delta >0. \end{aligned}$$

Now we assume that \(\nabla _U f({\overline{U}},{\overline{V}},{\overline{W}})=0\) and \(\nabla _V f({\overline{U}},{\overline{V}},{\overline{W}})\ne 0\). There is an \(\epsilon >0\) such that

$$\begin{aligned} \mu _2:=\min \{\Vert \nabla _V f(U,V,W)\Vert _2 \ : \ \Vert V-{\overline{V}}\Vert _2<\epsilon \}>0. \end{aligned}$$
(3.12)

In this case we use the Taylor expansion of the function \(h_k^{(2)}\) around 0. We have

$$\begin{aligned} h_k^{(2)}(\phi _2)-h_k^{(2)}(0) \ge (h_k^{(2)})'(0)\phi _2-\frac{1}{2}M_2\phi _1^2, \end{aligned}$$
(3.13)

for \(M_2=\max \vert (h_k^{(2)})''(\xi )\vert <\infty\), and

$$\begin{aligned} (h_k^{(2)})'(0)=\langle \nabla _V f(U_k,V_k,W_k), V_k{\dot{R}}(i_k,j_k,0)\rangle . \end{aligned}$$

Lemma 3.2(ii) and relation (3.12) imply

$$\begin{aligned} \vert (h_k^{(2)})'(0)\vert \ge \eta \Vert \nabla _V f(U_k,V_k,W_k)\Vert _2\ge \eta \mu _2>0. \end{aligned}$$
(3.14)

The assertion of the lemma follows from (3.13), (3.6), and (3.14) with \(\phi _2=\frac{1}{M_2}(h_k^{(2)})'(0)\).

Finally, if \(\nabla _U f({\overline{U}},{\overline{V}},{\overline{W}})=0\) and \(\nabla _V f({\overline{U}},{\overline{V}},{\overline{W}})=0\), since \(\nabla f(U,V,W)({\overline{U}},{\overline{V}},{\overline{W}})\ne 0\), then it must be that \(\nabla _W f({\overline{U}},{\overline{V}},{\overline{W}})\ne 0\). Then, there is an \(\epsilon >0\) such that

$$\begin{aligned} \mu _3:=\min \{\Vert \nabla _W f(U,V,W)\Vert _2 \ : \ \Vert W-{\overline{W}}\Vert _2<\epsilon \}>0. \end{aligned}$$
(3.15)

Here we need the Taylor expansion of \(h_k^{(3)}\) around 0,

$$\begin{aligned} h_k^{(3)}(\phi _3)-h_k^{(3)}(0) \ge (h_k^{(3)})'(0)\phi _1-\frac{1}{2}M_3\phi _3^2, \end{aligned}$$
(3.16)

for \(M_3=\max \vert (h_k^{(2)})''(\xi )\vert <\infty\). We repeat the same steps as for the preceding two cases. We have

$$\begin{aligned} (h_k^{(3)})'(0)=\langle \nabla _W f(U_k,V_k,W_k), W_k{\dot{R}}(i_k,j_k,0)\rangle , \end{aligned}$$

and, using Lemma 3.2(iii) and the relation (3.15), it follows that

$$\begin{aligned} \vert (h_k^{(3)})'(0)\vert \ge \eta \Vert \nabla _W f(U_k,V_k,W_k)\Vert _2\ge \eta \mu _3>0. \end{aligned}$$
(3.17)

We attain inequality (3.5) using (3.16), (3.6), and (3.17) with \(\phi _3=\frac{1}{M_3}(h_k^{(3)})'(0)\). \(\square\)

Using Lemma 3.4 we can now prove Theorem 3.1.

Proof of Theorem 3.1

Suppose that \({\overline{U}}\), \({\overline{V}}\), \({\overline{W}}\) are, respectively, accumulation points of the sequences \(\{U_j\}_{j\ge 1}\), \(\{V_j\}_{j\ge 1}\), \(\{W_j\}_{j\ge 1}\) generated by Algorithm 2.1. Then there are subsequences \(\{U_j\}_{j\in K_U}\), \(\{V_j\}_{j\in K_V}\), \(\{W_j\}_{j\in K_W}\) such that

$$\begin{aligned} \{U_j\}_{j\in K_U}\rightarrow {\overline{U}}, \quad \{V_j\}_{j\in K_V}\rightarrow {\overline{V}}, \quad \{W_j\}_{j\in K_W}\rightarrow {\overline{W}}, \end{aligned}$$

where \(K_U,K_V,K_W\subseteq {\mathbb {N}}\).

Assume that \(({\overline{U}},{\overline{V}},{\overline{W}})\) is not a stationary point of the function f, that is

$$\begin{aligned} \nabla f({\overline{U}},{\overline{V}},{\overline{W}})\ne 0. \end{aligned}$$
(3.18)

Then, for any \(\epsilon >0,\) there are \(k_0^{(U)}\in K_U\), \(k_0^{(V)}\in K_V\), \(k_0^{(W)}\in K_W\) such that

$$\begin{aligned} \Vert U_k-{\hat{U}}\Vert _2<\epsilon , \quad \Vert V_k-{\hat{V}}\Vert _2<\epsilon , \quad \Vert W_k-{\hat{W}}\Vert _2<\epsilon , \end{aligned}$$

for every \(k>k_0\), \(k_0=\max \{k_0^{(U)},k_0^{(V)},k_0^{(W)}\}\). Thus, Lemma 3.4 implies \(f(U_{k+1},V_{k+1},W_{k+1})-f(U_k,V_k,W_k)\ge \delta >0\). It follows that

$$\begin{aligned} f(U_k,V_k,W_k)\rightarrow \infty , \end{aligned}$$

when \(k\rightarrow \infty\). Since f is continuous, if \((U_k,V_k,W_k)\) converges, then \(f(U_k,V_k,W_k)\) should converge, too. This gives a contradiction. Therefore, assumption (3.18) cannot hold and \(({\overline{U}},{\overline{V}},{\overline{W}})\) is a stationary point of f. \(\square\)

Note that all results from this section can be generalized to order-d tensors, \(d>3\).

4 Numerical examples

We illustrate the convergence of Algorithm 2.1 through several numerical examples. We observe the relative off-norm of a tensor \({\mathcal {A}}\), which is given as

$$\begin{aligned} \frac{{\text {off}}({\mathcal {A}})}{\Vert {\mathcal {A}}\Vert }. \end{aligned}$$
(4.1)

For a diagonal tensor, value (4.1) is equal to zero, while for a random tensor it is, typically, close to one. Note that the off-norm is not a norm because it can be equal to zero for a nonzero input.

Figure 1 shows the change of the relative off-norm. We distinguish two different situations, one where a tensor can be completely diagonalized using orthogonal transformations, and a more general one where orthogonal diagonalization is not possible. For the first set of tensors we get \(\frac{{\text {off}}({\mathcal {A}})}{\Vert {\mathcal {A}}\Vert }=0\). Otherwise, we get the convergence to some value between 0 and 1. A random diagonalizable tensor is constructed by taking a diagonal tensor with random entries on the diagonal and multiplying it in each mode by orthogonal matrices obtained from QR factorizations of three random matrices. The algorithm uses row-wise cyclic pivot ordering (2.16) with different values of the parameter \(\eta\).

Fig. 1
figure 1

Change in the relative off-norm for two \(30\times 30\times 30\) tensors with different values of \(\eta\). Left: Diagonalizable tensor. Right: Non-diagonalizable tensor

In Figure 2 we compare five different pivot orderings. In addition to the row-wise top to bottom (2.16) and the column-wise left to right (2.17) ordering we have the row-wise bottom to top

$$\begin{aligned} {\mathcal {O}}_r' & =(n-1,n),(n-2,n-1),(n-2,n),(n-3,n-2),\ldots \\ &\qquad \ldots,(2,n),(1,2),\ldots ,(1,n), \end{aligned}$$

the column-wise right to left

$$\begin{aligned} {\mathcal {O}}_c'=(1,n),\ldots ,(n-1,n),(1,n-1),\ldots ,(n-2,n-1),\ldots ,(1,2), \end{aligned}$$

and the diagonal ordering of pivot pairs

$$\begin{aligned} {\mathcal {O}}_d=(1,2),(2,3),\ldots ,(n-1,n),(1,3),(2,4),\ldots ,(n-2,n),(1,4),\ldots ,(1,n). \end{aligned}$$

We run the algorithm on six random tensors, four of which cannot be diagonalized by orthogonal transformations, all with \(\eta =\frac{1}{20n}\). As expected, different pivot strategies are faster/slower on different tensors. However, no matter what pivot strategy we choose, the algorithm converges to the same point.

Fig. 2
figure 2

Change in the relative off-norm for six \(10\times 10\times 10\) tensors with different pivot strategies

In Figure 3 we compare two different initializations for our Jacobi-type algorithm. The first one is identity initialization as it was done in Algorithm 2.1. There we have

$$\begin{aligned} {\mathcal {A}}^{(0)}={\mathcal {A}}, \quad U_0=V_0=W_0=I. \end{aligned}$$
(4.2)

The other initialization that can be used is coming from the HOSVD of \({\mathcal {A}}\), see [5], \({\mathcal {A}}={\widetilde{\mathcal {S}}}\times _1{\widetilde{U}}\times _2{\widetilde{V}}\times _3{\widetilde{W}},\) where \({\widetilde{U}}\), \({\widetilde{V}}\), and \({\widetilde{W}}\) are matrices of left singular vectors of matricizations \(A_{(1)}\), \(A_{(2)}\), and \(A_{(3)}\), respectively, and \({\widetilde{\mathcal {S}}}={\mathcal {A}}\times _1{\widetilde{U}}^T\times _2{\widetilde{V}}^T\times _3{\widetilde{W}}^T.\) Then, instead of the initialization (4.2), we set \({\mathcal {A}}^{(0)}={\widetilde{\mathcal {S}}}\), \(U_0={\widetilde{U}}\), \(V_0={\widetilde{V}}\), \(W_0={\widetilde{W}}\). We run the algorithm with \(\eta =\frac{1}{20n}\) on two random tensors. We can see that the HOSVD initialization is superior in the beginning cycles. This is the case because, compared to the starting tensor \({\mathcal {A}}\), the core tensor \({\widetilde{\mathcal {S}}}\) from the HOSVD of \({\mathcal {A}}\) is significantly closer to a diagonal tensor. Nevertheless, after those first cycles, both initializations are equally good.

Fig. 3
figure 3

Convergence of the algorithm with different initializations. Left: \(20\times 20\times 20\) tensor. Right: \(10\times 10\times 10\) tensor

5 Symmetric and antisymmetric tensors

We say that a tensor is symmetric if its elements remain constant under any permutation of indices. For a symmetric tensor \({\mathcal {X}}\in {\mathbb {R}}^{n\times n\times n}\) we have

$$\begin{aligned} x_{ijk}=x_{ikj}=x_{jik}=x_{jki}=x_{kij}=x_{kji}. \end{aligned}$$

Symmetric tensors were studied in details in [13] and [12], where the authors also work with a Jacobi-type algorithm.

Our algorithm is not structure-preserving. In order to have a symmetry-preserving Jacobi-type algorithm, rotation matrices should be the same in all modes. Since the rotations \(R_{U,k}\), \(R_{V,k}\), and \(R_{W,k}\) in the kth step are chosen depending on different tensors, they are not all the same. Nevertheless, we have noticed in practice that for smaller values of \(\eta\) from (2.18), after the convergence criterion is satisfied, the Algorithm 2.1, in most of the cases, returns mutually equal matrices UVW and a symmetric tensor \(\mathcal {S}\). However, this is not the case for larger \(\eta\). We will illustrate this behaviour on an example.

Departure from symmetry is measured as the distance in the Frobenius norm between the tensor \({\mathcal {A}}\) and its symmetrization \({\text {sym}}({\mathcal {A}})\),

$$\begin{aligned} \Vert {\mathcal {A}}-{\text {sym}}({\mathcal {A}})\Vert , \end{aligned}$$
(5.1)

where, for a 3rd order tensor \({\mathcal {X}}\), we have

$$\begin{aligned} {\text {sym}}({\mathcal {X}})={\frac{1}{6}}(x_{ijk}+x_{ikj}+x_{jik}+x_{jki}+x_{kij}+x_{kji}). \end{aligned}$$

It is easy to check that, if \({\mathcal {X}}\) is symmetric, then \({\text {sym}}({\mathcal {X}})={\mathcal {X}}\), and the expression (5.1) is equal to zero. We applied Algorithm 2.1 with \(\eta =\frac{1}{2000n}\) on a randomly generated symmetric \(20\times 20\times 20\) tensor \({\mathcal {A}}\). In the left picture in Figure 4 we can see that after we start with a symmetric tensor, symmetry is lost already in the first cycle, as expected, but the tensor becomes more and more symmetric through iterations and the sequence \(({\mathcal {A}}^{(k)})_k\) converges to a symmetric tensor. In the right picture we see that the distance between each pair of matrices \(U_k,V_k,W_k\) converges to zero, that is, \(U_k,V_k,W_k\) converge to the same matrix. This does not happen for \(\eta =\frac{1}{20n}\).

Fig. 4
figure 4

Departure from the symmetry for a random symmetric \(20\times 20\times 20\) tensor

One should keep in mind that for a symmetric starting tensor the solution of the maximization problem (2.5) is not necessary a symmetric tensor. One such example is tensor \({\mathcal {T}}\) from [12, Example 5.5] that can be given by its matricization

$$\begin{aligned} T_{(1)}={ \left[ \begin{array}{ccccccccc} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \\ \end{array} \right] }. \end{aligned}$$
(5.2)

First, we notice that neither identity nor HOSVD initialization work on this tensor. In both cases the diagonal elements of \({\mathcal {T}}^{(0)}\) are zero and all rotation angles are zero, so the tensor is unchanged. Thus, here we do preconditioning with a random orthogonal matrix Q by setting \({{\mathcal {A}}}^{(0)}={{\mathcal {A}}}\times _1Q\times _2Q\times _3Q.\) For the vast majority of the choices of Q, Algorithm 2.1 with \(\eta =\frac{1}{2000n}\) converges to the symmetric tensor \({\mathcal {S}}\),

$$\begin{aligned} S_{(1)}={ \left[ \begin{array}{ccccccccc} 0.8889 &{} -0.4444 &{} -0.4444 &{} -0.4444 &{} -0.4444 &{} -0.1111 &{} -0.4444 &{} -0.1111 &{} -0.4444 \\ -0.4444 &{} -0.4444 &{} -0.1111 &{} -0.4444 &{} 0.8889 &{} -0.4444 &{} -0.1111 &{} -0.4444 &{} -0.4444 \\ -0.4444 &{} -0.1111 &{} -0.4444 &{} -0.1111 &{} -0.4444 &{} -0.4444 &{} -0.4444 &{} -0.4444 &{} 0.8889 \\ \end{array} \right] }, \end{aligned}$$

with transformation matrix \(U=V=W\) depending on Q. This is a stationary point of the objective function (2.6), but not a point of its global maximum. In the other rare cases the algorithm converged to one of the better, but nonsymmetric, solutions of the form \(\bar{{\mathcal {S}}}\)

$$\begin{aligned} {\bar{S}}_{(1)}={ \left[ \begin{array}{ccccccccc} \pm 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} \pm 1 &{} 0 \\ 0 &{} 0 &{} \pm 1 &{} 0 &{} \pm 1 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} \pm 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} \pm 1 \\ \end{array} \right] }. \end{aligned}$$

On the other hand, a tensor is antisymmetric if its elements change sign when permuting pairs of indices. For an antisymmetric tensor \({{\mathcal {X}}}\in {\mathbb {R}}^{n\times n\times n}\) we have

$$\begin{aligned} x_{ijk}=x_{jki}=x_{kij}=-x_{ikj}=-x_{jik}=-x_{kji}. \end{aligned}$$

In every antisymmetric tensor, elements on the positions where two or more indices are the same are equal to zero. Hence, all elements on the diagonal of an antisymmetric tensor are zero. This is the reason why, contrary to the symmetric case where one may be interested in a structure-preserving algorithm, we are not interested in preserving the antisymmetry. Here, by each iteration a tensor moves further from the structure. Still, antisymmetric tensors need some special attention. If we apply the algorithm directly in the form given in Algorithm 2.1, with the identity initialization, the algorithm will fail when computing the rotation angle. This happens because for an antisymmetric tensor, when computing the tangent of the double rotation angle (2.13), (2.14), and (2.15), we get both the numerator and the denominator equal to zero. We can overcome this problem with a preconditioning step — instead of the identity initialization (4.2) we use the HOSVD initialization as described in Section 4.