1 Introduction

We consider the problem of decomposing a real-valued symmetric tensor as the sum of outer products of real-valued vectors. Let represent an \(m\)-way, \(n\)-dimension symmetric tensor. Given a real-valued vector of length \(n\), we let denote the \(m\)-way, \(n\)-dimensional symmetric outer product tensor such that . Comon et al. [15] showed that any real-valued symmetric tensor can be decomposed as

(1)

with \(\lambda _k \in {\mathbb {R}}\) and for \(k\,=\,1,\dots ,p\); see the illustration in Fig. 1. We assume that the tensor is low-rank, i.e., \(p\) is small relative to the typical rank of a random tensor. We survey the methods that have been proposed for related problems and discuss several optimization formulations, including a surprisingly effective method that ignores the symmetry.

Fig. 1
figure 1

Symmetric tensor factorization for \(m\,=\,3\)

We also consider the related problem of decomposing a real-valued nonnegative symmetric tensor as the sum of outer products of real-valued nonnegative vectors. Let represent an \(m\)-way, \(n\)-dimension nonnegative symmetric tensor. In this case, the goal is a factorization of the form

(2)

If such a factorization exists, we say that is completely positive [39]. If such a factorization does not exist, then we propose to solve a “best fit” problem instead.

The paper is structured as follows. Section 2 provides notation and background material. Related decompositions, including the best symmetric rank-1 approximation, the symmetric Tucker decomposition, partially symmetric decompositions, and the complex-valued canonical decompositions are discussed in Sect. 3. We describe two optimization formulations for symmetric decomposition in Sect. 4, and a mathematical program for the nonnegative problem in Sect. 5. Numerical results, including the methodology for generating challenging problems, is presented in Sect. 6. Finally, Sect. 7 discusses our findings and future challenges.

2 Background

2.1 Notation and preliminaries

A tensor is a multidimensional array. The number of ways or modes is called the order of a tensor. For example, a matrix is a tensor of order two. Tensors of order three or greater are called higher-order tensors.

Let \(n_1 \times n_2 \times \cdots \times n_m\) denote the size of an \(m\)-way tensor. We say that the tensor is cubic if all the modes have the same size, i.e., \(n\,=\,n_1\,=\,n_2 \cdots \,=\,n_m\). In other words, “cubic” is the tensor generalization of “square.” In this case, we refer to \(n\) as the dimension of the tensor. We let \({\mathbb {R}}^{[m,n]}\) denote the space of all cubic real-valued tensors of order \(m\) and dimension \(n\). As appropriate, we use multiindex notation to compactly index tensors so that . Thus, denotes \(a_{i_1 i_2 \cdots i_m}\).

The norm of a tensor is the square root of the sum of squares of its elements, i.e.,

Unless otherwise noted, all norms are the (elementwise) \(\ell _2\)-norm.

2.2 Symmetric tensors

A tensor is symmetric if its entries do not change under permutation of the indices. Formally, we let \(\pi (m)\) denote the set of permutations of length \(m\). For instance,

It is well known that \(|\pi (m)| \,=\, m!\). We say a real-valued \(m\)-way \(n\)-dimensional tensor is symmetric [15] if

Such tensors are also sometimes referred to as supersymmetric. For a 3-way tensor of dimension \(n\), symmetry means

We let \({\mathbb {S}}^{[m,n]}\subset {\mathbb {R}}^{[m,n]}\) denote the subspace of all symmetric tensors.

2.3 Symmetric outer product tensors

A tensor in \({\mathbb {S}}^{[m,n]}\) is called rank one if it has the form where \(\lambda \in {\mathbb {R}}\) and . If \(m\) is odd or \(\lambda > 0\), then the \(m\)th real root of \(\lambda \) always exists, so we can rewrite the tensor as

(3)

If \(m\) is even, however, the \(m\)th real root does not exist if \(\lambda < 0\), so the scalar cannot be absorbed as in (3).

2.4 Model parameters

For the symmetric decomposition, we let denote the vector of weights and denote the matrix of component vectors, i.e.,

The notation \(x_{ik}\) refers to the \(i\)th entry in the \(k\)th column, so recalling the multiindex notation , we have

3 Related problems

3.1 Canonical polyadic tensor decomposition

Canonical polyadic (CP) tensor decomposition has been known since 1927 [25, 26]. It is known under several names, two of the most prominent being CANDECOMP as proposed by Carroll and Chang [14] and PARAFAC by Harshman [24]. Originally, the term CP was proposed as a combination of these two names [29], but more recently has been re-purposed to mean “canonical polyadic.” For details on CP, we refer the reader to the survey [31]. Here, we describe the problem in the case of a cubic tensor . Our goal is to discover a decomposition of the form

(4)

The circle denotes the vector outer product so the entry is

Each summand is called a component. An illustration is shown in Fig. 2. One of the most effective methods for this problem is alternating least squares (ALS). We solve for each factor matrix

in turn by solving a linear least squares problem, cycling through all modes (i.e., \(j\,=\,1,\dots ,m\)) repeatedly until convergence. See, e.g., [31, Figure 3.3] for details.

Fig. 2
figure 2

CP tensor factorization for \(m\,=\,3\)

3.2 Canonical decomposition with partial symmetry

Partial symmetry has been considered since the work of Carroll and Chang [14]. At the same time Carroll and Chang [14] introduced CANDECOMP, they also defined INDSCAL which assumes two modes are symmetric. For simplicity of discussion, we assume a cubic tensor . For \(m\,=\,3\) and the last two dimensions being symmetric, this means

and the factorization should be of the form

In other words, the last two vectors in each component are equal. An illustration is provided in Fig. 3.

Fig. 3
figure 3

INDSCAL tensor factorization for \(m\,=\,3\)

Carroll and Chang [14] proposed to use an alternating method that ignores symmetry, with the idea that it will often converge to a symmetric solution (up to diagonal scaling). Later work showed that not all KKT points satisfy this condition [18]. In §4.7, we show how a generalization of this method can be surprisingly effective for symmetric tensor decomposition and provide some motivation for why this might be the case.

We also note that the methods proposed in this manuscript can be extended to partial symmetries.

3.3 Best symmetric rank-1 approximation

The best symmetric rank-1 approximation problem is

(5)

An illustration is shown in Fig. 4. This problem was first considered in De Lathauwer et al. [17], but their proposed symmetric higher-order power method was not convergent. The power method has been improved so that it is convergent in subsequent work [30, 32, 33, 41].

Fig. 4
figure 4

Best symmetric rank-1 decomposition for \(m\,=\,3\)

This problem is directly related to the problem of computing tensor Z-eigenpairs. A pair is a Z-eigenpair [34, 38] of a tensor if

where denotes a vector in \({\mathbb {R}}^n\) such that

The problems are related because any Karush-Kuhn-Tucker (KKT) point of (5) is a Z-eigenpair of ; see, e.g., [32].

Han [23] has considered an unconstrained optimization formulation of the problem (5). Cui et al. [16] use Jacobian SDP relaxations in polynomial optimization to find all real eigenvalues sequentially, from the largest to the smallest. Nie and Wang [36] consider semidefinite relaxations.

3.4 Symmetric Tucker decomposition

A related problem is symmetric Tucker decomposition. Here the goal is to find an orthogonal matrix and a symmetric tensor that solves

An illustration is shown in Fig. 5. This topic has been considered in [13, 27, 40] and is useful for compression and signal processing applications. Alas, the computational techniques are quite different, so we do not consider them further in this work.

Fig. 5
figure 5

Symmetric Tucker decomposition for \(m\,=\,3\)

3.5 Complex-valued symmetric tensor decomposition

An alternative version of the problem allows a complex decomposition, i.e.,

(6)

Techniques from algebraic geometry have been proposed to solve (6) in [1012, 37]. More recently, Nie [35] devised has a combination of algebraic and numerical approaches for solving this problem. Generally, these approaches do not scale to large \(n\), though Nie’s numerical method scales much better than previous approaches.

In the complex case, the typical rank (i.e., with probability one) is given by the theorem below. To the best of our knowledge, for the real case, no analogous results are known [15].

Theorem 1

(Alexander-Hirschowitz [4, 15]) For \(m > 2\), the typical symmetric rank (over \({\mathbb {C}}\)) of an order-\(m\) symmetric tensor of dimension \(n\) is

$$\begin{aligned} \left\lceil \frac{1}{n} \left( {\begin{array}{c}n+k-1\\ k\end{array}}\right) \right\rceil \end{aligned}$$

except for where it should be increased by one.

4 Optimization formulations for symmetric tensor decomposition

4.1 Index multiplicities

A tensor has \(n^m\) entries, but not all are distinct. Let the set of all possible indices be denoted by

Clearly, \(|{\mathcal {R}}| = n^m\).

Following [9], we define an index class as a set of tensor indices such that the corresponding tensor entries all share a value due to symmetry. For example, for \(m\,=\,3\) and \(n\,=\,2\), the tensor indices \((1,1,2)\) and \((1,2,1)\) are in the same index class since \(a_{112}\,=\,a_{121}\). For each index class, we specify an index representation which is an index such that the entries are in nondecreasing order. For instance, \((1,1,2)\) is the index representation for the index class that includes \(a_{121}\). The set

denotes all possible index representations.

Each index class also has a monomial representation [9]. For each there is a corresponding monomial representation such that

$$\begin{aligned} x_{i_1} x_{i_2} \cdots x_{i_m}, = x_1^{c_1} x_2^{c_2} \cdots x_n^{c_n}. \end{aligned}$$

Specifically, \(c_j\) represents that number of occurrences of index \(j\) in for \(j\,=\,1,\dots ,n\). Clearly, \(\sum _j c_j \,=\, m\). Conversely, for a given , we build an index with \(c_1\) copies of 1, \(c_2\) copies of 2, etc. This results in an \(m\)-long index representation. The set of monomial representations is denoted by

From [9], we have that the number of distinct entries of is given by

$$\begin{aligned} |{\mathcal {I}}| = |{\mathcal {C}}| = \left( {\begin{array}{c}m+n-1\\ m\end{array}}\right) = \frac{n^m}{m!} + O(n^{m-1}). \end{aligned}$$

It can be shown [9] that the multiplicity of the entry corresponding to a monomial representation is

(7)

Table 1 shows an example of index and monomial representations for \({\mathbb {S}}^{[3,2]}\), including the multiplicities of each element.

Table 1 Index and monomial representations for \({\mathbb {S}}^{[3,2]}\)

Without loss of generality, we exploit the one-to-one correspondence between index and monomial representations to change between them. For example,

and

4.2 Two formulations

For given and \(p\), our goal is to find and such that (1) is satisfied in a minimization sense. We consider two optimization formulations. The first formulation is the standard least squares formulation, i.e.,

(8)

Observe that this counts each unique entry multiple times, according to its multiplicity. The second formulation counts each unique entry only once, i.e.,

(9)

Either formulation can be expressed generically as

Choosing yields \(f_1\) whereas yields \(f_2\). The value denotes the difference between the model and the tensor at entry . Note that this formulation easily adapts to the case of missing data, i.e., missing data should have weight of zero in the optimization formulation [2, 3].

4.3 Gradients

Using the generic formulation, the gradients are given by

(10)

For \(f_1\), we mention an alternate gradient expression because it is more efficient to compute for larger values \(n\) and \(m\). The derivation follows [1], and the gradients are given by

(11)

This formulation does not easily accommodate missing data since is implicit.

4.4 Scaling ambiguity

Observe that either objective function suffers from scaling ambiguity. Suppose we have two equivalent models defined by

related by a positive scaling vector such that

To avoid this ambiguity, it is convenient to require for all \(k\). We could enforce this condition as an equality constraint, but instead we treat it as a exact penalty, i.e.,

(12)

It is straightforward to observe that the gradient is given by

In the experimental results, we see that choosing \(\gamma \,=\,0.1\) appears to be adequate for enforcing the penalty.

4.5 Sparse component weights

We assume so far that \(p\) is known, but this is not always the case. One technique to get around this problem is to guess a large value for \(p\) and then add a sparsity penalty on , the weight vector. Specifically, we can use an approximate \(\ell _1\) penalty of the form suggested by [42]:

In this case, the gradient is

$$\begin{aligned} \frac{\partial p_{\alpha ,\beta } }{\partial \lambda _k} = {\beta } \left[ (1+\exp (-\alpha \lambda _k))^{-1} + (1 + \exp (\alpha \lambda _k))^{-1} \right] . \end{aligned}$$

Note that the \(\beta \) term is not part of the approximation but rather the weight of the penalization. In our experiments, the results are insensitive to the precise choices of \(\alpha \) and \(\beta \).

4.6 Putting it all together

The final function to be optimized is

The choice of determines the choice of objective function. We can also set for any missing values. The choice of \(\gamma \) determines the weight of the penalty on the norm of the columns of . Since this constraint is easy to satisfy and mostly convenience, the exact choice of \(\gamma \) is not critical. We later show experiments with \(\gamma \,=\,0\) and \(\gamma \,=\, 0.1\), to contrast the difference between no penalty and a small penalty. (Increasing \(\gamma \) beyond \(0.1\) did not have any impact on the experiments.) The parameter \(\alpha \) determines the “steepness” of the approximate \(\ell _1\) penalty function, and the choice of \(\beta \) determines the weight of the sparsity-encouraging penalty. In [42], they start with a small value of \(\alpha \) and gradually increase it. In our experiments, we use fixed values \(\alpha \,=\,10\). The \(\beta \) term is the weight given to the penalty, which is usually determined heuristically; we use \(\beta \,=\,0.1\) in our experiments.

4.7 Ignoring symmetry

Another approach to symmetric decomposition is to ignore the symmetry altogether and use a standard CP tensor decomposition method such as ALS [19, 31]; surprisingly, there are situations under which this non-symmetric method yields a symmetric solution.

Under mild conditions, the CP decomposition (4) is unique up to permutation and scaling of the components, i.e., essentially unique. Sidiropoulos and Bro [43, Theorem 3] have a general a posteriori result on the essential uniqueness of the CP decompositions for tensors. If we specialize this result to the symmetric case by assuming for \(j\,=\,1,\dots ,m\), the result says that a sufficient condition for the uniqueness of (4) is

(13)

Here, the k-rank of the matrix is the largest number \(k\) such that every subset of \(k\) columns of is linearly independent. Table 2 shows sufficient k-rank’s for various values of \(m\) and \(p\). For instance, if \(m\,=\,3\) and \(p\,=\,25\), then is sufficient for uniqueness. The table does not directly depend on \(n\); however, recall that is an \(n \times p\) matrix, so .

Table 2 Minimal sufficient for uniqueness of symmetric outer product factorization

The importance of essential uniqueness is that the global solution of the unconstrained problem (4) is the same as for the symmetric problem (1) so long as satisfies (13). If we normalize the factors in (4) and, without loss of generality, ignore the permutation ambiguity, then uniqueness implies, for \(k\,=\,1,\dots ,p\),

A bit of care must be taken to convert from a solution that ignores symmetry since it could be the case, e.g., that . Algorithm 1 gives a simple procedure to “symmetrize” a tensor so that the signs align. It also averages the final sign-aligned factor matrices in case they are not exactly equal.

The benefit of ignoring symmetry is that we can use existing software for the CP decomposition. The disadvantage is that it requires \(m\) times as much storage, i.e., it must store the matrices thru rather than just . Moreover, there is no guarantee that the optimization algorithm will find the global minimum.

figure a

5 Optimization formulation for nonnegative symmetric factorization

The notion of completely positive tensors has been introduced by Qi et al. [39]. It is a natural extension of completely positive matrices. A nonnegative tensor is called completely positive if it has a decomposition of the form in (2).

The formulation is analogous to the unconstrained case, except that there is no (or equivalently, we constrain ) and we add nonnegativity constraints. For given , our goal is to find such that (2) is satisfied. We again assume \(p\) is known. The mathematical program is given by

Choosing yields the analogue of \(f_1\) whereas yields the analogue \(f_2\). The value is the difference between the model and the tensor at entry .

Using the generic formulation and following (10) without \(\lambda _k\), the gradients are given by

Our formulation finds the best nonnegative factorization. Fan and Zhou [20] consider the problem of verifying that a tensor is completely positive.

6 Numerical results

For our numerical results, we assume the tensor has underlying low-rank structure, as is typical in comparisons of numerical methods for tensor factorization (see, e.g., [44]). Hence, we assume there is some underlying and to be recovered, where \(p\) is lower than the typical rank. The noise-free data tensor is given by

(14)

The data tensor may also be contaminated by noise as controlled by the parameter \(\eta \ge 0\), i.e.,

(15)

Here is a noise tensor such that each element is drawn from a normal distribution, i.e., . The parameters \(m\), \(n\), \(p\) control the size of the problem. If the vectors in are collinear, then the problem is generally more difficult [28, 44].

For the \(f_1\) objective function in (8), we calculate the gradients as specified in (11). For small problems this may not be as fast as (10), but for larger problems it makes a significant difference in speed, as shown in the results. For \(f_2\), we precompute the index set \({\mathcal {I}}\) as well as the corresponding monomial representations \({\mathcal {C}}\) and multiplicities . This means that these values need not be computed each time the objective function and gradient are evaluated. The time for this preprocessing is included in the reported runtimes.

All tests were conducted on a laptop with an Intel Dual Core i7-3667U CPU and 8 GB of RAM, using MATLAB R2013a. For the optimization, unless otherwise noted, all tests are based on SNOPT, Version 7.2–9 [21, 22], using the MATLAB MEX interface. SNOPT default parameters were used except for the following: Major iteration limit \(=\) 10,000, New superbasics limit/Superbasics limit \(=\) 999, Major optimality tolerance \(=\) 1e-8. All tensor computations use the Tensor Toolbox for MATLAB, Version 2.5 [68] as well as additional codes for symmetric tensors (e.g., to calculate the index sets) that will be included in the next release.

6.1 Numerical results on a collection of test problems

We consider the impact of the problem formulation resulting from the choice of objective function and column normalization penalty. The objective function can weighted, based on the standard least squares formulation denoted by \(f_1\) in (8), or unweighted, which counts each unique entry only once denoted by \(f_2\) in (9). The column normalization penalty is either \(\gamma \,=\, 0\) (no penalty) or \(\gamma \,=\,0.1\). Higher values of \(\gamma \) did not change the results.

We test the choices for several test problems as follows. We consider four sizes:

  • \(m\,=\,3,n\,=\,4,p\,=\,2\);

  • \(m\,=\,4,n\,=\,3,p\,=\,5\);

  • \(m\,=\,4,n\,=\,25,p\,=\,3\); and

  • \(m\,=\,6,n\,=\,6,p\,=\,4\).

In the first case, since \(m\) is odd, we have the option to exclude from the optimization, but we include it here for consistency in this set of experiments. For each size, we also consider three noise levels: .

A random instance is created as follows. We generate a true solution defined by and . The weight vector has entries selected uniformly from \(\{-1,1\}\), i.e.,

The factor matrix is computed by first generating a matrix with random values from the normal distribution, i.e.,

$$\begin{aligned} \hat{\mathbf{X}}^* \in {\mathbb {R}}^{n \times p} \quad \text {such that}\quad \hat{x}_{ik}^* \in \mathcal {N}(0,1), \end{aligned}$$

and then normalizing each column to length one, i.e., .

Finally, given and , we can compute the tensor from (14) and add noise at the level specified by \(\eta \) per (15). For each problem size and noise level, we generate ten instances.

For each problem size, we generate five random starting points by choosing entries of from a Gaussian distribution (no column normalization) and entries of uniformly at random from . The same five starting points are used for all problems of that size.

For each problem formulation corresponding to a choice for objective function and for normalization penalty, we do fifty runs, i.e., ten instances with five random starts each. The same instances and starting points are used across all formulations. The output of each run is a weight vector and a matrix . Table 3a compares the relative error which measures the proportion of the observed data that is explained by the model, i.e.,

In the case of no noise, the ideal relative error is zero; otherwise, we hope for something near the noise level, i.e., \(\eta \). In our experiments, we say a run or instance is successful if the relative error is \(\le \)0.1. For each formulation, three values are reported. The first value is the number of successful runs. Since we are using five starting points per instance, the second value is the number of instances such that at least one starting point is successful. Finally, the last value is the median relative error across all fifty runs. Summary totals are provide in the last line for the 600 runs and 150 instances. Clearly, \(\gamma \,=\,0.1\) is superior to \(\gamma \,=\,0\) in terms of number of successful runs and instances. The comparison of unweighted (\(f_2\)) and weighted (\(f_1\)) is less clear cut—the unweighted formulation is successful for many more runs overall, but the weighted formulation is successful for more instances overall.

Table 3 Results of different formulations for a set of test problems

Table 3b compares the solution scores which is a measure of how accurately and are as compared to and . Without loss of generality, we assume both and have normalized columns. (If , then we rescale and .) There is a permutation ambiguity, but we permute the computed solution so as to maximize the following score:

A solution score of 1 indicates a perfect match, and we say a run or instance is successful if its solution score is \(\ge \)0.9. As with the relative error, we report three values. The first value is the number of runs out of fifty that are successful, the second value is the number of instances out of ten that are successful (i.e., at least one starting point was successful), and the third value is the median solution score. We also report totals for each formulation across the 600 runs and 150 instances. Consistent with Table 3a, using \(\gamma \,=\,0.1\) is more successful than \(\gamma \,=\,0\). The unweighted is once again successful for more runs, but the two methods are nearly tied in terms of number of instances.

Observe in Table 3b that the second size (\(m\,=\,4,n\,=\,3,p\,=\,5\)) has very low solution scores despite having good performance in terms of relative error. This is because the solution may not be unique, i.e., the k-rank of is no more than three, but the minimum k-rank that is sufficient for uniqueness is four per Table 2. If the solution is not unique, then multiple solutions exist and there is no reason to expect that the particular solution we find will be that one. For example, a particular instance for \(m\,=\,4,n\,=\,3,p\,=\,5\) with \(\eta \,=\,0\) is defined by

The alternate model given by

has a relative error \(<10^{-6}\). The last two columns generally agree, but the first three do not and the solution score is only 0.65. It may be interesting to know that in the matrix case (\(m\,=\,2\)), we would never compare the computed solution without imposing additional constraints such as orthogonality.

Table 3c compares the total runtimes for each method. As with any nonconvex optimization problem, there is significant variation from run to run, but we can gain a sense of the general expense for each method. As a reminder, we computed the gradient in the weighted case as shown in (11). If we compute it instead using (10), the runtimes for the weighted and unweighted methods are roughly the same. For size \(m\,=\,4,n\,=\,25,p\,=\,3\), the computation in (11) yields a 5-15X speed improvement because \(n\) is large; otherwise for smaller \(n\), the computation in (10) will generally be faster.

Finally, we briefly consider the impact on \(\gamma \) with respect to the constraint violation from (12), i.e.,

In Table 4, we report the number of runs where the constraint violation is \(\le \)0.01 and the mean value. Recall that the addition of the constraint violation is mainly a convenience, but it does improve the formulation by eliminating a manifold of equivalent solutions.

Table 4 Constraint violation runs \(\le \) 0.01 and mean

Table 5 shows results for more difficult test problems where has collinear normalized columns,i.e., for all \(k \ne \ell \) with \(k,\ell \in \{1,\dots ,p\}\). The procedure for generating the collinear columns is described by Tomasi and Bro [44]. The setup is the same as in the previous subsection except for the change in how we generate and the omission of size \(m\,=\,4,n\,=\,3,p\,=\,5\) (since the procedure we are using does not allow \(p>n\)). The results in Table 5 are are analogous to those in Table 3. We omit the runtimes since they are similar. Although fewer runs are successful, the number of instances solved is similar.

Table 5 Results of different formulations for “collinear test problems

From these results, we have a sense that the symmetric factorization problem can be solved using standard optimization techniques. Because the problems are nonconvex, multiple starting points are needed to improve the odds of finding a global minimizer. Our results also indicate that it is helpful to add a penalty to remove the scaling ambiguity; otherwise, with no penalty, the Jacobian at the solution is singular which seems to have a negative impact on the solution quality.

6.2 Ignoring symmetry

As noted previously, Carroll and Chang [14] ignored symmetry with the idea that it may not be required. Ideally, the solution that is computed by a standard method, like CP-ALS [19, 31] or CP-OPT [1], will be symmetric up to scaling.

Using the same problems from Table 3, we apply CP-ALS (as implemented in the Tensor Toolbox), followed by Algorithm 1 to symmetrize the solution. Three of the four sizes generically satisfy the sufficient uniqueness condition in (13).

  • For \(m\,=\,3\) and \(p\,=\,2\), we require . Since is an \(n \times p\) matrix with \(n\,=\,4\) whose columns are randomly generated, with probability 1.

  • For \(m\,=\,4\) and \(p\,=\,5\), we require . Since is an \(n \times p\) matrix with \(n\,=\,3\), it cannot satisfy the condition because . Hence, the solutions may not be unique, and an example of a non-unique solution is provided in the previous subsection.

  • For \(m\,=\,4\) and \(p\,=\,3\), we require . Since is an \(n \times p\) matrix with \(n\,=\,25\) whose columns are randomly generated, with probability 1.

  • For \(m\,=\,6\) and \(p\,=\,6\), we require . Since is an \(n \times p\) matrix with \(n\,=\,4\) whose columns are randomly generated, with probability 1.

Table 6 shows the results, which are analogous to those in Table 3. CP-ALS with symmetrization is highly competitive. In terms of the relative error, its total number of 442 successful runs is near the high of 472 for the symmetric optimization methods; likewise, it has 116 successful instances versus 118 for symmetric optimization. Its scores are not as impressive in terms of the solution score, though this is mainly a problem for the size \(m\,=\,4,n\,=\,3,p\,=\,5\), as expected due to lack of symmetry. The major advantage of CP-ALS is runtime, where it is typically ten times faster or more. Despite the fact that CP-ALS may not find a symmetric solution, using a standard CP solution procedure followed by symmetrization is indeed an effective approach in many situations.

Table 6 Results of CP-ALS plus symmetrization on test problems from Table 3

6.3 Sparsity penalty for rank determination

In Example 5.5(i) of, Nie [35] considers an method for determining the rank of a tensor. The example tensor is of order \(m\,=\,4\) and defined by

Using our optimization approach with \(f_1\) and \(\gamma \,=\,0.1\), we impose the approximate \(\ell _1\) penalty of the form suggested by [42], using \(\alpha \,=\,10\) and \(\beta \,=\,0.1\) to arrive at the following result:

We calculate the similarity score as described previously, selecting the two components that yield the best match for a score of 0.999865. The calculation takes approximately 2 sec. Using \(\alpha \,=\,1000\) causes numerical blow-up, but \(\alpha \,=\,100\) or \(\alpha \,=\,1\) work nearly as well as \(\alpha \,=\,10\), i.e., the solution score is 0.9998 (with \(\beta \,=\,0.1\)). Likewise, varying \(\beta \) has little impact on the solution quality (with \(\alpha \,=\,10\)).

Using the same penalty parameters (\(\alpha \,=\,10\) and \(\beta \,=\,0.1\)), we construct ten instances of problems of size \(m\,=\,4\), \(n\,=\,3\), and \(p\,=\,2\) for each noise level . We use a solution with three components but once again apply the sparsity penalty, using the same parameters as above. We use five random starts per instance. The results as shown in Table 7. The second column shows the number of instances (out of 10) where the solution score was \(\ge 0.9\), and the third column is the total number of runs that are successful (out of 50) for which this condition was satisfied. The fourth column shows that median relative error, and the last column shows the mean and standard deviation of the runtime. In the noise-free case, the correct solution is found in every run. For \(\eta \,=\, 0.01\), the correct solution is obtained for 9 out of 10 instances. For \(\eta \,=\,0.1\), the problem is only solved to the desired accuracy in 4 out of 10 instances.

Table 7 Impact of sparsity penalty for problems of size \(m\,=\,4\), \(n\,=\,3\), and \(p\,=\,2\) with a solution that has \(p\,=\,3\)

7 Conclusions and future challenges

Alas, the penalty approach is a heuristic; forthcoming work [5] will use statistical validation to select the rank.

7.1 Nonnegative factorization

Lastly, we consider the problem of nonnegative factorization. We use the same problem setup as in §6.1 with the exception that we set all entries equal to one and choose entries of to be uniform on \([0,1]\), i.e., \(x_{ij}^* \in \mathcal {U}[0,1]\). The optimization formulation excludes , so there is no penalty on the columns norms of (\(\gamma \,=\, 0\)). We add bound constraints that all entries of are nonnegative. We compare only the weighted and unweighted formulations. Table 8 shows the results, which are analogous to Table 3. There is little difference between the two formulations, except runtimes as discussed previously.

Table 8 Results of nonnegative optimization on test problems

We consider straightforward optimization formulations for real-valued symmetric and nonnegative symmetric tensor decompositions. These methods can be used as a baselines for comparison as new methods are developed. In particular, these methods should be useful for larger problems with inherent low-rank structure. For instance, the size \(m\,=\,4\) and \(n\,=\,25\) is larger in terms of dimension than most other symmetric tensor decomposition problems in the literature, though other works consider larger values of \(p\) [35]. Furthermore, we consider noise-contaminated problems, which may be problematic for algebraic methods.

Although the symmetric and nonnegative symmetric tensor decomposition problems are nonconvex, these numerical optimization approaches are effective at recovering the known solution in our experiments, especially when we use multiple random starting points. These optimization formulations can be adapted to the case of partial symmetries. Moreover, we show that if the solution is essentially unique (and the optimization method finds a global minima), then symmetry need not be directly enforced by the optimization method. In this case, efficient tools for the nonsymmetric CP problem may be employed directly.

We expect many further improvements, including different optimization formulations that exploit structure and consideration of other optimization methods.