1 Introduction

This paper is concerned with low-rank completion for tensors in the sense of multi-dimensional arrays. To be more specific, we aim to solve the tensor completion problem

(1.1)

Here, \(\operatorname{rank}(\mathbf {X})\) denotes the multilinear rank [15] of the tensor X, a tuple of d integers defined via the ranks of the matricizations of X (see Sect. 2.1 for details) and \(\mathrm {P}_{\varOmega}: \mathbb{R}^{n_{1} \times \cdots \times n_{d}} \to \mathbb{R}^{n_{1} \times \cdots \times n_{d}}\) is a linear operator. A typical choice for P Ω frequently encountered in applications is

$$\mathrm {P}_\varOmega \mathbf {X}:= \begin{cases} \mathbf {X}_{i_1 i_2\ldots, i_d} &\mathrm{if}\ (i_1, i_2,\ldots, i_d) \in \varOmega,\\ 0 & \mathrm{otherwise}, \end{cases} $$

where Ω⊂[1,n 1]×⋯×[1,n d ] denotes the so-called sampling set. In this case, the objective function ∥P Ω X−P Ω A2/2 measures the ability of X to match the entries of the partially known tensor A in Ω.

The tensor completion problem (1.1) and variants thereof have been discussed a number of times in the literature. Most of this work builds upon existing work for the special case d=2, also known as matrix completion, see [18] for a comprehensive overview. One of the first approaches to tensor completion has been discussed by Liu et al. [17]. It is based on extending the notion of nuclear norm to tensors by defining ∥X as the (weighted) sum of the nuclear norms of the matricizations of X. This leads to the convex optimization problem

$$\begin{aligned} \min_{\mathbf {X}}\ \|\mathbf {X}\|_* \quad \mathrm{subject\ to}\quad \mathrm {P}_\varOmega \mathbf {X}= \mathrm {P}_\varOmega \mathbf {A}, \end{aligned}$$
(1.2)

which can be addressed by established techniques such as block coordinate descent. Allowing noise in the sampling data and thus somewhat closer to our formulation (1.1) of tensor completion, Signoretto et al. [25] and Gandy et al. [11] consider the unconstrained optimization problem

$$\begin{aligned} \min_{\mathbf {X}}\ \|\mathrm {P}_\varOmega \mathbf {X}- \mathrm {P}_\varOmega \mathbf {A}\|^2 + \mu \|\mathbf {X}\|_* \end{aligned}$$

and propose the use of ADMM (alternating direction method of multipliers) and other splitting methods. This approach has been shown to yield good recovery results when applied to tensors from various fields such as medical imaging, hyperspectral images and seismic data. However, nuclear norm minimization approaches are usually quite costly and involve singular value decompositions of potentially very large matrices. Liu/Shang [16] recently proposed the use of the economy sized QR decomposition, reducing the cost per iteration step considerably.

Besides the two approaches described above, a number of variations [20] and alternatives have been discussed in the literature. For example, [17] proposes a block coordinate descent method while [23] proposes an iterative hard thresholding method for fitting the factors of a Tucker decomposition. In [3], gradient optimization techniques have been proposed for fitting the factors of CP decomposition. Closely related to the approach considered in this paper, Da Silva and Herrmann [8] have recently proposed to perform tensor completion in the hierarchical Tucker format via Riemannian optimization.

The approach proposed in this paper is based on the observation that the set of tensors of fixed multilinear rank r, denoted by , forms a smooth manifold [14, 28]. Manifold structure for low-rank tensors has recently been exploited in a number of works targeting applications in numerical analysis and computational physics, see [12] for an overview. We will make use of this manifold structure by viewing (1.1) as an unconstrained optimization problem on . This view allows for the use of Riemannian optimization techniques [2]. A similar approach has been considered in [19, 21, 30] for the matrix case, where it was shown to be competitive to other state-of-the-art approaches to matrix completion. Note that it is not entirely trivial to extend such Riemannian optimization techniques from the matrix to the tensor case, due to the lack of a simple characterization of the metric projection onto  [15].

The rest of this paper is organized as follows. In Sect. 2, we recall differential geometric properties of tensors having fixed multilinear rank and propose a suitable retraction map for . Section 3 proposes the use of the nonlinear CG algorithm on for solving (1.1), for which several algorithmic details as well as convergence properties are discussed. Finally, in Sect. 4, we investigate the effectiveness of our algorithm for various test cases, including synthetic data, hyperspectral images, and function-related tensors.

2 Differential geometry for low-rank tensor manifolds

To adapt the nonlinear CG algorithm on manifolds to the tensor completion problem (1.1), we derive a number of basic tools from differential geometry for tensors of low multilinear rank.

2.1 Preliminaries on tensors

Throughout this paper, we will follow the notation in the survey paper by Kolda and Bader [15]. In the following, we give a brief summary.

The ith mode matricization

$$X_{(i)} \in \mathbb {R}^{n_i \times \prod_{j \neq i} n_j } $$

of a tensor \(\mathbf {X}\in \mathbb {R}^{n_{1} \times \cdots \times n_{d}}\) is a rearrangement of the entries of X into the matrix X (i), such that the ith mode becomes the row index and all other (d−1) modes become column indices, in lexicographical order. Similarly, the vectorization \(\mathrm {vec}(\mathbf {X}) \in \mathbb {R}^{\prod_{i=1}^{d} n_{i} }\) stacks all entries of X into one long vector. The ranks of all the matricizations yield the multilinear rank tuple r of X:

$$\operatorname{rank}(\mathbf {X}) = \bigl( \operatorname{rank}(X_{(1)}), \operatorname{rank}(X_{(2)}),\ldots,\operatorname{rank}(X_{(d)}) \bigr). $$

The ith mode product of X multiplied with a matrix \(M \in \mathbb {R}^{m \times n_{i}}\) is defined as

$$\mathbf {Y}= \mathbf {X}\times_i M \quad \Leftrightarrow \quad Y_{(i)} = M X_{(i)}, \quad \mathbf {Y}\in \mathbb {R}^{n_1 \times \cdots \times n_{i-1} \times m \times n_{i+1} \times \cdots \times n_d}. $$

The inner product of two tensors X and Y is given by

$$ \langle \mathbf {X}, \mathbf {Y}\rangle = \bigl\langle \mathrm {vec}(\mathbf {X}), \mathrm {vec}( \mathbf {Y}) \bigr\rangle = \mathrm {vec}(\mathbf {X})^T \mathrm {vec}(\mathbf {Y}) = \operatorname{trace}\bigl(X_{(1)}^T Y_{(1)}\bigr) . $$
(2.1)

This induces the norm \(\|\mathbf {X}\| := \sqrt{\langle \mathbf {X}, \mathbf {X}\rangle}\).

Any tensor of multilinear rank r=(r 1,r 2,…,r d ) can be represented in the so-called Tucker decomposition

$$ \mathbf {X}= \mathbf {C}\times_1 U_1 \times_2 U_2 \dots \times_d U_d = \mathbf {C}\mathop {\text {\huge $\times $}}\limits _{i=1}^d U_i, $$
(2.2)

with the core tensor \(\mathbf {C}\in \mathbb{R}^{r_{1} \times \cdots \times r_{d}}\), and the basis matrices \(U_{i} \in \mathbb{R}^{n_{i} \times r_{i}}\). Without loss of generality, all U i are orthonormal: \(U_{i}^{T} U_{i} = I_{r_{i}}\), which will be assumed for the rest of the paper.

Let us denote the truncation of a tensor X to multilinear rank r using the higher order singular value decomposition (HOSVD) [9] by \(\mathrm {P}^{\mathrm{HO}}_{\mathbf{r}}\). The HOSVD procedure can be described by the successive application of best rank-r i approximations \(\mathrm {P}^{i}_{r_{i}}\) in each mode i=1,…,d:

Each individual projection can be computed by a truncated SVD as follows. Let U Y contain the r i dominant left singular vectors of the ith matricization Y (i) of a given tensor Y. Then the tensor resulting from the projection \(\widetilde{\mathbf {Y}}= \mathrm {P}^{i}_{r_{i}} \mathbf {Y}\) is given in terms of its matricization as \(\widetilde{Y}_{(i)} = U_{Y}U_{Y}^{T} Y_{(i)}\).

In contrast to the matrix case, the HOSVD does in general not yield the best rank-r approximation. Instead, the following quasi-best approximation property [9] holds:

(2.3)

where is any best approximation of \(\mathbf {X}\in \mathbb{R}^{n_{1} \times \ldots \times n_{d}}\) in the norm ∥⋅∥.

The HOSVD does inherit the smoothness of low-rank matrix approximations.

Proposition 2.1

(Smoothness of truncated HOSVD)

Let . Then there exists a neighborhood \(D \subset \mathbb{R}^{n_{1} \times \cdots \times n_{d}}\) of X such that is C smooth.

Proof

Let D i denote the open set of tensors whose ith mode matricization has a nonzero gap between the r i th and the (r i +1)th singular values. From standard results in matrix perturbation theory, it then follows [7] that each projector \(\mathrm {P}^{i}_{r_{i}}\) is smooth and well-defined on D i . Since is contained in all D i and is a fixpoint of every \(\mathrm {P}^{i}_{r_{i}}\), it is possible to construct an open neighborhood \(D \in \mathbb {R}^{n_{1} \times \cdots \times n_{d}}\) of X such that \(\mathrm {P}^{i}_{r_{i}} \circ \cdots \circ \mathrm {P}^{1}_{r_{1}}D \subseteq D_{i}\) for all i. Hence, the chain rule yields the smoothness of the operator \(\mathrm {P}^{\mathrm{HO}}_{\mathbf{r}}\) on D. □

2.2 Manifold setting

The set of tensors of fixed multilinear rank r=(r 1,r 2,…,r d ) forms a smooth embedded submanifold of \(\mathbb {R}^{n_{1}\times \cdots \times n_{d}}\) [28, 29]. By counting the degrees of freedom in (2.2), it follows that the dimension of is given by

Observe that is much smaller than the dimension of \(R^{n_{1}\times \cdots \times n_{d}}\) when r i n i . The Tucker decomposition (2.2) allows for the efficient representation and manipulation of tensors in .

According to [14], the tangent space of at X=C×1 U 1…× d U d can be parametrized as

(2.4)

where \(\mathbf {G}\in \mathbb{R}^{r_{1} \times \cdots \times r_{d}}\) and \(V_{i} \in \mathbb{R}^{n_{i} \times r_{i}}\) are the free parameters. Furthermore, the orthogonal projection of a tensor \(\mathbf {A}\in \mathbb{R}^{n_{1} \times \dots \times n_{d}}\) onto is given by

$$ \mathbf {A}\mapsto \biggl( \mathbf {A}\mathop {\text {\huge $\times $}}\limits _{j=1}^d U_j^T \biggr) \mathop {\text {\huge $\times $}}\limits _{i=1}^d U_i + \sum_{i=1}^d \mathbf {C}\times_i \biggl( \mathrm {P}_{U_i}^\bot \biggl[ \mathbf {A}\mathop {\text {\huge $\times $}}\limits _{j \neq i} U_j^T \biggr]_{(i)} C_{(i)}^\dagger \biggr) \mathop {\text {\huge $\times $}}\limits _{k \neq i} U_k. $$
(2.5)

Here, \(C_{(j)}^{\dagger}\) denotes the pseudo-inverse of C (j). Note that C (j) has full row rank and hence \(C_{(j)}^{\dagger}= C_{(j)}^{T} (C_{(j)} C_{(j)}^{T})^{-1}\) . We use \(\mathrm {P}_{U_{i}}^{\bot}:= I_{r_{i}} - U_{i} U_{i}^{T}\) to denote the orthogonal projection onto the orthogonal complement of span(U i ).

2.3 Riemannian metric and gradient

As a metric on , we will use the Euclidean metric from the embedded space induced by the inner product (2.1). Together with this metric, becomes a Riemannian manifold. This in turn allows us to define the Riemannian gradient of an objective function, which can be obtained from the projection of the Euclidean gradient into the tangent space.

Proposition 2.2

([2, Chap. 3.6])

Let \(f: \mathbb{R}^{n_{1} \times \cdots \times n_{d}} \rightarrow \mathbb{R}\) be a cost function with Euclidean gradientf X at point . Then the Riemannian gradient of is given by .

By Proposition 2.2, the Riemannian gradient of the objective function f(X)=∥P Ω X−P Ω A2/2 is given by

(2.6)

2.4 Retraction

Retraction maps an element from the tangent space at to the manifold . The choice of this map is not unique. A popular theoretical choice is the so-called exponential map, which, however, is usually too expensive to compute. According to [2] it is often sufficient to approximate this exponential map in first order for the purpose of optimization algorithms. A graphical depiction of this concept is shown on the left of Fig. 1. More specifically, a retraction fulfills the following properties.

Fig. 1
figure 1

Graphical representation of the concept of retraction and vector transport within the framework of Riemannian optimization techniques

Definition 2.1

(Retraction, [1, Def. 1])

Let be a smooth submanifold of \(\mathbb{R}^{n_{1}\times \cdots \times n_{d}}\). Let 0 x denote the zero element of . A mapping R from the tangent bundle into is said to be a retraction on around if there exists a neighborhood U of (x,0 x ) in such that the following properties hold:

  1. (a)

    We have \(\mathbf {U}\subseteq \operatorname{dom}(R)\) and the restriction is smooth.

  2. (b)

    R(y,0 y )=y for all (y,0 y )∈U.

  3. (c)

    With the canonical identification satisfies the local rigidity condition:

    where denotes the identity mapping on .

If is an embedded submanifold then the orthogonal projection

(2.7)

induces the so called projective retraction

which satisfies the properties of Definition 2.1 [1, Prop. 5].

Since it only satisfies the quasi-best approximation property (2.3), the HOSVD procedure does not yield a projective retraction. Nevertheless, it still possesses all necessary properties of a retraction in the sense of Definition 2.1.

Proposition 2.3

(HOSVD as Retraction)

The map

(2.8)

is a retraction on around X.

Proof

The map R defined in (2.8) can be written as the composition

where the smooth map is defined as F(X,ξ):=X+ξ. By Proposition 2.1, \(\mathrm {P}^{\mathrm{HO}}_{\mathbf{r}}\) is smooth for all and sufficiently small ξ. Hence, R defines a locally smooth map in a neighborhood around (X,0 X ).

Definition 2.1(b) follows from the fact that the application of the HOSVD to elements in leaves them unchanged.

It remains to check Definition 2.1(c), the local rigidity condition. Because the tangent space is a first order approximation of around ξ, we have that for t→0. Thus, using (2.3):

Hence, R(X,)=(X+)+O(t 2), which gives \(\frac{d}{dt} R(\mathbf {X}, t \xi) |_{t=0} = \xi\). In other words, , which completes the proof. □

2.5 Vector transport

A vector transport , as introduced in [2], allows us to map tangent vectors from to . Because is an embedded submanifold of \(\mathbb{R}^{n_{1} \times \cdots \times n_{d}}\), the orthogonal projection constitutes a vector transport, see [2, Sect. 8.1.3]:

A visualization of the concept of vector transport is shown on the right of Fig. 1.

3 Nonlinear Riemannian CG

With the concepts introduced in Sect. 2, we have all the necessary geometric ingredients for performing Riemannian optimization on the manifold of low-rank tensors. In particular, the nonlinear CG algorithm discussed in [2, Sect. 8.3], yields Algorithm 1. This can be seen as an extension of the standard nonlinear CG algorithm [22], with the Euclidean gradient replaced by the Riemannian gradient. Applying retraction after each optimization step ensures that we stay on the manifold. Finally, the use of vector transport allows us to calculate conjugate directions using the Polak-Ribière+ (PR+) update rule. If the search directions become insufficiently gradient-related during the iteration, the algorithm should revert to steepest descent, see [5]. A standard Armijo backtracking scheme is added to control the step sizes, using the result of a linearized line search procedure as an initial guess.

Algorithm 1
figure 2

Geometric nonlinear CG for Tensor Completion

In the following sections, we will provide algorithmic details on the individual steps of Algorithm 1 and discuss their computational complexity. To simplify the expressions for the complexity, we assume that n:=n 1=⋯=n d and r:=r 1=⋯=r d .

3.1 Calculation of the gradient

The calculation of the Riemannian gradient (2.6) requires the explicit computation of individual entries of a tensor X from its Tucker decomposition:

$$\mathbf {X}_{i_1 i_2 \ldots i_d} = \sum_{j_1=1}^{r_1} \sum _{j_2=1}^{r_2} \cdots \sum _{j_d=1}^{r_d} \mathbf {C}_{j_1 j_2 \ldots j_d} (U_1)_{i_1 j_1} (U_2)_{i_2 j_2} \cdots (U_d)_{i_d j_d}. $$

In total, this requires |Ω|(d+1)r d operations for computing P Ω X.

The projection of E:=P Ω X−P Ω A onto the tangent space gives the gradient ξ, which will be stored in factorized form as

$$ \xi = \mathbf {G}\mathop {\text {\huge $\times $}}\limits _{j=1}^d U_j + \sum_{i=1}^d \mathbf {C}\times_i V_i \mathop {\text {\huge $\times $}}\limits _{j \neq i} U_j , $$
(3.1)

where

$$\mathbf {G}:= \mathbf {E}\mathop {\text {\huge $\times $}}\limits _{j=1}^d U_j^T, \qquad V_i := \mathrm {P}_{U_i}^\bot \biggl[ \mathbf {E}\mathop {\text {\huge $\times $}}\limits _{j \neq i} U_j^T \biggr]_{(i)} C_{(i)}^\dagger, $$

see (2.5). By exploiting the sparsity of E, the computation of G and V i , i=1,…,d, requires O(r d(|Ω|+n)+r d+1) operations. This makes the calculation of the gradient the most time consuming part of our optimization scheme.

3.2 Vector transport and new search direction

To calculate the new search direction, we use the Polak-Ribière+ update formula adapted to Riemannian optimization, see [2, 22]:

(3.2)

The calculation of β k requires the evaluation of the vector transport

where ξ k−1=gradf(X k−1) is assumed to be in the factorized form (3.1). Moreover, is given in terms of a Tucker decomposition \(\mathbf {X}_{k} = \widetilde{\mathbf {C}}\mathop {\text {\huge $\times $}}\nolimits _{i=1}^{d} \widetilde{U}_{i} \). As in the previous section, we obtain

(3.3)

where

$$\widetilde{\mathbf {G}}:= \xi_{k-1} \mathop {\text {\huge $\times $}}\limits _{j=1}^d \widetilde{U}_j^T, \qquad \widetilde{V}_i := \mathrm {P}_{\widetilde{U}_i}^\bot \biggl[ \xi_{k-1} \mathop {\text {\huge $\times $}}\limits _{j \neq i} \widetilde{U}_j^T \biggr]_{(i)} \widetilde{C}_{(i)}^\dagger. $$

To compute and \(\widetilde{\mathbf {G}}\) and \(\widetilde{V}_{i}\), we make use of the linearity in ξ k−1 and process each summand in the representation (3.1) of ξ k−1 separately. By exploiting the tensor product structure of each summand, we then arrive at a total cost of O(nr d) operations.

Further, the evaluation of (3.2) requires the inner product between the tensor in (3.3) and ξ k =gradf(X i ) also given in factorized form:

$$\xi_k = \widehat{\mathbf {G}}\mathop {\text {\huge $\times $}}\limits _{j=1}^d \widetilde{U}_j + \sum_{i=1}^d \widetilde{\mathbf {C}}\times_i \widehat{V}_i \mathop {\text {\huge $\times $}}\limits _{j \neq i} \widetilde{U}_j. $$

Utilizing the orthogonality of \(\widetilde{U}_{i}\) and the uniqueness condition \(\widetilde{U}_{i}^{T} \widetilde{V}_{i} = \widetilde{U}_{i}^{T} \widehat {V}_{i} = 0\) for the tangent space, see (2.4), we obtain

The evaluation of the (smaller) inner products requires O(nr 2+r d+1) operations. The norm of ξ k−1 appearing in the denominator of (3.2) is computed analogously. Hence, the total cost for computing β k is given by O(nr d) operations.

Once β k has been determined, the new conjugate direction is computed by

where is the previous conjugate direction. The vector transport is performed exactly in the same way as above. The obtained tensor in is multiplied by β k and added to . Due to linearity, the addition of two tensors in the same tangent space is performed by simply adding the corresponding coefficients G and V i .

3.3 Calculation of the retraction

To obtain the next iterate, Algorithm 1 retracts the updated tensor X+αη back to the manifold by means of the HOSVD. When performing this retraction, we will exploit the fact that is in Tucker decomposition and is represented in the factorized form (3.1):

$$\begin{aligned} \mathbf {X}+ \alpha \eta &= \mathbf {C}\mathop {\text {\huge $\times $}}\limits _{i=1}^d U_i + \alpha \Biggl( \mathbf {G}\mathop {\text {\huge $\times $}}\limits _{i=1}^d U_i + \sum_{i=1}^d \mathbf {C}\times_i V_i \mathop {\text {\huge $\times $}}\limits _{j \neq i} U_j \Biggr) \\ &= (\mathbf {C}+ \alpha \mathbf {G}) \mathop {\text {\huge $\times $}}\limits _{i=1}^d U_i + \alpha \sum_{i=1}^d \mathbf {C}\times_i V_i \mathop {\text {\huge $\times $}}\limits _{j \neq i} U_j \\ &= \mathbf {S}\mathop {\text {\huge $\times $}}\limits _{i=1}^d [ U_i, V_i ] \end{aligned}$$

where \(\mathbf {S}\in \mathbb {R}^{2r_{1} \times \cdots\times 2 r_{d}}\) has the special structure depicted in Fig. 2. After orthogonalizing the combined basis matrices [U i ,V i ] and a corresponding update of S, we can then restrict the application of the HOSVD to the smaller tensor S, which requires only O(r d+1) operations. The retraction is completed by multiplying the basis matrices obtained from the HOSVD of S to the combined basis factors. In total, the retraction requires O(nr 2+r d+1) operations.

Fig. 2
figure 3

Structure of the combined core tensor S for the case d=3

3.4 Line search

Following [30], we obtain an initial guess for the step size α in Algorithm 1 by performing exact line search along the tangent space. This leads to the optimization problem

$$\alpha^* = \mathop {\mathrm {argmin}}\limits_\alpha \big\| \mathrm {P}_\varOmega(\mathbf {X}+ \alpha\xi) - \mathrm {P}_\varOmega \mathbf {A}\big\| ^2, $$

which can be solved analytically:

$$ \alpha^* = \frac{\langle \mathrm {P}_\varOmega \xi, \mathrm {P}_\varOmega (\mathbf {A}- \mathbf {X}) \rangle}{\langle \mathrm {P}_\varOmega \xi, \mathrm {P}_\varOmega \xi \rangle}. $$
(3.4)

The computation of the nonzero entries of P Ω ξ requires 2|Ω|(d+1)r d operations (see Sect. 3.1), while each of the inner products requires 2|Ω| operations.

This linearized exact line search procedure is combined with an Armijo backtracking scheme. In our numerical experiments, we have observed that this was almost never necessary; α is often very close to the optimal step size.

3.5 Summary of computational complexity

By exploiting the low-rank structures of the iterates, we arrive at a total cost of

$$ O \bigl( r^d \bigl(n + |\varOmega|\bigr) + r^{d+1} \bigr) $$
(3.5)

operations. In particular, Algorithm 1 scales linearly with n, when keeping r fixed. This makes the algorithm suitable for large sizes n and moderate values of d, say d=3,4.

3.6 Convergence of Algorithm 1

To investigate the convergence of our proposed algorithm, we proceed similarly to the matrix case [30, Sect. 4.1] by applying the general convergence theory for Riemannian optimization. In particular [2, Theorem 4.3.1] yields the following proposition, which shows that any limit point of Algorithm 1 coincides with the prescribed entries P Ω A within the tangent space.

Proposition 3.1

Let X k be an infinite sequence of iterates generated by Algorithm 1. Then, every accumulation point X of X k satisfies .

A more detailed convergence analysis is complicated by the fact that is not closed; for example a sequence of tensors in may approach a tensor for which the ith matricization has rank less than r i . To avoid this effect, we first discuss the convergence for a modification of the original cost function f(X)=∥P Ω X−P Ω A2/2:

(3.6)

where ∥⋅∥ denotes the Frobenius norm for matrices.

Proposition 3.2

Let X k be an infinite sequence of iterates generated by Algorithm 1 but with the modified cost function g defined in (3.6). Then

$$\lim_{k\to\infty} \big\| \mathrm {grad}g(\mathbf {X}_k)\big\| = 0. $$

Proof

By construction of the line search, all iterates X k fulfill g(X k )≤g(X 0) and therefore

$$\frac{1}{2} \| \mathrm {P}_\varOmega \mathbf {X}_k - \mathrm {P}_\varOmega \mathbf {A}\|^2 +\mu^2 \sum _{i=1}^d \bigl( \| X_{k,(i)} \|^2 + \big\| X_{k,(i)}^\dagger \big\| ^2 \bigr) \leq g(\mathbf {X}_0) =: C_0^2, $$

where X k,(i) denotes the ith matricization of X k . In particular,

$$\mu^2 \sum_{i=1}^d \| X_{k,(i)} \|^2 \leq C_0^2, \qquad \mu^2 \sum_{i=1}^d \big\| X^\dagger_{k,(i)} \big\| ^2 \leq C_0^2, $$

yielding upper and lower bounds for the largest and smallest singular values, respectively:

$$\sigma_{\max}(X_{k,(i)}) \le \| X_{k,(i)} \| \le C_0 / \mu, \qquad \sigma^{-1}_{\min}(X_{k,(i)}) \le \big\| X^\dagger_{k,(i)} \big\| \le C_0 / \mu. $$

Hence, all iterates X k stay inside the compact set

Now suppose, conversely to the statement of the proposition, that ∥gradg(X k )∥ does not converge to zero. Then there is δ>0 and a subsequence of {X k } such that ∥gradg(X k )∥>δ for all elements of the subsequence. Since X k B, it follows that this subsequence has an accumulation point X for which also ∥gradg(X )∥>δ. However, this contradicts [2, Theorem 4.3.1], which states that every accumulation point is a critical point of g. □

It is instructive to compare the gradient of g with the gradient of f at . For this purpose, we use the fact that X (i) has full row-rank and thus the derivative of its pseudo-inverse \(X_{(i)}^{\dagger}= X_{(i)}^{T} ( X_{(i)}X_{(i)}^{T} )^{-1}\) can be written as

$$\partial X_{(i)}^\dagger = - X_{(i)}^\dagger ( \partial X_{(i)}) X_{(i)}^\dagger + \bigl(I - X_{(i)}^\dagger X_{(i)}\bigr) (\partial X_{(i)})^T \bigl( X_{(i)}X_{(i)}^T \bigr)^{-1}. $$

Thus, the Euclidean gradient of g at X is given by

$$ \nabla g(\mathbf {X}) = \nabla f(\mathbf {X}) + 2 \mu^2 \sum _{i=1}^d \bigl[ U_i \bigl( \varSigma_i - \bigl(\varSigma_i^\dagger \bigr)^3 \bigr)V_i^T \bigr]^{(i)}, $$
(3.7)

in terms of the singular value decomposition \(X_{(i)} = U_{i} \varSigma_{i} V_{i}^{T}\). The operation [ ⋅ ](i) reverses matricization, that is, [X (i)](i)=X.

The statement of Proposition 3.2 holds for arbitrarily small μ. If the smallest singular values of the matricizations stay bounded from below as μ→0, that is, the accumulation points X of {X k } do not approach the boundary of as μ→0, then (3.7) shows that gradf(X )→0 as μ→0. Thus, the regularization term becomes negligible in such a situation. For more details, we refer to the discussion in [30, Sect. 4.1].

4 Numerical experiments

Algorithm 1 (geomCG) was implemented in Matlab version 2012a, using the Tensor Toolbox version 2.5 [4] for handling some of the tensor operations. However, to attain reasonable performance, it was important to implement operations with sparse tensors in C and call them via mex interfaces. In particular, this was done for the evaluation of the objective function (1.1), the computation of the Euclidean gradient and its projection onto the tangent space (3.1), as well as for the linearized line search (3.4). For simplicity, we restricted the implementation to the case d=3. The source code is freely available under a BSD license and can be downloaded from http://anchp.epfl.ch.

To measure the convergence during the iteration, Algorithm 1 computes the relative residual

$$\frac{\| \mathrm {P}_\varOmega \mathbf {X}- \mathrm {P}_\varOmega \mathbf {A}\| }{ \|\mathrm {P}_\varOmega \mathbf {A}\|}. $$

However, to investigate the reconstruction quality of the algorithm, measuring the relative residual on the sampling set Ω is not sufficient. For this purpose, we also measure the relative error ∥P Γ X−P Γ A∥/∥P Γ A∥ on a random test set Γ of the same cardinality as Ω.

Unless stated otherwise, we assume that the tensor has equal size in all modes, n:=n 1=n 2=n 3 and similarly for the ranks, r:=r 1=r 2=r 3. All tests were performed on a quad-core Intel Xeon E31225, 3.10 GHz, with 8 GB of RAM running 64-Bit Debian 7.0 Linux. Stated calculation times are wall-clock times, excluding the set-up time of the problem.

4.1 Computational complexity for synthetic data sets

A synthetic data tensor A of exact multilinear rank r is created by choosing the entries of the core tensor C and the basis matrices U 1,U 2,U 3 as pseudo-random numbers from a uniform distribution on [0,1].

As a first test, we check that the implementation of Algorithm 1 exhibits the same scaling behaviour per iteration as predicted by the theoretical discussion in Sect. 3.5. To measure the scaling with regard to the tensor size n, we fix the multilinear rank to r=(10,10,10) and scale the size of the sampling set linearly with the tensor size, |Ω|=10n. We perform 10 iterations of our algorithm and repeat the process 10 times for different randomly chosen datasets. Analogously, we measure the dependence on the tensor rank by setting the tensor size to n=300 and fixing the sampling set to 0.1 % of the full tensor.

The results are shown in Fig. 3. We observe that our algorithm scales indeed linearly in the tensor size over a large interval n∈[100,3000]. Even for such large tensors, the time per iteration step is very low. Plotting the results for the scaling with regard to the tensor rank, we observe an O(r 3)-dependence, in agreement with (3.5).

Fig. 3
figure 4

Time needed per iteration step for various problem sizes. Left: Runtime with fixed rank r=(10,10,10) and varying tensor size n=n 1=n 2=n 3∈{100,150,…,3000}. The size of the sampling set scales linearly with n, |Ω|=10n. Right: Runtime with fixed tensor size n=(300,300,300) and varying tensor rank r=r 1=r 2=r 3∈{20,25,…,100}. Size of sampling set: 0.1 % of the full tensor. The dashed line shows a scaling behaviour O(r 3)

4.2 Reconstruction of synthetic data sets

We compare the reconstruction performance of our algorithm with the hard completion algorithm by Signoretto et al. [26, Alg. 3], based on the so called inexact splitting method applied to (1.2). We selected this algorithm because it is publicly available on the authors’ website. The algorithm depends on certain parameters discussed in [26]. We have chosen the proximity parameter τ=10 and the nuclear norm weights λ 1=λ 2=λ 3=1, which corresponds to the settings used in the supplied test routine.

In Fig. 4 we present the convergence behaviour of the algorithms for varying sizes of the sampling set, in terms of the error on the test set Γ. The sampling set sizes are denoted by a percentage p of the full tensor, |Ω|=pN 3. We use a relative residual of 10−12 and a maximum number of 300 iterations as stopping criterions. Both algorithm need more iterations if the number of missing entries increases, but the effect is more strongly pronounced for hard completion. Our algorithm performs better both when measuring the performance with respect to time or the number of iterations.

Fig. 4
figure 5

Convergence curves for different sampling set sizes as functions of iterations and time for our proposed algorithm (geomCG) and the hard completion algorithm by Signoretto et al. [26]. Tensor size and multilinear rank fixed to n=100 and r=(5,5,5), respectively (Color figure online)

4.2.1 Reconstruction of noisy data

In this part, we investigate the convergence properties of Algorithm 1 in the presence of noise. The known entries of A are perturbed by rescaled Gaussian noise E, such that ∥P Ω E∥=ε 0∥P Ω A∥ for a prescribed noise level ε 0. Ideally, Algorithm 1 should return an approximation X at the level of ε 0, that is,

$$\big\| \mathrm {P}_\varOmega \mathbf {X}^* - \mathrm {P}_\varOmega (\mathbf {A}+ \mathbf {E})\big\| / \| \mathrm {P}_\varOmega \mathbf {A}\| \approx \big\| \mathrm {P}_\varOmega \mathbf {A}- \mathrm {P}_\varOmega (\mathbf {A}+ \mathbf {E})\big\| / \|\mathrm {P}_\varOmega \mathbf {A}\| = \varepsilon_0. $$

To test that the noise does not lead to a misidentification of the rank of the underlying problem, we compare the case where we take the initial guess on the correct manifold to an uninformed rank-(1,1,1) guess. There, we employ a heuristic rank adaptation strategy discussed in Sect. 4.3. We show in Fig. 5 that in both cases we can indeed recover the original data up to the given noise level.

Fig. 5
figure 6

Tensor completion from noisy measurements with n=100, r=(6,6,6). The relative size of the sampling set was fixed to 10 %. The black line corresponds to the noise-free case. The different colors correspond to the noise levels ε 0∈{10−4,10−6,…,10−12}. Left: Results when the underlying rank r is known. Right: Results for the case of unknown rank of the underlying problem. Due to the rank adaptation procedure, more iterations are necessary

4.2.2 Size of the sampling set

It is well known that in the matrix case, the number of random samples needed to exactly recover the original matrix is at least O(nrlogn) under standard assumptions; see e.g. [6, 13]. In the left plot of Fig. 6, we present numerical experiments suggesting that a similar statement may hold for the three-dimensional tensor case. The algorithm is declared converged (and hence yields perfect reconstruction) if the relative residual drops below 10−6 within 100 iterations.

Fig. 6
figure 7

Scaling of the sampling set size needed to reconstruct the original tensor of fixed multilinear rank (10,10,10). Left: Minimum size of sampling set needed to attain convergence vs. tensor size n=n 1=n 2=n 3. Right: Phase transition of the convergence speed (4.1). White means fast convergence, black means no convergence. The line corresponds to O(nlog(n))

The right plot of Fig. 6 displays a phase transition of the measured convergence speed of Algorithm 1, computed from

$$ \rho = \biggl(\frac{\|\mathrm {P}_\varGamma \mathbf {X}_{k_\mathrm{end}} - \mathrm {P}_\varGamma \mathbf {A}\|}{ \|\mathrm {P}_\varGamma \mathbf {X}_{k_{\mathrm{end}} - 10} - \mathrm {P}_\varGamma \mathbf {A}\|} \biggr)^{\frac{1}{10}} \in [0,1], $$
(4.1)

where k end is the final iteration.

4.3 Applications

In the following, we assess the performance of our algorithm on tensors derived from applications. In contrast to synthetic data sets, tensors from applications usually do not posses a clear, well-defined multilinear rank. Often, they exhibit a rather smooth decay of the singular values in each matricization. In such a setting, Algorithm 1 requires a good initial guess, as directly applying it with a (large) fixed rank r usually results in severe overfitting. We propose the following heuristic to address this problem: Starting from a multilinear rank-(1,1,1)-approximation, we iteratively increase the multilinear rank in each mode and rerun our algorithm with the previous result as initial guess. This procedure is repeated until the prescribed final multilinear rank r is reached. We increase the multilinear rank every time the current relative change in the square root of the cost function is smaller than a tolerance δ:

$$ \bigl\vert \sqrt{f(\mathbf {X}_{i-1})} - \sqrt{f( \mathbf {X}_i)} \bigr\vert < \delta \sqrt{f(\mathbf {X}_i)}. $$
(4.2)

Initially, we use a large value for δ, say δ=1. We observed this approach to be effective at steering the algorithm into the direction of the optimal solution. Once we arrive at the final rank r, we can also use (4.2) as a stopping criterion with a much smaller value for δ, say δ=0.001. In cases of convergence problems, the initial value for δ should be chosen smaller, at the cost of additional iterations. In the following numerical experiments, we always include this initialization procedure in the reported computation times.

4.3.1 Hyperspectral image

As a first application of our algorithm to real-world data, we consider the hyperspectral image “Ribeira” [10] discussed in [27].

This results in a tensor of size 1017×1340×33, where each slice corresponds to an image of the same scene measured at a different wavelength. To provide a faithful comparison, we proceed in the same way as in [27] and resize the tensor to 203×268×33 by bilinear interpolation before applying our algorithm. The results are shown in Table 1. The reconstruction quality is assessed in terms of the normalized root mean squared error:

$$\mathrm{NRMSE}(\mathbf {X},\mathbf {A}) := \frac{\| \mathrm {P}_{\varOmega^c} \mathbf {A}- \mathrm {P}_{\varOmega^c} \mathbf {X}\|}{ (\max(\mathrm {P}_{\varOmega^c} \mathbf {A}) - \min(\mathrm {P}_{\varOmega^c} \mathbf {A})) \sqrt{|\varOmega^c|}}, $$

where Ω c is the complement of the sampling set, that is, the unknown entries. We compare with the results reported in [27] for the tensor completion algorithm tensor, the frame-wise matrix completion approach frame, and matrix completion applied to the mode-3 matricization only. As shown in Fig. 7, the singular values of the matricizations decay at a different rate. We take this into account in our algorithm, by choosing the final mode-1 and mode-2 ranks of the approximation significantly larger than the mode-3 rank. It can be observed that our algorithm (geomCG) yields very competitive results, especially in the case where the sampling set is small. There is one case of overfitting for geomCG(55,55,5), marked by a star.

Fig. 7
figure 8

Full hyperspectral image data set “Ribeira” scaled to size (203,268,33). Top left: Singular value decay of each matricization. Top right: The sampled tensor P Ω A with 10 % known entries. Unknown entries are marked in black. Bottom left: Result of our algorithm with iterative increase of ranks up to a final rank of r=(15,15,6), corresponding to entry geomCG(15,15,6) in Table 1. Bottom right: Result of our algorithm with iterative increase of ranks up to a final rank of r=(65,65,7), corresponding to entry geomCG(65,65,7) in Table 1

Table 1 Reconstruction results for “Ribeira” hyperspectral image. The results for frame, mode-3 and tensor are taken from [27]. geomCG(r 1,r 2,r 3) denotes the result of Algorithm 1 using a prescribed final multilinear rank (r 1,r 2,r 3)

4.3.2 Reconstruction of function data

To investigate the applicability of our algorithm to compress tensors related to functions with singularities, we consider

$$ f: [-1,1]^3 \to \mathbb {R}, \quad x \mapsto \mathrm{e}^{-\|x\|_2} $$
(4.3)

discretized on a uniform tensor grid with mesh width h=1/100. The function values are collected in a tensor \(\mathbf {A}\in \mathbb {R}^{201 \times 201 \times 201}\). In this setting, we assume that the location of the singularity is known a priori. As f has a cusp at the origin, the information in A is strongly localized at this point and tensor completion applied naively to A would not lead to reasonable compression. To avoid this effect, we therefore cut out a small hypercube [−0.1,0.1]3, corresponding to the 21×21×21 central part of the discretized tensor. The idea is to not include this region in the sampling set Ω. The entries corresponding to this region are stored separately and reconstructed exactly after performing low-rank tensor completion on the remaining region. We therefore do also not include the central part in the test set Γ when verifying the accuracy of the completed tensor. The obtained results are shown in Fig. 8. Already sampling 5 % of the entries gives an accuracy of 10−5. This would yield a compression ratio of 5.1 % if we stored the involved entries. However, storing the rank-(5,5,5) approximation along with the central part yields the significantly lower compression ratio of 0.15 %.

Fig. 8
figure 9

Convergence of the algorithm for the discretization of a function with cusp (4.3). The final rank of the approximation is r=(10,10,10). The part corresponding to [−0.1,0.1]3 is excluded from the sampling set Ω and the test set Γ

4.3.3 Stochastic elliptic PDE with Karhunen-Loève expansion

Finally, we consider an elliptic PDE with stochastic coefficients:

$$\begin{aligned} -\nabla \bigl( a(x,y) \nabla u(x,y) \bigr) &= f(x),\quad(x,y) \in D \times \varTheta, \\ u(x,y) &= 0\quad\quad\quad (x,y) \in \partial D \times \varTheta, \end{aligned}$$

where yΘ is a random variable and D=[−1,1] is the computational domain. We represent the stochastic variable y by an infinite number of parameters α∈[−1,1] and write a(x,α) in terms of its Karhunen-Loève expansion

$$a(x,\alpha) = a_0 + \sum_{\mu=1}^\infty \sqrt{\lambda_\mu} a_\mu(x) \alpha_\mu, $$

where a μ (x), μ=1,2,… are normalized L 2(D)-functions and the so called Karhunen-Loève eigenvalues λ μ ≥0 decrease monotonically. We truncate the Karhunen-Loève expansion after μ=3 and employ a standard piecewise linear finite element (FE) discretization. This yields a parameter-dependent linear system of equations,

$$ (A_0 + \alpha_1 A_1 + \alpha_2 A_2 + \alpha_3 A_3)x = f, $$
(4.4)

where each \(A_{\mu}\in \mathbb {R}^{m\times m}\) is the FE stiffness matrix corresponding to the coefficient a μ . We refer to, e.g., [24] for a detailed discussion of this procedure. In our examples, we choose

$$a_0(x) = 1, \qquad a_\mu(x) = \sin(\mu x). $$

The parameters α are then sampled uniformly on a tensor grid on [−1,1]×[−1,1]×[−1,1]. Assuming that we are only interested in the mean of the solution for a specific set of parameters, this results in the solution tensor \(\mathbf {X}\in \mathbb {R}^{n \times n \times n}\), where each entry of this tensor requires the solution of a discretized PDE (4.4) for one combination of the discretized (α 1,α 2,α 3). Hence, evaluating the full tensor is fairly expensive. Using tensor completion, we sample X at (few) randomly chosen points and try to approximate the missing entries.

In Fig. 9 we show the results of this approach for m=50,n=100, and two different choices of α μ . We used the Karhunen-Loève eigenvalues \(\sqrt{\lambda_{\mu}} = 5 \exp(-2 \mu)\) and \(\sqrt{\lambda_{\mu}} = (1+\mu)^{-2}\), respectively. As the second choice results in slower singular value decays, our algorithm requires more iterations to attain the same accuracy. Using 5 % of the tensor as a sampling set is in both cases sufficient to recover the original tensor to good precision. As the sampling set gets smaller, overfitting of the sampling data is more likely to occur, especially for the second choice of λ μ .

Fig. 9
figure 10

Convergence for the solution tensor of size n=100 of a parametrized linear system obtained from a discretized stochastic PDE. Left: Result for Karhunen-Loève eigenvalues \(\sqrt{\lambda_{\mu}} = 5 \exp(-2 \mu)\). Right: Result for Karhunen-Loève eigenvalues \(\sqrt{\lambda_{\mu}} = (1+\mu)^{-2}\). The final rank of the solution is r=(4,4,4)

5 Conclusions

We have shown that the framework of Riemannian optimization yields a very effective nonlinear CG method for performing tensor completion. Such a method has also been suggested in [8]. One of the main contributions in this paper consists of a careful discussion of the algorithmic and implementation details, showing that the method scales well for large data sets and is competitive to existing methods for tensor completion. On the theoretical side, we have proven that HOSVD satisfies the properties of a retraction and discussed the convergence properties of the nonlinear CG method.

The numerical experiments indicate the usefulness of tensor completion not only for data-related but also for function-related tensors. We feel that this aspect merits further exploration. To handle high-dimensional applications, the approach considered in this paper needs to be extended to other SVD-based low-rank tensor formats, such as the tensor train and the hierarchical Tucker formats, see also [8].