1 Introduction

The numerical solution of high-dimensional partial differential equations remains one of the most challenging tasks in numerical mathematics. A naive discretisation based on well-established methods for solving PDEs numerically, such as finite differences, finite elements or spectral elements, suffers severely from the so-called curse of dimensionality. This notion refers to the exponential scaling \( \mathscr {O} ( n^d)\) of the computational complexity with respect to the dimension d of the discretised domain. For example, if \(d=10\) and we consider \(n=100\) basis functions in each coordinate direction, this leads to a discretisation space of dimension \(100^{10}\). Even for low-dimensional univariate spaces, e.g. \(n=2\), but with \(d=500\), one has to deal with a space of dimension \(2^{500}\). It is therefore clear that one needs to find additional structures to design tractable methods for such large-scale problems.

Many established methods for large-scale problems rely on the framework of sparse and nonlinear approximation theory in certain dictionaries [40]. These dictionaries are fixed in advance, and their appropriate choice is crucial. Low-rank approximation can be regarded as a related approach, but with the dictionary consisting of general separable functions—going back to one of the oldest ideas in applied mathematics, namely separation of variables. As this dictionary is uncountably infinite, the actual basis functions used for a given problem have to be computed adaptively.

On the level of matrix or bivariate approximation, the singular value decomposition (SVD) provides a tool to find such problem-adapted, separable basis functions. Related concepts underlie model order reduction techniques such as proper orthogonal decompositions and reduced bases [119]. In fact, one might say that low-rank matrix approximation is one of the most versatile concepts in computational sciences. Generalising these principles to higher-order tensors has proven to be a promising, yet nontrivial, way to tackle high-dimensional problems and multivariate functions [17, 65, 87]. This article presents a survey of low-rank tensor techniques from the perspective of hierarchical tensors, and complements former review articles [63, 67, 70, 87] with novel aspects. A more detailed review of tensor networks for signal processing and big data applications, with detailed explanations and visualisations for all prominent low-rank tensor formats can be found in [27]. For an exhaustive treatment, we also recommend the monograph [65].

Regarding low-rank decomposition, the transition from linear to multilinear algebra is not as straightforward and harmless as one might expect. The canonical polyadic format [76] represents a tensor \({\mathbf {u}}\) of order d as a sum of elementary tensor products, or rank-one tensors,

$$\begin{aligned} {\mathbf {u}} ( i_1 , \ldots , i_d ) = \sum _{k=1}^r {\mathbf {C}}^1_k (i_1) \cdots {\mathbf {C}}^d_k (i_d ), \quad i_\mu = 1, \ldots , n_\mu , \quad \mu =1 , \ldots , d, \end{aligned}$$
(1.1)

with \( {\mathbf {C}}^\mu _k \in {\mathbb {R}}^{n_\mu }\). For tensors of order two, the CP format simply represents the factorisation of a rank-r matrix, and therefore is a natural representation for higher-order tensors as well. Correspondingly, the minimal r required in (1.1) is called the canonical rank of \({\mathbf {u}}\).

If r is small, the CP representation (1.1) is extremely data-sparse. From the perspective of numerical analysis, however, it turns out to have several disadvantages in case \(d > 2\). For example, the set of tensors of canonical rank at most r is not closed [127]. This is reflected by the fact that for most optimisation problems involving tensors of low CP rank, no robust methods exist. For further results concerning difficulties with the CP representation and rank of higher-order tensors, we refer to [65, 75, 127, 136], and highlight the concise overview [100]. Many of these issues have also been investigated from the perspective of algebraic geometry, see the monograph [95].

The present article is intended to provide an introduction and a survey of a somehow alternative route. Instead of directly extending matrices techniques to analogous notions for tensors, the strategy here is to reduce questions of tensor approximation to matrix analysis. This can be accomplished by the hierarchical tensor (HT) format, introduced by Hackbusch and Kühn [69], and the tensor train (TT) format, developed by Oseledets and Tyrtyshnikov [110, 111, 113, 114]. They provide alternative data-sparse tensor decompositions with stability properties comparable to the SVD in the matrix case, and can be regarded as multi-level versions of the Tucker format [87, 133]. Whereas the data complexity of the Tucker format intrinsically suffers from an exponential scaling with respect to dimensionality, the HT and TT format have the potential of bringing this down to a linear scaling, as long as the ranks are moderate. This compromise between numerical stability and potential data sparsity makes the HT and TT formats promising model classes for representing and approximating tensors.

However, circumventing the curse of dimensionality by introducing a nonlinear (here: multilinear) parameterisation comes at the price of introducing a curse of nonlinearity, or more precisely, a curse of non-convexity. Our model class of low-rank hierarchical tensors is no longer a linear space nor a convex set. Therefore, it becomes notoriously difficult to find globally optimal solutions to approximation problems, and first-order optimality conditions remain local. In principle, the explicit multilinear representation of hierarchical tensors is amenable to block optimisation techniques like variants of the celebrated alternating least squares method, e.g. [17, 25, 34, 38, 44, 72, 78, 87, 91, 112, 132, 146], but their convergence analysis is typically a challenging task as the multilinear structure does not meet classical textbook assumptions on block optimisation. Another class of local optimisation algorithms can be designed using the fact that, at least for fixed rank parameters, the model class is a smooth embedded manifold in tensor space, and explicit descriptions of its tangent space are available [4, 35, 71, 79, 86, 89, 104, 105, 138, 139]. However, here one his facing the technical difficulty that this manifold is not closed: its closure only constitutes an algebraic variety [123].

An important tool available for hierarchical tensor representation is the hierarchical singular value decomposition (HSVD) [60], as it can be used to find a quasi-best low-rank approximation using only matrix procedures with full error control. The HSVD is an extension of the higher-order singular value decomposition [39] to different types of hierarchical tensor models, including TT [65, 109, 113]. This enables the construction of iterative methods based on low-rank truncations of iterates, such as tensor variants of iterative singular value thresholding algorithms [7, 10, 18, 68, 85, 90].

Historically, the parameterisation in a hierarchical tensor framework has evolved independently in the quantum physics community, in the form of renormalisation group ideas [55, 142], and more explicitly in the framework of matrix product and tensor network states [124], including the HSVD for matrix product states [140]. A further independent source of such developments can also be found in quantum dynamics, with the multilayer multi-configurational time-dependent Hartree (MCTDH) method [15, 102, 141]. We refer the interested reader to the survey articles [63, 99, 130] and to the monograph [65].

Although the resulting tensor representations have been used in different contexts, the perspective of hierarchical subspace approximation in [69] and [65] seems to be completely new. Here, we would like to outline how this concept enables one to overcome most of the difficulties with the parameterisation by the canonical format. Most of the important properties of hierarchical tensors can easily be deduced from the underlying very basic definitions. For a more detailed analysis, we refer to the respective original papers. We do not aim to give a complete treatment, but rather to demonstrate the potential of hierarchical low-rank tensor representations from their basic principles. They provide a universal and versatile tool, with basic algorithms that are relatively simple to implement (requiring only basic linear algebra operations) and easily adaptable to various different settings.

An application of hierarchical tensors of particular interest, on which we focus here, is the treatment of high-dimensional partial differential equations. In this article, we will consider three major examples in further detail: PDEs depending on countably many parameters, which arise in particular in deterministic formulations of stochastic problems; the many-particle Schrödinger equation in quantum physics; and the Fokker–Planck equation describing a mechanical system in a stochastic environment. A further example of an application of practical importance are chemical master equations, for which we refer to [41, 42].

This article is arranged as follows: Sect. 2 covers basic notions of low-rank expansions and tensor networks. In Sect. 3 we consider subspace-based representations and basic properties of hierarchical tensor representations, which play a role in the algorithms using fixed hierarchical ranks discussed in Sect. 4. In Sect. 5, we turn to questions of convergence of hierarchical tensor approximations with respect to the ranks, and consider thresholding algorithms operating on representations of variable ranks in Sect. 6. Finally, in Sect. 7, we describe in more detail the mentioned applications to high-dimensional PDEs.

2 Tensor Product Parameterisation

In this section, we consider basic notions of low-rank tensor formats and tensor networks and how linear algebra operations can be carried out on such representations.

2.1 Tensor Product Spaces and Multivariate Functions

We start with some preliminaries. In this paper, we consider the d-fold topological tensor product

$$\begin{aligned} {\mathscr {V}} = \bigotimes _{\mu =1}^d {\mathscr {V}}_\mu \end{aligned}$$
(2.1)

of separable \({\mathbb {K}}\)-Hilbert spaces \({\mathscr {V}}_1,\dots ,{\mathscr {V}}_d\). For concreteness, we will focus on the real field \({\mathbb {K}} = {\mathbb {R}}\), although many parts are easy to extend to the complex field \({\mathbb {K}} = {\mathbb {C}}\). The confinement to Hilbert spaces constitutes a certain restriction, but still covers a broad range of applications. The topological difficulties that arise in a general Banach space setting are beyond the scope of the present paper, see [52, 65]. Avoiding them allows us to put clearer focus on the numerical aspect of tensor product approximation.

We do not give the precise definition of the topological tensor product of Hilbert spaces in (2.1) (see, e.g. [65]), but only recall the properties necessary for our later purposes. Let \(n_\mu \in {\mathbb {N}} \cup \{\infty \}\) be the dimension of \({\mathscr {V}}_\mu \). We set

$$\begin{aligned} {\mathscr {I}}_\mu = {\left\{ \begin{array}{ll} \{1,\dots ,n_\mu \}, \quad &{}\text {if } n_\mu < \infty , \\ {\mathbb {N}}, \quad &{}\text {else,} \end{array}\right. } \end{aligned}$$
(2.2)

and \({\mathscr {I}} = {\mathscr {I}}_1 \times \cdots \times {\mathscr {I}}_d\). Fixing an orthonormal basis \(\{ {\mathbf {e}}^\mu _{i_\mu } {:} i_\mu \in {\mathscr {I}}_\mu \}\) for each \({\mathscr {V}}_\mu \), we obtain a unitary isomorphism \(\varphi ^\mu :\ell ^2({\mathscr {I}}_\mu ) \rightarrow {\mathscr {V}}_\mu \) by

$$\begin{aligned} \varphi ^\mu ({\mathbf {c}}) {:=} \sum _{i\in {\mathscr {I}}_\mu } {\mathbf {c}}(i)\, {\mathbf {e}}^\mu _i ,\quad {\mathbf {c}} \in \ell ^2({\mathscr {I}}_\mu ). \end{aligned}$$

Then \(\{ {\mathbf {e}}^1_{i_1} \otimes \dots \otimes {\mathbf {e}}^d_{i_d}:i_1, \ldots ,i_d \in {\mathscr {I}} \} \) is an orthonormal basis of \({\mathscr {V}}\), and

$$\begin{aligned} \varPhi {:=} \varphi ^1 \otimes \cdots \otimes \varphi ^d \end{aligned}$$

is a unitary isomorphism from \(\ell ^2({\mathscr {I}})\) to \({\mathscr {V}}\).

Such a fixed choice of orthonormal basis allows us to identify the elements of \({\mathscr {V}}\) with their coefficient tensors \({\mathbf {u}} \in \ell ^2({\mathscr {I}})\),

$$\begin{aligned} (i_1, \ldots , i_d ) \mapsto {\mathbf {u}} (i_1, \ldots ,i_d ) \in {\mathbb {R}} \quad i_\mu \in {\mathscr {I}}_\mu ; \ \ \mu =1, \ldots , d, \end{aligned}$$
(2.3)

often called hypermatrices, depending on discrete variables \(i_\mu \), usually called indices.

In conclusion, we will focus in the following on the space \(\ell ^2({\mathscr {I}})\), which is itself a tensor product of Hilbert spaces, namely,

$$\begin{aligned} \ell ^2({\mathscr {I}}) = \ell ^2({\mathscr {I}}_1) \otimes \dots \otimes \ell ^2({\mathscr {I}}_d). \end{aligned}$$
(2.4)

The corresponding multilinear tensor product map of d univariate \(\ell ^2\)-functions is defined pointwise as \( ( {\mathbf {u}}^1 \otimes \dots \otimes {\mathbf {u}}^d )(i_1,\dots ,i_d) = {\mathbf {u}}^1(i_1) \cdots {\mathbf {u}}^d(i_d)\). Tensors of this form are called elementary tensors or rank-one tensors. Also the terminology decomposable tensors is used in differential geometry.

Let \( n= \max \{n_\mu {:} \mu =1, \ldots , d \} \) be the maximum dimension among the \({\mathscr {V}}_\mu \). Then the number of possibly nonzero entries in a pointwise representation (2.3) of \(\mathbf {u} \) is \( n_1 \cdots n_d = {\mathscr {O}} (n^d) \). This exponential scaling with respect to d is one aspect of what is referred to as curse of dimensionality, and poses a common challenge for the discretisation of the previously mentioned examples of high-dimensional PDEs. In the present paper we consider methods that aim to circumvent this core issue of high-dimensional problems using low-rank tensor decomposition. In very abstract terms, all low-rank tensor decompositions considered below ultimately decompose the tensor \({\mathbf {u}} \in \ell ^2({\mathscr {I}})\) such that

$$\begin{aligned} {\mathbf {u}}(i_1,\dots ,i_d) = \tau ({\mathbf {C}}^1(i_1),\dots ,{\mathbf {C}}^d(i_d),{\mathbf {C}}^{d+1},\dots ,{\mathbf {C}}^D), \end{aligned}$$
(2.5)

where \(\tau {:} {\mathscr {W}} {:=} {\mathscr {W}}_1 \times \dots {\mathscr {W}}_d \times {\mathscr {W}}_{d+1} \times \dots \times {\mathscr {W}}_D \rightarrow {\mathbb {R}}\) is multilinear on a Cartesian product of vector spaces \({\mathscr {W}}_\nu \), \(\nu =1,\dots ,D\). The choice of these vector spaces and the map \(\tau \) determines the format, and the tensors in its range are considered as “low-rank.” An example is the CP representation (1.1).

Remark 2.1

Since \(\varPhi \) is multilinear as well, we obtain representations of the very same multilinear structure (2.5) for the corresponding elements of \({\mathscr {V}}\),

$$\begin{aligned} \varPhi ({\mathbf {u}}) = (\varphi ^1\otimes \cdots \otimes \varphi ^d) \bigl ( (i_1,\ldots ,i_d)\mapsto \tau ({\mathbf {C}}^1(i_1),\dots ,{\mathbf {C}}^d(i_d),{\mathbf {C}}^{d+1}, \dots ,{\mathbf {C}}^D) \bigr ). \end{aligned}$$

For instance, if \({\mathscr {V}}\) is a function space on a tensor product domain \(\varOmega =\varOmega _1\times \cdots \times \varOmega _d\) on which point evaluation is defined, and \(({\mathbf {e}}^1 \otimes \cdots \otimes {\mathbf {e}}^d)(x_1,\dots ,x_d) = {\mathbf {e}}^1(x_1) \cdots {\mathbf {e}}^d(x_d)\) for \(x \in \varOmega \), then formally (dispensing for the moment with possible convergence issues), exploiting the multilinearity properties, we obtain

$$\begin{aligned} \varPhi ({\mathbf {u}})(x_1,\ldots ,x_d)&= \tau \left( \sum _{i_1 =1}^{n_1} {\mathbf {e}}^1_{i_1}(x_1) {\mathbf {C}}^1(i_1), \dots , \sum _{i_d =1}^{n_d} {\mathbf {e}}^d_{i_d}(x_d) {\mathbf {C}}^d(i_d),{\mathbf {C}}^{d+1},\dots ,{\mathbf {C}}^D \right) \\&= \tau (\varphi ^1({\mathbf {C}}^1)(x_1), \dots ,\varphi ^d({{\mathbf {C}}}^d)(x_d),{{\mathbf {C}}}^{d+1},\dots ,{{\mathbf {C}}}^D), \end{aligned}$$

and the same applies to other tensor product functionals on \({\mathscr {V}}\). Since in the present case of Hilbert spaces (2.1), the identification with \(\ell ^2({\mathscr {I}})\) via \(\varPhi \) thus also preserves the considered low-rank structures, we exclusively work on basis representations in \(\ell ^2({\mathscr {I}})\) in what follows.

2.2 The Canonical Tensor Format

The canonical tensor format, also called CP (canonical polyadic) decomposition, CANDECOMP or PARAFAC, represents a tensor of order d as a sum of elementary tensor products \( {\mathbf {u}} = \sum _{k=1}^{r} \mathbf{c}_k^1 \otimes \cdots \otimes \mathbf{c}^d_k, \) that is

$$\begin{aligned} \mathbf {u}(i_1,\dots ,i_d) = \sum _{k=1}^{r} \mathbf {C}^1(i_1,k) \cdots \mathbf {C}^d(i_d,k), \end{aligned}$$
(2.6)

with \({\mathbf {c}}^\mu _k = \mathbf {C}^\mu (\cdot ,k) \in \ell ^2 ({\mathscr {I}}_\mu ) \) [25, 72, 76]. The minimal r such that such a decomposition exists is called the canonical rank (or simply rank) of the tensor \(\mathbf {u}\). It can be infinite.

Depending on the rank, the representation in the canonical tensor format has a potentially extremely low complexity. Namely, it requires at most r d n nonzero entries, where \(n = \max \left|{\mathscr {I}}_\mu \right|\). Another key feature (in the case \(d >2\)) is that the decomposition (2.6) is essentially unique under relatively mild conditions (assuming that r equals the rank). This property is a main reason for the prominent role that the canonical tensor decomposition plays in signal processing and data analysis, see [28, 87, 93] and references therein.

In view of the applications to high-dimensional partial differential equations, one can observe that the involved operator can usually be represented in the form of a canonical tensor operator, and the right-hand side is also very often of the form above. This implies that the operator and the right-hand sides can be stored in this data-sparse representation. This motivates the basic assumption in numerical tensor calculus that all input data can be represented in a sparse tensor form. Then there is a reasonable hope that the solution of such a high-dimensional PDE might also be approximated by a tensor in the canonical tensor format with moderate r. The precise justification for this hope is subject to ongoing research (see [36] for a recent approach and further references), but many known numerical solutions obtained by tensor product ansatz functions such as (trigonometric) polynomials, sparse grids, or Gaussian kernels are in fact low-rank approximations, mostly in the canonical format. However, the key idea in nonlinear low-rank approximation is to not fix possible basis functions in (2.6) in advance. Then we have an extremely large library of functions at our disposal. Motivated by the seminal papers [16, 17], we will pursue this idea throughout the present article.

From a theoretical viewpoint, the canonical tensor representation (2.6) may be seen a straightforward generalisation of low-rank matrix representation, and coincides with it when \(d=2\). As it turns out, however, the parameterisation of tensors via the canonical representation (2.6) is not as harmless as it seems to be. For example, for \(d>2\), the following difficulties appear:

  • The canonical tensor rank is (in case of finite-dimensional spaces) NP-hard to compute [75].

  • The set of tensors of the above form with canonical rank at most r is not closed when \(r \ge 2\) (border rank problem). As a consequence, a best approximation of a tensor by one of smaller canonical rank may not exist; see [127]. This is in strong contrast to the matrix case \(d=2\), see Sect. 3.1.

  • In fact, the set of tensors of rank at most r does not form an algebraic variety [95].

Further surprising and fascinating difficulties with the canonical tensor rank in case \(d>2\) with references are listed in [87, 94, 100, 136]. Deep results of algebraic geometry have been invoked for the investigation of these problems, see the monograph [95] for the state of the art. The problem of non-closedness can often be mitigated by imposing further conditions such as symmetry [95], nonnegativity [101] or norm bounds on factors [127].

In this paper we show a way to avoid all these difficulties by considering another type of low-rank representation, namely the hierarchical tensor representation [65], but at the price of a slightly higher computational and conceptual complexity. Roughly speaking, the principle of hierarchical tensor representations is to reduce the treatment of higher-order tensors to matrix analysis.

2.3 Tensor Networks

For fixed r, the canonical tensor format (2.6) is multilinear with respect to every matrix \({\mathbf {C}}^\mu {:=} \bigl ({\mathbf {C}}^\mu (i,k)\bigr )_{{i\in {\mathscr {I}}_\mu ,\,k=1,\ldots ,r}}\). A generalised concept of low-rank tensor formats is obtained by considering classes of tensors which are images of more general multilinear parameterisations. A very general form is

$$\begin{aligned} {\mathbf {u}} (i_1,\dots ,i_d) = \sum _{k_1 =1}^{r_1} \cdots \sum _{k_E =1}^{r_E} \prod _{\nu =1}^D {\mathbf {C}}^\nu (i_1,\dots ,i_d, k_1, \ldots , k_E ) \end{aligned}$$
(2.7)

with an arbitrary number D of components \({\mathbf {C}}^\nu (i_1,\dots ,i_d, k_1, \ldots , k_E)\) that potentially depend on all variable \(i_1,\dots ,i_d\) and additional contraction variables \(k_1,\dots ,k_E\). For clarity, we will call the indices \(i_\mu \) physical variables. Again, we can regard \({\mathbf {u}}\) as elements of the image of a multilinear map \(\tau _{\mathfrak {r}}\),

$$\begin{aligned} {\mathbf {u}} = \tau _{\mathfrak {r}}({\mathbf {C}}^1,\dots ,{\mathbf {C}}^D), \end{aligned}$$
(2.8)

parametrising a certain class of “low-rank” tensors. By \(\mathfrak {r} = (r_1,\dots ,r_E)\) we indicate that this map \(\tau \) depends on the representation ranks \(r_1,\dots ,r_E\).

The disadvantage compared to the canonical format is that the component tensors have order \(e_\nu \) instead of 2, where \(1 \le e_\nu \le d+E\) is the number of (contraction and physical) variables in \({\mathbf {C}}^\nu \) which are actually active.Footnote 1 In cases of interest introduced below (like the HT or TT format), this number is small, say at most three. More precisely, let p and q be bounds for the number of active contraction variables, and q a bound for the number of active physical variables per component. Let also \(n_\mu \le n\) for all \(\mu \), and \(r_\eta \le r\) for all \(\eta \), the data complexity of the format (2.7) is bounded by \( D n^q r^{p}\). Computationally efficient representations of multivariate functions arise by bounding p, q and r.

Without further restriction, the low-rank formats (2.7) form a too general class. An extremely useful subclass are tensor networks.

Definition 2.2

We call the multilinear parameterization (2.7) a tensor network, if

  1. (i)

    each physical variable \(i_\mu \), \(\mu =1, \ldots , d\), is active in exactly one component \({\mathbf {C}}^\nu \);

  2. (ii)

    each contraction variable \(k_\eta \), \(\eta = 1,\dots ,E\), is active in precisely two components \({\mathbf {C}}^\nu \), \(1 \le \nu \le D\),

see also [49, 124] and the references given there.

We note that the canonical tensor format (2.6) is not a tensor network, since the contraction variable k relates to all physical variables \(i_\mu \).

The main feature of a tensor network is that it can be visualised as a graph with D nodes, representing the components \({\mathbf {C}}^\nu \), \(\nu = 1,\dots ,D\), and E edges connecting those nodes with common active contraction variable \(k_\eta \), \(\eta = 1,\dots ,E\). In this way, the edges represent the summations over the corresponding contraction variables in (2.7). Among all nodes, the ones in which a physical variable \(i_\mu \), \(\mu = 1,\dots ,d\), is active play a special role and get an additional label, which in our pictures will be depicted by an additional open edge connected to the node. The graphical visualisation of tensor networks is a useful and versatile tool for describing decompositions of multivariate functions (tensors) into nested summations over contraction variables. This will be illustrated in the remainder of this section.

Plain vectors, matrices and higher-order tensors are trivial examples of tensor networks, since they contain no contraction variable at all. A vector \(i \mapsto \mathbf {u}(i)\) is a node with single edge i, a matrix \((i_1,i_2) \mapsto \mathbf {u}(i_1,i_2)\) is a node with two edges \(i_1,i_2\), and a d-th-order tensor is a node with d edges connected to it:

figure a

The simplest nontrivial examples of tensor networks are given by low-rank matrix decompositions like \(\mathbf {A} = \mathbf {U} \mathbf {V}^T\) or \(\mathbf {A} = \mathbf {U} \varvec{\varSigma } \mathbf {V}^T\), the latter containing a node with no physical variable:

figure b

Note that these graphs do not show which physical variables belong to which open edge. To emphasise a concrete choice one has to attach the labels \(i_\mu \) explicitly. Let us consider a more detailed example, in which we also attach contraction variables to the edges:

figure c

The graph on the right side represents the tensor network

$$\begin{aligned} \mathbf {u}(i_1,i_2,i_3,i_4) \!=\! \sum _{k_1=1}^{r_1}{ \sum _{k_2 = 1}^{r_2}} \mathbf {C}^1(i_3,k_1,{ k_2}){{ \sum _{k_3 = 1}^{r_3}} \mathbf {C}^2(i_1,{ k_2},{ k_3})}{\sum _{k_4 = 1}^{r_4}} \mathbf {C}^3(k_1,{ k_3},{ k_4}) \mathbf {C}^4(i_2,i_4,{ k_4}). \end{aligned}$$

Note that node \(\mathbf {C}^3\) depends on no physical variable, while \(\mathbf {C}^4\) depends on two. The sums have been nested to illustrate how the contractions are performed efficiently in practice by following a path in the graph.

As a further example, we illustrate how to contract and decontract a tensor of order \(d=4\) by a rank-r matrix decomposition that separates physical variables \((i_1,i_2)\) from \((i_3,i_4)\), using, e.g. SVD:

figure d
$$\begin{aligned} {\mathbf {u}}(i_1,i_2,i_3,i_4) = {\sum _{k=1}^r } {\mathbf {C}}^1 (i_1,i_2 , {k} ) {\mathbf {C}}^2 ( i_3, i_4, {k }) \end{aligned}$$

Diagrammatic representations of a similar kind are also common in quantum physics for keeping track of summations, for instance Feynman and Goldstein diagrams.

Remark 2.3

The representation of tensor networks by graphs can be formalised using the following definition as an alternative to Definition 2.2. A tensor network is a tuple \((G,E,\iota ,\mathfrak {r})\), where (GE) is a connected graph with nodes G and edges E, \(\iota {:} \{1,\dots ,d\} \rightarrow G\) is an assignment of physical variable, and \(\mathfrak {r} {:} E \rightarrow {\mathbb {N}}\) are weights on the edges indicating the representation ranks for the contraction variables.

Remark 2.4

Tensor networks are related to statistical networks such as hidden Markov models, Bayesian belief networks, latent tree models, and sum-product networks. However, due to the probabilistic interpretation the components need to satisfy further constraints to ensure nonnegativity and appropriate normalisation. For further details on latent tree networks, we refer the reader to the recent monograph [149].

2.4 Tree Tensor Networks

The main subject of this article are tensor networks of a special type, namely those with a tree structure.

Definition 2.5

A tensor network is called a tree tensor network if its graph is a tree, that is, contains no loops.

Among the general tensor networks, the tree tensor networks have favourable topological properties that make them more amenable to numerical utilisation. For instance, tensors representable in tree networks with fixed rank bounds \(r_\nu \) form closed sets, and the ranks have clear interpretation as matrix ranks, as will be explained in Sect. 3. In contrast, it has been shown that the set of tensors represented by a tensor network whose graph has closed loops is not closed in the Zariski sense [96]. In fact, there is no evidence that the more general tensor network parameterisation with loops do not suffer from similar problems as the canonical tensor format, which is not even a tensor network. In the following, we therefore restrict ourselves to tree tensor networks.

Some of the most frequent examples of tree tensor networks are the Tucker, hierarchical Tucker (HT), and tensor train (TT) formats. In case \(d=4\), they are represented by the tensor networks depicted in Fig. 1 and will be treated in detail in Sect. 3. By allowing large enough representation ranks, it is always possible to represent a d-th-order tensor in any of these formats, but the required values \(r = \max {r_\eta }\) can differ substantially depending on the choice of format. Let again p be a bound for the number of connected (active) contraction variables for a node, and \(n = \max n_\mu \). The three mentioned formats have storage complexity \(O(dnr^p)\). A potential disadvantage of the Tucker format is that \(p = d\), which implies a curse of dimension for large d. In contrast, \(p = 3\) in HT, and \(p=2\) in HT.

Fig. 1
figure 1

Important examples of tree tensor networks. a Tucker format. b Hierarchical Tucker (HT) format. c Tensor train (TT) format

2.5 Linear Algebra Operations

For two tensors given in the same tree tensor network representation it is easy to perform standard linear algebra operations, such as summation, Hadamard (pointwise) product, and inner products. Also the application of a linear operator to such a tensor can be performed in this representation if the operator is in a “compatible” form.

For instance, a matrix-vector product \({\mathbf {b}} = {\mathbf {A}} {\mathbf {u}}\) results in a vector, and is obtained as a single contraction \({\mathbf {b}}(i) = \sum _{k=1}^n {\mathbf {A}}(i,k) {\mathbf {u}}(k)\). Hence it has the following tensor network representation.

figure e

As a next example, consider a fourth-order tensor represented in the TT format (see Fig. 1; Sect. 3.4):

figure f
$$\begin{aligned} {\mathbf {u}} (i_1, i_2,i_3, i_4) = \sum _{k_1=1}^{r_1} \sum _{k_2 = 1}^{r_2} \sum _{k_3=1}^{r_3} \mathbf {G}^1 (i_1 ,k_1 ) \mathbf {G}^2 (k_1 , i_2 , k_2) \mathbf {G}^3 (k_2 , i_3 , k_3) \mathbf {G}^4 ( k_3, i_4 ) \end{aligned}$$

Note that the ordering of physical and contraction variable in the components \(\mathbf {G}^\mu \) was adjusted to follow the linear structure of the tree. A linear operator \(\mathbf {A}\) in TT format has the following form:

figure g

The application of \(\mathbf {A}\) to \(\mathbf {u}\) is illustrated by:

figure h

Summing over connected edges \(i_\mu \) related to physical variables results again in a TT tensor \(\tilde{\mathbf {u}} = \mathbf {A} \mathbf {u}\):

figure i
$$\begin{aligned} \tilde{\mathbf {u}} (j_1, j_2,j_3, j_4) = \sum _{\tilde{k}_1=1}^{\tilde{r}_1} \sum _{\tilde{k}_2 = 1}^{\tilde{r}_2} \sum _{\tilde{k}_3=1}^{\tilde{r}_3} \tilde{\mathbf {G}}^1 (j_1 ,\tilde{k}_1 ) \tilde{\mathbf {G}}^2 (\tilde{k}_1 , j_2 , \tilde{k}_2) \tilde{\mathbf {G}}^3 (\tilde{k}_2 , j_3 , \tilde{k}_3) \tilde{\mathbf {G}}^4 ( \tilde{k}_3, j_4 ) . \end{aligned}$$

The new ranges \(\tilde{r}_\eta \) for \(\tilde{k}_\eta \) are bounded by \(\tilde{r}_\eta \le r_\eta s_\eta \), compared to the initial \(r_\eta \). It can be seen that the overall complexity of computing \(\mathbf {A} \mathbf {u}\) is linear in d, quadratic in n and polynomial in the ranks.

To estimate the complexity of standard linear algebra operations, one observes that summing tensors in tree network representations leads to summation of ranks, while multiplicative operations like matrix-vector products or Hadamard products lead to multiplication of ranks. Fortunately, this is only an upper bound for the ranks. How to recompress the resulting parameterisation with and without loss of accuracy will be shown later. Details on linear algebra operations are beyond the scope of this paper, but can be found in [27, 65, 110].

3 Tree Tensor Networks as Nested Subspace Representations

In this section we explain how the deficiencies of the canonical format are cured using tree tensor network parameterisations. Tree tensor networks have the fundamental property that if one edge of the tree is removed, exactly two subtrees are obtained. This property enables the application of matrix techniques to tree tensor networks and constitutes the main difference to the canonical tensor format.

3.1 The Matrix Case \(d=2\) Revisited

An \(m \times n\) matrix can be either seen as an element of \({\mathbb {R}}^m \otimes {\mathbb {R}}^n\), as a bivariate function, or as a linear operator from \({\mathbb {R}}^n\) to \({\mathbb {R}}^m\). In the general, possibly infinite-dimensional case this corresponds to the fact that the topological tensor product of \({\mathscr {H}} = {\mathscr {H}}_1 \otimes {\mathscr {H}}_2\) is isometrically isomorphic to the Hilbert space \(\text {HS}({\mathscr {H}}_2,{\mathscr {H}}_1)\) of Hilbert–Schmidt operators from \({\mathscr {H}}_2\) to \({\mathscr {H}}_1\). This space consists of bounded linear operators \(T {:} {\mathscr {H}}_2 \rightarrow {\mathscr {H}}_1\) for which \( \Vert T \Vert _{\text {HS}}^2 = \langle T, T \rangle _{\text {HS}} < \infty , \) where the inner product is defined as \( \langle S, T \rangle _{\text {HS}} = \sum _{i_2=1}^{n_2} \langle S \mathbf {e}_{i_2}^2, T \mathbf {e}_{i_2}^2 \rangle . \) Here \(\{ {\mathbf {e}}^\mu _{i_2} {:} i_2 \in {\mathscr {I}}_\mu \}\) is any orthonormal basis of \({\mathscr {H}}_2\). It is an easy exercise to convince oneself that the choice of basis is irrelevant. The isometric isomorphism \(\mathbf {u} \mapsto T_{\mathbf {u}}\) between \({\mathscr {H}}\) and \(\text {HS}({\mathscr {H}}_2,{\mathscr {H}}_1)\) which we then consider is constructed by identifying

$$\begin{aligned} \mathbf {u}^1 \otimes \mathbf {u}^2 \in {\mathscr {H}}_1\otimes {\mathscr {H}}_2 \quad \leftrightarrow \quad \langle \cdot , \mathbf {u}^2 \rangle \mathbf {u}^1 \in \text {HS}({\mathscr {H}}_2,{\mathscr {H}}_1) \end{aligned}$$
(3.1)

and linear expansion.

The relation to compact operators makes the case \(d=2\) unique as it enables spectral theory for obtaining tensor decompositions and low-rank approximations. The nuclear decomposition of compact operators plays the decisive role. It has been first obtained by Schmidt for integral operators [121]. A proof can be found in most textbooks on linear functional analysis or spectral theory. For matrices the decomposition (3.2) below is called the singular value decomposition (SVD), and can be traced back even further, see [128] for the history. We will use the same terminology. The best approximation property stated below was also obtained by Schmidt, and later also attributed to Eckart and Young [45]. We state the result in \(\ell ^2({\mathscr {I}}_1 \times {\mathscr {I}}_2)\); see [65] for a self-contained treatment from a more general tensor perspective.

Theorem 3.1

(E. Schmidt, 1907) Let \(\mathbf {u} \in \ell ^2({\mathscr {I}}_1 \times {\mathscr {I}}_2)\), then there exist orthonormal systems \(\{\mathbf {U}^1(\cdot ,k) {:} k \in {\mathscr {I}}_1 \}\) in \(\ell ^2({\mathscr {I}}_1)\) and \(\{\mathbf {U}^2(\cdot , k) {:} k \in {\mathscr {I}}_2 \}\) in \(\ell ^2({\mathscr {I}}_2)\), and \(\sigma _1 \ge \sigma _2 \ge \dots \ge 0\), such that

$$\begin{aligned} \mathbf {u}(i_1,i_2) = \sum _{k=1}^{\min (n_1,n_2)} \sigma _k \ \mathbf {U}^1(i_1,k) \mathbf {U}^2(i_2,k), \end{aligned}$$
(3.2)

with convergence in \(\ell ^2({\mathscr {I}}_1 \times {\mathscr {I}}_2)\). A best approximation of \({\mathbf {u}} \) by a tensor of rank \(r \le \min (n_1,n_2)\) in the norm of \(\ell ^2({\mathscr {I}}_1 \times {\mathscr {I}}_2)\) is provided by

$$\begin{aligned} \mathbf {u}_r(i_1,i_2) = \sum _{k=1}^{r} \sigma _k \ \mathbf {U}^1(i_1,k) \mathbf {U}^2(i_2,k), \end{aligned}$$

and the approximation error satisfies

$$\begin{aligned} \Vert \mathbf {u} - \mathbf {u}_r \Vert ^2 = \sum _{k = r+1}^{\min (n_1,n_2)} \sigma _k^2. \end{aligned}$$

The best approximation is unique in case \(\sigma _r > \sigma _{r+1}\).

The numbers \(\sigma _k\) are called singular values, the basis elements \(\mathbf {U}^1(\cdot ,k)\) and \(\mathbf {U}^2(\cdot ,k)\) are called corresponding left and right singular vectors. They are the eigenvectors of \(T_{\mathbf {u}}^{} T_{\mathbf {u}}^*\) and \(T_{\mathbf {u}}^* T_{\mathbf {u}}^{}\), respectively.

In matrix notation, let \(\mathbf {U}\) be the (possibly infinite) matrix with entries \(\mathbf {u}(i_1,i_2)\). Then, using (3.1), the singular value decomposition (3.2) takes the familiar form

$$\begin{aligned} \mathbf {U} = \mathbf {U}_1^{} \varvec{\varSigma } \mathbf {U}_2^T, \end{aligned}$$

where \(\mathbf {U}_\mu ^{} = [\mathbf {u}^\mu _1,\mathbf {u}^\mu _2,\dots ]\) have columns \(\mathbf {u}^\mu _k\), \(\mu = 1,2\), and \(\varvec{\varSigma } = {{\mathrm{diag}}}(\sigma _1,\sigma _2,\dots )\).

3.2 Subspace Approximation

The problem of finding the best rank-r approximation to a tensor of order two (a matrix) can be interpreted as a subspace approximation problem, and Schmidt’s theorem 3.1 provides a solution.

The problem is as follows: Find subspaces \({\mathscr {U}}_1 \subseteq \ell ^2({\mathscr {I}}_1)\) and \({\mathscr {U}}_2 \subseteq \ell ^2({\mathscr {I}}_2)\) of dimension r such that

$$\begin{aligned} {{\mathrm{dist}}}(\mathbf {u}, {\mathscr {U}}_1 \otimes {\mathscr {U}}_2) = \Vert \mathbf {u} - \varPi _{{\mathscr {U}}_1 \otimes {\mathscr {U}}_2} \mathbf {u} \Vert = \text { min!} \end{aligned}$$
(3.3)

Here \(\varPi _{{\mathscr {U}}_1 \otimes {\mathscr {U}}_2}\) denotes the orthogonal projection on \({\mathscr {U}}_1 \otimes {\mathscr {U}}_2\). The truncated singular value decomposition \(\mathbf {u}_r\) is the solution to this problem, more precisely the subspaces spanned by the dominating r left and right singular vectors, respectively, since one has that a tensor of order two is of rank at most r if and only if it is contained in such a subspaceFootnote 2 \({\mathscr {U}}_1 \otimes {\mathscr {U}}_2\). We highlight that the admissible set over which we minimise the distance \({{\mathrm{dist}}}(\mathbf {u}, {\mathscr {U}}_1 \otimes {\mathscr {U}}_2)\) is the closure of a Cartesian product of Grassmanians [1, 46, 95]. Note that the rank of \(\mathbf {u}\) can now be defined as the minimal r such that the minimal distance in (3.3) is zero.

In contrast to the representability in canonical tensor format, the interpretation of low-rank approximation as subspace approximation, which is possible in case \(d=2\), provides a different concept which offers advantageous mathematical properties also in the higher-order case. In the sequel we will pursue this concept. A direct generalisation of (3.3) to higher-order tensors leads to the by now classical Tucker format [65, 87, 133]. Given a tensor \(\mathbf {u} \in \ell ^2({\mathscr {I}})\) and dimensions \(r_1,\dots ,r_d\) one is searching for optimal subspaces \( {\mathscr {U}}_\mu \subseteq \ell ^2({\mathscr {I}}_\mu )\) of dimension \(r_\mu \) such that

$$\begin{aligned} {{\mathrm{dist}}}(\mathbf {u}, {\mathscr {U}}_1 \otimes \dots \otimes {\mathscr {U}}_d) = \Vert \mathbf {u} - \varPi _{{\mathscr {U}}_1 \otimes \dots \otimes {\mathscr {U}}_d} \mathbf {u} \Vert = \text { min!} \end{aligned}$$
(3.4)

The (elementwise) minimal tuple of \((r_1,\dots ,r_d)\) which minimises the distance to zero is called the Tucker rank of \(\mathbf {u}\). It follows from this definition that a tensor has Tucker rank at most \((r_1,\dots ,r_d)\) if and only if \(\mathbf {u} \in {\mathscr {U}}_1 \otimes \dots \otimes {\mathscr {U}}_d\) with \(\dim ({\mathscr {U}}_\mu ) \le r\). Note that this in turn is the case if and only if \(\mathbf {u}\) can be written as

$$\begin{aligned} \mathbf {u}(i_1,\dots ,i_d) = \sum _{k_1 = 1}^{r_1} \cdots \sum _{k_d=1}^{r_d} \mathbf {C}(i_1,\dots ,i_d,k_1,\dots ,k_d) \mathbf {U}^1(i_1,k_1) \cdots \mathbf {U}^d(i_d,k_d). \end{aligned}$$
(3.5)

For instance, one can choose \(\{ \mathbf {U}^\mu (\cdot ,1), \dots ,\mathbf {U}^d(\cdot ,r) \}\) to be a basis of \({\mathscr {U}}_\mu \). The multilinear representation (3.5) of tensors is called the Tucker format [77, 133]. Its tensor network representation is given in Fig. 1a.

The minimal \(r_\mu \) appearing in the Tucker rank of \({\mathbf {u}}\), as well as the corresponding subspaces \({\mathscr {U}}_\mu \), can be found constructively and independently from each other as follows. For \(\mu = 1,\dots ,d\), let \({\mathscr {I}}_\mu ^c = {\mathscr {I}}_1 \times \dots \times {\mathscr {I}}_{\mu - 1} \times {\mathscr {I}}_{\mu +1} \times \dots \times {\mathscr {I}}_{d}\). Then the spaces \(\ell ^2({\mathscr {I}}_\mu \times {\mathscr {I}}_\mu ^c) = \ell ^2({\mathscr {I}}_\mu ) \otimes \ell ^2({\mathscr {I}}_{\mu }^c)\), which are tensor product spaces of order two, are all isometrically isomorphic to \(\ell ^2({\mathscr {I}})\). The corresponding isomorphisms \(\mathbf {u} \mapsto \mathbf {M}_\mu (\mathbf {u})\) are called matricisations. The SVDs of \(\mathbf {M}_\mu (\mathbf {u})\) provide us with subspaces \({\mathscr {U}}_\mu \) of minimal dimension \(r_\mu \) such that \(\mathbf {M}_\mu (\mathbf {u}) \in {\mathscr {U}}_\mu \otimes \ell ^2({\mathscr {I}}_\mu ^c)\), that is,

$$\begin{aligned} \mathbf {u} \in \ell ^2({\mathscr {I}}_1) \otimes \dots \otimes \ell ^2({\mathscr {I}}_{\mu -1}) \otimes {\mathscr {U}}_\mu \otimes \ell ^2({\mathscr {I}}_{\mu +1}) \otimes \dots \otimes \ell ^2({\mathscr {I}}_d). \end{aligned}$$
(3.6)

Comparing with (3.4), this shows that this \(r_\mu \) cannot be larger than the corresponding Tucker rank. On the other hand, since (3.6) holds for \(\mu =1,\dots ,d\) simultaneously, we get (see, e.g. [65, Lemma 6.28])

$$\begin{aligned} \mathbf {u} \in {\mathscr {U}}_1 \otimes \dots \otimes {\mathscr {U}}_d, \end{aligned}$$
(3.7)

which in combination yields that the \({\mathscr {U}}_\mu \) found in this way solve (3.4). These considerations will be generalised to general tree tensor networks in Sect. 3.5.

Similar to the matrix case one may pose the problem of finding the best approximation of a tensor \(\mathbf {u}\) by one of lower Tucker rank. This problem always has a solution [52, 134], but no normal form providing a solution similar to the SVD is currently available. The higher-order SVD [39] uses the dominant left singular subspaces from the SVDs of the matricisations \(\mathbf {M}_\mu (\mathbf {u})\), but they only provide quasi-optimal approximations. This will be explained in more detail Sect. 3.5, albeit somewhat more abstractly than required for the Tucker format.

There is a major drawback of the Tucker format, which motivates us to go beyond it: unless the core tensor \(\mathbf {C}\) in (3.5) is sparse, the low-rank Tucker representation is still affected by the curse of dimensionality. Since, in general, the core tensor contains \(r_1 \cdots r_d \) possibly nonzero entries, its storage complexity scales exponentially with the order as \({\mathscr {O}}(r^d)\), where \(r = \max \{ r_\mu {:} \mu =1,\dots ,d\}\). With \( n = \max \{ n_\mu {:} \mu =1, \ldots , d\}\), the overall complexity for storing the required data, including the basis vectors \(\mathbf {U}^\mu ( \cdot , k)\), is bounded by \({\mathscr {O}} ( ndr + r^d)\). Without further sparsity of the core tensor, the pure Tucker format is appropriate only for tensors of low order, say \(d \le 4\). Nonetheless, subspace-based tensor representation approximation as in the Tucker format is not a dead end. We will describe how it can be used in a hierarchical fashion to circumvent the curse of dimensionality, at least for a large class of tensors.

3.3 Hierarchical Tensor Representation

The hierarchical Tucker format or hierarchical tensor format (HT) was introduced by Hackbusch and Kühn [69], and extends the idea of subspace approximation to a hierarchical or multi-level framework. It is a tree tensor network corresponding to the diagram in Fig. 1b. Here we derive it from a geometric viewpoint.

Let us reconsider the subspace relation (3.7) with subspaces \({\mathscr {U}}_\mu \) of dimension \(r_\mu \). There exists a subspace \({\mathscr {U}}_{\{ 1,2 \} } \subseteq {\mathscr {U}}_1 \otimes {\mathscr {U}}_2\), of possibly lower dimension \(r_{\{1,2\}} \le r_1 r_2\), such that we actually have \(\mathbf {u} \in {\mathscr {U}}_{\{ 1,2 \} } \otimes {\mathscr {U}}_3 \otimes \dots \otimes {\mathscr {U}}_d\). Then \( {\mathscr {U}}_{\{ 1,2\}} \) is a space of “matrices”, and has a basis \(\{ \mathbf {U}^{\{1,2\}}(\cdot ,\cdot ,k_{\{1,2\}}) {:} k_{\{1,2\}} = 1,\dots ,r_{\{1,2\}} \}\), whose elements can be represented in the basis of \( {\mathscr {U}}_1 \otimes {\mathscr {U}}_2\):

$$\begin{aligned} \mathbf {U}^{\{1,2\}}(i_1,i_2,k_{\{1,2\}}) = \sum _{{k}_1 =1 }^{r_1} \sum _{{k}_2 =1}^{r_2} \mathbf {B}^{\{1,2\}}(k_1,k_2,k_{\{1,2\}}) \mathbf {U}^1(i_1,k_1) \mathbf {U}^2(i_2,k_2). \end{aligned}$$

One can now continue in several ways, e.g. by choosing a subspace \( {\mathscr {U}}_{\{ 1,2,3\} } \subseteq {\mathscr {U}}_{\{ 1,2\} } \otimes {\mathscr {U}}_3 \subseteq {\mathscr {U}}_1 \otimes {\mathscr {U}}_2 \otimes {\mathscr {U}}_3\). Another option is to find a subspace \( {\mathscr {U}}_{\{ 1,2,3,4\} } \subseteq {\mathscr {U}}_{\{ 1,2\} } \otimes {\mathscr {U}}_{\{3,4\}}\), where \( {\mathscr {U}}_{\{3,4\}}\) is defined analogously to \( {\mathscr {U}}_{\{1,2\}}\), and so on.

For a systematic treatment, this approach is cast into the framework of a partition tree \({\mathbb {T}}\) (also called dimension tree) containing subsets of \(\{1,\dots ,d\}\) such that

  1. (i)

    \(\alpha ^* {:}= \{1,\dots ,d\} \in {\mathbb {T}}\), and

  2. (ii)

    for every \(\alpha \in {\mathbb {T}}\) with \(\left|\alpha \right| > 1\) there exist \(\alpha _1,\alpha _2 \in {\mathbb {T}}\) such that \( \alpha = \alpha _1 \cup \alpha _2 \) and \( \alpha _1 \cap \alpha _2 = \emptyset \).

Such a set \({\mathbb {T}}\) forms a binary tree by introducing edges between fathers and sons. The vertex \(\alpha ^*\) is then the root of this tree, while the singletons \(\{\mu \}\), \(\mu =1,\dots ,d\) are the leaves. By agreeing that \(\alpha _1\) should be the left son of \(\alpha \) and \(\alpha _2\) the right son, a pre-order traversion through the tree yields the leaves \(\{\mu \}\) appearing according to a certain permutation \(\varPi _{{\mathbb {T}}}\) of \(\{1,\dots ,d\}\).

By \(\hat{{\mathbb {T}}}\) we denote the subset of \(\alpha \) which are neither the root, nor a leaf (inner vertices). In the HT format, to every \(\alpha \in {\mathbb {T}} {\setminus } \{\alpha ^*\}\) with sons \(\alpha _1,\alpha _2\) a subspace \({\mathscr {U}}_\alpha \subseteq \bigotimes _{j \in \alpha } \ell ^2({\mathscr {I}}_{j})\) of dimension \(r_\alpha \) is attached such that the nestedness properties

$$\begin{aligned} {\mathscr {U}}_{\alpha } \subseteq {\mathscr {U}}_{\alpha _1} \otimes {\mathscr {U}}_{\alpha _2}, \quad \alpha \in \hat{{\mathbb {T}}}, \end{aligned}$$

and \(\pi _{{\mathbb {T}}}(\mathbf {u}) \in {\mathscr {U}}_{\alpha _1^*} \otimes {\mathscr {U}}_{\alpha _2^*}\) hold true. Here \(\pi _{{\mathbb {T}}}\) denotes the natural isomorphismFootnote 3 between \(\bigotimes _{\mu =1}^d \ell ^2({\mathscr {I}}_\mu )\) and \(\bigotimes _{\mu =1}^d \ell ^2({\mathscr {I}}_{\varPi _{{\mathbb {T}}}(\mu )})\).

Corresponding bases \(\{ \mathbf {U}^\alpha (\cdot ,\ldots ,\cdot ,k_\alpha ) {:} k_\alpha = 1,\dots ,r_\alpha \}\) of \({\mathscr {U}}_\alpha \) are then recursively expressed as

$$\begin{aligned} \mathbf {U}^\alpha ({\mathbf {i}}_\alpha ,k_\alpha ) = \sum _{{k}_1 =1 }^{r_{\alpha _1}} \sum _{{k}_2 =1}^{r_{\alpha _2}} \mathbf {B}^{\alpha }(k_1,k_2,k_\alpha ) \mathbf {U}^{\alpha _1}({\mathbf {i}}_{\alpha _1},k_1) \mathbf {U}^{\alpha _2}({\mathbf {i}}_{\alpha _2},k_2), \quad \alpha \in \hat{{\mathbb {T}}}, \end{aligned}$$
(3.8)

where \({\mathbf {i}}_\alpha = \times _{\mu \in \alpha } \{i_\mu \}\) denotes, with a slight abuse of notation, the tuple of physical variables represented by \(\alpha \). Finally, \(\mathbf {u}\) is recovered as

$$\begin{aligned} \mathbf {u}(i_1,\dots ,i_d) = \sum _{{k}_1 =1 }^{r_{\alpha _1^*}} \sum _{{k}_2 =1}^{r_{\alpha _2^*}} \mathbf {B}^{\alpha ^*}(k_1,k_2) \mathbf {U}^{\alpha _1^*}({\mathbf {i}}_{\alpha _1^*},k_1) \mathbf {U}^{\alpha _2^*}({\mathbf {i}}_{\alpha _2^*},k_2). \end{aligned}$$
(3.9)

It will be notationally convenient to set \(\mathbf {B}^\alpha = \mathbf {U}^\alpha \) for leaves \(\alpha = \{\mu \}\). If equation (3.9) is recursively expanded using (3.8), we obtain a multilinear low-rank format of the form (2.7) with \(E = \left|T\right|-1\), \(D = \left|T\right|\), \(r_\eta = r_\alpha \), and \(\mathbf {C}^\nu = \mathbf {B}^\alpha \) (in some ordering), that satisfies Definition 2.2. Its graphical representation takes the form of the tree in Fig. 1b, and has the same topology as the tree \({\mathbb {T}}\) itself, ignoring the edges with open ends which can be seen as labels indicating physical variables.

The tensors \(\mathbf {B}^{\alpha }\) will be called component tensors, the terminology transfer tensors is also common in the literature. In line with (2.8) the tensors which are representable in the HT format with fixed \(\mathfrak {r} = (r_\alpha )\) are the images

$$\begin{aligned} \mathbf {u} = \pi _{{\mathbb {T}}}^{-1}(\tau _{{\mathbb {T}},\mathfrak {r}}((\mathbf {B}^\alpha )_{\alpha \in {\mathbb {T}}})) \end{aligned}$$

of a multilinear map \(\pi _{{\mathbb {T}}}^{-1} \circ \tau _{{\mathbb {T}},\mathfrak {r}}\).

For fixed \(\mathbf {u}\) and \({\mathbb {T}}\), the minimal possible \(r_\alpha \) to represent \(\mathbf {u}\) as image of \(\tau _{{\mathbb {T}},\mathfrak {r}}\) are, as for the Tucker format, given by ranks of certain matricisations of \(\mathbf {u}\). This will be explained in Sect. 3.5 for general tree tensor networks.

Depending on the contraction lengths \(r_\alpha \), the HT format can be efficient, as it only requires storing the tuple \((\mathbf {B}^\alpha )_{\alpha \in {\mathbb {T}}}\). Every \(\mathbf {B}^\alpha \) is a tensor of order at most three. The number of nodes in the tree \({\mathbb {T}} \) is bounded by \( 2 d -1 = {\mathscr {O}} (d) \), including the root node. Therefore, the data complexity for representing \(\mathbf {u}\) is \({\mathscr {O}} (ndr + d r^3) \), where \( n = \max \{n_\mu {:} \mu =1 , \ldots , d\}\), \(r = \max \{ r_{\alpha } {:} \alpha \in {\mathbb {T}}{\setminus } \{\alpha ^* \}\}\). In contrast to the classical Tucker format, the complexity formally no longer scales exponentially in d.

It is straightforward to extend the concept to partition trees such that vertices are allowed to have more than two sons, but the binary trees are the most common. Note that the Tucker format itself represents an extreme case where the root decomposes immediately into d leaves, as illustrated in Fig. 1a.

3.4 Tensor Trains and Matrix Product Representation

As a third example, we consider the tensor train (TT) format as introduced in [111, 113]. As it later turned out, this format plays an important role in physics, where it is known as matrix product states (MPS). The unlabelled tree tensor network of this format can be seen in Fig. 1c. When attaching the physical variable \(n_\mu \) in natural order from left to right, the pointwise multilinear representation is

$$\begin{aligned} \mathbf {u}(i_1,\dots ,i_d) = \sum _{k_1=1}^{r_1} \dots \sum _{k_{d-1}=1}^{r_{d-1}} \mathbf {G}^1 (i_1 ,k_1 ) \mathbf {G}^2 (k_1 , i_2 , k_2) \cdots \mathbf {G}^d ( k_{d-1}, i_d ) \end{aligned}$$
(3.10)

The TT format is hence of the form (2.7) with \(D=d\) and \(E = d-1\), and satisfies Definition 2.2.

Introducing the matrices \( \mathbf {G}^\mu (i_\mu ) = [\mathbf {G}^\mu (k_{\mu -1},i_\mu ,k_\mu )] \in {\mathbb {R}}^{r_{\mu -1} \times r_{\mu } }\), with the convention \(r_0 =r_d =1\), \(\mathbf {G}^1(1,i_1,k_1) = \mathbf {G}^1(i_1,k_1)\), and \(\mathbf {G}^d(k_{d-1},i_1,1) = \mathbf {G}^d(k_{d-1},i_d)\), formula (3.10) becomes a matrix product,

$$\begin{aligned} \mathbf {u}(i_1,\dots ,i_d) = \mathbf {G}^1 (i_1) \mathbf {G}^2 (i_2) \cdots \mathbf {G}^d (i_d ), \end{aligned}$$
(3.11)

which explains the name matrix product states used in physics. In particular, the multilinear dependence on the components \(\mathbf {G}^\mu \) is evident, and may be expressed as \(\mathbf {u} = \tau _{\text {TT}} (\mathbf {G}^1,\dots ,\mathbf {G}^d)\).

From the viewpoint of subspace representation, the minimal \(r_\eta \), \(\eta = 1,\dots ,d-1\), required for representing \(\mathbf {u}\) in the TT format are the minimal dimensions of subspaces \({\mathscr {U}}_{\{1,\dots ,\nu \}} \subseteq \bigotimes _{\mu = 1}^\nu \ell ^2({\mathscr {I}}_\mu )\) such that the relations

$$\begin{aligned} \mathbf {u} \in {\mathscr {U}}_{\{1,\dots ,\eta \}} \otimes \Bigl ( \bigotimes _{\mu = \eta +1}^d \ell ^2({\mathscr {I}}_\mu ) \Bigr ), \quad \eta = 1,\dots ,d-1 \end{aligned}$$

hold simultaneously. Again, these subspaces can be obtained as ranges of corresponding matricisations, as will be explained in the next subsection. Regarding nestedness, we will see that one even has

$$\begin{aligned} {\mathscr {U}}_{\{1,\dots ,\eta \}} \subseteq {\mathscr {U}}_{\{1,\dots ,\eta -1\}} \otimes \ell ^2({\mathscr {I}}_\eta ), \quad \eta = 1,\dots ,d-1. \end{aligned}$$

A tensor in canonical format

$$\begin{aligned} \mathbf {u} (i_1,\dots ,i_d) = \sum _{k=1}^{r_c} \mathbf {C}^1 (i_1,k) \cdots \mathbf {C}^d (i_d,k) \end{aligned}$$

can be easily written in TT form, by setting all \(r_\eta \) to \(r_c\), \(\mathbf {G}^1 = \mathbf {C}^1\), \(\mathbf {C}^d = (\mathbf {G}^d)^T\), and

$$\begin{aligned} \mathbf {G}^\mu ( k_{\mu -1} , i_\mu , k_\mu ) = {\left\{ \begin{array}{ll} \mathbf {C}^\mu (i_\mu ,k),&{} \quad \text {if }k_{\mu -1} = k_\mu = k,\\ 0, &{}\quad \text {else,} \end{array}\right. } \end{aligned}$$

for \( \mu =2,\dots ,d-1\). From (3.11) we conclude immediately that a single point evaluation \(\mathbf {u}(i_1,\dots ,i_d)\) can be computed easily by matrix multiplication using \({\mathscr {O}} ( d r^2 ) \) arithmetic operations, where \(r = \max \{ r_\eta {:} \eta = 1,\dots ,d-1\}\). With \(n = \max \{n_\mu {:} \mu =1 , \ldots , d\}\), the data required for the TT representation is \({\mathscr {O}} (d n r^2)\), as the d component tensors \(\mathbf {G}^\mu \) need to be stored. Depending on r, the TT format hence offers the possibility to circumvent the curse of dimensionality.

Due to its convenient explicit representation (3.11) we will use the TT format frequently as a model case for explanation.

3.5 Matricisations and Tree Rank

After having discussed the most prominent examples of tree tensor networks in the previous sections, we return to the consideration of a general tree tensor network \(\tau _{\mathfrak {}} = \tau _{\mathfrak {r}}(\mathbf {C}^1,\dots ,\mathbf {C}^D)\) encoding a representation (2.7) and obeying Definitions 2.2 and 2.5. The nodes have indices \(1,\dots ,D\), and the distribution of physical variables \(i_\mu \) is fixed (nodes are allowed to carry more than one physical index).

The topology of the network is described by the set of its edges. Following [9], we now introduce a notion of effective edges, which may in fact comprise several lines in a graphical representation such as Fig. 1, and correspond precisely to the matricisations arising in the tensor format. The set of such edges will be denoted by \({\mathbb {E}}\). In slight deviation from (2.7), the contraction variable \((k_\eta )_{\eta \in {\mathbb {E}}}\) and the representation ranks \(\mathfrak {r} = (r_\eta )_{\eta \in {\mathbb {E}}}\) will now be indexed by the set \({\mathbb {E}}\).

Since we are dealing with a tree tensor network, along every contraction index we may split the tree into two disjoint subtrees. Both subtrees must contain vertices carrying physical variables. Hence such a splitting induces a partition

$$\begin{aligned} \alpha ^* = \{1,\dots ,d\} = \alpha \cup \alpha ^c \end{aligned}$$

by gathering the \(\mu \) for which the physical index \(i_\mu \) is in the respective subtree. We then call the unordered pair \(\{ \alpha , \alpha ^c\}\) an edge.

For instance, for a given partition tree \({\mathbb {T}}\) in the HT case, we have

$$\begin{aligned} {\mathbb {E}} = \bigl \{ \{ \alpha , \alpha ^c \} :\alpha \in {\mathbb {T}} {\setminus } \{ \alpha ^*\} \bigr \} \end{aligned}$$
(3.12)

as used in [9], with each element of \({\mathbb {E}}\) corresponding to precisely one matricisation arising in the format. As a consequence of the definition, for each \(\eta \in {\mathbb {E}}\) we may pick a representative \([ \eta ] \in {\mathbb {T}}\). Note that in (3.12), the set \(\{ \alpha ^*_1,\alpha ^*_2\}\) appears twice as \(\alpha \) runs over \({\mathbb {T}} {\setminus } \{ \alpha ^*\}\), which is a consequence of the two children of \(\alpha ^*\) corresponding to the same matricisation; hence \(|{\mathbb {E}}| = 2d-3\).

In order to introduce the same notion for tree tensor networks, we first give a construction of a corresponding generalised partition tree \({\mathbb {T}}\) by assigning labels to the nodes in the tensor network as follows. Pick any node \(\nu ^*\) to be the root of the tree, for which we add \(\alpha ^*=\{1,\ldots ,d\}\) to \({\mathbb {T}}\). This induces a top–down (father–son) ordering in the whole tree. For all nodes \(\nu \), we have a partition of the physical variables in the respective subtree of the form

$$\begin{aligned} \alpha _{\nu } = \Bigl (\bigcup _{\nu ' \in {{\mathrm{sons}}}(\nu )} \alpha _{\nu '} \Bigr ) \cup \beta _\nu , \end{aligned}$$
(3.13)

where \(\beta _\nu \) is the set of physical variables attached to \(\nu \) (of course allowing \(\beta _\nu =\emptyset \)). We now add all \(\alpha _\nu \) that are obtained recursively in this manner to the set \({\mathbb {T}}\). It is easy to see that such a labelling is possible for any choice of \(\nu ^*\).

For such a generalised partition tree \({\mathbb {T}}\) of a tree tensor network, we again obtain a set of effective edges \({\mathbb {E}}\) exactly as in (3.12), and again have a representative \([\eta ]\in {\mathbb {T}}\) for each \(\eta \in {\mathbb {E}}\).

The difference between this general construction and the particular case (3.12) of the HT format is that we now allow incomplete partitions (complemented by \(\beta _\nu \)), and in principle also further nodes with the same label. In the case of the TT format (3.10), which corresponds to the network considered in Fig. 1c with linearly arranged \(i_\mu \), starting from the rightmost node \(\nu ^*=\{d\}\), one obtains the \(d-1\) edges

$$\begin{aligned} \bigl \{ \{1\}, \{2,\ldots ,d\} \bigr \} , \bigl \{ \{1,2\}, \{ 3,\ldots ,d\} \bigr \} , \dots , \bigl \{ \{1,\dots ,d-1\} , \{d \} \bigr \} \end{aligned}$$

which in this case comprise the set \({\mathbb {E}}\).

The main purpose of this section is to show how the minimal representation ranks \((r_{\eta })_{\eta \in {\mathbb {E}}}\) are obtained from matrix ranks. For every edge \(\eta \in {\mathbb {E}}\), we have index sets and , and, by (2.4), a natural isometric isomorphism

$$\begin{aligned} \mathbf {M}_\eta {:} \ell ^{2}({\mathscr {I}}) \rightarrow \ell ^2({\mathscr {I}}_\eta ) \otimes \ell ^2({\mathscr {I}}_\eta ^c), \end{aligned}$$

called \(\eta \) -matricisation or simply matricisation. The second-order tensor \(\mathbf {M}_\eta (\mathbf {u})\) represents a reshape of the hyper-matrix (array) \(\mathbf {u}\) into a matrix in which the rows are indexed by \({\mathscr {I}}_\eta \) and the columns by \({\mathscr {I}}_\eta ^c\). The order in which these index sets are traversed is unimportant for what follows.

Definition 3.2

The rank of \(\mathbf {M}_\eta (\mathbf {u})\) is called the \(\eta \) -rank of \(\mathbf {u}\), and denoted by \({{\mathrm{rank}}}_\eta (\mathbf {u})\). The tuple \({{\mathrm{rank}}}_{{\mathbb {E}}}(\mathbf {u}) = ({{\mathrm{rank}}}_\eta (\mathbf {u}))_{\eta \in {\mathbb {E}}}\) is called the tree rank of \(\mathbf {u}\) for the given tree tensor network.

Theorem 3.3

A tensor is representable in a tree tensor network \(\tau _{\mathfrak {r}}\) with edges \({\mathbb {E}}\) if and only if \({{\mathrm{rank}}}_\eta (\mathbf {u}) \le r_\eta \) for all \(\eta \in {\mathbb {E}}\).

Proof

Assume \(\mathbf {u}\) is representable in the form (2.7). Extracting the edge \(\eta \) corresponding to (without loss of generality, only one) contraction index \(k_\eta \) from the tree we obtain two disjoint subtrees on both sides of \(\eta \), with corresponding contraction variables relabelled as \(k_1,\dots ,k_s\) and \(k_{s+1},\dots ,k_{E-1}\), respectively; the set of nodes for the components is partitioned into \(\{1,\dots ,D\} = \gamma _\eta ' \cup \gamma _\eta ''\). Since in every component \(\mathbf {C}^\nu \) at most two contraction variable are active, it follows that

$$\begin{aligned} \mathbf {u}(i_1,\dots ,i_d)= & {} \sum _{k_\eta = 1}^{r_\eta } \left( \sum _{k_1=1}^{r_1} \cdots \sum _{k_s = 1}^{r_s} \prod _{\nu ' \in \gamma '} \mathbf {C}^{\nu '}(i_1,\dots ,i_d,k_1,\dots ,k_E)\right) {}\qquad \\&\times \left( \sum _{k_{s+1}}^{r_{s+1}} \cdots \sum _{k_{E-1} = 1}^{r_{E-1}} \prod _{\nu '' \in \gamma ''} \mathbf {C}^{\nu ''}(i_1,\dots ,i_d,k_1,\dots ,k_E) \right) .\nonumber \end{aligned}$$
(3.14)

The edge \(\eta \) is of the form \(\eta =\{\alpha ,\alpha ^c\}\), where all physical variables \(i_\mu \) with \(\mu \in \alpha \) are active in some \(\mathbf {C}^{\nu '}\) with \(\nu ' \in \gamma '\), and those in \(\alpha ^c\) are active in some \(\mathbf {C}^{\nu ''}\) with \(\nu '' \in \gamma ''\). Thus (3.14) implies \({{\mathrm{rank}}}_\eta (\mathbf {u}) \le r_\eta \).

To prove the converse statement it suffices to show that we can choose \(r_\eta = {{\mathrm{rank}}}_\eta (\mathbf {u})\). We assume a proper labelling with distinguished node \(\nu ^*\). To every edge \(\eta \) belongs a subspace \({\mathscr {U}}_\eta \subseteq \ell ^2({\mathscr {I}}_{[\eta ]})\), which is the Hilbert space whose orthonormal basis are the left singular vectors of \(\mathbf {M}_\eta (\mathbf {u})\) belonging to positive singular values. Its dimension is \(r_\eta \). In a slight abuse of notation (one has to involve an isomorphism correcting the permutation of factors in the tensor product) we note that

$$\begin{aligned} \mathbf {u} \in {\mathscr {U}}_\eta \otimes \ell ^2({\mathscr {I}}_\eta ^c) \end{aligned}$$
(3.15)

for every \(\eta \). Here our argumentation will be rather informal to avoid notational technicalities. One can show that (3.15) in combination with (3.13) yields (in intuitive notation)

$$\begin{aligned} {\mathscr {U}}_{\alpha _\nu } \subseteq \Bigl ( \bigotimes _{\nu ' \in {{\mathrm{sons}}}(\nu )} {\mathscr {U}}_{\alpha _{\nu '}} \Bigr ) \otimes \ell ^2({\mathscr {I}}_{\beta _\nu }), \end{aligned}$$
(3.16)

and

$$\begin{aligned} \mathbf {u} \in \Bigl ( \bigotimes _{\nu \in {{\mathrm{sons}}}(\nu ^*)} {\mathscr {U}}_{\alpha _\nu } \Bigr ) \otimes \ell ^2({\mathscr {I}}_{\beta _{\nu ^*}}), \end{aligned}$$
(3.17)

by [65, Lemma 6.28]. Let \(\{ \mathbf {U}^\nu (\cdot ,\dots ,\cdot ,k_{\eta (\nu )}) {:} k_{\eta (\nu )} = 1,\dots , r_{\eta (\nu )}\}\) be a basis of \({\mathscr {U}}_{\alpha _\nu }\), with \(\eta (\nu )=\{\alpha _\nu ,\alpha _\nu ^c\}\). We also set \(\mathbf {U}^{\nu ^*} = \mathbf {u}\). Now if a node \(\nu \) has no sons, we choose \(\mathbf {C}^\nu = \mathbf {U}^{\nu }\). For other \(\nu \ne \nu ^*\), by (3.16) or (3.17), a tensor \(\mathbf {C}^\nu \) is obtained by recursive expansion. By construction, the final representation for \(\mathbf {u}\) yields a decomposition according to the tree network. \(\square \)

3.6 Existence of Best Approximations

We can state the result of Theorem 3.3 differently. Let \({\mathscr {H}}_{\le \mathfrak {r}} = {\mathscr {H}}_{\le \mathfrak {r}}({\mathbb {E}})\) denote the set of all tensors representable in a given tensor tree network with edges \({\mathbb {E}}\). For every \(\eta \in {\mathbb {E}}\) let

$$\begin{aligned} {\mathscr {M}}_{\le r_\eta }^{\eta } = \{ \mathbf {u} \in \ell ^2({\mathscr {I}}) {:} {{\mathrm{rank}}}_\eta (\mathbf {u}) \le r_\eta \}. \end{aligned}$$

Then Theorem 3.3 states that

$$\begin{aligned} {\mathscr {H}}_{\le \mathfrak {r}} = \{ \mathbf {u} {:} {{\mathrm{rank}}}_{{\mathbb {E}}} (\mathbf {u}) \le \mathfrak {r} \} = \bigcap _{\eta \in {\mathbb {E}}} {\mathscr {M}}^\eta _{\le r_\eta }. \end{aligned}$$
(3.18)

Using the singular value decomposition, it is relatively easy to show that for any finite \(r_\eta \), the set \({\mathscr {M}}_{\le r_\eta }^\eta \) is weakly sequentially compact [52, 65, 134, 136], and for \(r_\nu = \infty \), we have \({\mathscr {M}}_{\le r_\eta }^{\eta } = \ell ^2{({\mathscr {I}})}\). Hence the set \({\mathscr {H}}_{\le \mathfrak {r}}\) is weakly sequentially closed. Depending on the chosen norm, this is even true in tensor product of Banach spaces [52]. A standard consequence in reflexive Banach spaces like \(\ell ^2({\mathscr {I}})\) (see, e.g. [147]) is the following.

Theorem 3.4

Every \(\mathbf {u} \in \ell ^2({\mathscr {I}})\) admits a best approximation in \({\mathscr {H}}_{\le \mathfrak {r}}\).

For matrices we know that truncation of the singular value decomposition to rank r yields the best rank-r approximation of that matrix. The analogous problem to find a best approximation of tree rank at most \(\mathfrak {r}\) for a tensor \(\mathbf {u}\), that is, a best approximation in \({\mathscr {H}}_{\le \mathfrak {r}}\), has no such clear solution and can be NP-hard [75]. As we are able to project onto every set \({\mathscr {M}}^\eta _{\le r_\eta }\) via SVD, the characterisation (3.18) suggests to apply successive projections on these sets to obtain an approximation in \({\mathscr {H}}_{\le \mathfrak {r}}\). This works depending on the order of these projections, and is called hierarchical singular value truncation.

3.7 Hierarchical Singular Value Decomposition and Truncation

The bases of subspaces considered in the explicit construction used to prove Theorem 3.3 can be chosen arbitrarily. When the left singular vectors of \(\mathbf {M}_\eta (\mathbf {u})\) are chosen, the corresponding decomposition \(\mathbf {u} = \tau _{\mathfrak {r}}(\mathbf {C}^1,\dots ,\mathbf {C}^D)\) is called the hierarchical singular value decomposition (HSVD) with respect to the tree network with effective edges \({\mathbb {E}}\). It was first considered in [39] for the Tucker format, later in [60] for the HT and in [109, 111] for the TT format. It was also introduced before in physics for the matrix product representation [140]. The HSVD can be used to obtain low-rank approximations in the tree network. This procedure is called HSVD truncation.

Most technical details will be omitted. In particular, we do not describe how to practically compute an exact HSVD representation; see, e.g. [60]. For an arbitrary tensor given in full format this is typically prohibitively expensive. However, for \(\mathbf {u} = \tau _{\mathfrak {r}}(\tilde{\mathbf {C}}^1,\dots \tilde{\mathbf {C}}^D)\) already given in the tree tensor network format, the procedure is quite efficient. The basic idea is as follows. One changes the components from leaves to root to encode some orthonormal bases in every node except \(\nu ^*\), using e.g. QR decompositions that operate only on (matrix reshapes of) the component tensors. Afterwards, it is possible to install HOSV bases from root to leaves using only SVDs on component tensors. Many details are provided in [65].

In the following we assume that \(\mathbf {u}\) has tree rank \(\mathfrak {s}\) and \(\mathbf {u} = \tau _{\mathfrak {s}}(\mathbf {C}^1,\dots ,\mathbf {C}^D) \in {\mathscr {H}}_{\le \mathfrak {s}}({\mathbb {E}})\) is an HSVD representation. Let \(\mathfrak {r} \le \mathfrak {s}\) be given. We consider here the case that all \(r_\eta \) are finite. An HSVD truncation of \(\mathbf {u}\) to \({\mathscr {H}}_{\le \mathfrak {r}}\) can be derived as follows. Let

$$\begin{aligned} \mathbf {M}_\eta (\mathbf {u}) = \mathbf {U}^\eta \varSigma ^\eta (\mathbf {V}^\eta )^T \end{aligned}$$

be the SVD of \(\mathbf {M}_\eta \), with \(\varSigma ^\eta = {{\mathrm{diag}}}(\sigma ^\eta _1(\mathbf {u}),\sigma ^\eta _2(\mathbf {u}),\dots )\) such that \(\sigma _1^\eta (\mathbf {u}) \ge \sigma _2^\eta (\mathbf {u}) \ge \dots \ge 0\). The truncation of a single \(\mathbf {M}_\eta (\mathbf {u})\) to rank \(r_\eta \) can be achieved by applying the orthogonal projection

$$\begin{aligned} \mathbf {P}_{\eta ,r_\eta } = \tilde{\mathbf {P}}_{[\eta ],r_\mu } \otimes {\mathrm {Id}}_{[\eta ]^c} {:} \ell ^2({\mathscr {I}}) \rightarrow {\mathscr {M}}^\eta _{\le r_\eta }, \end{aligned}$$
(3.19)

where \(\tilde{\mathbf {P}}_{[\eta ],r_\mu }\) is the orthogonal projection onto the span of \(r_\eta \) dominant left singular vectors of \(\mathbf {M}_\eta (\mathbf {u})\). Then \(\mathbf {P}_{\eta ,r_\eta }(\mathbf {u})\) is the best approximation of \(\mathbf {u}\) in the set \({\mathscr {M}}_{\le r_\eta }^\eta \). Note that \(\mathbf {P}_{\eta ,r_\eta } = \mathbf {P}_{\eta ,r_\eta ,\mathbf {u}}\) itself depends on \(\mathbf {u}\).

The projections \((\mathbf {P}_{\eta ,r_\eta })_{\eta \in {\mathbb {E}}}\) are now applied consecutively. However, to obtain a result in \({\mathscr {H}}_{\le \mathfrak {r}}\), one has to take the ordering into account. Let \({\mathbb {T}}\) be a generalised partition tree of the tensor network. Considering a \(\alpha \in {\mathbb {T}}\) with son \(\alpha '\) we observe the following:

  1. (i)

    Applying \(\mathbf {P}_{\eta ,r_\eta }\) with \(\eta =\{\alpha ,\alpha ^c\}\) does not destroy the nestedness property (3.15) at \(\alpha \), simply because the span of only the dominant \(r_\eta \) left singular vectors is a subset of the full span.

  2. (ii)

    Applying \(\mathbf {P}_{\eta ',r_{\eta '}}\) with \(\eta '=\{\alpha ',\alpha '^c\}\) does not increase the rank of \(\mathbf {M}_\eta (\mathbf {u})\). This holds because there exists \(\beta \subseteq \{1,\dots ,d\}\) such that \({\mathrm {Id}}_{[\eta ']^c} \subseteq {\mathrm {Id}}_{[\eta ]^c} \otimes {\mathrm {Id}}_{\beta }\). Thus, since \(\mathbf {P}_{\eta ',r_{\eta '}}\) is of the form (3.19), it only acts as a left multiplication on \(\mathbf {M}_\eta (\mathbf {u})\).

Property (ii) by itself implies that the top-to-bottom application of the projections \(\mathbf {P}_{\eta ,r_\eta }\) will result in a tensor in \({\mathscr {H}}_{\le \mathfrak {r}}\). Property (i) implies that the procedure can be performed, starting at the root element, by simply setting to zero all entries in the components that relate to deleted basis elements in the current node or its sons, and resizing the tensors accordingly.

Let \({{\mathrm{level}}}\eta \) denote the distance of \([\eta ]\) in the tree to \(\alpha ^*\), and let L be the maximum such level. The described procedure describes an operator

$$\begin{aligned} \mathbf {H}_{\mathfrak {r}} {:} \ell ^2({\mathscr {I}}) \rightarrow {\mathscr {H}}_{\le \mathfrak {r}}, \quad \mathbf {u} \mapsto \Bigl ( \prod _{{{\mathrm{level}}}{\eta } = L} \mathbf {P}_{\eta ,r_\eta ,\mathbf {u}} \,\cdots \prod _{{{\mathrm{level}}}{\eta } = 1} \mathbf {P}_{\eta ,r_\eta ,\mathbf {u}} \Bigr ) (\mathbf {u}), \end{aligned}$$
(3.20)

called the hard thresholding operator. Remarkably, as the following result shows, it provides a quasi-optimal projection. Recall that a best approximation in \({\mathscr {H}}_{\le \mathfrak {r}}\) exists by Theorem 3.4.

Theorem 3.5

For any \(\mathbf {u} \in \ell ^2({\mathscr {I}})\), one has

$$\begin{aligned} \min _{\mathbf {v} \in {\mathscr {H}}_{\le \mathfrak {r}}} \Vert \mathbf {u} - \mathbf {v} \Vert \le \Vert \mathbf {u} - \mathbf {H}_{\mathfrak {r}}(\mathbf {u}) \Vert \le \sqrt{\sum _{\eta \in {\mathbb {E}}} \sum _{k > r_\eta } \bigl (\sigma ^\eta _k({\mathbf {u}})\bigr )^2} \le \sqrt{\left|{\mathbb {E}}\right|} \min _{\mathbf {v} \in {\mathscr {H}}_{\le \mathfrak {r}}} \Vert \mathbf {u} - \mathbf {v} \Vert . \end{aligned}$$

The proof follows more or less immediately along the lines of [60], using the properties \(\Vert \mathbf {u} - P_1 P_2 \mathbf {u} \Vert ^2 \le \Vert \mathbf {u} - P_1 \mathbf {u} \Vert ^2 + \Vert \mathbf {u} - P_2 \mathbf {u} \Vert ^2\), which holds for any orthogonal projections \(P_1,P_2\), and \(\min _{\mathbf {v} \in {\mathscr {M}}^\eta _{\le r_\eta }} \Vert \mathbf {u} - \mathbf {v} \Vert \le \min _{\mathbf {v} \in {\mathscr {H}}_{\le \mathfrak {r}}} \Vert \mathbf {u} - \mathbf {v} \Vert \), which follows from (3.18).

There are sequential versions of hard thresholding operators which traverse the tree in a different ordering, and compute at edge \(\eta \) the best \(\eta \)-rank-\(r_\eta \) approximation of the current iterate by recomputing an SVD. These techniques can be computationally beneficial, but the error cannot easily be related to that of the direct HSVD truncation; see [65, Sect. 11.4.2] for corresponding bounds similar to Theorem 3.5.

3.8 Hierarchical Tensors as Differentiable Manifolds

We now consider geometric properties of \({\mathscr {H}}_{\mathfrak {r}} = {\mathscr {H}}_{\mathfrak {r}}({\mathbb {E}}) = \{ \mathbf {u} {:} {{\mathrm{rank}}}_{{\mathbb {E}}} (\mathbf {u}) = \mathfrak {r} \}\), that is, \({\mathscr {H}}_{\mathfrak {r}}= \bigcap _{\eta \in {\mathbb {E}}} {\mathscr {M}}^\eta _{r_\eta }\), where \({\mathscr {M}}^\eta _{r_\eta }\) is the set of tensors with \(\eta \)-rank exactly \(r_\eta \). We assume that \(\mathfrak {r}\) is such that \({\mathscr {H}}_{\mathfrak {r}}\) is not empty. In contrast to the set \({\mathscr {H}}_{\le \mathfrak {r}}\), it can be shown that \({\mathscr {H}}_{\mathfrak {r}}\) is a smooth embedded submanifold if all ranks \(r_\eta \) are finite [79, 138], which enables Riemannian optimisation methods on it as discussed later. This generalises the fact that matrices of fixed rank form smooth manifolds [74].

The cited references consider finite-dimensional tensor product spaces, but the arguments can be transferred to the present separable Hilbert space setting [136], since the concept of submanifolds itself generalises quite straightforwardly, see, e.g. [97, 148]. The case of infinite ranks \(r_\eta \), however, is more subtle and needs to be treated with care [52, 54].

We will demonstrate some essential features using the example of the TT format. Let \(\mathfrak {r} = (r_1,\dots ,r_{d-1})\) denote finite TT representation ranks. Repeating (3.11), the set \({\mathscr {H}}_{\le \mathfrak {r}}\) is then the image of the multilinear map

$$\begin{aligned} \tau _{\text {TT}} {:} {\mathscr {W}} {:=} {\mathscr {W}}_1 \times \dots \times {\mathscr {W}}_d \rightarrow \ell ^2({\mathscr {I}}), \end{aligned}$$

where \({\mathscr {W}}_\nu = \mathbb {R}^{r_{\nu -1}} \otimes \ell ^2({\mathscr {I}}_\nu ) \otimes \mathbb {R}^{r_{\nu }}\) (with \(r_0 = r_d = 1\)), and \(\mathbf {u} = \tau _{\text {TT}}(\mathbf {G}^1,\dots ,\mathbf {G}^d)\) is defined via

$$\begin{aligned} \mathbf {u}(i_1,\dots ,i_d) = \mathbf {G}^1 (i_1) \mathbf {G}^2 (i_2) \cdots \mathbf {G}^d (i_d ). \end{aligned}$$
(3.21)

The set \({\mathscr {W}}\) is called the parameter space for the TT format with representation rank \(\mathfrak {r}\). It is not difficult to deduce from (3.21) that in this case \({\mathscr {H}}_{\mathfrak {r}} = \tau _{\text {TT}}({\mathscr {W}}^*)\), where \({\mathscr {W}}^*\) is the open and dense subset of parameters \((\mathbf {G}^1,\dots ,\mathbf {G}^d)\) for which the embeddings (reshapes) of every \(\mathbf {G}^\nu \) into the matrix spaces \(\mathbb {R}^{r_{\nu -1}} \otimes (\ell ^2({\mathscr {I}}_\mu ) \otimes \mathbb {R}^{r_{\nu }})\), respectively \((\mathbb {R}^{r_{\nu -1}} \otimes \ell ^2({\mathscr {I}}_\nu )) \otimes \mathbb {R}^{r_{\nu }}\), have full possible rank \(r_{\nu -1}\), respectively \(r_\nu \). Since \(\tau _{\text {TT}}\) is continuous, this also shows that \({\mathscr {H}}_{\le \mathfrak {r}}\) is the closure of \({\mathscr {H}}_{\mathfrak {r}}\) in \(\ell ^2({\mathscr {I}})\).

A key point that has not been emphasised so far is that the representation (3.21) is by no means unique. We can replace it with

$$\begin{aligned} \mathbf {u}(i_1,\dots ,i_d) = \left[ \mathbf {G}^1 (i_1) \mathbf {A}^1 \right] \left[ (\mathbf {A}^1)^{-1} \mathbf {G}^2 (i_2) \mathbf {A}^2 \right] \cdots \left[ (\mathbf {A}^{d-1})^{-1} \mathbf {G}^d (i_d ) \right] , \end{aligned}$$
(3.22)

with invertible matrices \(\mathbf {A}^\nu \), which yields new components \(\tilde{\mathbf {G}}^\nu \) representing the same tensor. This kind of non-uniqueness occurs in all tree tensor networks and reflects the fact that in all except one node only the subspaces are important, not the concrete choice of basis. A central issue in understanding the geometry of tree representations is to remove these redundancies. A classical approach, pursued in [136, 138], is the introduction of equivalence classes in the parameter space. To this end, we interpret the transformation (3.22) as a left action of the Lie group \({\mathscr {G}}\) of regular matrix tuples \(({\mathbf {A}}^1,\dots ,{\mathbf {A}}^{d-1})\) on the regular parameter space \({\mathscr {W}}^*\). The parameters in an orbit \({\mathscr {G}} \circ (\mathbf {G}^1,\dots ,\mathbf {G}^d)\) lead to the same tensor and are called equivalent. Using simple matrix techniques one can show that this is the only kind of non-uniqueness that occurs. Hence we can identify \({\mathscr {H}}_{\mathfrak {r}}\) with the quotient \({\mathscr {W}}^* / {\mathscr {G}}\). Since \({\mathscr {G}}\) acts freely and properly on \({\mathscr {W}}^*\), the quotient admits a unique manifold structure such that the canonical mapping \({\mathscr {W}}^* \rightarrow {\mathscr {W}}^* / {\mathscr {G}}\) is a submersion. One now has to show that the induced mapping \(\hat{\tau }_{\text {TT}}\) from \({\mathscr {W}}^* / {\mathscr {G}}\) to \(\ell ^2({\mathscr {I}})\) is an embedding to conclude that its image \({\mathscr {H}}_{\mathfrak {r}}\) is an embedded submanifold. The construction can be extended to general tree tensor networks.

The tangent space \({\mathscr {T}}_{\mathbf {u}} {\mathscr {H}}_{\mathfrak {r}}\) at \(\mathbf {u}\), abbreviated by \({\mathscr {T}}_{\mathbf {u}}\), is of particular importance for optimisation on \({\mathscr {H}}_{\mathfrak {r}}\). The previous considerations imply that the multilinear map \(\tau _{\text {TT}}\) is a submersion from \({\mathscr {W^*}}\) to \({\mathscr {H}}_{\mathfrak {r}}\). Hence the tangent space at \(\mathbf {u} = \tau _{\text {TT}}(\mathbf {C}^1,\dots ,\mathbf {C}^d)\) is the range of \(\tau '_{\text {TT}}(\mathbf {C}^1,\dots ,\mathbf {C}^d)\), and by multilinearity, tangent vectors at \(\mathbf {u}\) are therefore of the generic form

$$\begin{aligned} \delta \mathbf {u}(i_1,\dots ,i_d)= & {} \delta \mathbf {G}^1(i_1) \mathbf {G}(i_2) \cdots \mathbf {G}^d (i_d) \nonumber \\&+ \;\cdots \; + \,\mathbf {G}^1(i_1) \cdots \mathbf {G}^{d-1}(i_{d-1}) \delta \mathbf {G}^d(i_d). \end{aligned}$$
(3.23)

As a consequence, a tangent vector \(\delta u\) has TT-rank \(\mathfrak {s}\) with \(s_\nu \le 2r_\nu \).

Since \(\tau '_{\text {TT}}(\mathbf {G}^1,\dots ,\mathbf {G}^d)\) is not injective (tangentially to the orbit \({\mathscr {G}} \circ (\mathbf {G}^1,\dots ,\mathbf {G}^d)\), the derivative vanishes), the representation (3.23) of tangent vectors cannot be unique. One has to impose gauge conditions in the form of a horizontal space. Typical choices for the TT format are the spaces \(\hat{{\mathscr {W}}}_\nu = \hat{{\mathscr {W}}}_\nu (\mathbf {G}^\nu )\), \(\nu = 1,\dots ,d-1\), comprised of \(\delta \mathbf {G}^\nu \) satisfying

$$\begin{aligned} \sum _{k_{\nu -1}=1}^{r_{\nu - 1}} \sum _{i_\nu = 1}^{n_\nu } \mathbf {G}^\nu (k_{\nu -1},i_\nu ,k_\nu ) \delta \mathbf {G}^\nu (k_{\nu -1},i_\nu ,k_\nu ) = 0, \quad k_\nu = 1,\dots ,r_\nu . \end{aligned}$$

These \(\hat{{\mathscr {W}}}_\nu \) are the orthogonal complements in \({\mathscr {W}}_\nu \) of the space of \(\delta \mathbf {G}^\nu \) for which there exists an invertible \(\mathbf {A}^\nu \) such that \(\delta \mathbf {G}^\nu (i_\nu ) = \mathbf {G}^\nu (i_\nu ) \mathbf {A}^\nu \) for all \(i_\nu \). This can be used to conclude that every tangent vector of the generic form (3.23) can be uniquely represented such that

$$\begin{aligned} \delta \mathbf {G}^\nu \in \hat{{\mathscr {W}}}_\nu , \quad \nu =1,\dots ,d-1. \end{aligned}$$
(3.24)

In fact, the different contributions in (3.23) then belong to linearly independent subspaces, see [79] for details. It follows that the derivative \(\tau '_{\text {TT}}(\mathbf {G}^1,\dots ,\mathbf {G}^d)\) maps the subspace \(\hat{{\mathscr {W}}}_\nu (\mathbf {G}^{1}) \times \dots \times \hat{{\mathscr {W}}}_{d-1}(\mathbf {G}^{d-1}) \times {\mathscr {W}}_d\) of \({\mathscr {W}}\) bijectively on \({\mathscr {T}}_{\mathbf {u}}\).Footnote 4 In our example, there is no gauge on the component \(\mathbf {G}^d\), but with modified gauge spaces, any component could play this role.

The orthogonal projection \(\varPi _{T_{\mathbf {u}}}\) onto the tangent space \({\mathscr {T}}_{\mathbf {u}}\) is computable in a straightforward way if the basis vectors implicitly encoded at nodes \(\nu =1,\dots ,d-1\) are orthonormal, which in turn is not difficult to achieve (using QR decomposition from left to right). Then the decomposition of the tangent space induced by (3.23) and the gauge conditions (3.24) is actually orthogonal. Hence the projection on \({\mathscr {T}}_{\mathbf {u}}\) can be computed by projecting on the different parts. To this end, let \(\mathbf {E}_\nu = \mathbf {E}_\nu (\mathbf {G}^1,\dots ,\mathbf {G}^d)\) be the linear map \(\delta \mathbf {G}^\nu \mapsto \tau _{\text {TT}}(\mathbf {G}^1,\dots ,\delta \mathbf {G}^\nu , \dots \mathbf {G}^d)\). Then the components \(\delta \mathbf {G}^\nu \) to represent the orthogonal projection of \(\mathbf {v} \in \ell ^2({\mathscr {I}})\) onto \({\mathscr {T}}_{\mathbf {u}}\) in the form (3.23) are given by

$$\begin{aligned} \delta \mathbf {G}^\nu = {\left\{ \begin{array}{ll} P_{\hat{{\mathscr {W}}}_\nu } \mathbf {E}_\nu ^+ \mathbf {v}, &{}\quad \nu = 1,\dots ,d-1,\\ \mathbf {E}_\nu ^T \mathbf {v}, &{}\quad \nu = d.\end{array}\right. } \end{aligned}$$

Here we denote by \(P_{\hat{{\mathscr {W}}}_\nu }\) is the orthogonal projection onto the gauge space \(\hat{{\mathscr {W}}}_\nu \), and \(\mathbf {E}_\nu ^+ = (\mathbf {E}_\nu ^T \mathbf {E}_\nu ^{})^{-1} \mathbf {E}_\nu ^T\) is the Moore–Penrose inverse of \(\mathbf {E}_\nu \). Indeed, the assumption that \(\mathbf {u}\) has TT-rank \(\mathfrak {r}\) implies that the matrix \(\mathbf {E}_\nu ^T \mathbf {E}_\nu ^{}\) is invertible. At \(\nu = d\) it is actually the identity by our assumption that orthonormal bases are encoded in the other nodes. In operator form, the projector \(\varPi _{{\mathscr {T}}_{\mathbf {u}}}\) can then be written as

$$\begin{aligned} \varPi _{{\mathscr {T}}_{\mathbf {u}}} \mathbf {v} = \sum _{\nu = 1}^{d-1} \mathbf {E}_\nu ^{} P_{\hat{{\mathscr {W}}}_\nu } \mathbf {E}_\nu ^+ \mathbf {v} + \mathbf {E}^{}_\nu \mathbf {E}_\nu ^T \mathbf {v}. \end{aligned}$$

The operators \(\mathbf {E}_\nu ^T\) are easy to implement, since they require only the computation of scalar product of tensors. Furthermore, the inverses \((\mathbf {E}_\nu ^T \mathbf {E}_\nu ^{})^{-1}\) are applied only to the small component spaces \({\mathscr {W}}_\nu \). This makes the projection onto the tangent space a flexible and efficient numerical tool for the application of geometric optimisation, see Sect. 4.2. Estimates of the Lipschitz continuity of \(\mathbf {\mapsto }P_{{\mathscr {T}}_{\mathbf {u}}}\) (curvature bounds) are of interest in this context, with upper bounds given in [4, 105].

The generalisation of these considerations to arbitrary tree networks is essentially straightforward, but can become notationally quite intricate, see [138] for the HT format.

4 Optimisation with Tensor Networks and Hierarchical Tensors and the Dirac–Frenkel Variational Principle

In this section, our starting point is the abstract optimisation problem of finding

$$\begin{aligned} \mathbf {u}^* = \hbox {argmin} \, J( {\mathbf {u}} ), \quad {\mathbf {u}} \in \mathscr {A}, \end{aligned}$$

for a given cost functional \( J: \ell ^2({\mathscr {I}})\rightarrow {\mathbb {R}} \) and an admissible set \( {\mathscr {A}} \subseteq \ell ^2({\mathscr {I}})\).

In general, a minimiser \({\mathbf {u}}^*\) will not have low hierarchical ranks in any tree tensor network, but we are interested in finding good low-rank approximations to \(\mathbf {u}^*\). Let \({\mathscr {H}}_{\le \mathfrak {r}}\) denote again a set of tensors representable in a given tree tensor network with corresponding tree ranks at most \(\mathfrak {r}\). Then we wish to solve the tensor product optimisation problem

$$\begin{aligned} {\mathbf {u}}_{\mathfrak {r}} = \hbox {argmin} \{ J({\mathbf {u}} ) {:} {\mathbf {u}} \in {\mathscr {C}} = {\mathscr {A}} \cap {{\mathscr {H}}_{\le \mathfrak { r}}} \} . \end{aligned}$$
(4.1)

By fixing the ranks, one fixes the representation complexity of the approximate solution. It needs to be noted that the methods discussed in this section yield approximations to \(\mathbf {u}_{\mathfrak {r}}\), but no information on the error with respect to \(\mathbf {u}^*\). Typically, one will not aim to approximate \(\mathbf {u}_{\mathfrak {r}}\) with an accuracy better than \(\Vert {\mathbf {u}_{\mathfrak {r}} - \mathbf {u}^*}\Vert \). In fact, finding an accurate approximation of the global minimiser \(\mathbf {u}_{\mathfrak {r}}\) of (4.1) is a difficult problem: The results in [75] show that even finding the best rank-one approximation of a given tensor, with finite \(\mathscr {I}\), up to a prescribed accuracy is generally NP-hard if \(d\ge 3\). This is related to the observation that (4.1) typically has multiple local minima.

In general, one thus cannot ensure a prescribed error in approximating a global minimiser \(\mathbf {u}_{\mathfrak {r}}\) of (4.1). In order to enable a desired accuracy with respect to \(\mathbf {u}^*\), one also needs in addition some means to systematically enrich \({\mathscr {C}}\) by increasing the ranks \(\mathfrak {r}\). Subject to these limitations, the methods considered in this section provide numerically inexpensive ways of finding low-rank approximations by hierarchical tensors.

Note that in what follows, the index set \({\mathscr {I}}\) as in (2.2) may be finite or countably infinite. In the numerical treatment of differential equations, discretisations and corresponding finite index sets need to be selected. This aspect is not covered by the methods considered in this section, which operate on fixed \({\mathscr {I}}\), but we return to this point in Sect. 5.

Typical examples of optimisation tasks (4.1) that we have in mind are the following, see also [49, 51].

  1. (a)

    Best rank-\(\mathfrak {r}\) approximation in \(\ell ^2({\mathscr {I}})\): for given \( \mathbf {v} \in \ell ^2({\mathscr {I}})\) minimise

    $$\begin{aligned} J( \mathbf {u} ) {:=} \Vert \mathbf {u} - \mathbf {v} \Vert ^2 \end{aligned}$$

    over \({\mathscr {A}} = \ell ^2({\mathscr {I}})\). This is the most basic task we encounter in low-rank tensor approximation.

  2. (b)

    Solving linear operator equations: for elliptic self-adjoint \( {\mathbf {A}} : \ell ^2({\mathscr {I}})\rightarrow \ell ^2({\mathscr {I}})\) and \({\mathbf {b}} \in \ell ^2({\mathscr {I}})\), we consider \({\mathscr {A}} {:=} \ell ^2({\mathscr {I}})\) and

    $$\begin{aligned} J( {\mathbf {u}} ) {:=} \frac{1}{2} \langle {\mathbf {A}} {\mathbf {u}},{\mathbf {u}} \rangle - \langle {\mathbf {b}} , {\mathbf {u}} \rangle \end{aligned}$$
    (4.2)

    to solve \(\mathbf {A} \mathbf {u} = \mathbf {b}\). For nonsymmetric isomorphisms \({\mathbf {A}}\), one may resort to a least squares formulation

    $$\begin{aligned} J({\mathbf {u}}) {:=} \Vert {\mathbf {A}} {\mathbf {u}}- {\mathbf {b}} \Vert ^2. \end{aligned}$$
    (4.3)

    The latter approach of minimisation of the norm of a residual also carries over to nonlinear problems.

  3. (c)

    Computing the lowest eigenvalue of symmetric \({\mathbf {A}}\) by minimising the Rayleigh quotient

    $$\begin{aligned} {\mathbf {u}}^* = \hbox {argmin} \{ J({\mathbf {u}})= \langle {\mathbf {A}} {\mathbf {u}} , {\mathbf {u}} \rangle {:} \Vert {\mathbf {u}} \Vert ^2 =1\} . \end{aligned}$$

    This approach can be easily extended if one wants to approximate the N lowest eigenvalues and corresponding eigenfunctions simultaneously, see, e.g. [43, 88, 108].

For the existence of a minimiser, the weak sequential closedness of the sets \({\mathscr {H}}_{\le \mathfrak { r}}\) is crucial. As mentioned before, this property can be violated for tensors described by the canonical format [65, 127], and in general no minimiser exists. However, it does hold for hierarchical tensors \( {\mathscr {H}}_{\le \mathfrak {r}} \), as was explained in Sect. 3.6. A generalised version of Theorem 3.4 reads as follows.

Theorem 4.1

Let \(J\) be strongly convex over \( \ell ^2({\mathscr {I}})\), and let \({\mathscr {A}} \subseteq \ell ^2({\mathscr {I}})\) be weakly sequentially closed. Then \(J\) attains its minimum on \({\mathscr {C}}={\mathscr {A}} \cap {{\mathscr {H}}_{\le \mathfrak { r}}}\).

Under the assumptions of example (b), due to ellipticity of \(\mathbf {A}\) in (4.2) or \(\mathbf {A}^* \mathbf {A}\) in (4.3), the functional \(J\) is strongly convex, and one obtains well-posedness of these minimisation problems with (a) as a special case.

Since in case (c) the corresponding set \({\mathscr {A}}\) (the unit sphere) is not weakly closed, such simple arguments do not apply there.

4.1 Alternating Linear Scheme

We are interested in finding a minimiser, or even less ambitiously, we want to decrease the cost functional along our model class when the admissible set is \( {\mathscr {A}}= \ell ^2({\mathscr {I}})\).

A straightforward approach which suggests itself in view of the multilinearity of \(\tau _{\text {TT}}(\mathbf {C}^1,\dots ,\mathbf {C}^D)\) is block coordinate descent (BCD). For the task of finding the best rank-\(\mathfrak {r}\) approximation this approach is classical and called alternating least squares (ALS), because the optimal choice of a single block is obtained from a least squares problem. For more general quadratic optimisation problems, we refer to BCD methods as alternating linear schemes.

The idea is to iteratively fix all components \(\mathbf {C}^\nu \) except one. The restriction \(\mathbf {C}^\nu \mapsto \tau _{\mathfrak {r}}(\mathbf {C}^1,\dots ,\mathbf {C}^D)\) is linear. Thus for quadratic \(J\) we obtain again a quadratic optimisation problem for the unknown component \(\mathbf {C}^\nu \), which is of much smaller dimension than the ambient space \(\ell ^2({\mathscr {I}})\). Generically, there exist unique solutions of the restricted problems.

figure j

In this way, the nonlinearity imposed by the model class is circumvented at the price of possibly making very little progress in each step or encountering accumulation points which are not critical points of the problem. Also the convergence analysis is challenging, as textbook assumptions on BCD methods are typically not met, see [50, 106, 118, 135] for partial results. However, regularisation can cure most convergence issues [137, 146].

In practical computations the abstract description in Algorithm 1 is modified to reorder the tree during the process in such a way that the component to be optimised becomes the root, and all bases encoded in the other nodes are orthonormalised accordingly. This does not affect the generated sequence of tensors [118], but permits much more efficient solution of the local least squares problems. In particular, the condition of the restricted problems is bounded by the condition of the original problem [78]. All contractions required to set up the local linear system for a single component scale only polynomially in r, n, and are hence computable at acceptable cost.

This optimisation procedure for tensor networks is known as the single-site density matrix renormalisation group (DMRG) algorithm in physics [143]. The two-site DMRG algorithm (modified ALS [78]) has been developed by S. White [142] for spin chain models. It is a substantial modification of the scheme above, combining neighbouring components \( \mathbf {C}^\nu \) and \( \mathbf {C}^{\nu +1}\) in one, which is subsequently optimised. The result is then separated again by an appropriately truncated SVD. This allows an adjustment of representation ranks, but comes at a higher numerical cost. In the numerical analysis community such algorithms have been used in [43, 78, 84, 88, 91, 112].

4.2 Riemannian Gradient Descent

The Riemannian optimisation framework [1] assumes that the minimiser \(\mathbf {u}_{\mathfrak {r}} \in {\mathscr {H}}_{\le \mathfrak {r}}\) of the problem constrained to \({\mathscr {H}}_{\le \mathfrak {r}}\) actually belongs to the smooth manifold \({\mathscr {H}}_{\mathfrak {r}}\) (cf. Sect. 3.8). For matrix manifolds this is the case if the global minimiser \(\mathbf {u}^*\) does not belong to the singular points \({\mathscr {H}}_{\le \mathfrak {r}} {\setminus } {\mathscr {H}}_{\mathfrak {r}}\), see [123].

Assuming \(\mathbf {u}_{\mathfrak {r}} \in {\mathscr {H}}_{\mathfrak {r}}\), the first-order necessary optimality condition is that the gradient of \(J\) at \(\mathbf {u}_{\mathfrak {r}}\) is perpendicular to the tangent space \( {\mathscr {T}}_{{\mathbf {u}}_{\mathfrak { r}} } = {\mathscr {T}}_{{\mathbf {u}}_{\mathfrak { r}} } {\mathscr {H}}_{ \mathfrak {r} } \). Hence a relaxed problem compared to (4.1) consists in finding \(\mathbf {u} \in {\mathscr {H}}_{\mathfrak {r}}\) such that

$$\begin{aligned} \langle \nabla J( {\mathbf {u}} ) , \delta {\mathbf {u}} \rangle = 0 \quad \text {for all }\delta {\mathbf {u}} \in {\mathscr {T}}_{{\mathbf {u}}}, \end{aligned}$$
(4.4)

where \(\nabla J\) is the gradient of \(J\). Since \({\mathscr {H}}_{\mathfrak {r}}\) is an embedded submanifold, a trivial Riemannian metric is inherited from the ambient space \(\ell ^2({\mathscr {I}})\), and for the Riemannian gradient one has \( \hbox {Grad } J( {\mathbf {u}}) = P_{{\mathscr {T}}_{\mathbf {u}}} \nabla J( {\mathbf {u}} )\), which by (4.4) should be driven to zero.

As a relatively general way of treating the above problems, we will consider projected gradient methods. In these methods, one performs gradient steps \( {\mathbf {y}}^{n+1} {:=} {\mathbf {u}}^n - \alpha _m \nabla J({\mathbf {u}}^n)\) in the ambient space \(\ell ^2({\mathscr {I}})\). More generally, one may take preconditioned gradient steps, which is not considered for brevity. For the problems derived from linear operator equations considered above, \({\mathbf {y}}^{n+1} \) is in principle computable whenever \({\mathbf {A}}\) and \({\mathbf {b}}\) have a suitable low-rank structure. The gradient step is followed by a mapping \( {\mathbf {R}} : \ell ^2({\mathscr {I}})\rightarrow {\mathscr {H}}_{\le \mathfrak {r}}\) to get back on the admissible set. The iteration is summarised as follows:

$$\begin{aligned} {\mathbf {y}}^{n+1}&: = {\mathbf {u}}^{n} - \alpha _n \nabla J( \mathbf { u}^n )&\text {(gradient step)},\\ {\mathbf {u}}^{n+1}&: = {\mathbf {R}} ({{\mathbf {y}}}^{n+1})&\text {(projection step)}. \end{aligned}$$

The specification of the above algorithm depends on the step size selection \(\alpha _n\) and on the choice of the projection operator \({\mathbf {R}} {:} \ell ^2({\mathscr {I}})\rightarrow {\mathscr {H}}_{\le \mathfrak {r}}\).

Let us remark that taking the best approximation

$$\begin{aligned} {\mathbf {R}}( {{\mathbf {y}}}^{n+1} ) {:=} \hbox {argmin} \{ \Vert {{\mathbf {y}}}^{n+1} - {\mathbf {z}}\Vert : {\mathbf {z}} \in {\mathscr {H}}_{\le \mathfrak {r}} \} \end{aligned}$$

is generally not numerically realisable [75]. A practically feasible choice for the nonlinear projection \(\mathbf {R}\) would be the HSVD truncation \({\mathbf {H}}_\mathfrak {r}\) defined in (3.20), which will be considered in Sect. 6.1.

Supposing that a retraction (defined below) is available on the tangent space, a nonlinear projection \({\mathbf {R}}\) can also be realised in two steps, by first projecting (linearly) onto the tangent space \({\mathscr {T}}_{{{\mathbf {u}}}^n}\) at \({{\mathbf {u}}}^n \), and subsequently applying the retraction R:

$$\begin{aligned} {\mathbf { z}}^{n+1}&: = P_{{\mathscr {T}}_{{{\mathbf {u}}}^n}} \big ( {{\mathbf {u}}}^{n} - \alpha _n \nabla J({{\mathbf {u}}}^n ) \big ) = {{\mathbf {u}}}^{n} - \alpha _n P_{{\mathscr {T}}_{{{\mathbf {u}}}^n}} \nabla J({{\mathbf {u}}}^n )&\text {(projected gradient step)} \\&= : {{\mathbf {u}}}^{n} + \varvec{\xi }^n, \quad \varvec{\xi }^n \in {\mathscr {T}}_{{{\mathbf {u}}}^n},&\\ {\mathbf {u}}^{n+1}&: = {R} ({{\mathbf {u}}}^n , {\mathbf {z}}^{n+1} - {{\mathbf {u}}}^n )= {R} ({{\mathbf {u}}}^n , \varvec{\xi }^n )&\text {(retraction step).} \end{aligned}$$

In the first line we used that \(\mathbf {u}^n \in {\mathscr {T}}_{\mathbf {u}^n}\) (since \({\mathscr {H}}_{\mathfrak {r}}\) is a cone). This algorithm is called the Riemannian gradient iteration.

Retractions and Riemannian gradient iteration have been introduced in [126]. We follow the treatment in the monograph [1]. A retraction maps \({\mathbf {u}} + \varvec{\xi } \), where \({\mathbf {u}} \in {\mathscr {H}}_{ \mathfrak {r}} \) and \( \varvec{\xi } \in {\mathscr {T}}_{{{\mathbf {u}}}}\), smoothly to a point \(R ( {{\mathbf {u}}} , \varvec{\xi } )\) on the manifold such that

$$\begin{aligned} \Vert {{\mathbf {u}}} + \varvec{\xi } - R ( {\mathbf {u}} , \varvec{\xi } ) \Vert = {\mathscr {O}} ( \Vert \varvec{\xi } \Vert ^2). \end{aligned}$$

Roughly speaking, a retraction is an approximate exponential map on the manifold. The exponential map itself satisfies the definition of a retraction, but is in general too expensive to evaluate. Several examples of retractions for hierarchical tensors are known [89, 104, 105].

Let us note that in principle, it can occur that an iterate \( {{\mathbf {u}}}^n \) is of lower rank, that is, \({{\mathbf {u}}}^n \in {\mathscr {H}}_{\mathfrak {s}}\), where \(s_\eta < r_\eta \) for at least one \(\eta \in {\mathbb {E}}\). In this case \({{\mathbf {u}}}^n \in {\mathscr {H}}_{\le \mathfrak {r}}\) is a singular point, and no longer on the manifold \({\mathscr {H}}_{\mathfrak {r}}\), so the Riemannian gradient algorithm breaks down. Since \({\mathscr {H}}_{\mathfrak {r}}\) is dense in \({\mathscr {H}}_{\le \mathfrak {r}} \), for any \(\varepsilon > 0\) there exists a tensor \( {{\mathbf {u}}}_{\varepsilon }^n \in {\mathscr {H}}_{\mathfrak {r}}\) with \(\Vert {{\mathbf {u}}} -{{\mathbf {u}}}_{\varepsilon }^n\Vert < \varepsilon \). Practically such a regularised \( {{\mathbf {u}}}^n_{\varepsilon } \) is not hard to obtain for a chosen \( \varepsilon \sim \Vert \nabla J( {{\mathbf {u}}}^n )\Vert \). Alternatively, the algorithm described above can be regularised, in order to automatically avoid arriving at a singular point [89].

In [123], the Riemannian gradient iteration was extended to closures of matrix manifolds, and convergence results were deduced from the Łojasiewicz inequality. We expect that these results can be extended to general tensor manifolds of fixed tree rank.

4.3 Dirac–Frenkel Variational Principle

The first-order optimality condition can be considered as the stationary case of a more general time-dependent formulation in the framework of the Dirac–Frenkel variational principle [102]. We consider an initial value problem

$$\begin{aligned} \frac{\mathrm {d}}{\mathrm {d}t} {{{\mathbf {u}}}}=\mathbf {F}({{\mathbf {u}}}),\quad {{{\mathbf {u}}}}(0)={{\mathbf {u}}}_{0}\in {\mathscr {H}}_{\mathfrak {r}}. \end{aligned}$$
(4.5)

The goal is to approximate the trajectory \({{\mathbf {u}}}(t)\) of (4.5), which might not be exactly of low rank, by a curve \({{\mathbf {u}}} _{\mathfrak {r}}(t)\) in \({\mathscr {H}}_{\mathfrak {r}}\). However, the pointwise best approximation \({{\mathbf {u}}}_{\mathfrak {r}}(t){:=}\hbox {argmin}_{{\mathbf {v}}\in {\mathscr {H}} {_{\mathfrak {r}}}}\Vert {{\mathbf {u}}}(t)- {\mathbf {v}}(t)\Vert \) provides in general no practical solution to the problem, since first, the computation of the exact trajectory is typically infeasible in high-dimensional problems, and second, it requires the solution of too many best approximation problems.

The Dirac–Frenkel variational principle [102] determines an approximate trajectory \({{\mathbf {u}}}_{\mathfrak {r} }(t)\in {\mathscr {H}}_{\mathfrak {r}}\) that minimises

$$\begin{aligned} \left\| \frac{\mathrm {d}}{\mathrm {d}t} {{\mathbf {u}}}(t)-\frac{\mathrm {d}}{\mathrm {d}t} {{\mathbf {u}}}_{\mathfrak {r}}(t)\right\| \rightarrow \min ,\quad {{\mathbf {u}}}_{\mathfrak {r}}(0)={{\mathbf {u}}}(0), \end{aligned}$$

corresponding to the weak formulation \(\langle \tfrac{\mathrm {d}}{\mathrm {d}t}{{\mathbf {u}}}_{\mathfrak {r}}-\mathbf {F} ({{\mathbf {u}}}_{\mathfrak {r}}),\delta {{\mathbf {u}}}\rangle =0\) for all \(\delta {{\mathbf {u}}}\in {\mathscr {T}}_{{{\mathbf {u}}}_{\mathfrak {r}}}\).

If the manifold were a closed linear space, the equations above would reduce to the corresponding Galerkin equations. Note also that for the gradient in the limiting case \(\frac{\mathrm {d}}{\mathrm {d}t} {{\mathbf {u}}}=0\), one obtains the first-order condition (4.4). However, this instationary approach applies also to nonsymmetric operators \( {\mathbf {A}} : \ell ^2({\mathscr {I}})\rightarrow \ell ^2({\mathscr {I}})\).

Even for the simple differential equation of the form \( \tfrac{\mathrm {d}}{\mathrm {d}t} {{\mathbf {u}}} (t) = \mathbf {F} (t)\), with solution \( {{\mathbf {u}}}(t) = {{\mathbf {u}}}(0) + \int _0^t \mathbf {F}(s) ds \), the Dirac–Frenkel principle leads to a coupled nonlinear system of ODEs, which is not always easy to solve. This motivated the development of splitting schemes that integrate the components successively, similarly to ALS [103, 104]. In particular, the splitting is easy to realise for linear differential equations.

When \({\mathbf {F}}\) is a partial differential operator, the Dirac–Frenkel principle leads to methods for approximating the solutions of instationary PDEs in high dimension by solving nonlinear systems of low-dimensional differential equations on the tensor manifold \({\mathscr {H}}_{\mathfrak {r}}\). This shares some similarities with the derivations of Hartree–Fock and time-dependent Hartree–Fock equations for fermions and the Gross–Pitaevskii equation for bosons. The Dirac–Frenkel principle is well-known in molecular quantum dynamics as the multi-configuration time-dependent Hartree method (MCTDH) [15, 102] for the Tucker format. For hierarchical tensors such a method has been formulated in [102, 141]. First convergence results have been obtained in [4, 105]. The more involved case of reflexive Banach spaces has been considered in [54]. Time evolution of matrix product states (TT format) for spin systems has been considered in detail in [71].

5 Convergence of Low-Rank Approximations

For a tensor of order d with mode sizes n and all hierarchical ranks bounded by r, the hierarchical format has storage complexity \({\mathscr {O}} ( drn + dr^3 )\); in the case of the tensor train format, one obtains \( {\mathscr {O}} ( d n r^2 ) \). Similar results hold for operations on these formats: the HSVD, for instance, requires \({\mathscr {O}} ( dr^2n + dr^4 )\) or \( {\mathscr {O}} ( d n r^3 ) \) operations, respectively. For small r, one can thus obtain a very strong improvement over the data complexity \(n^d\) of the full tensor.

In the numerical treatment of PDEs, however, the underlying function spaces require discretisation. In this context, the above complexity considerations are thus only formal, since d, n, and r cannot be considered as independent parameters.

In the context of PDEs, the appropriate question becomes: what is the total complexity, in terms of the required number of parameters or arithmetic operations, for achieving a prescribed accuracy \(\varepsilon >0 \) in the relevant function space norm? In this setting, not only the ranks, but also the dimension n of the univariate trial spaces—and in the example of Sect. 7.1 even the tensor order d—need to be considered as functions of \(\varepsilon >0\). This leads to the fundamental question of appropriate notions of approximability in terms of which one can quantify the dependencies of \( d (\varepsilon )\), \(n (\varepsilon )\), \(r(\varepsilon ) \) on \(\varepsilon \).

5.1 Function Spaces and Preconditioning

In order to treat such approximability questions, we need to consider hierarchical tensors in infinite-dimensional spaces. Let \(\varOmega \subset {\mathbb {R}}^d\) be a tensor product domain, e.g.

$$\begin{aligned} \varOmega =I_1\times \cdots \times I_d, \qquad I_1,\ldots ,I_d\subseteq {\mathbb {R}}. \end{aligned}$$
(5.1)

As we have noted, Hilbert function spaces such as \(L^2(\varOmega ) = \bigotimes \nolimits _{\mu =1}^{d}L^{2}(I_{\mu })\) and \(H^1(\varOmega )\) are, by an appropriate choice of basis, isomorphic to \(\ell ^2({\mathbb {N}}^d) = \bigotimes \nolimits _{\mu =1}^{d} \ell ^2({\mathbb {N}})\).

So far \({\mathscr {V}} = \bigotimes _{\mu =1}^d {\mathscr {V}}_\mu \), as in (2.1), has been assumed to be a Hilbert space with a cross-norm, that is, \( \Vert {{\mathbf {u}}} \Vert _{{\mathscr {V}}} = \Vert {{\mathbf {u}}}_1\Vert _{{\mathscr {V}}_1} \cdots \Vert {{\mathbf {u}}}_d \Vert _{{\mathscr {V}}_d} \) for \({{\mathbf {u}}} = {{\mathbf {u}}}_1 \otimes \cdots \otimes {{\mathbf {u}}}_d \in {\mathscr {V}}\). Examples of such spaces are \(L^2\)-spaces, as well as certain mixed Sobolev spaces, over tensor product domains (5.1). Indeed, if \({\mathscr {V}}\) is endowed with a cross-norm, by choice of suitable bases for the \({\mathscr {V}}_\mu \) one obtains an isomorphism \(\ell ^2(\mathbb {N}^d) \rightarrow {\mathscr {V}}\) of Kronecker rank one (that is, mapping elementary tensors to elementary tensors).

Standard Sobolev norms do not have this property. For instance, in the important case of the standard \(H^1_0(\varOmega )\)-norm \( \Vert v \Vert ^2_{H^1_0 ( \varOmega ) } = \sum _{\mu =1}^d \Vert \partial _{x_\mu } v \Vert ^2\) on \(\varOmega \) as in (5.1) with homogeneous Dirichlet boundary data, this is related to the fact that the Laplacian is not a rank-one operator. Applying the inverse of the homogeneous Dirichlet Laplacian on \(\varOmega =(0,1)^d\) in the corresponding eigenfunction basis representation amounts to multiplication by the diagonal operator with entries

$$\begin{aligned} L_{\nu } {:=} \pi ^{-2} \bigl ( \nu _1^2 + \ldots + \nu _d^2)^{-1} ,\quad \nu \in {\mathbb {N}}^d . \end{aligned}$$
(5.2)

Since the eigenfunctions are separable, but the tensor of corresponding eigenvalues \((L_\nu )_{\nu \in \mathbb {N}^d}\) does not have a finite-rank representation, the inverse of the Laplacian therefore does not have a representation of finite rank either.

It does, however, have efficient low-rank approximations: as a consequence of the results in [20], for each \(r \in \mathbb {N}\) there exist \(\omega _{r,k}, \alpha _{r,k} >0\) such that the exponential sum

$$\begin{aligned} {\mathscr {E}}_r(L_\nu ^{-1}) {:=} \sum _{k=1}^r \omega _{r,k} e^{-\alpha _{r,k} L_\nu ^{-1}} = \sum _{k=1}^r \omega _{r,k} \prod _{\mu =1}^d e^{-\pi ^{2} \alpha _{r,k} \nu _\mu ^2} , \quad \nu \in \mathbb {N}^d, \end{aligned}$$
(5.3)

which is a sum of r separable terms, satisfies

$$\begin{aligned} \sup _{\nu \in \mathbb {N}^d}\left|L_\nu - {\mathscr {E}}_r(L_\nu ^{-1})\right| \le \frac{16}{\pi ^2 d} \exp (-\pi \sqrt{r}). \end{aligned}$$
(5.4)

Approximations of the type (5.3) can also be used for preconditioning. They are particularly useful in the context of diagonal preconditioners for wavelet representations of elliptic operators, where the diagonal elements have a form analogous to (5.2). In this case, the operator exponentials in an approximation of the form (5.3) reduce to the exponentials of the diagonal entries corresponding to each tensor mode. In contrast to (5.4), however, in this case sequences of the form (5.2) need to be approximated up to a certain relative accuracy. As a consequence, the required rank of the exponential sum then also depends on the considered discretisation subspace. This is analysed in detail in [6, 8].

On finite-dimensional subspaces, one can also use multilevel preconditioners such as BPX with tensor structure. This has been considered for space-time formulations of parabolic problems in [2]; in the elliptic case, the analysis of BPX—including the question of d-dependence—is still open in this context.

Also when \({\mathscr {V}}\) is a standard Sobolev space (such as \({\mathscr {V}} = H^1_0(\varOmega )\) as above), we can still obtain an isomorphism \(\ell ^2(\mathbb {N}^d) \rightarrow {\mathscr {V}}\) by choice of an appropriate basis. Such an isomorphism, however, then does not have bounded Kronecker rank; as a consequence, corresponding representations of bounded elliptic operators on \({\mathscr {V}}\) as isomorphisms \({\mathbf {A}}:\ell ^2(\mathbb {N}^d)\rightarrow \ell ^2(\mathbb {N}^d)\) generally also have unbounded ranks and thus need to be approximated. In other words, in the case of operators on standard Sobolev spaces, with the Dirichlet Laplacian \(-\varDelta :H^1_0(\varOmega )\rightarrow H^{-1}(\varOmega )\) as a prototypical example, the price to pay for such well-conditioned representations on an \(\ell ^2\)-sequence space with cross-norm is that simple formal low-rank structures of the operator (as present in \(-\varDelta \)) are lost.

In summary, however, for our present purposes we may restrict ourselves to approximation of tensors in the high-dimensional sequence space \(\ell ^2({\mathscr {I}})\), with \({\mathscr {I}} = \mathbb {N}^d\).

5.2 Computational Complexity

An important question is to what extent one can profit from low-rank approximability of problems, in the sense that approximate solutions for any given \(\varepsilon \) can actually be found at reasonable cost. This includes in particular the identification of a suitable discretisation and a corresponding subset of \(\mathbb {N}^d\) to achieve the prescribed target error.

One option is a choice based on a priori estimates. In the case of tensor product finite difference or finite element discretisations, one has such estimates in terms of norms of higher derivatives of the exact solution. The dependence of these norms and further constants appearing in the estimates on d, however, is typically not easy to quantify; see [6, Sect. 4.3] for an illustration by a simple Poisson problem for large d.

These difficulties can be avoided by explicitly computable a posteriori bounds. Such bounds are provided, for linear operator equations \({\mathbf {A}}{{\mathbf {u}}}={\mathbf {b}}\) on \(\ell ^2({\mathscr {I}})\), by the adaptive low-rank method in [7]. This adaptive scheme is based on iterative thresholding, see also Sect. 6.1. Assume that \({{\mathbf {u}}} \in \ell ^2({\mathscr {I}})\) belongs to a subset for which accuracy \(\varepsilon \) requires at most the maximum hierarchical rank \(r(\varepsilon )\) and the maximum mode size \(n(\varepsilon )\). For given \(\varepsilon \), the adaptive low-rank method then finds \({{\mathbf {u}}}_\varepsilon \) in hierarchical format with \(\Vert {{\mathbf {u}}}-{{\mathbf {u}}}_\varepsilon \Vert _{\ell ^2({\mathscr {I}})} \le \varepsilon \), with ranks and mode sizes bounded up to fixed constants by \(r(\varepsilon )\) and \(n(\varepsilon )\), respectively. In addition, if for instance \({\mathbf {A}}\) has finite rank and can be applied efficiently to each tensor mode, then the total number of operations required can be bounded by \(C(d)\, \bigl ( d \, r^4(\varepsilon ) + d\, r^2(\varepsilon )\, n(\varepsilon ) \bigr )\), with C(d) polynomial in d—in other words, up to C(d) one has the operation complexity of performing the HSVD on the best low-rank approximation of accuracy \(\varepsilon \). This property is shown in [7] for \(n(\varepsilon )\) algebraic and \(r(\varepsilon )\) polylogarithmic in \(\varepsilon \), but analogous results can be derived for algebraically growing \(r(\varepsilon )\) as well. Similar estimates, with additional logarithmic factors, are obtained in [6, 8] for this method applied to problems on Sobolev spaces, where \({\mathbf {A}}\) is not of finite rank as discussed in Sect. 5.1.

5.3 Low-Rank Approximability

Since \(n(\varepsilon )\) is strongly tied to the underlying univariate discretisations, let us now consider in more detail when one can expect to have efficient low-rank approximations of solutions, that is, slow growth of \(r(\varepsilon )\) as \(\varepsilon \rightarrow 0\). The HSVD of tensors yields information on the approximation error in \(\ell ^2\) with respect to the hierarchical ranks: as a consequence of Theorem 3.5, the error of best low-rank approximation of \({{\mathbf {u}}}\) is controlled by the decay of its hierarchical singular values.

To quantify the sparsity of sequences, we use weak-\(\ell ^p\)-norms. For a given sequence \(a = (a_k)_{k\in \mathbb {N}} \in \ell ^2 ({\mathbb {N}}) \), let \(a^*_n\) denote the n-th largest of the values \(| {a_k} | \). Then for \(p>0\), the space is defined as the collection of sequences for which is finite, and this quantity defines a quasi-norm on for \(0<p<1\), and a norm for \(p \ge 1\). It is closely related to the \(\ell ^p \)-spaces, since for \( p < p'\), one has .

Algebraic decay of the hierarchical singular values can be quantified in terms of

(5.5)

Note that the p-thSchatten class, which one obtains by replacing in (5.5) by \(\ell ^p\), is contained in . For these spaces, from Theorem 3.5 we obtain the following low-rank approximation error estimate.

Proposition 5.1

Let for \(0<p<2\). Then there exists a tensor \({\hat{\mathbf {u}}}\) such that

It has been shown in [122] that, for instance, mixed Sobolev spaces are contained in the Schatten classes; we refer to [122] also for a more precise formulation and a discussion of the resulting data complexity. The results in [13] for approximation of functions with mixed smoothness in the canonical format also have implications for approximability in tensor networks. However, classical notions of regularity in Sobolev and Besov spaces provide only a partial answer, since one can easily construct functions of very low regularity that still have finite-rank representations.

A central question is therefore for which problems one can obtain low-rank approximability beyond that guaranteed by regularity. In particular, under which conditions do assumptions on the low-rank approximability of input data imply that the solution is again of comparable low-rank approximability?

Instead of using regularity as in [122], one can also exploit structural features of the considered problems to infer low-rank approximability of corresponding solutions. For linear operator equations, as shown in [92], if an operator on a space endowed with a cross-norm is well-conditioned and has a finite-rank representation, with finite-rank right-hand side, one obtains error bounds for the solution that decay algebraically with respect to the ranks; similar results are shown for eigenvalue problems. As noted in Sect. 5.1, however, when problems on standard Sobolev spaces are represented on spaces with cross-norm such as \(\ell ^2(\mathbb {N}^d)\), the conditions of bounded condition number and finite representation ranks are in general mutually exclusive, and in such cases the results of [92] are therefore restricted to finite discretisations.

In the particular case of the inverse Laplacian, using exponential sums one can obtain low-rank approximations which converge almost exponentially. Also in these results, the particular norms in which the problem is considered have a strong influence. For instance, the following is shown in [36]: If \(f\in H^{-1+\delta }(\varOmega )\) for \(\delta >0\), then for \(A{:=}-\varDelta \),

$$\begin{aligned} \bigl \Vert A^{-1} f - {\mathscr {E}}_r(A) f \bigr \Vert _{H^1} \le C \exp \Bigl (-\frac{\delta \pi }{2} \sqrt{r}\Bigr ) \Vert f\Vert _{H^{-1+\delta }} \,, \end{aligned}$$
(5.6)

where \(C>0\) and again

$$\begin{aligned} {\mathscr {E}}_r(A) = \sum _{k=1}^r \omega _{r,k} e^{-\alpha _{r,k} A} \end{aligned}$$
(5.7)

with certain \(\omega _{r,k}, \alpha _{r,k} >0\). Since the operators \(e^{t \varDelta }\), \(t>0\), are of rank one, this yields almost exponentially convergent rank-r approximations of the inverse Laplacian. In the dependence on the particular topologies in (5.6), there is a marked difference to the results in [59], which are also based on approximations of the form (5.7) but consider the error in Euclidean norm for discretised problems.

The situation is simpler in the case of parameter-dependent PDEs, which are typically posed on tensor product spaces with cross-norm such as \(H^1_0(\varOmega ) \otimes L^2(U)\), where \(\varOmega \) is the lower-dimensional spatial domain and \(U = U_1\times \cdots \times U_d\) is a domain of parameter values (see also Sect. 7.1). Convergence of low-rank approximations faster than any algebraic rate or of exponential type has been established for particular problems of this class in [5, 85, 90].

There are also relevant counterexamples where the ranks required for a certain accuracy grow strongly with respect to d. A variety of such counterexamples originate from ground state computations of quantum lattice systems, such as one- to three-dimensional spin systems, which in many cases exhibit translation symmetries that allow a precise analysis. There is a number of works on area laws in quantum physics, see, e.g. [3] and the references given there.

6 Iterative Thresholding Schemes

Let us consider the variational formulation of the original operator equation \( {{\mathbf {u}}} = \hbox { arg min}_{{\mathbf {v}} \in \ell ^2({\mathscr {I}})} J({\mathbf {v}} ) \) with \(J\) as in (4.2) or (4.3). In the methods we have considered in Sect. 4, this problem is approached in a manner analogous to Ritz–Galerkin discretisations: one restricts the minimisation to the manifold \({\mathscr {H}}_{ \mathfrak {r}}\) of hierarchical tensors with given fixed rank \(\mathfrak {r}\), or better to its closure \( {\mathscr {H}}_{ \le \mathfrak {r}}\), and attempts to solve such constrained minimisation problems for \(J\). However, since \( {\mathscr {H}}_{ \mathfrak {r}}\) and \({\mathscr {H}}_{ \le \mathfrak {r}}\) are not convex, there are generally multiple local minima. Roughly speaking, in this approach one has fixed the model class and aims to achieve a certain accuracy within this class.

Instead, one can also first prescribe an accuracy to obtain a convex admissible set \( {\mathscr {C}}_\varepsilon {:=} \{ {\mathbf {v}} \in \ell ^2({\mathscr {I}}): \Vert {\mathbf {A}} {\mathbf {v}} - {\mathbf {b}} \Vert \le \varepsilon \} \). Over this admissible set, one may now try to minimise the computational costs. Roughly speaking, we want to minimise the largest hierarchical rank of \({\mathbf {v}}\). This can be seen as a motivation for the various methods based on rank truncations that we consider in this section. Note that even in the matrix case \(d=2\), the functional \( {\mathbf {A}} \mapsto \hbox {rank} ({\mathbf {A}} )\) is not convex. The nuclear norm can be regarded as a convex relaxation of this functional, and its minimisation over \({\mathscr {C}}_\varepsilon \) by proximal gradient techniques leads to soft thresholding iterations as in Sect. 6.2 below.

The methods considered in this section, in contrast to those in Sect. 4, thus iteratively modify the tensor ranks of the approximations. Note that while the following theoretical considerations apply to infinite-dimensional sequence spaces with \({\mathscr {I}} = \mathbb {N}^d\), in practice these schemes again operate on fixed finite \({\mathscr {I}}\). They can also be employed, however, as part of methods that adaptively identify suitable finite \({\mathscr {I}}\) for controlling the error with respect to the solution of the continuous problem, as described in Sect. 5.2.

6.1 Iterative Hard Thresholding Schemes

Starting from a (preconditioned) gradient step \( {{\mathbf {u}}}^{n+1} = {{\mathbf {u}}}^{n} - {\mathbf {C}}_n^{-1} \nabla J( {\mathbf { u}}^n )\) in the ambient space \(\ell ^2({\mathscr {I}})\), in order to keep our iterates of low rank, we introduce projection or truncation operators \({\mathbf {R}}_n\) and \(\mathbf {T}_n \), realised by hard thresholding (3.20) of the singular values in the HSVD,

$$\begin{aligned} {{\mathbf {u}}}^{n+1} : = {\mathbf {R}}_n \bigl ( {{\mathbf {u}}}^{n} - \mathbf {T}_n [ {\mathbf {C}}_n^{-1} \nabla J( {\mathbf { u}}^n ) ] \bigr ) . \end{aligned}$$
(6.1)

If we take \( {\mathbf {T}}_n {:=} {\mathbf {I}} \) and \( {\mathbf {R}}_n {:=} {\mathbf {H}}_{\mathfrak {r}} \) (the HSVD projection (3.20)), this can be considered as an analogue of iterative hard thresholding in compressive sensing [19] and matrix recovery [56, 131]. In the context of low-rank approximation, such truncated iterations based on various representation formats have a rather long history, see, e.g. [7, 10, 17, 18, 68, 85, 90].

We consider the choice \( {\mathbf {T}}_n {:=} {\mathbf {I}} \), and \( {\mathbf {R}}_n {:=} {\mathbf {H}}_\mathfrak {r} \) in more detail, using the trivial preconditioner \({\mathbf {C}}_n {:=} {\mathbf {I}}\). Defining the mapping \({\mathbf {B}}\) on \(\ell ^2({\mathscr {I}})\) by \({\mathbf {B}}({{\mathbf {u}}}) {:=} {{\mathbf {u}}} - \nabla J( {{\mathbf {u}}} )\), we then have the iteration

$$\begin{aligned} {{\mathbf {y}}}^{n+1} {:=} {\mathbf {B}}({{\mathbf {u}}}^n),\quad {{\mathbf {u}}}^{n+1} {:=} {\mathbf {H}}_{\mathfrak {r}}({{\mathbf {y}}}^{n+1}),\quad n\in {\mathbb {N}}. \end{aligned}$$
(6.2)

Let \({{\mathbf {u}}}\) be a fixed point of \({\mathbf {B}}\), that is, a stationary point of \(J\). As a consequence of Theorem 3.5, denoting by \({{\mathbf {u}}}_\mathfrak {r}\) the best approximation of \({{\mathbf {u}}}\) of ranks \(\mathfrak {r}\), we have the quasi-optimality property

$$\begin{aligned} \Vert {{\mathbf {y}}}^{ n} - {\mathbf {H}}_{\mathfrak {r}} ({{\mathbf {y}}}^{n}) \Vert \le c_d \Vert {{\mathbf {y}}}^{n} - {{\mathbf {u}}}_{\mathfrak {r}} \Vert , \end{aligned}$$
(6.3)

where \( c_d = \sqrt{d-1} \) in the case of tensor trains, and \(c_d = \sqrt{2d-3}\) in the case of the hierarchical format. Making use of this property, one can proceed similarly to [18, Sect. 4]: since \({{\mathbf {u}}} = {\mathbf {B}}({{\mathbf {u}}})\) and \({{\mathbf {y}}}^{n+1}={\mathbf {B}}({{\mathbf {u}}}^n)\), and by (6.3),

$$\begin{aligned} \Vert {{\mathbf {u}}}^{n+1} - {{\mathbf {u}}} \Vert&\le \Vert {\mathbf {H}}_\mathfrak {r} ({{\mathbf {y}}}^{n+1}) - {{\mathbf {y}}}^{n+1}\Vert + \Vert {\mathbf {B}}({{\mathbf {u}}}^n) - {\mathbf {B}}({{\mathbf {u}}})\Vert \\&\le c_d \Vert {\mathbf {B}}({{\mathbf {u}}}^n) - {{\mathbf {u}}}_{\mathfrak {r}} \Vert + \Vert {\mathbf {B}}({{\mathbf {u}}}^n) - {\mathbf {B}}({{\mathbf {u}}})\Vert \\&\le c_d \Vert {{\mathbf {u}}} - {{\mathbf {u}}}_\mathfrak {r} \Vert + (1 + c_d) \Vert {\mathbf {B}}({{\mathbf {u}}}^n) - {\mathbf {B}}({{\mathbf {u}}}) \Vert . \end{aligned}$$

From this, we immediately obtain the following convergence result.

Proposition 6.1

Let \(\Vert {\mathbf {B}}({\mathbf {v}})-{\mathbf {B}}({\mathbf {w}})\Vert \le \rho \Vert {\mathbf {v}} - {\mathbf {w}}\Vert \) for all \({\mathbf {v}},{\mathbf {w}}\in \ell ^2({\mathscr {I}})\), where \(\beta {:=} (1+c_d)\rho < 1\) with \(c_d\) as in (6.3). Then for any \({{\mathbf {u}}}^0 \in \ell ^2({\mathscr {I}})\),

$$\begin{aligned} \Vert {{\mathbf {u}}}^{n} - {{\mathbf {u}}} \Vert \le \beta ^n \Vert {{\mathbf {u}}}^0 - {{\mathbf {u}}} \Vert + \frac{c_d}{1 - \beta } \Vert {{\mathbf {u}}} - {{\mathbf {u}}}_\mathfrak {r} \Vert . \end{aligned}$$
(6.4)

We thus obtain \(\limsup _n \Vert {{\mathbf {u}}}^{n} - {{\mathbf {u}}} \Vert \lesssim \Vert {{\mathbf {u}}} - {{\mathbf {u}}}_\mathfrak {r} \Vert \); for this we need, however, an extremely restrictive contractivity property for \({\mathbf {B}}\). For instance, in the case of the least squares problem (4.3), where one has \({\mathbf {B}} = {\mathbf {I}} - \omega {\mathbf {A}}^*{\mathbf {A}}\) with suitable \(\omega >0\), this amounts to the requirement

$$\begin{aligned} ( 1- \delta ^2 ) \Vert {\mathbf {v}} \Vert ^2 \le \Vert {\mathbf {A}} {\mathbf {v}}\Vert ^2 \le (1+ \delta ^2 ) \Vert {\mathbf {v}}\Vert ^2 , \quad {\mathbf {v}} \in \ell ^2({\mathscr {I}}), \end{aligned}$$
(6.5)

with \(0 < \delta < 1/\sqrt{1+c_d}\), or in other words, \(\hbox {cond}({\mathbf {A}}) < \sqrt{1 + 2/c_d}\).

Note that the above arguments can be applied also in the case of nontrivial preconditioners \({\mathbf {C}}_n\) in (6.1). Since obtaining such extremely strong preconditioning is essentially as difficult as solving the original problem, the action of \({\mathbf {C}}_n^{-1}\) will typically need to be realised by another iterative solver, as considered in [18]. The setting of Proposition 6.1 in itself may thus be of most interest when it suffices to have (6.5) only on a small subset of \(\ell ^2({\mathscr {I}})\), as in compressive sensing-type problems.

A more common approach is to take \({\mathbf {R}}_n ={\mathbf {H}}_{\mathfrak {r}_n}\) with each \(\mathfrak {r}_n\) adapted to achieve a certain error bound, for instance such that for an \(\varepsilon >0\) each \({{\mathbf {u}}}^{n+1} {:=} {\mathbf {H}}_{\mathfrak {r}_n}({\mathbf {B}}({{\mathbf {u}}}^n))\), now with a general mapping \({\mathbf {B}}\) in (6.2), satisfies \( \Vert {{\mathbf {u}}}^{n+1} - {\mathbf {B}}({{\mathbf {u}}}^n) \Vert \le \varepsilon \). In this case, in the setting of Proposition 6.1, but assuming only \(\rho <1\) (i.e., contractivity of \({\mathbf {B}}\)), we obtain

$$\begin{aligned} \Vert {{\mathbf {u}}}^{n} - {{\mathbf {u}}} \Vert \le \rho ^n \Vert {{\mathbf {u}}}^0 - {{\mathbf {u}}} \Vert + \frac{\varepsilon }{1 - \rho } . \end{aligned}$$
(6.6)

Note that one now has a much weaker assumption on \({\mathbf {B}}\), but in contrast to (6.2) one generally does not obtain information on the ranks of \({{\mathbf {u}}}^n\). To enforce convergence to \({{\mathbf {u}}}\), the parameter \(\varepsilon \) needs to be decreased over the course of the iteration.

When one proceeds in this manner, the appropriate choice of these truncation tolerances is crucial: One does not have direct control over the ranks, and they may become very large when \(\varepsilon \) is chosen too small. A choice of truncation parameters that ensures that the ranks of \({{\mathbf {u}}}^n\) remain comparable to those required for the current error \(\Vert {{\mathbf {u}}}^{n} - {{\mathbf {u}}} \Vert \), while maintaining convergence, is a central part of the adaptive method for linear operator equations in [7, 8] that has been mentioned in Sect. 5.

The choice \( {\mathbf {R}}_n = {\mathbf {I}} \) and \( {\mathbf {T}}_n {:=} {\mathbf {H}}_{ \mathfrak {r}_n } \) leads to the basic concept of the AMEn algorithm [44], although actually a somewhat different componentwise truncation is used for \({\mathbf {T}}_n\), and this is combined with componentwise solves as in the ALS scheme. Note that the basic version of this method with \( {\mathbf {R}}_n = {\mathbf {I}} \), for which the analysis was carried out in [44], increases the ranks of the iterates in every step; the practically realised version in fact also uses for \({\mathbf {R}}_n\) a particular type of HSVD truncation. Although the theoretically guaranteed error reduction rates depend quite unfavourably on d, this method shows very good performance in practical tests, for instance coarse discretisations of elliptic problems (see [44, Sect. 7]). Further issues affecting the convergence of the method, however, arise with finer discretisations (see [6, Sect. 4.2] and [80]).

6.2 Iterative Soft Thresholding Schemes

Soft thresholding of sequences by applying \(s_{\kappa }(x) {:=} \hbox {sgn}(x)\max \{ |x|-{\kappa }, 0 \}\) for a \({\kappa }>0\) to each entry is a non-expansive mapping on \(\ell ^2\), cf. [33, 37]. A soft thresholding operation \(S_{\kappa }\) for matrices (and Hilbert–Schmidt operators) can be defined as application of \(s_{\kappa }\) to the singular values. Then \(S_{{\kappa }} \) is non-expansive in the Frobenius (or Hilbert–Schmidt) norm [9, 22].

On this basis, a non-expansive soft thresholding operation for the rank reduction of hierarchical tensors is constructed in [9] as follows. By \(S_{{{\kappa }}, \eta } \) we denote soft thresholding applied to the \(\eta \)-matricisation \({\mathbf {M}}_\eta (\cdot )\), that is, \(S_{{\kappa },\eta } ({\mathbf {v}}) = {\mathbf {M}}^{-1}_\eta \circ S_{{\kappa }} \circ {\mathbf {M}}_\eta ({{\mathbf {u}}}) \). The soft thresholding operator \( {\mathbf {S}}_{{{\kappa }}} : \ell ^2({\mathscr {I}})\rightarrow \ell ^2({\mathscr {I}})\) is then given as the successive application of this operation to each matricisation, that is,

$$\begin{aligned} {\mathbf {S}}_{{{\kappa }} } ({\mathbf {v}} ) : = S_{{\kappa }, \eta _E } \circ \ldots \circ S_{{\kappa }, \eta _1 } ({\mathbf {v}}) , \end{aligned}$$
(6.7)

where \(\eta _1,\ldots ,\eta _E\) is an enumeration of the effective edges \({\mathbb {E}}\). It is easy to see that the operator \({\mathbf {S}}_{\kappa }\) defined in (6.7) is non-expansive on \(\ell ^2({\mathscr {I}})\), that is, for any \({\mathbf {v}},{\mathbf {w}} \in \ell ^2({\mathscr {I}})\) and \({\kappa }>0\), one has \( \Vert {\mathbf {S}}_{\kappa }({\mathbf {v}} ) - {\mathbf {S}}_{\kappa }({\mathbf {w}})\Vert \le \Vert {\mathbf {v}} - {\mathbf {w}} \Vert \).

We now consider the composition of \(\mathbf{S}_{\kappa }\) with an arbitrary convergent fixed point iteration with a contractive mapping \({ \mathbf B} :\ell ^2({\mathscr {I}})\rightarrow \ell ^2({\mathscr {I}})\), where \(\rho \in (0,1)\) such that

$$\begin{aligned} \Vert \mathbf{B} ( {\mathbf {v}} ) - \mathbf{B} ({\mathbf {w}}) \Vert \le \rho \Vert {\mathbf {v}}-{\mathbf {w}} \Vert \,,\quad {\mathbf {v}}, {\mathbf {w}} \in \ell ^2({\mathscr {I}})\,. \end{aligned}$$
(6.8)

Lemma 6.2

([9])  Assuming (6.8), let \({{\mathbf {u}}}\) be the unique fixed point of \(\mathbf{B}\). Then for any \({\kappa }>0\), there exists a uniquely determined \({{\mathbf {u}}}^{\kappa }\in \ell ^2({\mathscr {I}})\) such that \({{\mathbf {u}}}^{\kappa }= {\mathbf {S}}_{\kappa }\bigl ( \mathbf{B} ({{\mathbf {u}}}^{\kappa }) \bigr )\), which satisfies

$$\begin{aligned} ( 1+ \rho )^{-1} \Vert {\mathbf {S}}_{\kappa }({{\mathbf {u}}}) - {{\mathbf {u}}} \Vert \le \Vert {{\mathbf {u}}}^{\kappa }- {{\mathbf {u}}} \Vert \le (1-\rho )^{-1} \Vert {\mathbf {S}}_{\kappa }({{\mathbf {u}}}) - {{\mathbf {u}}} \Vert \,. \end{aligned}$$
(6.9)

Let \({{\mathbf {u}}}^0 \in \ell ^2({\mathscr {I}})\), then \( \Vert {{\mathbf {u}}}^n - {{\mathbf {u}}}^{\kappa }\Vert \le \rho ^n \Vert {{\mathbf {u}}}^0 - {{\mathbf {u}}}^{\kappa }\Vert \) for \({{\mathbf {u}}}^{n+1} {:=} {\mathbf {S}}_{\kappa }\bigl ( \mathbf{B} ({{\mathbf {u}}}^n) \bigr )\).

For fixed \({\kappa }\), the thresholded gradient iteration thus converges (at the same rate \(\rho \) as the unperturbed iteration) to a modified solution \({{\mathbf {u}}}^{\kappa }\), and the distance of \({{\mathbf {u}}}^{\kappa }\) to the exact solution \({{\mathbf {u}}}\) is proportional to the error of thresholding \({{\mathbf {u}}}\). This needs to be contrasted with (6.4) and (6.6) in the case of hard thresholding, where the thresholded iterations are not ensured to converge, but only to enter a neighbourhood of the solution, and properties like (6.9) that establish a relation to best approximation errors are much harder to obtain (for instance, by strong contractivity of \({\mathbf {B}}\) as in Proposition 6.1).

Here we now consider the particular case of a quadratic minimisation problem (4.2) with symmetric elliptic \({\mathbf {A}}\), corresponding to a linear operator equation, where \({\mathbf {B}} = {\mathbf {I}} - \omega {\mathbf {A}}\) with a suitable \(\omega >0\). For this problem, based on Lemma 6.2, in [9] a linearly convergent iteration of the form \({{\mathbf {u}}}^{n+1} {:=} {\mathbf {S}}_{{\kappa }_n}\bigl ( \mathbf{B} ({{\mathbf {u}}}^n) \bigr )\) with \({\kappa }_n \rightarrow 0\) is constructed, where each iterate \({{\mathbf {u}}}^n\) is guaranteed to have quasi-optimal ranks. More specifically, for instance if \({{\mathbf {u}}}\) belongs to \({\mathrm {w}}\ell ^p_*\) as defined in Sect. 5, then with a constant \(C>0\),

$$\begin{aligned} \Vert {{\mathbf {u}}}_n - {{\mathbf {u}}} \Vert \le C d^{1+2s} \Vert {{\mathbf {u}}} \Vert _{{\mathrm {w}}\ell ^p_*} \bigl ( \max _{\eta \in {\mathbb {E}}} \hbox {rank}_\eta ({{\mathbf {u}}}_n) \bigr )^{-s},\quad s= \frac{1}{p} - \frac{1}{2} . \end{aligned}$$

An analogous quasi-optimality statement holds in the case of exponential-type decay \(\sigma ^\eta _k({{\mathbf {u}}}) = {\mathscr {O}}(e^{-c k^\beta })\) with some \(c,\beta >0\).

The central issue in achieving these bounds is how to choose \({\kappa }_n\). Clearly, the \({\kappa }_n\) need to decrease sufficiently to provide progress of the iteration towards \({{\mathbf {u}}}\), but if they decrease too rapidly this can lead to very large tensor ranks of the iterates. As shown in [9], both linear convergence and the above quasi-optimality property hold if one proceeds as follows: whenever \( \Vert {{\mathbf {u}}}_{n+1} - {{\mathbf {u}}}_n \Vert \le \frac{1 - \rho }{ 2 \Vert {\mathbf {A}}\Vert \rho } \Vert {\mathbf {A}}{{\mathbf {u}}}_{n+1} - {\mathbf {b}} \Vert \) holds, set \({\kappa }_{n+1} = \frac{1}{2} {\kappa }_n\); otherwise set \({\kappa }_{n+1} = {\kappa }_n\). The resulting procedure is universal in the sense that in order to achieve the stated rank bounds, nothing needs to be known a priori about the low-rank approximability of \({{\mathbf {u}}}\).

This method has not been combined with an adaptive choice of discretisation so far, but the asymptotic bounds on the ranks of each iterate that this method provides are somewhat stronger than those for the adaptive methods in [7, 8], in the sense that they do not depend on the low-rank structure of \({\mathbf {A}}\).

7 Applications

In principle, high-dimensional partial differential equations on product domains can be discretised directly by tensor product basis functions. This is suitable in our first example of uncertainty quantification problems. We also discuss two further examples, one from quantum chemistry and another one from molecular dynamics, where such a direct approach is not adequate. In these applications, certain reformulations that exploit specific features are much better suited, and we describe how our general setting of tensor approximations can be adapted to these cases.

7.1 Uncertainty Quantification

We consider linear diffusion problems on a domain \( \varOmega \subset {\mathbb {R}}^m\), \(m=1,2,3\), with given parameter-dependent diffusion coefficients a (xy ) for \( {x} \in \varOmega \) and \({y} \in U\), and with the set of parameter values U to be specified. The parametrised problem reads

$$\begin{aligned} - \nabla _x \cdot \bigl ( a ({x}, y ) \nabla _x u ( {x} , y ) \bigr ) = f ({x}) , \quad x \in \varOmega \ , \ y \in U , \end{aligned}$$

with appropriate boundary conditions, for instance homogeneous Dirichlet conditions \(u(x,y)=0\) for all \(y\in U\) and \( x\in \partial \varOmega \). In our setting, we aim to solve such problems in the Bochner space \({\mathscr {H}} = L^2(U, H^1_0(\varOmega ), \mu )\), where \(\mu \) is an appropriate measure.

Examples of particular parameterisations that arise in deterministic formulations of stochastic problems are the affine case

$$\begin{aligned} a ( {x} , y ) = a_0 ({x}) + \sum _{k=1}^{\infty } y_k a_k ({x}) , \end{aligned}$$
(7.1)

where \(U = [-1,1]^{\mathbb {N}}\) and \(\mu \) is the uniform measure on U, and the lognormal case

$$\begin{aligned} a ( {x} , y ) = \exp \Bigl ( a_0 ({x}) + \sum _{k=1}^{\infty } y_k a_k ({x}) \Bigr ) , \end{aligned}$$
(7.2)

where \(U = {\mathbb {R}}^{\mathbb {N}}\) and \(\mu \) is the tensor product of standard Gaussian measures (so that the \(y_k\) correspond to independent identically distributed normal random variables).

In each case, the solution u can be expressed as a tensor product polynomial expansion (also referred to as polynomial chaos) of the form

$$\begin{aligned} u ( {x}, {y} ) = \sum _{k} {{\mathbf {u}}} (x, k) \prod _{i=1}^\infty p_{k_i } ( y_i ), \end{aligned}$$

where \(p_\ell \), \(\ell \in {\mathbb {N}}\), are the univariate polynomials orthonormal with respect to the underlying univariate measure, and the summation over k runs over the finitely supported multi-indices in \({\mathbb {N}}_0^{\mathbb {N}}\). We refer to [14, 29, 30, 57, 58, 98, 125, 145] and the references therein.

In both cases (7.1) and (7.2), due to the Cartesian product structure of U, the underlying energy space \( {\mathscr {V}} \simeq H^1_0 (\varOmega ) \otimes L_2 ( U , \mu ) \) is a (countable) tensor product of Hilbert spaces, endowed with a cross-norm. By truncation of the expansions in (7.1), (7.2), one obtains a finite tensor product.

In this form, tensor decompositions can be used for solving these problems, for instance combined with a finite element discretisation of \(H^1_0(\varOmega )\), cf. [47]. The total solution error is then influenced by the finite element discretisation, the truncation of coefficient expansions and polynomial degrees, and by the tensor approximation ranks. An adaptive scheme that balances these error contributions by a posteriori error estimators, using tensor train representations and ALS for tensor optimisation with increasing ranks after a few iterations, can be found with numerical tests in [48].

7.2 Quantum Physics: Fermionic Systems

The electronic Schrödinger equation describes the stationary state of a non-relativistic quantum mechanical system of N electrons in a field of K classical nuclei of charge \(Z_{\eta }\in {\mathbb {N}}\) and fixed positions \(R_{\eta }\in {\mathbb {R}}^{3}\), \(\eta = 1,\ldots , K\). It is an operator eigenvalue equation for the Hamilton operator H, given by

$$\begin{aligned} H {:=} -\frac{1}{2}\,\sum _{\xi =1}^N \varDelta _\xi + V_\text {ext} + \frac{1}{2}\sum _{\xi =1}^{N}\sum _{\begin{array}{c} \zeta =1 \\ \zeta \ne \xi \end{array}}^{N} \frac{1}{| x_\xi -x_\zeta |}, \quad V_\text {ext} {:=} - \sum _{\xi =1}^{N} \sum _{\eta =1}^K\frac{Z_\eta }{| x_\xi -R_\eta |} , \end{aligned}$$

which acts on wave functions \(\varPsi \) that depend on the N spatial coordinates \(x_{\xi }\in {\mathbb {R}}^{3}\) and on the N spin coordinates \(s_{\xi } \in {\mathbb {Z}}_2\) of the electrons. By the Pauli principle, the wave function \(\varPsi \) needs to be antisymmetric with respect to the particle variables, that is, it needs to change sign under exchange of two distinct variable pairs \((x_{\xi }, s_\xi )\) and \( (x_\zeta ,s_\zeta )\), see, e.g. [129]. The corresponding space of wave functions is (see, e.g. [23])

$$\begin{aligned} {\mathbb {H}}^1_N = \bigl [ H^1 ({\mathbb {R}}^{3}\times {\mathbb {Z}}_2, {\mathbb {C}})\bigr ]^{N} \cap \bigwedge _{\xi =1}^N L_2 ({\mathbb {R}}^{3}\times {\mathbb {Z}}_2, {\mathbb {C}}), \end{aligned}$$

where the symbol \(\wedge \) denotes the antisymmetric tensor product (exterior product). For the sake of simplicity, we focus on the approximation of ground states, where one aims to find

$$\begin{aligned} \varPsi _0 = \hbox {argmin} \bigl \{ \langle \varPhi , H\varPhi \rangle : \; \langle \varPhi , \varPhi \rangle =1 ,\; \varPhi \in {\mathbb {H}}^1_N \bigr \} \end{aligned}$$

and the corresponding energy \(E_0 = \langle \varPsi _0, H\varPsi _0\rangle \). It is sufficient in the present setting to consider only real-valued functions, that is, \({\mathbb {C}}\) can be replaced by \({\mathbb {R}}\).

Discretisations can be constructed based on antisymmetric tensor products of single-particle basis functions, so-called Slater determinants. For a given orthonormal one-particle set

$$\begin{aligned} \{ \varphi _\mu : \mu = 1, \ldots , d \} \,\subset \,H^1 ( {\mathbb {R}}^3 \times {\mathbb {Z}}_2 ) , \end{aligned}$$
(7.3)

the corresponding Slater determinants \(\varphi _{\mu _1} \wedge \dots \wedge \varphi _{\mu _d}\), \(\mu _1 < \dots < \mu _d\), form an orthonormal basis of a space \({\mathscr {V}}^d_N\), called the Full-CI space. A Ritz–Galerkin approximation to \(\varPsi _0\) can then be obtained by minimising over the finite-dimensional subspace \({\mathscr {V}}_N^d \subset {\mathbb {H}}^1_N\), which leads to the discretised eigenvalue problem of finding the lowest \(E\in {\mathbb {R}}\) and corresponding \(\varPsi \in {\mathscr {V}}_N^d\) such that

$$\begin{aligned} \langle \varPhi , H\varPsi \rangle = E \langle \varPhi , \varPsi \rangle \quad \text {for all }\varPhi \in {\mathscr {V}}_N^d. \end{aligned}$$
(7.4)

Starting from a single-particle basis (7.3), where d is greater than the number N of electrons, every ordered selection \(\nu _1 , \ldots , \nu _N\) of \(N \le d\) distinct indices corresponds to an N-particle Slater determinant \(\varPsi _{SL} [\nu _1 , \ldots , \nu _N]\). The index of each such basis function can be encoded by a binary string \(\beta = (\beta _1,\ldots ,\beta _d)\) of length d, where \(\beta _i = 1\) if \( i \in \{\nu _1 , \ldots , \nu _N\}\), and \(\beta _i = 0\) otherwise. Setting \( {\mathbf {e}}^0 {:=} (1,~0)^T\), \({\mathbf {e}}^1 : = (0,~1)^T \in {\mathbb {R}}^2 \), the linear mapping defined by

$$\begin{aligned} \iota :\varPsi _{SL} [\nu _1 , \ldots , \nu _N] \mapsto {\mathbf {e}}^{\beta _1} \otimes \ldots \otimes {\mathbf {e}}^{\beta _d} \in {{\mathscr {B}}}_d {:=} \bigotimes _{\mu =1}^d {\mathbb {R}}^2 \end{aligned}$$

is a unitary isomorphism between the Fock space \({\mathscr {F}}_d = \oplus _{M=0}^d {\mathscr {V}}^d_M\) and \({\mathscr {B}}_d \). The solution of the discretised N-electron Schrödinger equation (7.4) is an element of \({\mathscr {F}}_d\), subject to the constraint that it contains only N-particle Slater determinants, which are eigenfunctions of the particle number operator P with eigenvalue N.

On \({\mathscr {B}}_d \) one can apply tensor approximation techniques without having to deal explicitly with the antisymmetry requirement. The representation of the discretised Hamiltonian \({\mathbf {H}}: {\mathscr {B}}_d \rightarrow {\mathscr {B}}_d \) is given by \({\mathbf {H}} = \iota \circ H \circ \iota ^\dagger \). For a given particle number N, we have to restrict the eigenvalue problem to the subspace \(\hbox {ker}({\mathbf {P}} - N {\mathbf {I}})\), with the discrete particle number operator \({\mathbf {P}} = \iota \circ P \circ \iota ^\dagger \). For electrically neutral systems, the exact ground state is an N-particle function, and this constraint can be dropped.

The discrete Hamilton operator \( {\mathbf {H}} \) has a canonical tensor product representation in terms of the one- and two-electron integrals. By the Slater-Condon rules [73, 129], one finds

$$\begin{aligned} {\mathbf {H}} = \sum _{ p, q =1 }^{{d}} h^q_p\, {\mathbf {a}}_{p}^\dagger {\mathbf {a}}_q + \sum _{p,q,r,s=1}^{{d}} g^{p,q}_{r,s} \, {\mathbf {a}}_r^\dagger {\mathbf {a}}_s^\dagger {\mathbf {a}}_p {\mathbf {a}}_{q}, \end{aligned}$$

where the coefficients \( h^q_p\), \(g_{p,q}^{r,s}\) are given by

$$\begin{aligned} h^p_q = \bigl \langle \varphi _p , \bigl \{ \textstyle \frac{1}{2} \displaystyle \varDelta + V_\text {ext} \bigr \} \varphi _q \bigr \rangle , \quad g_{p,q}^{r,s} = \bigl \langle \varphi _p\overline{\varphi _r} , \,\bigl (|\cdot |^{-1} * \varphi _q \overline{\varphi _s} \bigr )\bigr \rangle . \end{aligned}$$

Here the discrete annihilation operators \( {\mathbf {a}}_p\) and creation operators \({\mathbf {a}}^{\dagger }_q\) can be written as Kronecker products of the \( 2\times 2\)-matrices

$$\begin{aligned} \ \mathbf{A} {:=} \left( \begin{array}{cc} 0 &{} 1 \\ 0 &{} 0 \end{array} \right) ,\quad \mathbf{A}^\dagger = \left( \begin{array}{cc} 0 &{} 0 \\ 1 &{} 0 \end{array} \right) ,\quad {\mathbf { S}} {:=} \left( \begin{array}{cc} 1 &{} 0 \\ 0 &{} -1 \end{array} \right) ,\quad {\mathbf { I}} {:=} \left( \begin{array}{cc} 1 &{} 0 \\ 0 &{} 1 \end{array} \right) , \end{aligned}$$

where \( {\mathbf { a}}_p {:=} {\mathbf {S}} \otimes \ldots \otimes {\mathbf {S}} \otimes {\mathbf { A}} \otimes {\mathbf {I}} \otimes \ldots \otimes {\mathbf {I}} \) with \({\mathbf {A}}\) appearing in the p-th position. Note that compared to the dimension \( 2^{d} \) of the ambient space \({\mathscr {B}}_d\), representation ranks of \({\mathbf {H}}\) thus scale only as \({\mathscr {O}}(d^4)\). For further details, see also [99, 130] and the references given there.

With the representation of the particle number operator \( {\mathbf {P}} = \sum _{ p,q =1}^{d} {\mathbf {a}}_{p}^\dagger {\mathbf {a}}_q \), finding the ground state of the discretised Schrödinger equation in binary variational form amounts to solving

$$\begin{aligned} \min _{{\mathbf {v}} \in {\mathscr {B}}_d} \bigl \{ \langle {\mathbf {H}}{\mathbf {v}}, {\mathbf {v}} \rangle : \langle {\mathbf {v}}, {\mathbf {v}} \rangle =1 \ , \ {\mathbf {P}}{\mathbf {v}} = N {\mathbf {v}} \bigr \}. \end{aligned}$$
(7.5)

Treating this problem by hierarchical tensor representations (e.g. by tensor trains, in this context usually referred to as matrix product states) for the d-fold tensor product space \({\mathscr {B}}_d\), one can obtain approximations of the wave function \(\varPsi \) that provide insight into separation of quantum systems into subsystems and their entanglement. The formulation (7.5) is fundamental in the modern formulation of many-particle quantum mechanics in terms of second quantisation. For a recent survey of related MPS techniques in physics see [124].

The practical application of the concepts described above in quantum chemistry is challenging due to the high accuracy requirements. For numerical examples, we refer to, e.g. [26, 130, 144]. The approach can be especially advantageous in the case of strongly correlated problems, such as the dissociation of molecules as considered in [107], which cannot be treated by classical methods such as Coupled Cluster. The tensor structure can also be exploited for the efficient computation of several eigenstates [43, 88].

Remark 7.1

Variants of the above binary coding can also be used in a much more general context. This leads to vector-tensorisation [64, 67], in the tensor train context also called quantised TT representation [82, 115], which can be applied to vectors \({\mathbf {x}}\in {\mathbb {K}}^{N}\), \({\mathbb {K}}\in \{ {\mathbb {R}}, {\mathbb {C}}\}\), with \(N=2^{d}\) that are identified with tensors \({{\mathbf {u}}}\in \bigotimes \nolimits _{i=1} ^{d}{\mathbb {K}}^{2}\). This identification can be realised by writing each index \(j \in \{0,\ldots ,2^d-1\}\) in its binary representation \(j = \sum _{i=0}^{d-1}c_i 2^i,\) \(c_i \in \{0,1\}\). The identification \( j \simeq (c_1, \ldots , c_d ) \), \( c_i \in \{0, 1 \}\) defines a tensor \({{\mathbf {u}}}\) of order d with entries \( {{\mathbf {u}}}(c_1,\ldots ,c_d) {:=} {\mathbf {x}}(j)\). In many cases of interest, the hierarchical representations or approximations of these tensors have low ranks. In particular, for polynomials, exponentials, and trigonometric functions, the ranks are bounded independently of the grid size, and almost exponentially convergent approximations can be constructed for functions with isolated singularities [61, 80, 81]. There is also a relation to multiresolution analysis [83].

7.3 Langevin Dynamics and Fokker–Planck Equations

Let us consider the Langevin equation, which constitutes a stochastic differential equation (SDE) of the form

$$\begin{aligned} d {x} (t) = - \nabla V \bigl ( {x} (t) \bigl ) \, dt + \sqrt{\frac{2}{\gamma }} dW_t, \quad \gamma = \frac{1}{k_b T}, \quad {x}(t) \in {\mathbb {R}}^d, \end{aligned}$$
(7.6)

where \( W_t \) is a d-dimensional Brownian motion, see, e.g. [116]. The corresponding Fokker–Planck equation describes the transition probability,

and is given by

$$\begin{aligned} \partial _t u ( {x} , t) = L u ({x} ,t) {:=} \nabla \cdot \bigl ( u ({x}, t ) \nabla V ({x}) \bigr ) + \frac{1}{\gamma } \varDelta u ({x},t), \quad u ({x} , 0 ) = u_0 ({x}) . \end{aligned}$$

The transition probability is the conditional probability density \( u({x} , t) = p ({x}, t \,|\, {x}_0, 0) \) for a particle starting at \({x}_0\) to be found at time t at point x .

For simplicity, let us assume that \( {x} \in \varOmega {:=} [-R, R]^d \) with homogeneous Neumann boundary conditions. Under rather general conditions, the operator L has a discrete spectrum \( 0 = \lambda _0 \ge \lambda _1 \ge \ \ldots \lambda _j \ge \ldots \), \( \lambda _j \rightarrow -\infty \) if \(j \rightarrow \infty \) and smooth eigenfunctions \( \varphi _j \), \( j \in {\mathbb {N}}_0\). It is easy to check that \( \varphi _0 ({x}) = \frac{1}{Z} e^{- \beta V ({x}) } \) is an eigenfunction \(\varphi _0 \) for the eigenvalue \(\lambda _0 =0\), with some normalisation constant \(\frac{1}{Z}\) satisfying \( \int _\varOmega \varphi _0 ({x}) \,d{x} = 1 \). Under reasonable conditions [116] it can be shown that \(\varphi _0 \) is the stationary or equilibrium distribution, \( \varphi _0 ({x}) = \lim _{t \rightarrow \infty } u ({x} , t ) \).

Instead of L, see, e.g. [42], we consider the transfer operator defined by mapping a given probability density \( u_0 ({x}) \) to a density at some time \( \tau > 0\),

$$\begin{aligned} u_0 ({x}) \mapsto T_{\tau } u_0 ({x} ) {:=} u ( {x} , \tau ), \quad {x} \in \varOmega = [-R , R]^d . \end{aligned}$$

In general \(T_{\tau } \) can be defined by a stochastic transition function \( p ( {x} , {y} ;\, \tau ) \), which describes the conditional probability of the system travelling from x to y in a finite time step \(\tau > 0\). We do not require explicit knowledge of p, but we make use of the fact that it satisfies the detailed balance condition \( \pi ({x} ) \, p( {x} , {y} , \tau ) = \pi ({y} ) \, p ({y} , {x} , \tau ) \), where \( \pi {:=} \varphi _0 \). Then \(T_\tau \) is self-adjoint with respect to the inner product with weight \(\pi ^{-1}\),

$$\begin{aligned} \langle u , v \rangle _{\pi } {:=} \int _\varOmega u ({x}) \, v({x} )\, \pi ^{-1}({x}) \,d{x} , \end{aligned}$$

that is, \(\langle T_{\tau } u , v \rangle _{\pi } = \langle u ,T_{\tau } v \rangle _{\pi }\). It has the same eigenfunctions \( \varphi _j \) as the Fokker–Planck operator L and eigenvalues \( \sigma _j = e^{ - \lambda _j \tau } \), with \(\sigma _j \in [0,1]\), which accumulate at zero. For the description of meta-stable states, we are interested in the first eigenfunctions \( \varphi _j \), \( j=0,1, \ldots , m\), where the corresponding eigenvalues \( \sigma _j\) of \(T_{\tau }\) are close to one. This provides a good approximation of the dynamics after the fast eigenmodes corresponding to \(\sigma _j \approx 0\) are damped out.

In contrast to L, the operator \( T_{\tau } \) is bounded in \(L_2 (\varOmega )\), and the eigenvalue problem can be tackled by Galerkin methods using a basis \( \varPhi _k \), \( k \in {\mathscr {I}} \), and the weighted inner product \(\langle \cdot , \cdot \rangle _{\pi } \): with the ansatz \( \varphi _j = \sum _{k} u_{j,k} \varPhi _k \), the unknown coefficients \( {{\mathbf {u}}} \) and approximate eigenvalues \(\sigma \) are solutions of a generalised discrete eigenvalue problem \( {\mathbf {M}} {{\mathbf {u}}} = \sigma \; {\mathbf {M}}^0 {{\mathbf {u}}}\), where \({\mathbf {M}}_{k,\ell } = \langle \varPhi _k , T_{\tau } \varPhi _{\ell } \rangle _{\pi }\) and \({\mathbf {M}}^0_{k,\ell } = \langle \varPhi _k , \varPhi _{\ell } \rangle _{\pi }\).

We do not have a low-rank representation of the operator \( T_{\tau }\) at our disposal. However, we can discretise the closely related backward Kolmogorov operator, where the matrix entries can be estimated from time series: if samples of sufficiently long trajectories of the SDE (7.6) are available, then the corresponding matrix entries \({\mathbf {M}}_{k,\ell }\) and \({\mathbf {M}}^0_{k,\ell }\) can be computed by Monte Carlo integration.

Typical choices of basis functions \(\varPhi _k\) are, for instance, piecewise constant functions (Markov state models [120]) or Gaussians combined with collocation (diffusion maps [32]). Here we propose a tensor product basis obtained from univariate basis functions \( x_i \mapsto \chi _{\mu _i} (x_i) \), \( x_i \in {\mathbb {R}}\), combined with a low-rank tensor representation of basis coefficients.

For instance, combining tensor train (TT) representations with DMRG iteration for finding good low-rank approximations of the eigenfunctions, in preliminary numerical tests we observe that surprisingly small ranks are sufficient to obtain comparable accuracy as with state-of-the-art alternative methods. This will be reported on in more detail in a forthcoming work.

8 Outlook

In view of the rapidly growing literature on the subject of this article, the overview that we have given here is necessarily incomplete. Still, we would like to mention some further topics of interest:

  • Adaptive sampling techniques analogous to adaptive cross approximation (ACA) [12, 109], which provide powerful tools to recover not only matrices, but also low-rank hierarchical tensors,

  • Tensor completion or tensor recovery [35, 89, 117], the counterpart to matrix recovery in compressive sensing,

  • Applications of hierarchical tensors in machine learning [27, 28, 31],

  • Greedy methods, based on successive best rank-one approximations [18, 24, 53],

  • Rank-adaptive alternating optimisation methods based on local residuals or low-rank approximations of global residuals [43, 44, 88],

  • HSVD truncation estimates in \(L^{\infty }\)-norm, see [66],

  • Optimisation of the dimension tree for the hierarchical format [11], which can give substantially more favourable ranks [21, 62].

Some aspects of low-rank approximations can be considered as topics of future research. For instance, so far the exploitation of sparsity of the component tensors has not been addressed. Combination of hierarchical tensor representations with linear transformations of variables (as in ridge function approximations) has not been explored so far either.