1 Introduction

In recent years, it has become clear that many interesting engineering applications feature an intrinsic dependence on a large number of parameters \(y_1,\ldots ,y_d\), leading to a major concentration of efforts in the development and analysis of high-dimensional approximation methods. In many relevant situations, the parameters \(y_j\) are independent random variables distributed on intervals \(I_j\) according to probability measures \(d\rho _j\). We are then typically interested in approximating a function

$$\begin{aligned} y=(y_1,\ldots ,y_d)\mapsto u(y), \end{aligned}$$

depending on these parameters and measuring the error in \(L^2(\varGamma ,d\rho )\), where \(\varGamma =I_1\times \cdots \times I_d\) and \(d\rho =d\rho _1 \otimes \cdots \otimes d\rho _d\). Up to a renormalization, we may assume that \(I_j=[-1,1]\) for all \(j\), so that \(\varGamma =[-1,1]^d\). In certain situations, the number of parameters may even be countably infinite, in which case \(\varGamma =[-1,1]^{\mathbb {N}}\). Examples where such problems occur are recurrent in the numerical treatment of parametric and stochastic PDEs, where a fast and accurate approximation of the parameter-to-solution map over high-dimensional parameter sets is useful to tackle more complex optimization, control, and inverse problems.

In such a context, the potential of specific high-dimensional polynomial approximation methods has been demonstrated in [6, 9, 10, 15, 21]. In these methods, the approximation is picked from a multivariate polynomial space

$$\begin{aligned} \mathbb {P}_\varLambda :=\mathrm{span} \left\{ y \mapsto y^\nu : \nu \in \varLambda \right\} , \end{aligned}$$

where \(\varLambda \) is a given finite subset of \(\mathbb {N}_0^d\). In the case of countably many parameters, \(d=\infty \), we replace \(\mathbb {N}_0^d\) by the set of finitely supported sequences of nonnegative integers.

The set \(\varLambda \) is said to be downward closed if and only if

$$\begin{aligned} \nu \in \varLambda \quad \mathrm{and} \quad \mu \le \nu \implies \mu \in \varLambda , \end{aligned}$$
(1)

where \(\mu \le \nu \) is meant component-wise as \(\mu _i\le \nu _i\) for all i. Polynomial spaces \(\mathbb {P}_\varLambda \) associated with downward closed index sets \(\varLambda \) have been studied in various contexts, see [2, 12, 13, 16, 17].

There exist two main approaches to polynomial approximation of a given function u based on pointwise evaluations. The first approach relies on interpolation of the function \(u\) at a given set of points \(\{y^1,\ldots ,y^n\}\) where \(n:=\#(\varLambda )=\dim (\mathbb {P}_\varLambda )\); that is, find \(v\in \mathbb {P}_\varLambda \) such that \(v(y^i)=u(y^i)\) for \(i=1,\ldots ,n\). The second approach relies on projection, which aims at minimizing the \(L^2(\varGamma ,d\rho )\) error between \(u\) and its approximation in \(\mathbb {P}_\varLambda \). Since the exact projection is not available, one typical strategy consists of using the discrete least-squares method, that is, solving the problem

$$\begin{aligned} \min _{v\in \mathbb {P}_\varLambda } \sum _{i=1}^m \left| v\left( y^i\right) -u\left( y^i\right) \right| ^2, \end{aligned}$$

where now \(m>n\). Discrete least-squares methods are often preferred to interpolation methods when the observed evaluations are polluted by noise. Their convergence analysis has been studied in the general context of learning theory, see, for example, [11, 14, 23, 24, 28].

In recent years, an analysis of discrete least-squares methods has been proposed [6, 15, 19, 20, 22], specifically targeted to the above described case of multivariate polynomial spaces associated with downward closed sets, in the case where the \(d\rho _j\) are identical Jacobi-type measures. This analysis, which builds upon the general results from [7], gives conditions ensuring that, in the absence of noise in the pointwise evaluation of u, the accuracy of the discrete least-squares approximation is comparable to the best approximation error achievable in \(\mathbb {P}_\varLambda \), that is,

$$\begin{aligned} e_{\varLambda }(u):=\inf _{v\in \mathbb {P}_\varLambda } \Vert u- v\Vert _{L^2(\varGamma ,d\rho )}. \end{aligned}$$

These conditions are stated in terms of a relation between the size m of the sample and the dimension n of \(\mathbb {P}_\varLambda \). A similar analysis also covers the case of an additive noise in the evaluation of the samples, which results in additional terms in the error estimate, see, e.g., [22].

One remarkable result from the above analysis is that the conditions ensuring that the least-squares method has accuracy comparable to \(e_{\varLambda }(u)\) only involve the dimension of \(\mathbb {P}_\varLambda \). These conditions are independent of the specific shape of the set \(\varLambda \) (as long as it is downward closed), and in particular independent of the dimension d.

The possibility of using arbitrary sets \(\varLambda \) is critical in the context of parametric PDEs in view of the recent results on high-dimensional polynomial approximation obtained in [3, 9, 10]. These results show that for relevant classes of parametric PDEs, the functions \(y\mapsto u(y)\) describing either the full solution or scalar quantities of interest can be approximated with convergence rates \(\mathcal{O}(n^{-s})\) which are independent of the parametric dimension d, when using polynomial spaces \(\mathbb {P}_{\varLambda _n}\) associated to specific sequences of downward closed multi-index sets \((\varLambda _n)_{n \ge 1}\) with \(\#(\varLambda _n)=n\). In summary, we have

$$\begin{aligned} e_n(u):=\min _{\#(\varLambda )=n} e_{\varLambda }(u) \le Cn^{-s}, \end{aligned}$$
(2)

where the minimum is taken over all downward closed sets of given cardinality n.

For each value of n, the optimal set \(\varLambda _n\) is the one that achieves the minimum in (2) among all downward closed \(\varLambda \) of cardinality n. This set is unknown to us when observing only the samples \(u(y^i)\). Therefore, a legitimate objective is to develop least-squares methods for which the accuracy is comparable to the quantity \(e_n(u)\).

In this paper, we discuss least-squares approximations on multivariate polynomial spaces for which the choice of \(\varLambda \) is optimized based on the sample. In particular, we prove that the performance of such approximations is comparable to the quantity in (2), under a relation between m and n where the dimension d enters as a logarithmic factor. Furthermore, we show that this logarithmic dependence on d can be fully removed by considering a more restricted class of downward closed sets called anchored sets, for which similar approximation rates as in (2) can be achieved. The resulting least-squares methods are thus immune to the curse of dimensionality.

The outline of the paper is the following: in Sect. 2, we introduce the notation and briefly review some of the previous results achieved in the analysis of discrete least squares on fixed multivariate polynomial spaces. In Sect. 3, we present the main results of the paper concerning discrete least-squares approximations on optimized polynomial spaces. Our analysis is based on establishing upper bounds on the number of downward closed or anchored sets of a given cardinality, or on the cardinality of their union.

The selection of the optimal polynomial space is based on minimizing the least-squares error among all possible choices of downward closed or anchored sets of a given cardinality n. Let us stress that in the form of an exhaustive search, this task becomes computationally intensive when n and d are simultaneously large. Our results should therefore mainly be viewed as a benchmark in arbitrary dimension d for assessing the performance of fast selection algorithms, such as greedy algorithms, that still need to be developed and analyzed in this context. A general discussion on alternate selection strategies with reasonable computational cost is presented in Sect. 4.

2 Discrete Least-Squares Approximation by Multivariate Polynomials

In this section, we introduce some useful notation and recall from [6] the main results achieved for the analysis of the stability and accuracy of discrete least-squares approximations in multivariate polynomial spaces.

2.1 Notation

In any given dimension \(d \in \mathbb {N}\), we consider the domain \(\Gamma :=[-1,1]^d\), and for some given real numbers \(\theta _1,\theta _2>-1\), the tensorized Jacobi measure

$$\begin{aligned} d\rho = \otimes _{j=1}^d d\beta , \end{aligned}$$

with density

$$\begin{aligned} \rho (y):= \prod _{j=1}^d \beta (y_j), \end{aligned}$$

where

$$\begin{aligned} d\beta =\beta (t)dt:=c(1-t)^{\theta _1}(1+t)^{\theta _2} dt, \quad c:=\left( \int _{-1}^{1} (1-t)^{\theta _1}(1+t)^{\theta _2} dt \right) ^{-1}. \end{aligned}$$

We may also consider the case \(\Gamma := [-1,1]^\mathbb {N}\) for which \(d=+\infty \) and \(d\rho \) is the Jacobi measure defined over \(\Gamma \) in the usual manner. We denote by \(L^2(\Gamma ,d\rho )\) the Hilbert space of real-valued square-integrable functions with respect to \(d\rho \) and denote by \(\Vert \cdot \Vert \) the associated norm, i.e.,

$$\begin{aligned} \Vert v\Vert := \left( \int _\varGamma |v(y)|^2 d\rho (y)\right) ^{1/2}. \end{aligned}$$

Moreover, let \({\mathcal {F}}\) be defined as the set \(\mathbb {N}_0^d\), where \(\mathbb {N}_0:=\{0,1,2,\ldots \}\), in the case \(d<+\infty \), or as the countable set of all finitely supported sequences from \(\mathbb {N}_0^\mathbb {N}\) in the case \(d=+\infty \). For any \(\nu \in {\mathcal {F}}\), we define

$$\begin{aligned} {{\mathrm{supp}}}(\nu ):=\{ j\ge 1 : \nu _j\ne 0 \}, \end{aligned}$$

and for any multi-index set \(\varLambda \subseteq {\mathcal {F}}\), we define

$$\begin{aligned} {{\mathrm{supp}}}(\varLambda ):=\cup _{\nu \in \varLambda } {{\mathrm{supp}}}\{ \nu \}. \end{aligned}$$

We say that a coordinate \(y_j\) is active in the space \(\mathbb {P}_\varLambda \) when \(j\in {{\mathrm{supp}}}(\varLambda )\).

For the given real parameters \(\theta _1,\theta _2>-1\), we introduce the family \((J_n)_{n\ge 0}\) of univariate orthonormal Jacobi polynomials associated with the measure \(d\beta \), and their tensorized counterpart

$$\begin{aligned} J_\nu (y)=\prod _{j\ge 1} J_{\nu _j}(y_j), \quad y=(y_j)_{j\ge 1}, \end{aligned}$$

for any \(\nu \in {\mathcal {F}}\). The \((J_{\nu })_{\nu \in {\mathcal {F}}}\) are an \(L^2(\varGamma ,d\rho )\)-orthonormal basis. Particular instances of these polynomials are tensorized Legendre polynomials when \(\theta _1=\theta _2=0\) and tensorized Chebyshev polynomials of the first kind when \(\theta _1=\theta _2=-1/2\).

In the present paper, we focus on finite multi-index sets \(\varLambda \) that are downward closed in the sense of (1). We also say that a polynomial space \(\mathbb {P}_\varLambda \) is downward closed when it is associated with a downward closed multi-index set \(\varLambda \subset {\mathcal {F}}\). Recall that \(\mathbb {P}_\varLambda \) has been defined as the span of the monomials \(y\mapsto y^\nu \) for \(\nu \in \varLambda \). Therefore it admits \((J_\nu )_{\nu \in \varLambda }\) as an \(L^2(\varGamma ,d\rho )\)-orthonormal basis in the case of \(\varLambda \) downward closed. Sometimes we enumerate the indices \(\nu \) using the lexicographical ordering, and denote this basis by \((\psi _k)_{k=1,\ldots ,n}\), where

$$\begin{aligned} n:=\#(\varLambda )=\dim (\mathbb {P}_\varLambda ). \end{aligned}$$

Given a finite downward closed multi-index set \(\varLambda \subset {\mathcal {F}}\), we would like to approximate the target function \(u:\varGamma \rightarrow \mathbb {R}\) in the \(L^2\) sense, using the noiseless evaluations \((u(y^i))_{i=1,\ldots ,m}\) of \(u\) at the points \((y^i)_{i=1,\ldots ,m}\), where the \(y^i\) are i.i.d. random variables distributed according to \(\rho \). We define the continuous \(L^2\) projection of \(u\) on the polynomial space \(\mathbb {P}_\varLambda \) as

$$\begin{aligned} \varPi _\varLambda u:= \mathop {{{\mathrm{argmin}}}}\limits _{v\in \mathbb {P}_\varLambda } \Vert u- v\Vert , \end{aligned}$$

and the discrete least-squares approximation \(\varPi _\varLambda ^m u\) as

$$\begin{aligned} \varPi _\varLambda ^m u:= \mathop {{{\mathrm{argmin}}}}\limits _{v\in \mathbb {P}_\varLambda } \Vert u- v\Vert _m, \end{aligned}$$
(3)

where we have used the notation

$$\begin{aligned} \Vert v \Vert _m := \left( \frac{1}{m}\sum _{i=1}^m v\left( y^i\right) ^2\right) ^{\frac{1}{2}}. \end{aligned}$$

We introduce the \(m\times \#(\varLambda )\) design matrix \(\mathbf D\) and the vector \({\varvec{b}}\in \mathbb {R}^m\) whose entries are given by \(\mathbf D_{i,k}=\psi _k(y^i)\) and \({\varvec{b}}_{i}=u(y^i)\). We define the Gramian matrix \(\mathbf G:=m^{-1} \mathbf D^{\mathrm{T}} \mathbf D\). The discrete least-squares projection in (3) is then given by

$$\begin{aligned} \varPi _\varLambda ^m u=\sum _{k=1}^{\#(\varLambda )}{} \mathbf{a}_k \psi _k, \end{aligned}$$

where the vector \(\mathbf{a}=(\mathbf{a}_k)\in \mathbb {R}^{\#(\varLambda )}\) is the solution to the normal equations

$$\begin{aligned} \mathbf G\mathbf{a} = m^{-1}\mathbf D^{\mathrm{T}} {{\varvec{b}}}. \end{aligned}$$

2.2 Previous Results on the Stability and Accuracy of Discrete Least Squares

We introduce the quantity

$$\begin{aligned} K(\mathbb {P}_\varLambda ) := \sup _{y\in \Gamma } \sum _{\nu \in \varLambda } \left| J_\nu (y) \right| ^2. \end{aligned}$$
(4)

It is proved in [6] that discrete least squares in multivariate polynomial spaces are stable and accurate provided that a precise condition involving m and \(K(\mathbb {P}_\varLambda )\) is satisfied. Similar results have been proved in [22] for the case of noisy observations of the target function, with several noise models.

For any \(\delta \in ]0,1[\), we introduce the positive quantity

$$\begin{aligned} \zeta (\delta ):=\delta + (1-\delta ) \ln (1-\delta ). \end{aligned}$$
(5)

Given a threshold \(\tau \in \mathbb {R}_0^+\), we introduce the truncation operator

$$\begin{aligned} T_\tau (t):=&\text {sign}(t)\min \{ \tau ,\vert t \vert \}, \quad \text { for any}\,t \in \mathbb {R}, \end{aligned}$$

and use it to define the truncated discrete least-squares projection operator \(u\mapsto T_\tau (\varPi _\varLambda ^m u)\). The main results from [6] concerning stability and accuracy of the discrete least-squares approximation with noiseless evaluations can be summarized as follows.

Theorem 1

In any dimension d, for any \(r > 0\), any \(\delta \in ]0,1[\), and any downward closed multi-index set \(\varLambda \subset \mathbb {N}_0^d\), one has

$$\begin{aligned}&\Pr \left( \left\{ (1-\delta ) \Vert v\Vert ^2 \le \Vert v\Vert _m^2 \le (1+\delta ) \Vert v\Vert ^2 , \ \forall v\in \mathbb {P}_\varLambda \right\} \right) \nonumber \\&\quad \ge 1-2n\exp \left( -\zeta (\delta )m/K(\mathbb {P}_\varLambda )\right) . \end{aligned}$$
(6)

If the following condition between m and \(K(\mathbb {P}_\varLambda )\) is satisfied:

$$\begin{aligned} \dfrac{m}{\ln m} \ge \dfrac{(1+r)}{\zeta (\delta )} K(\mathbb {P}_\varLambda ), \end{aligned}$$
(7)

then

$$\begin{aligned} \Pr \left( \left\{ (1-\delta ) \Vert v\Vert ^2 \le \Vert v\Vert _m^2 \le (1+\delta ) \Vert v\Vert ^2 , \quad \forall v\in \mathbb {P}_\varLambda \right\} \right) \ \ge 1 - 2 m^{-r}. \end{aligned}$$

Moreover, for any \(u\in L^\infty (\Gamma )\) with \(\Vert u\Vert _{L^\infty (\Gamma )} \le \tau \), the following holds:

$$\begin{aligned}&\Pr \left( \Vert u- \varPi _\varLambda ^m u\Vert \le \left( 1+\sqrt{\dfrac{1}{1-\delta }}\right) \inf _{v\in \mathbb {P}_\varLambda } \Vert u- v\Vert _{L^\infty (\varGamma )} \right) \ge 1 - 2 m^{-r}, \end{aligned}$$
(8)
$$\begin{aligned}&\mathbb {E}\left( \Vert u- T_\tau \left( \varPi _\varLambda ^m u\right) \Vert ^2 \right) \le \left( 1+ \frac{4\zeta (\delta )}{(1+r) \ln m} \right) \Vert u- \varPi _\varLambda u\Vert ^2 + 8 \tau ^2 m^{-r}. \end{aligned}$$
(9)

The above theorem states that under condition (7), the discrete least-squares approximation is stable, since one has

$$\begin{aligned} (1-\delta ) \Vert v\Vert ^2 \le \Vert v\Vert _m^2 \le (1+\delta ) \Vert v\Vert ^2 ,\quad \forall v\in \mathbb {P}_{\varLambda } \Leftrightarrow (1-\delta ) \mathbf{I}\le \mathbf G\le (1+\delta ) \mathbf{I}, \end{aligned}$$

where \(\mathbf{I}\) denotes the identity matrix. Under the same condition, the discrete least-squares approximation is also accurate in probability, from (8), and in expectation, from (9), since the approximation error behaves like the best approximation error in \(L^\infty \) or in \(L^2\).

The quantity \(K(\mathbb {P}_\varLambda )\) depends both on \(\varLambda \) and on the chosen Jacobi measure, and therefore on the parameters \(\theta _1,\theta _2\). The following result from [6, 18] shows that, once these two parameters are fixed, the quantity \(K(\mathbb {P}_\varLambda )\) satisfies bounds that only depend on \(\#(\varLambda )\), independently of the particular shape of \(\varLambda \) and of the dimension d.

Lemma 1

In any dimension d and for any finite downward closed set \(\varLambda \subset {\mathcal {F}}\), one has

$$\begin{aligned} \#(\varLambda ) \le K(\mathbb {P}_\varLambda ) \le {\left\{ \begin{array}{ll} (\#(\varLambda ))^{\ln 3 / \ln 2} &{}\quad \mathrm{if}\,\, \theta _1=\theta _2=-1/2, \\ (\#(\varLambda ))^{2\max \{\theta _1,\theta _2\}+2} &{}\quad \mathrm{if}\,\,\theta _1,\theta _2\in \mathbb {N}_0. \end{array}\right. } \end{aligned}$$

Combining the two results, one therefore obtains sufficient conditions for stability and optimal accuracy expressed only in terms of a relation between \(\#(\varLambda )\) and m. For example, in the case of the uniform measure that corresponds to \(\theta _1=\theta _2=0\), this relation is of the form

$$\begin{aligned} \dfrac{m}{\ln m} \ge c \left( \#(\varLambda ) \right) ^2, \quad c:=c(\delta ,r). \end{aligned}$$

3 Optimal Selection of Downward Closed Polynomial Spaces

The results recalled in the previous section hold for a given downward closed set \(\varLambda \subset {\mathcal {F}}\). We now consider the problem of optimizing the choice of \(\varLambda \), or equivalently that of the space \(\mathbb {P}_\varLambda \).

3.1 Optimized Index Sets

We define the family

$$\begin{aligned} \mathcal{M}^d_n := \{\varLambda \subset {\mathcal {F}}: \varLambda \text { is downward closed and } \#(\varLambda )=n\}, \end{aligned}$$

of all downward closed sets of cardinality n in d coordinates. Note that, in contrast to the family of all subsets of \({\mathcal {F}}\) of cardinality n, the family \(\mathcal{M}^d_n\) is finite.

The error of best n-term polynomial approximation by downward closed sets is then defined by

$$\begin{aligned} \sigma _n(u):=\min _{\varLambda \in \mathcal{M}_{n}^{d}} \min _{ v\in \mathbb {P}_\varLambda } \Vert u-v\Vert . \end{aligned}$$

A best n-term downward closed set is a \(\varLambda \in \mathcal{M}_{n}^{d}\) that achieves this minimum, that is, such that

$$\begin{aligned} \varLambda ^{\mathrm{opt}} := \mathop {{{\mathrm{argmin}}}}\limits _{\varLambda \in \mathcal{M}_{n}^{d} } \min _{ v\in \mathbb {P}_\varLambda } \Vert u-v\Vert . \end{aligned}$$

Using the Parseval identity, we find that \(\varLambda ^{\mathrm{opt}}\) is also given by

$$\begin{aligned} \varLambda ^{\mathrm{opt}} = \mathop {{{\mathrm{argmin}}}}\limits _{\varLambda \in \mathcal{M}_{n}^{d} } \sum _{\nu \notin \varLambda } |u_\nu |^2, \quad u_\nu =\int _\varGamma u(y) J_\nu (y) d\rho (y). \end{aligned}$$

Note that such a set may not be unique due to possible ties in the values of the coefficients, in which case we consider a unique choice by breaking the ties in some arbitrary but fixed way. We set

$$\begin{aligned} u_n:=\varPi _{\varLambda ^{\mathrm{opt}} }u=\mathop {{{\mathrm{argmin}}}}\limits _{ v\in \mathbb {P}_{\varLambda ^{\mathrm{opt}}}} \Vert u-v\Vert . \end{aligned}$$
(10)

Of course, in the least-squares method, the discrete data do not allow us to identify \(\varLambda ^{\mathrm{opt}}\). Instead, we rely on

$$\begin{aligned} \varLambda ^{\mathrm{opt}}_{ m } := \mathop {{{\mathrm{argmin}}}}\limits _{\varLambda \in \mathcal{M}_{n}^{d} } \min _{ v\in \mathbb {P}_\varLambda } \Vert u- v\Vert _m, \end{aligned}$$
(11)

and compute

$$\begin{aligned} w_n:=\varPi _{\varLambda ^{\mathrm{opt}}_m}^m u=\mathop {{{\mathrm{argmin}}}}\limits _{ v\in \mathbb {P}_{\varLambda ^{\mathrm{opt}}_m}} \Vert u-v\Vert _m. \end{aligned}$$

Our objective is now to compare the accuracy of the polynomial least-squares approximation based on \(\varLambda ^{\mathrm{opt}}_{m}\) with the above optimal error \(\sigma _n(u)\). For this purpose, we shall use the random variable

$$\begin{aligned} C_{n}^{d} := \max _{\varLambda \in \mathcal{M}_{n}^{d} } \max _{v\in \mathbb {P}_\varLambda } \frac{\Vert v\Vert ^2}{\Vert v\Vert ^2_{m} }. \end{aligned}$$

Note that the search of \(\varLambda ^{\mathrm{opt}}_{m}\) remains a difficult task from the computational point of view, due to the fact that \(\#(\mathcal{M}_n^d )\) becomes very large even for moderate values of n and d. As we discuss further, this cardinality also affects the conditions between m and n which guarantee the optimality of the least-squares approximation based on \(\varLambda ^{\mathrm{opt}}_m\).

For this reason, it is useful to introduce an additional restriction on the potential index sets. We say that \(\varLambda \) is anchored if and only if it is downward closed and satisfies in addition

$$\begin{aligned} e_{j} \in \varLambda \quad \text { and }\quad j^{\prime } \le j\implies e_{j^{\prime }} \in \varLambda , \end{aligned}$$

where \(e_j\) and \(e_{j^{\prime }}\) are the Kronecker sequences with 1 at positions \(j\) and \(j^{\prime }\), respectively. We also say that a polynomial space \(\mathbb {P}_\varLambda \) is anchored when \(\varLambda \) is anchored. Likewise, we define the family

$$\begin{aligned} \mathcal{A}_{n}^{} := \{\varLambda \subset {\mathcal {F}}: \varLambda \text { is anchored} \text { and } \#(\varLambda )=n\}. \end{aligned}$$

The restriction to anchored sets introduces an order of priority among the coordinates: given any \(j\ge 1\), the coordinate \(y_j\) is active in \(\varLambda \) only if all the coordinates \(y_k\) for \(k<j\) are also active. In particular, for any set \(\varLambda \in \mathcal{A}_n\), we have

$$\begin{aligned} {{\mathrm{supp}}}(\varLambda )=\{1,\ldots ,k\} \end{aligned}$$
(12)

for some \(k\le n-1\).

It is proved in [9] that, for relevant classes of parametric PDEs, the same algebraic convergence rates \(\mathcal{O}(n^{-s})\) can be obtained when imposing the anchored structure on the optimally selected sets \((\varLambda _n)_{n \ge 1}\) with \(\#(\varLambda _n)=n\). As we shall see further, one specific advantage of anchored sets is to completely remove the dependence on the dimension d in the convergence analysis of the least-squares method, and allows us in particular to consider the infinite-dimensional framework.

Using the same notation as before with obvious modifications, we introduce the following entities:

(13)

and

$$\begin{aligned} \widetilde{C}_{n}^{} := \max _{\varLambda \in \mathcal{A}_{n}^{}} \max _{v\in \mathbb {P}_\varLambda } \frac{\Vert v\Vert ^2}{\Vert v\Vert ^2_{m} }. \end{aligned}$$

Remark 1

The estimators \(w_n\) and \(\tilde{w}_n\) can be viewed as the solutions of a nonconvex optimization problem. This problem has a natural algebraic formulation. Recalling \((J_\nu )_{\nu \in {\mathcal {F}}}\) the \(L^2(\varGamma ,d\rho )\)-orthonormal basis, we introduce for a given finite set \(\mathcal{J}\subset {\mathcal {F}}\) the design matrix

$$\begin{aligned} \mathbf D=\left( J_\nu \left( y^i\right) \right) , \end{aligned}$$
(14)

where the row index i ranges in \(\{1,\ldots ,m\}\) and the column index \(\nu \) ranges in \(\mathcal{J}\). Then, recalling the data vector \(\mathbf{b}=(u(y^i))_{i=1,\ldots ,m}\), and taking \(\mathcal{J}\) as the union of all downward closed sets of cardinality n, we find that the component vector \(\mathbf{a}=(a_\nu )_{\nu \in \mathcal{J}}\) of \(w_n=\sum \nolimits _{\nu \in \mathcal{J}} a_\nu J_\nu \) is the solution to the constrained minimization

$$\begin{aligned} \min \left\{ \Vert \mathbf D\mathbf{a}-\mathbf{b}\Vert _{\ell ^2} : \Vert \mathbf{a}\Vert _{\ell ^0_d}\le n\right\} . \end{aligned}$$

Here \(\Vert \mathbf{a}\Vert _{\ell ^0_d}\) is the cardinality of the smallest downward closed set \(\varLambda \subset \mathcal{J}\) that contains all the nonzero entries of \(\mathbf{a}\). In other words, we minimize over those \(\mathbf{a}\) whose support is contained in a downward closed set of cardinality at most n. Likewise, taking \(\mathcal{J}\) as the union of all anchored sets of cardinality n, we find that the component vector \(\widetilde{\mathbf{a}}=(\tilde{a}_\nu )_{\nu \in \mathcal{J}}\) of \(\tilde{w}_n=\sum \nolimits _{\nu \in \mathcal{J}} \tilde{a}_\nu J_\nu \) is the solution of the constrained minimization

$$\begin{aligned} \min \left\{ \Vert \mathbf D\widetilde{\mathbf{a}}-\mathbf{b}\Vert _{\ell ^2} : \Vert \widetilde{\mathbf{a}}\Vert _{\ell ^0_a}\le n\right\} . \end{aligned}$$

Here \(\Vert \widetilde{\mathbf{a}}\Vert _{\ell ^0_a}\) is the cardinality of the smallest anchored set \(\varLambda \subset \mathcal{J}\) that contains all the nonzero entries of \(\widetilde{\mathbf{a}}\). It is well known that such optimization problems with \(\ell ^0\)-type constraints have combinatorial nature, and in turn numerical algorithms for computing their solutions do not generally have polynomial complexity. This justifies in practice the use of convex relaxation methods such as basis pursuit, or greedy selection strategies such as orthonormal matching pursuit. While our paper is mainly directed towards the convergence properties of the estimators \(w_n\) and \(\tilde{w}_n\) which are the exact solutions to the above problems, we discuss these numerical methods in the final section.

We now would like to compare the estimators \(w_n\) and \(\tilde{w}_n\) with the best n-term approximation in the aforementioned classes of multi-index sets \(\mathcal{M}_{n}^d\) and \(\mathcal{A}_n^{}\). The following lemma shows the role played by \(C_{n}^{d}\) and \(\widetilde{C}_{n}^{}\) in quantifying the relation between the error achieved by the optimal discrete least-squares projection and the error achieved by the optimal \(L^2\) projection.

Lemma 2

It holds that, for any \(\varLambda \in \mathcal{M}^d_n\),

$$\begin{aligned} \Vert u- w_{n} \Vert \le \Vert u- v \Vert + 2\sqrt{C_{2n-1}^{d} } \Vert u- v \Vert _m,\quad v\in \mathbb {P}_\varLambda , \end{aligned}$$
(15)

and for any \(\varLambda \in \mathcal{A}_n\),

$$\begin{aligned} \Vert u- \tilde{w}_{n} \Vert \le \Vert u- \tilde{v} \Vert + 2\sqrt{\widetilde{C}_{2n -1}^{} } \Vert u- \tilde{v} \Vert _m, \quad \tilde{v} \in \mathbb {P}_\varLambda . \end{aligned}$$
(16)

Proof

Let \(\varLambda \in \mathcal{M}^d_n\), and define . We first observe that is also downward closed and because any downward closed set contains the null multi-index. Since \(w_n \in \mathbb {P}_{\varLambda ^{\mathrm{opt}}_m}\), we have for any \(v \in \mathbb {P}_{\varLambda }\). It follows that

$$\begin{aligned} \Vert v- w_n \Vert&\le \sqrt{C_{ 2n -1 }^{d} } \Vert v - w_n \Vert _m \le \sqrt{C_{ 2n -1 }^{d} }\left( \Vert u- v \Vert _m + \Vert u - w_n \Vert _m\right) \\&\le 2\sqrt{C_{ 2n -1 }^{d} } \Vert u - v \Vert _m, \end{aligned}$$

and therefore

$$\begin{aligned} \Vert u- w_n \Vert \le \Vert u- v \Vert + \Vert v- w_n \Vert \le \Vert u- v \Vert + 2\sqrt{C_{ 2n -1 }^{d} } \Vert u- v \Vert _m, \end{aligned}$$

which is (15). The proof of (16) is analogous. \(\square \)

Note that the estimates in the above lemma imply in particular that

$$\begin{aligned} \Vert u- w_{n} \Vert \le \Vert u- u_n \Vert + 2\sqrt{C_{2n-1}^{d} } \Vert u- u_n \Vert _m, \end{aligned}$$
(17)

and

$$\begin{aligned} \Vert u- \tilde{w}_{n} \Vert \le \Vert u- \tilde{u}_n \Vert + 2\sqrt{\widetilde{C}_{2n -1}^{} } \Vert u- \tilde{u}_n \Vert _m, \end{aligned}$$

with \(u_n\) and \(\tilde{u}_n\) defined by (10) and (13). Note that they also imply

$$\begin{aligned} \Vert u- w_{n} \Vert \le \left( 1 + 2\sqrt{C_{2n-1}^{d} }\right) \Vert u- v \Vert _{L^\infty },\quad v\in \mathbb {P}_{\varLambda }, \end{aligned}$$
(18)

for any \(\varLambda \in \mathcal{M}_n^d\), and

$$\begin{aligned} \Vert u- \tilde{w}_{n} \Vert \le \left( 1 + 2\sqrt{\widetilde{C}_{2n -1}^{} }\right) \Vert u- \tilde{v} \Vert _{L^\infty }, \quad \tilde{v} \in \mathbb {P}_\varLambda , \end{aligned}$$
(19)

for any \(\varLambda \in \mathcal{A}_n\).

3.2 Probabilistic Bounds

In view of Lemma 2, we are interested in bounding the random variables \(C_{n}^{d}\) and \(\widetilde{C}_{n}^{}\). In this section, we give probabilistic bounds, which ensure that under certain conditions between m and n, these random variables do not exceed a fixed value, here set to 2, with high probability. In the whole section, we choose \(\delta =1/2\), so that, with the notation (5), one has

$$\begin{aligned} \zeta :=\zeta (\delta )=\zeta (1/2)=(1-\ln 2)/2 \approx 0.153. \end{aligned}$$

We define, for any \(\nu \in {\mathcal {F}}\), the “rectangular” set \(\mathcal{R}_\nu := \{ \mu \in {\mathcal {F}}, \mu \le \nu \}\), and for any \(n \ge 1\), the hyperbolic cross set

$$\begin{aligned} \mathcal{H}_{n}^{d} := \left\{ \mu \in {\mathcal {F}}: \prod _{j=1}^d (\mu _j+1) \le n \right\} . \end{aligned}$$

Note that

$$\begin{aligned} \mathcal{H}_{n}^{d} = \bigcup _{\#(\mathcal{R}_\nu ) \le n } \mathcal{R}_\nu . \end{aligned}$$

The cardinality of \(\mathcal{H}_{n}^{d} \) is bounded by

$$\begin{aligned} \#\left( \mathcal{H}_{n}^{d}\right) \le n (1+\ln (n ))^{d-1}, \end{aligned}$$
(20)

see [15, Appendix A.2] for a proof and some remarks on the accuracy of this upper bound. The relevance of the hyperbolic cross for our purposes is due to the following observation.

Lemma 3

The union of all downward closed sets of cardinality at most n in finite dimension d coincides with \(\mathcal{H}_{n}^{d}\); that is,

$$\begin{aligned} \bigcup _{ \varLambda \in \mathcal{M}_{n}^{d} }\varLambda = \mathcal{H}_{n}^{d}. \end{aligned}$$
(21)

Proof

On the one hand, all rectangles \(\mathcal{R}_\nu \) such that \(\#(\mathcal{R}_\nu ) \le n\) belong to \( \mathcal{M}_{n}^{d}\), so that inclusion holds from right to left. On the other hand, inclusion from left to right follows by observing that for any \(\varLambda \in \mathcal{M}_{n}^{d}\), one has \(\varLambda =\cup _{\mu \in \varLambda } \mathcal{R}_\mu \) and \(\mathcal{R}_\mu \subset \mathcal{H}_{n}^{d}\) for all \(\mu \in \varLambda \). \(\square \)

This leads us to a first probabilistic bound for the random variable \(C_n^d\). Indeed, using (21) we obtain that

$$\begin{aligned} \Pr \left( C_n^d> 2 \right)&= \Pr \left( \max _{\varLambda \in \mathcal{M}_{n}^{d}} \max _{v\in \mathbb {P}_\varLambda } \frac{\Vert v\Vert ^2}{\Vert v\Vert _m^2}> 2 \right) \le \Pr \left( \max _{ v\in \mathbb {P}_{\mathcal{H}_{n}^{d}}} \frac{\Vert v\Vert ^2}{\Vert v\Vert _m^2} >2 \right) . \end{aligned}$$

Thus, using Theorem 1 with \(\delta =1/2\) combined with the estimates in Lemma 1, we obtain that, in any dimension d and for any \(r>0\), if m and n satisfy

$$\begin{aligned} \frac{m}{\ln m} \ge {\left\{ \begin{array}{ll} \frac{(1+r)}{\zeta } \left( \#\left( \mathcal{H}_{n}^{d}\right) \right) ^{\ln 3 / \ln 2} &{}\text { with Chebyshev polynomials of the first kind},\\ \frac{(1+r)}{\zeta } \left( \#\left( \mathcal{H}_{n}^{d}\right) \right) ^{2\max \{\theta _1,\theta _2 \} +2 } &{} \text { with Jacobi polynomials and } \theta _1,\theta _2 \in \mathbb {N}_0, \end{array}\right. } \end{aligned}$$
(22)

then

$$\begin{aligned} \Pr \left( C_{n}^{d}>2\right) \le 2m^{ - r }. \end{aligned}$$

From (21) and (12), we also find that the union of all anchored sets of cardinality at most n satisfies the following inclusion:

$$\begin{aligned} \bigcup _{\varLambda \in \mathcal{A}_n} \varLambda \subset \mathcal{H}_n^{n-1}. \end{aligned}$$

By similar arguments, we obtain the following probabilistic bound for the random variable \(\widetilde{C}_{n}^{}\): in any dimension d, for any \(r>0\), if m and n satisfy

$$\begin{aligned} \frac{m}{\ln m} \ge {\left\{ \begin{array}{ll} \frac{(1+r)}{\zeta } \left( \#\left( \mathcal{H}_{n}^{n-1 }\right) \right) ^{\ln 3 / \ln 2} &{} \text { with Chebyshev polynomials of the first kind},\\ \frac{(1+r)}{\zeta } \left( \#\left( \mathcal{H}_{n}^{n-1}\right) \right) ^{2\max \{\theta _1,\theta _2 \} +2 } &{} \text { with Jacobi polynomials and } \theta _1,\theta _2 \in \mathbb {N}_0, \end{array}\right. } \end{aligned}$$
(23)

then

$$\begin{aligned} \Pr \left( \widetilde{C}_{n }^{}>2\right) \le 2m^{ - r }. \end{aligned}$$

The above results describe the regimes of m and n such that \(C_{n}^{d}\) and \(\widetilde{C}_{n}^{}\) do not exceed 2 with high probability. In the case of downward closed sets, this regime is quite restrictive due to the presence of \(\ln (n)^{d-1}\) in the upper bound (20) for the cardinality of \(\mathcal{H}_{n}^{d}\), which forces the sample size m to be extremely large as d grows. Likewise, m has to be extremely large compared to n in the case of anchored sets.

We next describe another strategy that yields similar probabilistic bounds under less restrictive regimes. It is based on estimating the cardinality of \(\mathcal{M}_n^d\) and \(\mathcal{A}_n^{}\) and using union bounds. Our way of estimating \(\#(\mathcal{M}_{n}^{d})\) and \(\#(\mathcal{A}_n^{})\) is based on strategies for encoding any downward closed or anchored set. One first such strategy leads to the following result.

Lemma 4

(First Cardinality Bound) One has the cardinality estimates

$$\begin{aligned} \#\left( \mathcal{M}_{n}^{d}\right) \le 2^{n d} \end{aligned}$$
(24)

and

$$\begin{aligned} \#\left( \mathcal{A}_{n}^{}\right) \le 2^{n(n-1)}. \end{aligned}$$
(25)

Proof

We encode any downward closed set \(\varLambda \subset {\mathcal {F}}\) of cardinality n in d dimensions, by associating d bits \(b_{\nu ,1},\ldots ,b_{\nu ,d}\) to each multi-index \(\nu \in \varLambda \), where the value of the jth bit is equal to one if \(\nu + e_j \in \varLambda \) and equal to zero if \(\nu + e_j \notin \varLambda \). We order these blocks of bits according to the lexicographic order \(\nu ^1,\ldots , \nu ^n\) of appearance of \(\nu \) in \(\varLambda \), where \(\nu ^1=(0,\ldots ,0)\). The resulting bitstream

$$\begin{aligned} B_\varLambda := \left( b_{\nu ^1,1},\ldots ,b_{\nu ^1,d},b_{\nu ^2,1},\ldots ,b_{\nu ^2,d},\ldots ,b_{\nu ^n,1},\ldots ,b_{\nu ^n,d}\right) \end{aligned}$$

uniquely encodes \(\varLambda \), that is, the encoding map \(\varLambda \mapsto B_\varLambda \) is injective. Indeed, assuming that \(\nu ^1,\ldots ,\nu ^k\) have been already identified, then the partial bitstream \((b_{\nu ^1,1},\ldots ,b_{\nu ^k,d})\) contains the information on the positions of all indices \(\nu \in \varLambda \) such that \(\nu \notin \{\nu ^1,\ldots ,\nu ^k\}\) and \(\nu -e_j\in \{\nu ^1,\ldots ,\nu ^k\}\) for some j. Therefore \(\nu ^{k+1}\) is uniquely determined as the index with smallest lexicographic order among such indices.

Since the length of \(B_\varLambda \) is nd, this model leads us to the upper bound (24). The same encoding can be applied to anchored sets of cardinality at most n with \(n-1\) bits for each index, leading to (25), which also directly follows from (24) and the fact that \(\#(\mathcal{A}_{n}^{}) \le \#(\mathcal{M}_{n}^{n-1})\). \(\square \)

Remark 2

Note that the cardinality estimate for the anchored sets is independent of the dimension d. In particular, this allows us to derive some further results in the infinite-dimensional framework, when using anchored sets.

Recalling the definition of \(K(\mathbb {P}_\varLambda )\) from (4), we introduce the following notation:

$$\begin{aligned} K_n= & {} \max _{\varLambda \in \mathcal{M}_{n}^{d} } K(\mathbb {P}_\varLambda ),\\ \widetilde{K}_n= & {} \max _{\varLambda \in \mathcal{A}_{n}^{} } K(\mathbb {P}_\varLambda ). \end{aligned}$$

Note that, according to Lemma 1, one has the estimate

$$\begin{aligned} \widetilde{K}_n\le K_n\le {\left\{ \begin{array}{ll} n^{\ln 3 / \ln 2} &{}\quad \text {if}\,\,\theta _1=\theta _2=-1/2, \\ n^{2\max \{\theta _1,\theta _2\}+2} &{}\quad \text {if}\,\,\theta _1,\theta _2\in \mathbb {N}_0. \end{array}\right. } \end{aligned}$$
(26)

We are now in position to establish probabilistic bounds for the random variables \(C^d_n\) and \(\widetilde{C}_n\).

Lemma 5

In any finite dimension d, for any \(r>0\), if m and n satisfy

$$\begin{aligned} \frac{m}{\ln m} \ge \left( 1+r+\dfrac{ n d \ln 2}{\ln m} \right) \dfrac{K_n }{\zeta }, \end{aligned}$$
(27)

then

$$\begin{aligned} \Pr \left( C_{n}^{d} >2 \right) \le 2 n m^{-(r+1)} \le 2 m^{-r}. \end{aligned}$$

In any finite dimension d or in infinite dimension, for any \(r>0\), if m and n satisfy

$$\begin{aligned} \frac{m}{\ln m} \ge \left( 1+r+\dfrac{ n^2 \ln 2}{\ln m} \right) \dfrac{ \widetilde{K}_n}{\zeta }, \end{aligned}$$
(28)

then

$$\begin{aligned} \Pr \left( \widetilde{C}_{n} >2 \right) \le 2 n m^{-(r+1)}\le 2 m^{-r}. \end{aligned}$$

Proof

Recalling (6) and using a union bound, we obtain

$$\begin{aligned} \Pr \left( C_{n}^{d}>2 \right)&= \Pr \left( \max _{\varLambda \in \mathcal{M}_{n}^{d}} \max _{v\in \mathbb {P}_\varLambda } \frac{\Vert v\Vert ^2}{\Vert v\Vert _m^2}> 2 \right) \nonumber \\&\le \sum _{\varLambda \in \mathcal{M}_n^d} \Pr \left( \max _{ v\in \mathbb {P}_{ \varLambda }} \frac{\Vert v\Vert ^2}{\Vert v\Vert _m^2} >2 \right) \nonumber \\&\le 2^{n d} 2 n \exp \left\{ - \zeta m / K_n \right\} \nonumber \\&= 2 n \exp \left\{ - \zeta m / K_n + n d \ln (2) \right\} , \end{aligned}$$

where we have used the cardinality estimate (24). The final bound is smaller than \(2m^{-r}\) under condition (27). The proof for \(\widetilde{C}_n\) is completely similar, using the cardinality estimate (25). \(\square \)

Remark 3

Combining with (26), we find that (27) holds if rdm, and n satisfy

$$\begin{aligned} \frac{m}{\ln m} \ge {\left\{ \begin{array}{ll} \left( 1+r+\dfrac{ n d \ln 2}{\ln m} \right) \dfrac{n^{\ln 3 / \ln 2} }{\zeta }, &{} \text {with Chebyshev polynomials of the first kind},\\ \left( 1+r+\dfrac{ n d \ln 2 }{\ln m} \right) \dfrac{ n^{2\max \{\theta _1,\theta _2 \} +2} }{\zeta } , &{} \text {with Jacobi polynomials and }\,\theta _1,\theta _2 \in \mathbb {N}_0. \end{array}\right. } \end{aligned}$$
(29)

Similarly, we find that (28) holds if rm, and n satisfy

$$\begin{aligned} \frac{m}{\ln m} \ge {\left\{ \begin{array}{ll} \left( 1+r+\dfrac{ n^2 \ln 2}{\ln m} \right) \dfrac{ n^{\ln 3 / \ln 2}}{\zeta }, &{}\text {with Chebyshev polynomials of the first kind},\\ \left( 1+r+\dfrac{ n^2 \ln 2 }{\ln m} \right) \dfrac{ n^{2\max \{\theta _1,\theta _2 \} +2} }{\zeta }, &{}\text {with Jacobi polynomials and }\,\theta _1,\theta _2 \in \mathbb {N}_0. \end{array}\right. } \end{aligned}$$
(30)

We next give an improved result on the cardinality of \(\mathcal{M}_n^d\) and \(\mathcal{A}_n\) based on another encoding strategy.

Lemma 6

(Second Cardinality Bound) One has the cardinality estimates

$$\begin{aligned} \#\left( \mathcal{M}_{n}^{d}\right) \le d^{n -1} (n-1)! \end{aligned}$$
(31)

and

$$\begin{aligned} \#(\mathcal{A}_{n}^{}) \le \left( \left( n-1 \right) ! \right) ^2. \end{aligned}$$
(32)

Proof

Given any downward closed multi-index set \(\varLambda \) with \(\#(\varLambda )=n\), we order the elements of \(\varLambda \) in such a way that the set

$$\begin{aligned} \varLambda ^k:= \left\{ \nu ^1,\ldots ,\nu ^{k}\right\} , \end{aligned}$$

obtained by retaining only the first k elements of \(\varLambda \), is downward closed for any \(k=1,\ldots ,n\). Such an ordering always exists, and in general it is not unique. Notice that it always holds that \(\nu ^1=(0,\ldots ,0)\). One way to impose a unique ordering is by taking for \(\nu ^k\) the smallest index \(\nu \) in lexicographic order among those \(\nu \in \varLambda {\setminus }\{\nu ^1,\ldots ,\nu ^{k-1}\}\) such that \(\{ \nu ^1,\ldots ,\nu ^{k}\}\) is downward closed. Each \(\nu ^{k}\) can be uniquely characterized by choosing some \(l_k\in \{1,\ldots ,k-1\}\) and \(j_k\in \{1,\ldots ,d\}\) such that

$$\begin{aligned} \nu ^{k}=\nu ^{l_k} + e_{j_k}. \end{aligned}$$

Again this choice can be made unique by asking that \(j_k\) is the smallest number with such a property. Therefore, the resulting map

$$\begin{aligned} \varLambda \mapsto \left( j_2,l_3,j_3,\ldots ,l_n,j_n\right) \end{aligned}$$

is well defined and injective. Hence

$$\begin{aligned} \#\left( \mathcal{M}_{n}^{d}\right) \le d (2 d )\ldots (n -1) d, \end{aligned}$$

which is (31). We obtain (32) from the inequality \(\#(\mathcal{A}_n)\le \#(\mathcal{M}_{n}^{n-1})\). \(\square \)

The above results lead to improved probabilistic bounds for the random variables \(C^d_n\) and \(\widetilde{C}_n\).

Lemma 7

In any finite dimension d, for any \(r>0\), if m and n satisfy

$$\begin{aligned} \frac{m}{\ln m} \ge \left( 1+r+ \dfrac{n \ln (dn)}{\ln m} \right) \dfrac{K_n }{\zeta }, \end{aligned}$$
(33)

then

$$\begin{aligned} \Pr \left( C_{n}^{d} >2 \right) \le 2 n m^{-(r+1)} \le 2 m^{-r}. \end{aligned}$$

In any finite dimension d or in infinite dimension, for any \(r>0\), if m and n satisfy

$$\begin{aligned} \frac{m}{\ln m} \ge \left( 1+r+\dfrac{ 2n \ln n}{\ln m} \right) \dfrac{ \widetilde{K}_n}{\zeta }, \end{aligned}$$
(34)

then

$$\begin{aligned} \Pr \left( \widetilde{C}_{n} >2 \right) \le 2 n m^{-(r+1)}\le 2 m^{-r}. \end{aligned}$$

Proof

Recalling (6) and using the inequality \(n ! \le e \sqrt{n} (n/e)^{n}\), which holds for any \(n\ge 1\), we obtain by means of a union bound that, for any \(n\ge 2\),

$$\begin{aligned} \Pr \left( C_{n}^{d} >2 \right)&\le d^{n -1} (n-1)! \ (2 n) \ \exp \left\{ - \zeta m / K_n \right\} \\&\le d^{n -1} e \sqrt{n-1} \left( \dfrac{n-1}{e}\right) ^{n-1} \ (2 n) \ \exp \left\{ - \zeta m / K_n \right\} \\&= 2 n \exp \left\{ - \zeta m / K_n + (n-1/2) \ln \left( \dfrac{ d (n-1) }{e} \right) - \dfrac{1}{2} \ln \left( \dfrac{d}{e} \right) + 1 \right\} \\&\le 2 n \exp \left\{ - \zeta m / K_n + n \ln (dn) \right\} , \end{aligned}$$

where we have used the cardinality estimate (31). The final bound is smaller than \(2m^{-r}\) under condition (33). Trivially the final bound holds true also when \(n=1\). The proof for \(\widetilde{C}_n\) is completely similar, using the cardinality estimate (32). \(\square \)

Remark 4

Combining with (26), we find that (33) holds if rdm, and n satisfy

$$\begin{aligned} \frac{m}{\ln m} \ge {\left\{ \begin{array}{ll} \left( 1+r+\dfrac{ n \ln (dn)}{\ln m} \right) \dfrac{n^{\ln 3 / \ln 2} }{\zeta }, &{}\text {with Chebyshev polynomials of the first kind},\\ \left( 1+r+\dfrac{ n \ln (dn) }{\ln m} \right) \dfrac{ n^{2\max \{\theta _1,\theta _2 \} +2} }{\zeta } , &{}\text {with Jacobi polynomials and }\,\theta _1,\theta _2 \in \mathbb {N}_0. \end{array}\right. } \end{aligned}$$
(35)

Similarly, we find that (34) holds if rm, and n satisfy

$$\begin{aligned} \frac{m}{\ln m} \ge {\left\{ \begin{array}{ll} \left( 1+r+\dfrac{ 2n \ln n}{\ln m} \right) \dfrac{ n^{\ln 3 / \ln 2}}{\zeta }, &{}\text {with Chebyshev polynomials of the first kind},\\ \left( 1+r+\dfrac{ 2n\ln n }{\ln m} \right) \dfrac{ n^{2\max \{\theta _1,\theta _2 \} +2} }{\zeta }, &{}\text {with Jacobi polynomials and }\,\theta _1,\theta _2 \in \mathbb {N}_0. \end{array}\right. } \end{aligned}$$
(36)

The regimes of m and n described by the above results are in principle less restrictive than those previously obtained using the cardinality of \(\mathcal{H}_{n}^{d}\) or \(\mathcal{H}_{n}^{n-1}\). Indeed, for most regimes of n and d, the cardinalities \(\#(\mathcal{H}_{n}^{d})\) or \(\#(\mathcal{H}_{n}^{n-1})\) are larger than \(n\ln (dn)\) and \(n\ln (n)\), respectively. We may summarize the probabilistic bounds established in this section as follows: for any \(r>0\) and any \(n\ge 1\), one has \(C_n^d\le 2\) with probability larger than \(1-2m^{-r}\) provided that (22) or (29) or (35) holds. Likewise, one has \(\widetilde{C}_n\le 2\) with probability larger than \(1-2m^{-r}\) provided that (23) or (30) or (36) holds.

Remark 5

The encoding strategies that are used for proving Lemmas 4 and 6 are both redundant, leading to an overestimation of \(\#(\mathcal{M}_n^d)\) and \(\#(\mathcal{A}_n)\). We are not aware of estimates for these cardinalities that are provably sharp up to multiplicative constants independent of n and d. However, we can establish lower bounds that show that for certain particular regimes of n and d, these cardinalities grow exponentially fast. One simple instance of a lower bound for \(\#(\mathcal{M}_n^d)\) in the regime where \(n-1\le d\) is obtained as follows: we note that \(\mathcal{M}_n^d\) includes in particular all sets of the form \(\{(0,\ldots ,0)\}\cup \{e_j:j\in S\}\) for \(S\subset \{1,\ldots ,d\}\) such that \(\#(S)=n-1\). It follows that

$$\begin{aligned} \#\left( \mathcal{M}_n^d\right) \ge {d\atopwithdelims ()n-1}. \end{aligned}$$

In the regime where \(n=d/2\) (for even d), we thus find that \(\log _2(\#(\mathcal{M}_n^d))\) grows at least as fast as d.

Remark 6

It is interesting to compare the probabilistic bounds obtained in this section with restricted isometry properties (RIP) recently obtained in [5], that are a common vehicle in the analysis of compressed sensing schemes. Recalling the design matrix \(\mathbf D\) introduced in (14), and defining its renormalized version \(\varPhi :=m^{-1/2} \mathbf D\), we indeed see that the property \(C_n^d \le 2\) can be rephrased as

$$\begin{aligned} \frac{1}{2} \Vert \mathbf{a}\Vert ^2 \le \Vert \varPhi \mathbf{a}\Vert ^2, \quad \Vert \mathbf{a}\Vert _{\ell ^0_d}\le n, \end{aligned}$$

that is, for all \(\mathbf{a}\) whose support is contained in a downward closed set of cardinality at most n. In [5], it is shown that the RIP property

$$\begin{aligned} (1-\delta ) \Vert \mathbf{a}\Vert ^2 \le \Vert \varPhi \mathbf{a}\Vert ^2\le (1+\delta ) \Vert \mathbf{a}\Vert ^2, \quad \Vert \mathbf{a}\Vert _{\ell ^0_d}\le n, \end{aligned}$$
(37)

holds with probability at least \(1-\gamma \) if

$$\begin{aligned} m\ge 2^6 e\frac{K_n}{\tilde{\delta }^2}\ln \left( \frac{K_n}{\tilde{\delta }^2}\right)&\max \left\{ \frac{2^5}{\tilde{\delta }^4}\ln \left( 40\frac{K_n}{\tilde{\delta }^2}\ln \left( \frac{K_n}{\tilde{\delta }^2}\right) \right) \ln (4N),\right. \nonumber \\&\quad \left. \frac{1}{\tilde{\delta }}\ln \left( \frac{1}{\gamma \tilde{\delta }}\ln \left( \frac{K_n}{\tilde{\delta }^2}\right) \right) \right\} ,\quad \tilde{\delta }:=\frac{\delta }{13}, \end{aligned}$$
(38)

where \(N=\#(\mathcal{J})\) with \(\mathcal{J}\) the union of all downward closed sets of cardinality at most n, that is, \(\mathcal{J}=\mathcal{H}_n^d\). Note that in our analysis, we are only interested in establishing the lower inequality in the RIP, with the particular constant \(\delta =\frac{1}{2}\). However, since we have started from the two-sided estimate in (6), one easily checks that the same analysis leads to the validity of the RIP property (37) with probability at least \(1-\gamma \), under the regime

$$\begin{aligned} \frac{m}{\ln m} \ge \left( 1+ \frac{\ln (2/\gamma )}{\ln m}+ \dfrac{n \ln (dn)}{\ln m} \right) \dfrac{K_n}{\zeta (\delta )}, \end{aligned}$$
(39)

where \(\zeta (\delta )\) is again given by (5). From an asymptotic point of view, it can be checked that the regime (38) is more favorable than (39): indeed, n appears on the right side of (38) only through \(K_n\) up to logarithmic factors, while it appears through \(nK_n\) in the right side of (39). However, the multiplicative constant on the right side of (38) is prohibitively large: for example, in the case \(\delta =\frac{1}{2}\) that is considered in the present paper, we find that \(\tilde{\delta }=\frac{1}{26}\) and therefore \(2^6 e\frac{1}{\tilde{\delta }^2}\frac{2^5}{\tilde{\delta }^4} > 10^{12}\). This shows that the regime described by (39) is more favorable for moderate values of n. Similar remarks hold when considering anchored sets rather than downward closed sets.

3.3 Accuracy of the Optimized Discrete Least-Squares Approximation

We are now in position to state our main results concerning the accuracy of the discrete least-squares approximation \(w_n\) and \(\tilde{w}_n\) over the optimized index set and . These results show that the accuracy compares favorably with the best approximation error of the function u, measured either in \(L^\infty \) or \(L^2\), using optimal choices of downward closed or anchored sets (which might differ from the sets and ). We begin with a result expressed in probability.

Theorem 2

Consider a function u defined on \(\varGamma \) and let \(r>0\). In any finite dimension, under condition (22) or (29) or (35), with n replaced by \(2n-1\), it holds that

$$\begin{aligned} \Pr \left( \Vert u - w_n \Vert \le (1+2\sqrt{2}) \min _{\varLambda \in \mathcal{M}^d_n}\min _{v \in \mathbb {P}_{\varLambda } } \Vert u - v \Vert _{L^\infty (\varGamma )} \right) \ge 1 - 2 m^{-r}. \end{aligned}$$

In any finite or infinite dimension, under condition (23) or (30) or (36), with n replaced by \(2n-1\), it holds that

$$\begin{aligned} \Pr \left( \Vert u - \tilde{w}_n \Vert \le (1+2\sqrt{2}) \min _{\varLambda \in \mathcal{A}_n}\min _{v \in \mathbb {P}_{\varLambda } }\Vert u - v \Vert _{L^\infty (\varGamma )} \right) \ge 1 - 2 m^{-r}. \end{aligned}$$

Proof

These estimates immediately follow from (18) and (19) combined with the probabilistic bounds from the previous section. \(\square \)

We next give a result expressed in expectation of the truncated discrete least-squares projection and .

Theorem 3

Consider a function u defined on \(\varGamma \), such that \(|u(y)|\le \tau \) for any \(y\in \Gamma \), and let \(r>0\). In any finite dimension, under condition (22) or (29) or (35), with n replaced by \(2n-1\), it holds that

$$\begin{aligned}&\mathbb {E}( \Vert u- T_\tau (w_n) \Vert ^2) \le (9 + 4\sqrt{2}) \Vert u - u_n \Vert ^2 + 8 \tau ^2 m^{ - r }. \end{aligned}$$
(40)

In any finite or infinite dimension, under condition (23) or (30) or (36), with n replaced by \(2n-1\), it holds that

$$\begin{aligned}&\mathbb {E}\left( \Vert u- T_\tau (\tilde{w}_n) \Vert ^2\right) \le \left( 9 + 4\sqrt{2}\right) \Vert u - \tilde{u}_n \Vert ^2 + 8 \tau ^2 m^{ - r }. \end{aligned}$$
(41)

Proof

For (40), we distinguish between the two complementary events \(\Omega _1:=\{C_{2n-1}^d\le 2\}\) and \(\Omega _2:=\{C_{2n-1}^d> 2\}\) and write

$$\begin{aligned} \mathbb {E}\left( \Vert u- T_\tau (w_n) \Vert ^2\right)= & {} \mathbb {E}\left( \Vert u- T_\tau (w_n) \Vert ^2| \Omega _1\right) \Pr (\Omega _1)\\&+\mathbb {E}\left( \Vert u- T_\tau (w_n) \Vert ^2| \Omega _2\right) \Pr (\Omega _2)=:E_1+E_2. \end{aligned}$$

Since \(|u-T_\tau (w_n)|\le 2\tau \) and \(\Pr (\Omega _2)\le 2m^{-r}\), the second term \(E_2\) is bounded by \(8 \tau ^2 m^{ - r }\). For the first term \(E_1\), we combine (17) and the fact that \(|u-T_\tau (w_n)|\le |u-w_n|\) to obtain the bound

$$\begin{aligned} E_1&\le \mathbb {E}\left( \left( \Vert u- u_n \Vert + 2\sqrt{2} \Vert u- u_n \Vert _m\right) ^2\right) \\&= \Vert u- u_n \Vert ^2 + 4 \sqrt{2} \Vert u- u_n \Vert \mathbb {E}\left( \Vert u- u_n \Vert _m \right) + 8 \mathbb {E}\left( \Vert u- u_n \Vert _m^2 \right) \\&\le \left( 9 + 4\sqrt{2}\right) \Vert u- u_n \Vert ^2. \end{aligned}$$

The proof of (41) is analogous. \(\square \)

Remark 7

The constants \(1+2\sqrt{2}\) and \(9+4\sqrt{2}\) in the above theorems can be reduced if one further restricts the regime between m and n so that \(C_{2n-1}^d\) and \(\tilde{C}_{2n-1}\) are close to 1 with high probability.

4 Alternative Algorithms for Model Selection

The actual computation of the optimized discrete least-squares approximations and in Theorems 2 and 3 would require an exhaustive search among all possible choices of downward closed or anchored sets of a given cardinality n, and this task might become computationally prohibitive when n and d are simultaneously large. Our results should therefore mainly be viewed as a benchmark in arbitrary dimension d for assessing the performance of fast selection algorithms.

We review hereafter other strategies, which could be used for the purpose of selecting a proper polynomial space, and relate them to the results from this paper.

  • Model selection by complexity regularization: the above discussed least-squares methods on optimized downward closed or anchored sets may be viewed as an instance of a model selection procedure. Model selection is a widely studied topic in statistical learning theory, in various settings such as classification, density estimation, denoising, regression, and inverse problems. In the regression framework, a typical approach consists of adding a complexity penalization term to the least-squares empirical risk to be minimized, see for example Chapter 12 in [14]. One general result for such estimators is provided by Theorem 12.1 of [14] established in the bounded regression framework. It gives an oracle bound which has the typical form of the minimum over all considered models of the sum of the approximation error and of a penalty term that is always larger than \(m^{-1}\), and that persists even as the noise level tends to zero. Therefore we cannot retrieve from these standard results for model selection the potentially fast convergence rates that can be derived from the above Theorem 3 in the noisefree case. Let us note that, from a computational point of view, this approach has the same complexity as the one required to compute the estimators \(w_n\) or \(\tilde{w}_n\) in the present paper, since it is based on an exhaustive optimization over the index sets. It is therefore prohibitive for large values of d and n.

  • Convex relaxation of \(\ell ^0\) problems: as explained in Remark 1, the estimators \(w_n\) and \(\tilde{w}_n\) are solutions to nonconvex optimization problems of \(\ell ^0\) type up to the additional prescription of the downward closed or anchored structure of the index sets. Convex relaxation methods based on \(\ell ^1\) or weighted-\(\ell ^1\) minimization, such as basis pursuit or LASSO, have been intensively studied, in particular under RIP properties for the design matrix. In the context of multivariate approximation, they have been first studied in the trigonometric polynomial framework [8] and then in the algebraic polynomial framework [5, 25, 26]. The corresponding optimization algorithms are computationally much less intensive than the complete optimization of the index set that is needed to compute \(w_n\) or \(\tilde{w}_n\), yet their complexity is still polynomial in the cardinality of the set \(\mathcal{J}\) that indexes the columns of \(\mathbf D\). This set coincides with the hyperbolic cross \(\mathcal{H}^d_n\) in the downward closed case, leading to a computational cost that could still be prohibitive for simultaneously large values of n and d. Note that these methods do not necessarily generate downward closed or anchored sets. While one may use the compressed sensing theory based on RIP properties to analyze the accuracy of the resulting estimators, see in particular the recovery guarantee established in Theorem 4.7 of [5], it is not clear to us that they achieve the same optimality bounds as described by Theorems 2 and 3.

  • Greedy algorithms: one classical alternative to the above described convex relaxation methods are greedy algorithms such as orthonormal matching pursuit (OMP) and its variants, such as iterative hard or soft thresholding. Convergence bounds for the estimators produced by these algorithms have been established under RIP properties, see [4, 27] for OMP in a general framework and Theorem 4.9 of [5] for iterative hard thresholding in the context of multivariate polynomial approximation. Similar to convex relaxation methods, it is not clear that the estimators obtained by these approaches achieve the same optimality bounds as described by Theorem 2 and 3. From a computational point of view, the complexity of each step of OMP scales linearly in the cardinality of \(\mathcal{J}\) leading to a smaller computational cost than convex relaxation methods, yet that could still be prohibitive for simultaneously large values of n and d. One natural way to reduce the complexity is to enforce the downward closed or anchored structure in the index selection: if \(\varLambda _k\) is the index set generated after k steps of OMP, we construct \(\varLambda _{k+1}=\varLambda _k\cup \{\nu \}\) by maximizing the inner product of the residual with the colums of \(\mathbf D\) corresponding only to the indices \(\nu \) such that the set \(\varLambda _k\cup \{\nu \}\) remains downward closed (or anchored). This restriction has the effect of significantly reducing the complexity. However it is not clear that any recovery guarantee can be established for such an algorithm.

  • Relaxed minimization: let us observe that one could replace the set \(\varLambda ^{\mathrm{opt}}_{m}\) defined in (11) with a near-optimal set \(\varLambda ^{\mathrm{near}}_{m}\) in the sense that one has

    $$\begin{aligned} \min _{ v\in \mathbb {P}_{\varLambda ^{\mathrm{near}}_{m}}} \Vert u- v\Vert _m \le C \min _{\varLambda \in \mathcal{M}_{n}^{d} }\min _{ v\in \mathbb {P}_\varLambda } \Vert u- v\Vert _m \end{aligned}$$

    for some fixed constant \(C\ge 1\). Then, it is easily checked that similar convergence bounds can be established for the resulting estimators, up to changing the multiplicative constants. If one is only interested in establishing convergence rates, the optimality criterion can be even further relaxed by only asking that

    $$\begin{aligned} \min _{ v\in \mathbb {P}_{\varLambda ^{\mathrm{near}}_{m}}} \Vert u- v\Vert _m \le C \min _{\varLambda \in \mathcal{M}_{\xi n}^{d} }\min _{ v\in \mathbb {P}_\varLambda } \Vert u- v\Vert _m \end{aligned}$$

    for some fixed \(0<\xi \le 1\). However, designing a fast selection algorithm that would produce such near-optimal sets is currently an open problem. A similar objective has been accomplished in the setting of tree structured index sets, see [1].