1 Introduction

In this paper we explore a connection of the approximation theoretic notion of polynomial mesh (norming set) of a compact set K with the statistical theory of optimal polynomial regression designs on K. We begin by recalling some basic definitions and properties.

Let \(\mathbb {P}_n^d(K)\) denote the space of polynomials of degree not exceeding n restricted to a compact set \(K\subset \mathbb {R}^d\), and \(\Vert f\Vert _Y\) the sup-norm of a bounded function on the compact set Y. We recall that a polynomial mesh on K (with constant \(c>0\)) is a sequence of norming subsets \(X_n\subset K\) such that

$$\begin{aligned} \Vert p\Vert _K\le c\,\Vert p\Vert _{X_n},\quad \forall p\in \mathbb {P}_n^d(K), \end{aligned}$$
(1)

where \(\hbox {card}(X_n)\) grows algebraically in

$$\begin{aligned} N=N_n(K)=\hbox {dim}(\mathbb {P}_n^d(K)). \end{aligned}$$
(2)

Notice that necessarily \(\hbox {card}(X_n)\ge N\), since \(X_n\) is determining for \(\mathbb {P}_n^d(K)\) (i.e., polynomials vanishing there vanish everywhere on K). With a little abuse of notation, below we shall call “polynomial mesh” both the entire sequence \(\{X_n\}\) and (more frequently) the single norming set \(X_n\).

Observe also that \(N=\mathcal {O}(n^\beta )\) with \(\beta \le d\), in particular \(N={n+d \atopwithdelims ()d}\sim n^d/d!\) on polynomial determining compact sets (i.e., polynomials vanishing there vanish everywhere in \(\mathbb {R}^d\)), but we can have \(\beta <d\) for example on compact algebraic varieties, like the sphere in \(\mathbb {R}^d\) where \(N={n+d \atopwithdelims ()d}-{n-2+d \atopwithdelims ()d}\).

The notion of polynomial mesh, though present in the literature for specific instances, was introduced in a systematic way in the seminal paper (Calvi and Levenberg 2008), and since then has seen an increasing interest, also in the computational framework. We recall among their properties that polynomial meshes are invariant under affine transformations, can be extended by algebraic transformations, finite union and product, are stable under small perturbations. Concerning finite union, for example, which is a powerful constructive tool, it is easily checked that if \(X_n^{(i)}\) are polynomial meshes for \(K_i\), \(1\le i\le s\), then

$$\begin{aligned} \Vert p\Vert _{\cup K_i}\le \max \{c_i\}\,\Vert p\Vert _{\cup X_n^{(i)}},\quad \forall p\in \mathbb {P}_n^d(\cup K_i). \end{aligned}$$
(3)

Polynomial meshes give good discrete models of a compact set for polynomial approximation, for example it is easily seen that the uniform norm of the unweighted Least Squares operator on a polynomial mesh, say \(L_n:\,C(K)\rightarrow \mathbb {P}_n^d(K)\), is bounded as

$$\begin{aligned} \Vert L_n\Vert =\sup _{f\ne 0}{\frac{\Vert L_n f\Vert _K}{\Vert f\Vert _K}}\le c\,\sqrt{\hbox {card}(X_n)}. \end{aligned}$$
(4)

Moreover, polynomial meshes contain extremal subsets of Fekete and Leja type for polynomial interpolation, and have been applied in polynomial optimization and in pluripotential numerics; cf., e.g., Bos et al. (2011), Piazzon (2019) and Piazzon and Vianello (2019).

The class of compact sets which admit (constructively) a polynomial mesh is very wide. For example, it has been proved in Calvi and Levenberg (2008) that a polynomial mesh with cardinality \(\mathcal {O}(N^r)=\mathcal {O}(n^{rd})\) can always be constructed simply by intersection with a sufficiently dense uniform covering grid, on compact sets satisfying a Markov polynomial inequality with exponent r (in particular, on compact bodies with Lipschitz boundary, in which case \(r=2\)).

From the computational point of view, it is however important to deal with low cardinality polynomial meshes. Indeed, polynomial meshes with \(\hbox {card}(X_n)=\mathcal {O}(N)\), that are said to be optimal (in the sense of cardinality growth), have been constructed on compact sets with different geometric structure, such as polygons and polyhedra, convex, starlike and even more general bodies with smooth boundary, sections of a sphere, ball and torus; cf., e.g., Kroó (2011), Piazzon (2016) and Sommariva and Vianello (2018). By (4), on optimal meshes we have \(\Vert L_n\Vert =\mathcal {O}(\sqrt{\hbox {card}(X_n)})= \mathcal {O}(\sqrt{N})\); we stress, however, that even with an optimal mesh typically \(\hbox {card}(X_n)\gg N\).

The problem of reducing the sampling cardinality while maintaining estimate (4) (Least Squares compression) has been considered for example in Piazzon et al. (2017a, b), where a strategy is proposed, based on weighted Least Squares on \(N_{2n}\) Caratheodory-Tchakaloff points extracted from the mesh by Linear or Quadratic Programming. Nevertheless, also reducing the Least Squares uniform operator norm though much more costly is quite relevant, and this will be addressed in the next Section via the theory of optimal designs.

2 Near optimal designs by polynomial meshes

Let \(\mu \) be a probability measure supported on a compact set \(K\subset \mathbb {R}^d\). In statistics, \(\mu \) is usually called a design and K the design space. The literature on optimal designs dates back at least one century, and is so vast and ramificated that we can not even attempt any kind of survey. We may for example quote, among many others, a classical and a quite recent textbook (Atkinson and Donev 1992; Celant and Broniatowski 2017), and the new paper (De Castro et al. 2019) that bear witness to the vitality of the field. Below we simply recall some relevant notions and results, trying to follow an apparently unexplored connection to the theory of polynomial meshes in the framework of polynomial regression.

Assume that \(supp(\mu )\) is determining for \(\mathbb {P}^d(K)\) (the space of d-variate real polynomials restricted to K); for a fixed degree n, we could even assume that \(supp(\mu )\) is determining for \(\mathbb {P}_n^d(K)\). We recall a function that plays a key role in the theory of optimal designs, the diagonal of the reproducing kernel for \(\mu \) in \(\mathbb {P}_n^d(K)\) (often called Christoffel polynomial), namely

$$\begin{aligned} K_n^\mu (x,x)=\sum _{j=1}^{N}{p_j^2(x)}, \end{aligned}$$
(5)

where \(\{p_j\}\) is any \(\mu \)-orthonormal basis of \(\mathbb {P}_n^d(K)\), for example that obtained from the standard monomial basis by applying the Gram–Schmidt orthonormalization process [it can be shown that \(K_n^\mu (x,x)\) does not depend on the choice of the orthonormal basis, cf. (7) below]. It has the important property that

$$\begin{aligned} \Vert p\Vert _K\le \sqrt{\max _{x\in K}K_n^\mu (x,x)}\,\Vert p\Vert _{L^2_{\mu }(K)},\quad \forall p\in \mathbb {P}_n^d(K), \end{aligned}$$
(6)

and also the following relevant characterization

$$\begin{aligned} K_n^\mu (x,x)=\max _{p\in \mathbb {P}_n^d(K),\,p(x)=1} \frac{1}{\int _K{p^2(x)\,d\mu }}. \end{aligned}$$
(7)

Now, by (5) \(\int _K{K_n^\mu (x,x)\,d\mu }=N\), which entails that \(\max _{x\in K}K_n^\mu (x,x)\ge N\). A probability measure \(\mu ^*=\mu ^*(K)\) is then called a G-optimal design for polynomial regression of degree n on K if

$$\begin{aligned} \min _\mu \max _{x\in K}K_n^{\mu }(x,x)=\max _{x\in K}K_n^{\mu ^*}(x,x)=N. \end{aligned}$$
(8)

Observe that, since \(\int _K{K_n^{\mu }(x,x)\,d\mu }=N\) for every \(\mu \), an optimal design has the following property

$$\begin{aligned} K_n^{\mu ^*}(x,x)=N\;\;\mu ^*\hbox {-}a.e.\quad \hbox {in}\;K. \end{aligned}$$
(9)

As is well-known, by the celebrated Kiefer–Wolfowitz General Equivalence Theorem Kiefer and Wolfowitz (1960) the difficult min–max problem (8) is equivalent to the much simpler maximization

$$\begin{aligned} \max _\mu det(G_n^\mu ),\quad G_n^\mu =\left( \int _K{q_i(x)q_j(x)\,d\mu }\right) _{1\le i,j\le N}, \end{aligned}$$
(10)

where \(G_n^\mu \) is the Gram matrix of \(\mu \) in a fixed polynomial basis \(\{q_i\}\) (also called information matrix in statistics). Such an optimality is called D-optimality, and entails that an optimal measure exists, since the set of Gram matrices of probability measures is compact (and convex); see e.g. Atkinson and Donev (1992), Bloom et al. (2010) and Bos (1990) for a quite general proof of these results, valid for both continuous and discrete compact sets. An optimal measure is not unique and not necessarily discrete (unless K is discrete itself), but an equivalent discrete optimal measure always exists by the Tchakaloff Theorem on positive quadratures of degree 2n for K; cf. Putinar (1997) for a general proof of the Tchakaloff Theorem. Moreover, the asymptotics of optimal designs as the degree n goes to \(\infty \) can be described using multivariate pluripotential theory, see Bloom et al. (2010, 2011).

G-optimality has two important interpretations in terms of probabilistic and deterministic polynomial regression. From a statistical point of view, it is the probability measure that minimizes the maximum prediction variance by n-th degree polynomial regression, cf. Atkinson and Donev (1992).

From the approximation theory point of view, calling \(L_n^{\mu ^*}\) the corresponding weighted Least Squares operator, by (6) we can write for every \(f\in C(K)\)

$$\begin{aligned} \Vert L_n^{\mu ^*} f\Vert _K\le & {} \sqrt{\max _{x\in K}K_n^{\mu ^*}(x,x)}\,\Vert L_n^{\mu ^*} f\Vert _{L^2_{\mu ^*}(K)} \le \sqrt{N}\,\Vert L_n^{\mu ^*} f\Vert _{L^2_{\mu ^*}(K)}\nonumber \\\le & {} \sqrt{N}\,\Vert f\Vert _{L^2_{\mu ^*}(K)} \le \sqrt{N}\,\Vert f\Vert _K,\quad \hbox {i.e.}\;\;\Vert L_n^{\mu ^*}\Vert \le \sqrt{N}, \end{aligned}$$
(11)

which shows that a G-optimal measure minimizes (the estimate of) the weighted Least Squares uniform operator norm.

The computational literature on D-optimal designs is huge, with a variety of approaches and methods. A classical approach is given by the discretization of K and then the D-optimization over the discrete set; see e.g. the references in De Castro et al. (2019) (where however a different approach is proposed, based on a moment-sum-of-squares hierarchy of semidefinite programming problems). In the discretization framework, the possible role of polynomial meshes seems apparently overlooked. We summarize the corresponding simple but meaningful near G-optimality result by the following Proposition.

Proposition 1

Let \(K\subset \mathbb {R}^d\) be a compact set, admitting a polynomial mesh \(\{X_n\}\) with constant c.

Then for every \(n\in \mathbb {N}\) and \(m\in \mathbb {N}\), \(m\ge 1\), the probability measure

$$\begin{aligned} \nu =\nu (n,m)=\mu ^*(X_{2mn}) \end{aligned}$$
(12)

is a near G-optimal design on K, in the sense that

$$\begin{aligned} \max _{x\in K}{K_n^\nu (x,x)}\le c_m\,N,\quad c_m=c^{1/m}. \end{aligned}$$
(13)

Proof

First, observe that for every \(p\in \mathbb {P}_{2n}^d(K)\)

$$\begin{aligned} \Vert p^m\Vert _K=\Vert p\Vert ^m_K\le c\,\Vert p^m\Vert _{X_{2mn}}=c\,\Vert p\Vert ^m_{X_{2mn}}, \end{aligned}$$

and thus

$$\begin{aligned} \Vert p\Vert _K\le c^{1/m}\,\Vert p\Vert _{X_{2mn}}. \end{aligned}$$

Now, \(X_{2mn}\) is clearly \(\mathbb {P}_{n}^d(K)\)-determining and hence denoting by \(\nu =\mu ^*(X_{2mn})\) an optimal measure for degree n on \(X_{2mn}\), which exists by the General Equivalence Theorem with \(supp(\nu )\subseteq X_{2mn}\), we get

$$\begin{aligned} \max _{x\in X_{2mn}}{K_n^\nu (x,x)}=N_n(X_{2mn})=N_n(K)=N. \end{aligned}$$

Since \(K_n^\nu (x,x)\) is a polynomial of degree 2n, we finally obtain

$$\begin{aligned} \max _{x\in K}{K_n^\nu (x,x)}\le c^{1/m}\,\max _{x\in X_{2mn}}{K_n^\nu (x,x)} \le c^{1/m}\,N. \end{aligned}$$

\(\square \)

Proposition 1 shows that polynomial meshes are good discretizations of a compact set for the purpose of computing a near G-optimal measure, and that G-optimality maximum condition (8) is approached at a rate proportional to 1/m, since \(c_m\sim 1+\log (c)/m\). In terms of the statistical notion of G-efficiency on K we have

$$\begin{aligned} \hbox {G}_{{\mathrm{eff}}}(\nu )=\frac{N}{\max _{x\in K}{K_n^\nu (x,x)}}\ge c^{-1/m}\sim 1-\log (c)/m. \end{aligned}$$
(14)

It is worth showing that a better rate proportional to \(1/m^2\) can be obtained on certain compact sets, where an (optimal) polynomial mesh can be constructed via the approximation theoretic notion of Dubiner distance.

We recall that the Dubiner distance on a compact set K, introduced in the seminal paper (Dubiner 1995), is defined as

$$\begin{aligned} dub_K(x,y)=\sup _{deg(p)\ge 1,\,\Vert p\Vert _K\le 1} \left\{ \frac{1}{deg(p)}\,\left| \arccos (p(x))-\arccos (p(y))\right| \right\} . \end{aligned}$$
(15)

Among its basic properties, we recall that it is invariant under invertible affine transformations, i.e., if \(\sigma (x)=Ax+b\), \(det(A)\ne 0\), then

$$\begin{aligned} dub_K(x,y)=dub_{\sigma (K)}(\sigma (x),\sigma (y)). \end{aligned}$$
(16)

The notion of Dubiner distance plays a deep role in multivariate polynomial approximation, cf. e.g. Bos et al. (2008) and Dubiner (1995). Unfortunately, such a distance is explicitly known only in the univariate case on intervals (where it is the arccos distance by the Van der Corput–Schaake inequality), and on the cube, simplex, sphere and ball (in any dimension), cf. Bos et al. (2008) and Dubiner (1995). On the other hand, it can be estimated on some classes of compact sets, for example on smooth convex bodies via a tangential Markov inequality on the boundary, cf. Piazzon and Vianello (2019). Its connection with the theory of polynomial meshes is given by the following elementary but powerful lemma (Piazzon and Vianello 2019); for the reader’s convenience, we recall also the simple proof.

Lemma 1

Let \(Y_n=Y_n(\alpha )\), \(n\ge 1\), be a sequence of finite sets of a compact set \(K\subset \mathbb {R}^d\), whose covering radius with respect to the Dubiner distance does not exceed \(\alpha /n\), where \(\alpha \in (0,\pi /2)\), i.e.

$$\begin{aligned} r(Y_n)=\max _{x\in K}\,dub_K(x,Y_n)=\max _{x\in K} \min _{y\in Y_n}\,dub_K(x,y)\le \frac{\alpha }{n}. \end{aligned}$$
(17)

Then, \(\{Y_n\}\) is a polynomial mesh on K with constant \(c=1/\cos (\alpha )\).

Proof

First, possibly normalizing and/or multiplying p by \(-1\), we can assume that \(\Vert p\Vert _K=p(\hat{x})=1\) for a suitable \(\hat{x} \in K\). Since (17) holds for \(Y_n\), there exists \(\hat{y}\in Y_n\) such that

$$\begin{aligned} |\arccos {(p(\hat{x}))}-\arccos {(p(\hat{y}))}|= |\arccos {(p(\hat{y}))}|\le \frac{\alpha \,deg(p)}{n}\le \alpha <\frac{\pi }{2}. \end{aligned}$$

Now the \(\arccos \) function is monotonically decreasing and nonnegative, thus we have that \(p(\hat{y})\ge \cos (\alpha )>0\), and finally

$$\begin{aligned} \Vert p\Vert _K=1\le \frac{p(\hat{y})}{\cos \alpha }\le \frac{1}{\cos \alpha }\,\Vert p\Vert _{Y_n}. \end{aligned}$$

\(\square \)

By Lemma 1 we can now prove the following proposition on near G-optimality by polynomial meshes constructed via the Dubiner distance.

Proposition 2

Let \(K\subset \mathbb {R}^d\) be a compact set and \(\{Y_n(\alpha )\}\) be the polynomial mesh of Lemma  1.

Then for every \(n\in \mathbb {N}\) and \(m>1\), the probability measure

$$\begin{aligned} \nu =\nu (n,m)=\mu ^*(Y_{2n}(\pi /(2m))) \end{aligned}$$
(18)

is a near G-optimal design on K, in the sense that

$$\begin{aligned} \max _{x\in K}{K_n^\nu (x,x)}\le c_m\,N,\quad c_m=\frac{1}{\cos (\pi /(2m))}. \end{aligned}$$
(19)

The proof follows essentially the lines of that of Proposition 1, with \(Y_{2n}(\pi /(2m))\) replacing \(X_{2mn}\), observing that by Lemma 1 for every \(p\in \mathbb {P}_{2n}^d(K)\) we have \(\Vert p\Vert _K\le c_m\,\Vert p\Vert _{Y_{2n}(\pi /(2m))}\). We stress that in this case \(c_m\sim 1+\pi ^2/(8m^2)\), i.e. G-optimality is approached at a rate proportional to \(1/m^2\). In terms of G-efficiency we have in this case

$$\begin{aligned} \hbox {G}_{{\mathrm{eff}}}(\nu )\ge \cos (\pi /(2m))\sim 1-\pi ^2/(8m^2). \end{aligned}$$
(20)

We recall that optimal polynomial meshes like those in Proposition 2 have been recently constructed in the framework of polynomial optimization on some compact sets where the Dubiner distance is known or can be estimated, such as the cube, the sphere, convex bodies with smooth boundary; cf. Piazzon and Vianello (2018, 2019) and Vianello (2018a).

Similar results can be obtained for compact sets of the general form

$$\begin{aligned}&K=\sigma (I\times \Theta ),\quad \sigma =\left( \sigma _\ell (t,\theta )\right) _{1\le \ell \le d},\nonumber \\&t\in I=I_1\times \cdots \times I_{d_1},\quad \theta \in \Theta =\Theta _1\times \cdots \times \Theta _{d_2+d_3}, \end{aligned}$$
(21)
$$\begin{aligned}&\sigma _\ell \in \bigotimes _{i=1}^{d_1}\mathbb {P}_1(I_i) \otimes \bigotimes _{j=1}^{d_2+d_3}\mathbb {T}_1(\Theta _j),\quad 1\le \ell \le d, \end{aligned}$$
(22)

where \(d_1,d_2,d_3\ge 0\), and \(I_i=[a_i,b_i]\), \(1\le i\le d_1\) (algebraic variables), \(\Theta _j=[u_j,v_j]\) with \(v_j-u_j=2\pi \), \(1\le j\le d_2\) (periodic trigonometric variables) and \(v_j-u_j<2\pi \), \(d_2+1\le j\le d_2+d_3\) (subperiodic trigonometric variables). Here and below \(\mathbb {T}_n=span(1,\cos (\theta ),\sin (\theta ) \ldots , \cos (n\theta ),\sin (n\theta ))\) denotes the space of univariate trigonometric polynomials of degree not exceeding n. Notice that the mapping \(\sigma \) can be non-injective.

The class (21)–(22) contains many common domains in applications, which have in some sense a tensorial structure. For example in the 2-dimensional case convex quadrangles (with triangles as special degenerate cases) fall into this class, because they are bilinear transformations of a square (by the so-called Duffy transform), with \(d_1=2\), \(d_2=d_3=0\). Similarly the disk described in polar coordinates (\(d_1=d_2=1\), \(d_3=0\)), the 2-sphere in spherical coordinates (\(d_1=0\), \(d_2=d_3=1\)), the torus in toroidal–poloidal coordinates (\(d_1=0\), \(d_2=2\), \(d_3=0\)), cf. Bos and Vianello (2012) and Sommariva and Vianello (2018).

Moreover, many examples of sections of disk, sphere, ball, surface and solid torus can be written as (21)–(22). For example, a circular sector of the unit disk with angle \(2\omega \), \(\omega <\pi \), can be described by such a \(\sigma \) with \(d_1=d_3=1\), \(d_2=0\), e.g.,

$$\begin{aligned} \sigma (t,\theta ) =(t\cos (\theta ),t\sin (\theta )),\quad (t,\theta )\in [0,1]\times [-\omega ,\omega ], \end{aligned}$$
(23)

(polar coordinates). Similarly, a circular segment with angle \(2\omega \) (one of the two portions of the disk cut out by a line) can be described by such a \(\sigma \) with \(d_1=d_3=1\), \(d_2=0\), e.g.,

$$\begin{aligned} \sigma (t,\theta )=(\cos (\theta ),t\sin (\theta )),\quad (t,\theta )\in [-1,1]\times [-\omega ,\omega ]. \end{aligned}$$
(24)

On the other hand, a toroidal rectangle is described with \(d_3=2\), \(d_1=d_2=0\), by the trasformation

$$\begin{aligned} \sigma (\theta )=\left( (R+r\cos (\theta _1))\cos (\theta _2),(R+r\cos (\theta _1)) \sin (\theta _2),r\sin (\theta _1)\right) , \end{aligned}$$
(25)

\(\theta =(\theta _1,\theta _2)\in [\omega _1,\omega _2]\times [\omega _3,\omega _4]\), where R and r are the major and minor radii of the torus. In the degenerate case \(R=0\) we get a so-called geographic rectangle of a sphere of radius r, i.e. the region between two given latitudes and longitudes. For other planar, surface and solid examples we refer the reader to Sommariva and Vianello (2015b, 2018).

By the geometric structure (21)–(22), we have that if \(p\in \mathbb {P}_n^d(K)\) then

$$\begin{aligned} p\circ \sigma \in \bigotimes _{i=1}^{d_1}\mathbb {P}_n(I_i) \otimes \bigotimes _{j=1}^{d_2+d_3}\mathbb {T}_n(\Theta _j), \end{aligned}$$
(26)

and this allows us to construct product-like polynomial meshes on such domains. Indeed, in the univariate case Chebyshev-like optimal polynomial meshes are known for algebraic polynomials and for trigonometric polynomials (even on subintervals of the period). This result is stated in the following

Lemma 2

Let \(K\subset \mathbb {R}^d\) be a compact set of the form (21)–(22). Then, for every fixed \(m>1\), K possesses a polynomial mesh \(\{Z_n(m)\}\) with constant \(c=(1/\cos (\pi /(2m)))^{d_1+d_2+d_3}\) and cardinality not exceeding \((mn)^{d_1+d_2}\,(2mn)^{d_3}\).

The proof is essentially that given in Sommariva and Vianello (2018) by resorting to algebraic-trigonometric Chebyshev-like grids mapped by \(\sigma \), with minor modifications to take into account the later results on subperiodic trigonometric Dubiner distance given in Vianello (2018b). If Chebyshev–Lobatto-like grids are used, mn and 2mn have to be substituted by \(mn+1\) and \(2mn+1\), respectively. If m is not an integer, all these quantities should be substituted by their ceiling (the least integer not smaller than).

By Lemma 2 we get immediately the following proposition.

Proposition 3

Let \(K\subset \mathbb {R}^d\) be a compact set of the form (21)–(22) and \(\{Z_{n}(m)\}\) the polynomial mesh of Lemma  2.

Then for every \(n\in \mathbb {N}\) and \(m>1\), the probability measure

$$\begin{aligned} \nu =\nu (n,m)=\mu ^*(Z_{2n}(m)) \end{aligned}$$
(27)

is a near G-optimal design on K, in the sense that

$$\begin{aligned} \max _{x\in K}{K_n^\nu (x,x)}\le c_m\,N,\quad c_m=\left( \frac{1}{\cos (\pi /(2m))}\right) ^{d_1+d_2+d_3}. \end{aligned}$$
(28)

Concerning G-efficiency we have now

$$\begin{aligned} \hbox {G}_{{\mathrm{eff}}}(\nu )\ge (\cos (\pi /(2m)))^{d_1+d_2+d_3}\sim 1-(d_1+d_2+d_3)\pi ^2/(8m^2). \end{aligned}$$
(29)

Remark 1

Observe that by Propositions 1-3, reasoning as in (11) we get

$$\begin{aligned} \Vert L^\nu _n\Vert \le \sqrt{c_m\,N}, \end{aligned}$$
(30)

i.e. the discrete probability measure \(\nu \) nearly minimizes (the estimate of) the weighted Least Squares uniform operator norm.

3 Caratheodory-Tchakaloff design concentration

Propositions 13 and the General Equivalence Theorem suggest a standard way to compute near G-optimal designs. First, one constructs a polynomial mesh such as \(X_{2mn}\) or \(Y_{2n}(\pi /(2m))\) or \(Z_{2n}(m)\), then computes a D-optimal design for degree n on the mesh by one of the available algorithms. Observe that such designs will be in general approximate, that is we compute a discrete probability measure \(\tilde{\nu }\approx \nu \) such that on the polynomial mesh

$$\begin{aligned} \max _{x\in mesh} K_n^{\tilde{\nu }}(x,x)\le \tilde{N}\approx N \end{aligned}$$
(31)

(with \(\tilde{N}\) not necessarily an integer), nevertheless estimates (13), (19) and (30) still hold with \(\tilde{\nu }\) and \(\tilde{N}\) replacing \(\nu \) and N, respectively.

Again, we can not even attempt to survey the vast literature on computational methods for D-optimal designs; we may quote among others the class of exchange algorithms and the class of multiplicative algorithms, cf. e.g. Celant and Broniatowski (2017), Mandal et al. (2015) and Torsney and Martin-Martin (2009) and the references therein.

Our computational strategy is in brief the following. We first approximate a D-optimal design for degree n on the polynomial mesh by a standard multiplicative algorithm, and then we concentrate the measure via Caratheodory-Tchakaloff compression of degree 2n, keeping the Christoffel polynomial, and thus G-efficiency, invariant. Such a compression is based on a suitable implementation of a discrete version of the well-known Tchakaloff Theorem, which in general asserts that any (probability) measure has a representing atomic measure with the same polynomial moments up to a given degree, with cardinality not exceeding the dimension of the corresponding polynomial space; cf. e.g. Putinar (1997), Piazzon et al. (2017a) and Sommariva and Vianello (2015a) and the references therein. In such a way we get near optimality with respect to both, G-efficiency and support cardinality, since the latter will not exceed \(N_{2n}=\hbox {dim}(\mathbb {P}_{2n}^d(K))\).

To simplify the notation, in what follows we shall denote by \(X=\{x_i\}\) either the polynomial mesh \(X=X_{2mn}\) or \(X=Y_{2n}(\pi /(2m))\) or \(X=Z_{2n}(m)\) (cf. Propositions 13), by M its cardinality, by \(w=\{w_i\}\) the weights of a probability measure on X (\(w_i\ge 0\), \(\sum w_i=1\)), and by \(K_n^{w}(x,x)\) the corresponding Christoffel polynomial.

The first step is the application of the standard Titterington’s multiplicative algorithm (Titterington 1976) to compute a sequence \(w(\ell )\) of weight arrays

$$\begin{aligned} w_i(\ell +1)=w_i(\ell )\,\frac{K_n^{w(\ell )}(x_i,x_i)}{N},\quad 1\le i\le M,\quad \ell \ge 0, \end{aligned}$$
(32)

where we take \(w(0)=(1/M,\ldots ,1/M)\). Observe that the weights \(w_i(\ell +1)\) determine a probability measure on X, since they are clearly nonnegative and \(\sum _i{w_i(\ell )\,K_n^{w(\ell )}(x_i,x_i)}=N\). The sequence \(w(\ell )\) is known to converge for any initial choice of probability weights to the weights of a D-optimal design (with a nondecreasing sequence of Gram determinants), cf. e.g. Dette et al. (2008) and the references therein.

In order to implement (32), we need an efficient way to compute the right-hand side. Denote by \(V_n=(\phi _j(x_i))\in \mathbb {R}^{M\times N}\) the rectangular Vandermonde matrix at X in a fixed polynomial basis \((\phi _1,\ldots ,\phi _N)\), and by D(w) the diagonal matrix of a weight array w. In order to avoid severe ill-conditioning that may already occur for low degrees, we have discarded the monomial basis and used the product Chebyshev basis of the smallest box containing X, a choice that turns out to work effectively in multivariate instances; cf. e.g. Bos et al. (2011), Piazzon (2019) and Piazzon et al. (2017b).

By the QR factorization

$$\begin{aligned} D^{1/2}(w)\,V_n=QR, \end{aligned}$$

with \(Q=(q_{ij})\) orthogonal (rectangular) and R square upper triangular, we have that \((p_1,\ldots ,p_N)=(\phi _1,\ldots ,\phi _N)R^{-1}\) is a w-orthonormal basis and

$$\begin{aligned} w_i\,K_n^{w}(x_i,x_i)=w_i\,\sum _{j=1}^N{p_j^2(x_i)}= \sum _{j=1}^N{q_{ij}^2},\quad 1\le i\le M. \end{aligned}$$
(33)

Thus we can update the weights at each step of (32) by a single QR factorization, using directly the squared 2-norms of the rows of the orthogonal matrix Q.

The convergence of (32) can be slow, but a few iterations usually suffice to obtain an already quite good design on X. Indeed, in all our numerical tests with bivariate polynomial meshes, after 10 or 20 iterations we already get \(90\%\) G-efficiency on X, and \(95\%\) after 20 or 30 iterations; cf. Figures 1-left and 2-left for typical convergence profiles. On the other hand, \(99\%\) G-efficiency would require hundreds, and \(99,9\%\) thousands of iterations. When a G-efficiency very close to 1 is needed, one should choose one of the more sophisticated approximation algorithms available in the literature, cf. e.g. De Castro et al. (2019), Dette et al. (2008) and Mandal et al. (2015) and the references therein.

Fig. 1
figure 1

Left: G-efficiency of the approximate optimal designs computed by (32) on a \(101\times 101\) Chebyshev–Lobatto grid of the square (upper curve, \(n=10\), \(m=5\)), and estimate (36) (lower curve); Right: Caratheodory-Tchakaloff compressed support (231 points) after \(\ell =22\) iterations (\(\hbox {G}_{{\mathrm{eff}}}\approx 0.95\))

Fig. 2
figure 2

Caratheodory-Tchakaloff compressed support (165 points) on a \(41\times 41\times 41\) Chebyshev–Lobatto grid of the cube for regression degree \(n=4\) (with \(m=5\)), after \(\ell =35\) iterations (\(\hbox {G}_{{\mathrm{eff}}}\approx 0.95\))

Though the designs given by (32) will concentrate in the limit on the support of an optimal design, which typically is of relatively low cardinality (with respect to M), this will be not numerically evident after only a small number of iterations. Hence, in practice, the support of the optimal measure is not readily identified, and a practioner may be presented with a measure with full support (albeit with many small weights). However, using Tchakaloff compression (described below) the cardinality of the support can be immediately reduced providing the practioner with a typically much smaller, and hence more pratical, design set.

Let \(V_{2n}\in \mathbb {R}^{M\times N_{2n}}\) be the rectangular Vandermonde matrix at X with respect to a fixed polynomial basis for \(\mathbb {P}_{2n}^d(X)=\mathbb {P}_{2n}^d(K)\) (recall that the chosen polynomial mesh is determining on K for polynomials of degree up to 2n), and w the weight array of a probability measure supported on X (in our instance, the weights produced by (32) after a suitable number of iterations, to get a prescribed G-efficiency on X). In this fully discrete framework the Tchakaloff Theorem is equivalent to the existence of a sparse nonnegative solution u to the underdetermined moment system

$$\begin{aligned} V_{2n}^tu=b=V_{2n}^tw,\quad u\ge 0, \end{aligned}$$
(34)

where b is the vector of discrete w-moments of the polynomial basis up to degree 2n. The celebrated Caratheodory Theorem on conical finite-dimensional linear combinations (Caratheodory 1911), ensures that such a solution exists and has no more than \(N_{2n}\) nonzero components.

In order to compute a sparse solution, we can resort to Linear or Quadratic Programming. We recall here the second approach, that turned out to be the most efficient in all the tests on bivariate discrete measure compression for degrees in the order of tens that we carried out, cf. Piazzon et al. (2017b). It consists of seeking a sparse solution \(\hat{u}\) to the NonNegative Least Squares problem

$$\begin{aligned} \Vert V_{2n}^t\hat{u}-b\Vert _2^2=\min _{u\ge 0}\Vert V_{2n}^tu-b\Vert _2^2 \end{aligned}$$
(35)

using the Lawson-Hanson active set algorithm (Lawson and Hanson 1995), that is implemented for example in the Matlab native function \(\mathtt {lsqnonneg}\). The nonzero components of \(\hat{u}\) determine the resulting design, whose support, say \(T=\{x_i:\,\hat{u}_i>0\}\), has at most \(N_{2n}\) points.

Observe that by construction \(K_n^{\hat{u}}(x,x)=K_n^{w}(x,x)\) on K, since the underlying probability measures have the same moments up to degree 2n and hence generate the same orthogonal polynomials. Now, since

$$\begin{aligned} \max _{x\in K}K_n^{w}(x,x)\le c_m\,\max _{x\in X}K_n^{w}(x,x)=\frac{c_mN}{\theta }, \end{aligned}$$

where \(\theta \) is the G-efficiency of w on X, in terms of G-efficiency on K we have the estimate

$$\begin{aligned} \hbox {G}_{{\mathrm{eff}}}(\hat{u})=\hbox {G}_{{\mathrm{eff}}}(w) \ge \frac{\theta }{c_m}, \end{aligned}$$
(36)

cf. Propositions 13, while in terms of the uniform norm of the weighted Least Squares operator we get the estimate

$$\begin{aligned} \Vert L_n^{\hat{u}}\Vert \le \sqrt{\frac{c_mN}{\theta }}. \end{aligned}$$
(37)

We present now several numerical tests. All the computations have been made in Matlab R2017b on a 2.7 GHz Intel Core i5 CPU with 16GB RAM. As a first example we consider polynomial regression on the square \(K=[-1,1]^2\). Since \(dub_{[-1,1]^2}(x,y)=\max \{\arccos |x_2-x_1|,\arccos |y_2-y_1|\}\), cf. Bos et al. (2008), by Proposition 2 we can take as initial support \(Y_{2n}(\pi /(2m))\) a \((2mn+1)\times (2mn+1)\) Chebyshev–Lobatto grid (here \(c_m=1/\cos (\pi /(2m))\), cf. Piazzon and Vianello 2018), apply the iteration (32) up to a given G-efficiency and then Caratheodory-Tchakaloff measure compression via (35).

The results corresponding to \(n=10\) and \(m=5\) are reported in Fig. 1. Notice that (36) turns out to be an underestimate of the actual G-efficiency on K (the maximum has been computed at a much finer Chebyshev–Lobatto grid, say \(Y_{2n}(\pi /(8m)\)). All the information required for polynomial regression up to \(95\%\) G-efficiency is compressed into \(231=\hbox {dim}(\mathbb {P}_{20}^2)\) sampling nodes and weights, in about 1.7 s.

In Fig. 2 we present a trivariate example, where \(K=[-1,1]^3\) and we consider regression degree \(n=4\) and \(m=5\), with a corresponding \(41\times 41\times 41\) Chebyshev–Lobatto grid. This polynomial mesh of about 68,900 points is compressed into \(165=\hbox {dim}(\mathbb {P}_{8}^3)\) sampling nodes and weights still ensuring \(95\%\) G-efficiency, in about 9 s.

In order to check the algorithm behavior on a more complicated domain, we take a 14-sided nonconvex polygon. An application of polygonal compact sets is the approximation of geographical regions; for example, the polygon of Fig. 3 resembles a rough approximation of the shape of Belgium. The problem could be that of locating a near minimal number of sensors for near optimal polynomial regression, to sample continuous scalar or vector fields that have to be reconstructed or modelled on the whole region.

Fig. 3
figure 3

Left: G-efficiency of the approximate optimal designs computed by (32) on a polynomial mesh with about 84,200 points of a 14-sided nonconvex polygon (upper curve, \(n=8\), \(m=5\)), and estimate (36) (lower curve); Right: Caratheodory-Tchakaloff compressed support (153 points) after \(\ell =26\) iterations (\(\hbox {G}_{{\mathrm{eff}}}\approx 0.95\))

With polygons we can resort to triangulation and finite union as in (3), constructing on each triangle a polynomial mesh like \(Z_{2n}(m)\) in Proposition 3 by the Duffy transform of a Chebyshev-grid of the square with approximately \((2mn)^2\) points; here \(c_m=1/\cos ^2(\pi /(2m))\) for any triangle and hence for the whole polygon. The results corresponding to \(n=8\) and \(m=5\) are reported in Fig. 3. The G-efficiency convergence profile is similar to that of Fig. 1, and the whole polynomial mesh of about 84,200 points is compressed into \(153=\hbox {dim}(\mathbb {P}_{16}^2)\) sampling nodes and weights still ensuring \(95\%\) G-efficiency, in about 8 s.

Remark 2

The practical implementation of any design requires an interpretation of the weights. As before, let \(X:=\{x_i\}_{i=1}^M\subset K\) be the support of a discrete measure with \(w_i>0,\)\(1\le i\le M.\) We will denote this measure by \(\mu _X.\) Further, let \(V_n:=(\phi _j(x_i))\in \mathbb {R}^{M\times N}\) be the Vandermonde evaluation matrix for the basis \(\{\phi _1,\ldots ,\phi _N\}.\) The Gram (information) matrix is then given by

$$\begin{aligned} G&=\left( \int _K \phi _i(x)\phi _jd\mu _X\right) _{1\le i,j\le N}\\&=\left( \sum _{k=1}^M w_k \phi _i(x_k)\phi _j(x_k)\right) _{1\le i,j\le N}\\&=V_n^tD(w)V_n \end{aligned}$$

where \(D(w)\in \mathbb {R}^{M\times M}\) is the diagonal matrix of the weights w. Then the best least squares approximation from \(\mathcal{P}_n^d(K),\) with respect to the measure \(\mu _X,\) to observations \(y_i,\)\(1\le i\le M,\) is given by

$$\begin{aligned} \sum _{j=1}^N c_j\phi _j(x),\qquad c=(V_n^tD(w)V_n)^{-1}V_n^tD(w)y. \end{aligned}$$
(38)

For an optimal design, the determinant of the Gram matrix \(V_n^tD(w)V_n\) is as large as possible and hence (38) could be used as a numerically stable (at least as much as possible) algorithm for computing an approximation to the given data. However, it is also useful to exploit the statistical meaning of the weights. Indeed, in comparison, underlying the statistical interpretation of least squares is the assumption that the observations are samples of a polynomial \(p\in \mathcal{P}_n^d(K),\) at the design points \(x_i,\) each with an error \(\epsilon _i\) assumed to be independent normal random variables \(\epsilon _i \sim N(0,\sigma _i^2).\) One then minimizes the sum of the squares of the normalized variables,

$$\begin{aligned} \frac{\epsilon _i}{\sigma _i}=\frac{p(x_i)-y_i}{\sigma _i}, \end{aligned}$$

i.e., one minimizes

$$\begin{aligned} \sum _{i=1}^M \left( \frac{p(x_i)-y_i}{\sigma _i}\right) ^2. \end{aligned}$$

If we write \(p(x)=\sum _{j=1}^Nc_j\phi _j(x),\) then this may be expressed in matrix-vector form as

$$\begin{aligned} \Vert C^{-1/2}(V_nc-y)\Vert _2^2 \end{aligned}$$

where \(C=D(\sigma _i^2)\in \mathbb {R}^{M\times M}\) is the (diagonal) covariance matrix. This is minimized by

$$\begin{aligned} c=(V_n^tC^{-1}V_n)^{-1}V_nC^{-1}y. \end{aligned}$$
(39)

Comparing (39) with (38) we see that the weights \(w_i\) correspond the reciprocals of the variances, \(w_i\sim 1/\sigma _i^2,\)\(1\le i\le M,\) but normalized so that \(\sum _{i=1}^Mw_i=1.\)

Now, if in principle any specific measurement has an error with a fixed variance \(\sigma ^2\) then the variance for an observation may be reduced by repeating the ith measurement \(m_i\) (say) times and then using the average \(\overline{y}_i\) in place of \(y_i,\) with resulting error variance \(\sigma ^2/m_i.\) Then \(w_i\sim 1/\sigma _i^2=m_i/\sigma ^2\) which, after normalization, results in

$$\begin{aligned} w_i=\frac{m_i}{\sum _{j=1}^Mm_j},\quad 1\le i\le M. \end{aligned}$$

In other words, the weights indicate the percentage of the total measurement budget to use at the ith observation point.

However, the computed weights are rarely rational and thus to obtain a useable percentage some rounding must be done. It turns out that the effect of this rounding on the determinant of the Gram matrix can be readily estimated. Indeed we may calculate

$$\begin{aligned} \frac{\partial }{\partial w_i}\log (\mathrm{det}(G_n^\mu ))&= \frac{\partial }{\partial w_i}\mathrm{tr}(\log (G_n^\mu ))\\&=\mathrm{tr}\left( \frac{\partial }{\partial w_i}\log (G_n^\mu )\right) \\&=\mathrm{tr}\left( (G_n^\mu )^{-1}\frac{\partial G_n^\mu }{\partial w_i}\right) . \end{aligned}$$

But, as we may write

$$\begin{aligned} G_n^\mu =\sum _{k=1}^M w_k p(x_k)p^t(x_k) \end{aligned}$$

where p(x) is the vector \(p(x)=[\phi _1(x),\phi _2(x),\ldots , \phi _N(x)]^t\in \mathbb {R}^M,\) we have

$$\begin{aligned} \frac{\partial G_n^\mu }{\partial w_i}= p(x_i)p^t(x_i) \end{aligned}$$

and hence

$$\begin{aligned} \frac{\partial }{\partial w_i}\log (\mathrm{det}(G_n^\mu ))&=\mathrm{tr}\left( (G_n^\mu )^{-1}p(x_i)p^t(x_i)\right) \\&=p^t(x_i)(G_n^\mu )^{-1}p(x_i)\\&=K_n^\mu (x_i,x_i). \end{aligned}$$

For an optimal design \(K_n^\mu (x_i,x_i)=N,\)\(1\le i \le M\) and for our near optimal designs this is nearly so. Hence a perturbation in a weight results in a relative perturbation in the determinant amplified by around a factor of N. In adjusting the weights, some roundings will be up and others down and so these perturbations will tend to negate each other. In other words, the rounding strategy is entirely practical.

4 Summary

In this paper we have shown that polynomial meshes (norming sets) can be used as useful discretizations of compact sets \(K\subset \mathbb {R}^d\) also for the purposes of (near) optimal statistical designs. We have further shown how the idea of Tchakaloff compression of a discrete measure can be efficiently used to concentrate the design measure onto a relatively small subset of its support, thus making any least squares calculation rather more pratical.