1 Introduction

We consider the problem of minimizing a polynomial \(f \in {\mathbb {R}}[x]\) of degree \(d \le n\) over the n-dimensional boolean hypercube \({\mathbb {B}}^{n} = \{0, 1\}^n\), i.e., of computing

$$\begin{aligned} f_{\min }:= \min _{x \in {\mathbb {B}}^{n}} f(x). \end{aligned}$$
(1)

This optimization problem is NP-hard in general. Indeed, as is well-known, one can model an instance of max-cut on the complete graph \(K_n\) with edge weights \(w=(w_{ij})\) as a problem of the form (1) by setting:

$$\begin{aligned} f(x) = -\sum _{1\le i < j\le n} w_{ij} (x_i - x_j)^2. \end{aligned}$$

As another example one can compute the stability number \(\alpha (G)\) of a graph \(G=(V,E)\) via the program

$$\begin{aligned} \alpha (G) = \max _{x\in {\mathbb {B}}^{|V|}} \sum _{i\in V}x_i -\sum _{\{i,j\}\in E} x_ix_j. \end{aligned}$$

One may replace the boolean cube \({\mathbb {B}}^{n}=\{0,1\}^n\) by the discrete cube \(\{\pm 1\}^n\), in which case maximizing a quadratic polynomial \(x^\top Ax\) has many other applications, e.g., to max-cut [13], to the cut norm [1], or to correlation clustering [4]. Approximation algorithms are known depending on the structure of the matrix A (see [1, 6, 13]), but the problem is known to be NP-hard to approximate within any factor less than 13/11 [2].

Problem (1) also permits to capture polynomial optimization over a general region of the form \({\mathbb {B}}^{n}\cap P\) where P is a polyhedron [17] and thus a broad range of combinatorial optimization problems. The general intractability of problem (1) motivates the search for tractable bounds on the minimum value in (1). In particular, several lift-and-project methods have been proposed, based on lifting the problem to higher dimension by introducing new variables modelling higher degree monomials. Such methods also apply to constrained problems on \({\mathbb {B}}^{n}\) where the constraints can be linear or polynomial; see, e.g., [3, 18, 27, 35, 39, 44]. In [21] it is shown that the sums-of-squares hierarchy of Lasserre [18] in fact refines the other proposed hierarchies. As a consequence the sum-of-squares approach for polynomial optimization over \({\mathbb {B}}^{n}\) has received a great deal of attention in the recent years and there is a vast literature on this topic. Among many other results, let us just mention its use to show lower bounds on the size of semidefinite programming relaxations for combinatorial problems such as max-cut, maximum stable sets and TSP in [25], and the links to the Unique Game Conjecture in [5]. For background about the sum-of-squares hierarchy applied to polynomial optimization over general semialgebraic sets we refer to [16, 19, 24, 32] and further references therein.

This motivates the interest in gaining a better understanding of the quality of the bounds produced by the sum-of-squares hierarchy. Our objective in this paper is to investigate such an error analysis for this hierarchy applied to binary polynomial optimization as in (1).

1.1 The sum-of-squares hierarchy on the boolean cube

The sum-of-squares hierarchy was introduced by Lasserre [16, 18] and Parrilo [32] as a tool to produce tractable lower bounds for polynomial optimization problems. When applied to problem (1) it provides for any integer \(r\in {\mathbb {N}}\) a lower bound \(\smash {f_{({r})}} \le f_{\min }\) on \(f_{\min }\), given by:

$$\begin{aligned} \smash {f_{({r})}} := \sup _{\lambda \in {\mathbb {R}}} \left\{ f(x) - \lambda \text { is a sum-of-squares of degree at most } 2r \text { on } {\mathbb {B}}^{n} \right\} . \end{aligned}$$
(2)

The condition ‘\(f(x)-\lambda \) is a sum-of-squares of degree at most 2r on \({\mathbb {B}}^{n}\)’ means that there exists a sum-of-squares polynomial \(s \in \varSigma _r\) such that \(f(x) - \lambda = s(x)\) for all \(x \in {\mathbb {B}}^{n}\), or, equivalently, the polynomial \(f-\lambda -s\) belongs to the ideal generated by the polynomials \(x_1-x_1^2,\ldots ,x_n-x_n^2\). Throughout, \(\varSigma _r\) denotes the set of sum-of-squares polynomials with degree at most 2r, i.e., of the form \(\sum _i p_i^2\) with \(p_i\in {\mathbb {R}}[x]_r\).

As sums of squares of polynomials can be modelled using semidefinite programming, problem (2) can be reformulated as a semidefinite program of size polynomial in n for fixed r [16, 32], see also [23]. In the case of unconstrained boolean optimization, the resulting semidefinite program is known to have an optimum solution with small coefficients (see [31, 33]). For fixed r, the parameter \(\smash {f_{({r})}}\) may therefore be computed efficiently (up to any precision).

The bounds \(\smash {f_{({r})}}\) have finite convergence: \(\smash {f_{({r})}}=f_{\min }\) for \(r\ge n\) [18, 21]. In fact, it has been shown in [36] that the bound \(\smash {f_{({r})}}\) is exact already for \(2r \ge n+d-1\). That is,

$$\begin{aligned} \smash {f_{({r})}} = f_{\min }\text { for } r \ge \frac{n+d-1}{2}. \end{aligned}$$
(3)

In addition, it is shown in [36] that the bound \(\smash {f_{({r})}}\) is exact for \({2r\ge n+d-2}\) when the polynomial f has only monomials of even degree. This extends an earlier result of [12] shown for quadratic forms (\(d=2\)), which applies in particular to the case of max-cut. Furthermore, this result is tight for max-cut, since one needs to go up to order \(2r\ge n\) in order to reach finite convergence (in the cardinality case when all edge weights are 1) [22]. Similarly, the result (3) is tight when d is even and n is odd [15].

The main contribution of this work is an analysis of the quality of the bounds \(\smash {f_{({r})}}\) for parameters \(r, n \in {\mathbb {N}}\) which fall outside of this regime, i.e., \(2r<n+d-1\). The following is our main result, which expresses the error of the bound \(\smash {f_{({r})}}\) in terms of the roots of Krawtchouk polynomials, which are classical univariate orthogonal polynomials with respect to a discrete measure on the set \(\{0, 1, \ldots , n\}\) (see Sect. 2 for details).

Theorem 1

Fix \(d \le n\) and let \(f \in {\mathbb {R}}[x]\) be a polynomial of degree d. For \(r, n \in {\mathbb {N}}\), let \(\xi ^n_{r}\) be the least root of the degree r Krawtchouk polynomial (19) with parameter n. Then, if \((r+1)/n \le 1/2\) and \({d(d+1) \cdot \xi _{r+1}^n / n \le 1/2}\), we have:

$$\begin{aligned} \frac{f_{\min }- \smash {f_{({r})}}}{\Vert f\Vert _{\infty }} \le 2 C_d \cdot \xi _{r+1}^n / n. \end{aligned}$$
(4)

Here \(C_d > 0\) is an absolute constant depending only on d and we set \(\Vert f\Vert _\infty := \max _{x \in {\mathbb {B}}^{n}} | f(x) |.\)

The extremal roots of Krawtchouk polynomials are well-studied in the literature. The following result of Levenshtein [26] shows their asymptotic behaviour.

Theorem 2

([26], Section 5) For \(t \in [0, 1/2]\), define the function

$$\begin{aligned} \varphi (t) = 1/2 - \sqrt{t(1-t)}. \end{aligned}$$
(5)

Then the least root \(\xi _r^n\) of the degree r Krawtchouk polynomial with parameter n satisfies

$$\begin{aligned} \xi _r^n / n \le \varphi (r/n) + c \cdot (r / n)^{-1/6} \cdot n^{-2/3} \end{aligned}$$
(6)

for some universal constant \(c > 0\).

Applying (6) to (4), we find that the relative error of the bound \(\smash {f_{({r})}}\) in the regime \(r \approx t \cdot n\) behaves as the function \(\varphi (t) = 1/2 - \sqrt{t(1-t)}\), up to a term in \(O(1/n^{2/3})\), which vanishes as n tends to \(\infty \). As an illustration, Fig. 1 in Appendix A shows the function \(\varphi (t)\).

1.2 A second hierarchy of bounds

In addition to the lower bound \(\smash {f_{({r})}}\), Lasserre [20] also defines an upper bound \(\smash {f^{({r})}} \ge f_{\min }\) on \(f_{\min }\) as follows:

$$\begin{aligned} \smash {f^{({r})}} := \inf _{s \in \varSigma _r} \left\{ \int _{{\mathbb {B}}^{n}} f(x) \cdot s(x) d\mu (x) : \int _{{\mathbb {B}}^{n}} s(x) d\mu (x) = 1\right\} , \end{aligned}$$
(7)

where \(\mu \) is the uniform probability measure on \({\mathbb {B}}^{n}\). For fixed r, similarly to \(\smash {f_{({r})}}\), one may compute \(\smash {f^{({r})}}\) (up to any precision) efficiently by reformulating problem (7) as a semidefinite program [20]. Furthermore, as shown in [20] the bound is exact for some order r, and it is not difficult to see that the bound \(\smash {f^{({r})}}\) is exact at order \(r = n\) and that this is tight (see Sect. 5).

Essentially as a side result in the proof of Theorem 1, we can show the following analog for the upper bounds \(\smash {f^{({r})}}\), which we believe to be of independent interest.

Theorem 3

Fix \(d \le n\) and let \(f \in {\mathbb {R}}[x]\) be a polynomial of degree d. Then, for any \(r, n \in {\mathbb {N}}\) with \((r+1)/n \le 1/2\), we have:

$$\begin{aligned} \frac{\smash {f^{({r})}} - f_{\min }}{\Vert f\Vert _{\infty }} \le C_d \cdot \xi _{r+1}^n / n, \end{aligned}$$

where \(C_d > 0\) is the constant mentioned in Theorem 1.

So we have the same estimate of the relative error for the upper bounds \(\smash {f^{({r})}}\) as for the lower bounds \(\smash {f_{({r})}}\) (up to a constant factor 2) and indeed we will see that our proof relies on an intimate connection between both hierarchies. Note that the above analysis of \(\smash {f^{({r})}}\) does not require any condition on the size of \(\xi _{r+1}^n\) as was necessary for the analysis of \(\smash {f_{({r})}}\) in Theorem 1. Indeed, as will become clear later, the condition put on \(\xi _{r+1}^n\) follows from a technical argument (see Lemma 6), which is not required in the proof of Theorem 3.

1.3 Asymptotic analysis for both hierarchies

The results above show that the relative error of both hierarchies is bounded asymptotically by the function \(\varphi (t)\) from (5) in the regime \(r\approx t\cdot n\). This is summarized in the following corollary, which can be seen as an asymptotic version of Theorem 1 and Theorem 3.

Corollary 1

Fix \(d\le n\) and for \(n,r\in {\mathbb {N}}\) write

$$\begin{aligned} E_{(r)}(n)&:= \sup _{f \in {\mathbb {R}}[x]_d} \big \{f_{\min }- \smash {f_{({r})}} : \Vert f\Vert _\infty = 1 \big \}, \\ E^{(r)}(n)&:= \sup _{f \in {\mathbb {R}}[x]_d} \big \{ \smash {f^{({r})}} - f_{\min }: \Vert f\Vert _\infty = 1 \big \}. \end{aligned}$$

Let \(C_d\) be the constant of Theorem 1 and let \(\varphi (t)\) be the function from (5). Then, for any \(t\in [0,1/2]\), we have:

$$\begin{aligned} \lim _{r/n\rightarrow t} E^{(r)}(n) \le C_d\cdot \varphi (t) \end{aligned}$$

and, if \(d(d+1)\cdot \varphi (t) \le 1/2\), we also have:

$$\begin{aligned} \lim _{r/n\rightarrow t} E_{(r)}(n) \le 2\cdot C_d\cdot \varphi (t). \end{aligned}$$

Here, the limit notation \(r/n\rightarrow t\) means that the claimed convergence holds for all sequences \((n_j)_j\) and \((r_j)_j\) of integers such that \(\lim _{j\rightarrow \infty } n_j=\infty \) and \(\lim _{j\rightarrow \infty }r_j/n_j=t\).

We close with some remarks. First, note that \(\varphi (1/2) = 0\). Hence Corollary 1 tells us that the relative error of both hierarchies tends to 0 as \(r / n \rightarrow 1/2\). We thus ‘asymptotically’ recover the exactness result (3) of [36].

Our results in Theorems 1 and 3 and Corollary 1 extend directly to the case of polynomial optimization over the discrete cube \(\{\pm 1\}^n\) instead of the boolean cube \({\mathbb {B}}^{n}=\{0,1\}^n\), as can easily be seen by applying a change of variables \(x\in \{0,1\}\mapsto 2x-1\in \{\pm 1\}\). In addition, as we show in Appendix A, our results extend to the case of polynomial optimization over the q-ary cube \(\{0,1,\ldots ,q-1\}^n\) for \(q > 2\).

After replacing f by its negation \(-f\), one may use the lower bound \(\smash {f_{({r})}}\) on \(f_{\min }\) in order to obtain an upper bound on the maximum \(f_{\max }\) of f over \({\mathbb {B}}^{n}\). Similarly, one may obtain a lower bound on \(f_{\max }\) using the upper bound \(\smash {f^{({r})}}\) on \(f_{\min }\). To avoid possible confusion, we will also refer to \(\smash {f_{({r})}}\) as the outer Lasserre hierarchy (or simply the sum-of-squares hierarchy), whereas we will refer to \(\smash {f^{({r})}}\) as the inner Lasserre hierarchy. This terminology (borrowed from [7]) is motivated by the following observations. As is well-known (and easy to see) the parameter \(f_{\min }\) can be reformulated as an optimization problem over the set \({\mathcal {M}}\) of Borel measures on \({\mathbb {B}}^{n}\):

$$\begin{aligned} f_{\min }=\min \Big \{\int _{{\mathbb {B}}^{n}} f(x)d\nu (x): \nu \in {\mathcal {M}}, \int _{{\mathbb {B}}^{n}} d\nu (x)=1\Big \}. \end{aligned}$$

If we replace the set \({\mathcal {M}}\) by its inner approximation consisting of all measures \(\nu (x)=s(x)d\mu (x)\) with polynomial density \(s\in \varSigma _r\) with respect to a given fixed measure \(\mu \), then we obtain the bound \(f^{(r)}\). On the other hand, any \(\nu \in {\mathcal {M}}\) corresponds to a linear functional \(L_\nu :p\in {\mathbb {R}}[x]_{2r}\mapsto \int _{{\mathbb {B}}^{n}}p(x)d\nu (x)\) which is nonnegative on sums-of-squares on \({\mathbb {B}}^{n}\). These linear functionals thus provide an outer approximation for \(\mathcal M\) and maximizing \(L_\nu (p)\) over it gives the bound \(f_{(r)}\) (in dual formulation).

1.4 Related work

As mentioned above, the bounds \(\smash {f_{({r})}}\) defined in (2) are known to be exact when \(2r \ge n+d -1\). The case \(d=2\) (which includes max-cut) was treated in [12], positively answering a question posed in [22]. Extending the strategy of [12], the general case was settled in [36]. These exactness results are best possible when d is even and n is odd [15].

In [14], the sum-of-squares hierarchy is considered for approximating instances of knapsack. This can be seen as a variation on the problem (1), restricting to a linear polynomial objective with positive coefficients, but introducing a single, linear constraint, of the form \(a_1x_1+\cdots +a_nx_n\le b\) with \(a_i>0\). There, the authors show that the outer hierarchy has relative error at most \(1 / (r - 1)\) for any integer \(r\ge 2\). To the best of our knowledge this is the only known case where one can analyze the quality of the outer bounds for all orders \(r\le n\).

For optimization over sets other than the boolean cube, the following results on the quality of the outer hierarchy \(\smash {f_{({r})}}\) are available. When considering general semi-algebraic sets (satisfying a compactness condition), it has been shown in [30] that there exists a constant \(c>0\) (depending on the semi-algebraic set) such that \(\smash {f_{({r})}}\) converges to \(f_{\min }\) at a rate in \(O(1 / \log (r/c)^{1/c})\) as r tends to \(\infty \). This rate can be improved to \(O(1 / r^{1/c})\) if one considers a variation of the sum-of-squares hierarchy which is stronger (based on the preordering instead of the quadratic module), but much more computationally intensive [38]. Specializing to the hypersphere \(S^{n-1}\), better rates in O(1/r) were shown in [10, 34], and recently improved to \(O(1/r^2)\) in [11]. Similar improved results exist also for the case of polynomial optimization on the simplex and the continuous hypercube \([-1,1]^n\); we refer, e.g., to [7] for an overview.

The results for semi-algebraic sets other than \({\mathbb {B}}^{n}\) mentioned above all apply in the asymptotic regime where the dimension n is fixed and \(r \rightarrow \infty \). This makes it difficult to compare them directly to our new results. Indeed, we have to consider a different regime in the case of the boolean cube \({\mathbb {B}}^{n}\), as the hierarchy always converges in at most n steps. The regime where we are able to provide an analysis in this paper is when \(r \approx t\cdot n\) with \(0<t\le 1/2\).

Turning now to the inner hierarchy (7), as far as we are aware, nothing is known about the behaviour of the bounds \(\smash {f^{({r})}}\) on \({\mathbb {B}}^{n}\). For full-dimensional compact sets, however, results are available. It has been shown that, on the hypersphere [8], the unit ball and the simplex [40], and the unit box [9], the bound \(\smash {f^{({r})}}\) converges at a rate in \(O(1/r^2)\). A slightly weaker convergence rate in \(O(\log ^2 r / r^2)\) is known for general (full-dimensional) semi-algebraic sets [40, 42]. Again, these results are all asymptotic in r, and thus hard to compare directly to our analysis on \({\mathbb {B}}^{n}\).

1.5 Overview of the proof

Here, we give an overview of the main ideas that we use to show our results. Our broad strategy follows the one employed in [11] to obtain information on the sum-of-squares hierarchy on the hypersphere. The following four ingredients will play a key role in our proof:

  1. 1.

    we use the polynomial kernel technique in order to produce low-degree sum-of-squares representations of polynomials that are positive over \({\mathbb {B}}^{n}\), thus allowing an analysis of \(f_{\min }- \smash {f_{({r})}}\);

  2. 2.

    using classical Fourier analysis on the boolean cube \({\mathbb {B}}^{n}\) we are able to exploit symmetry and reduce the search for a multivariate kernel to a univariate sum-of-squares polynomial on the discrete set \([0:n] := \{0,1,\ldots ,n\}\);

  3. 3.

    we find this univariate sum-of-squares by applying the inner Lasserre hierarchy to an appropriate univariate optimization problem on [0 : n];

  4. 4.

    finally, we exploit a known connection between the inner hierarchy and the extremal roots of corresponding orthogonal polynomials (in our case, the Krawtchouk polynomials).

Following these steps we are able to analyze the sum-of-squares hierarchy \(\smash {f_{({r})}}\) as well as the inner hierarchy \(\smash {f^{({r})}}\). We now sketch how our proof articulates along these four main steps.

Let \(f\in {\mathbb {R}}[x]_d\) be the polynomial with degree d for which we wish to analyze the bounds \(f_{(r)}\) and \(f^{(r)}\). After rescaling, and up to a change of coordinates, we may assume w.l.o.g. that f attains its minimum over \({\mathbb {B}}^{n}\) at \(0 \in {\mathbb {B}}^{n}\) and that \(f_{\min }=0\) and \(f_{\max } = 1\). So we have \(\Vert f\Vert _\infty =1\). To simplify notation, we will make these assumptions throughout.

The first key idea is to consider a polynomial kernel \(K\) on \({\mathbb {B}}^{n}\) of the form:

$$\begin{aligned} K(x, y) = u^2(d(x,y)), \end{aligned}$$
(8)

where \(u \in {\mathbb {R}}[t]_r\) is a univariate polynomial of degree at most r and d(xy) is the Hamming distance between x and y. Such a kernel K induces an operator \({\mathbf {K}}\), which acts linearly on the space of polynomials on \({\mathbb {B}}^{n}\) by:

$$\begin{aligned} p\in {\mathbb {R}}[x] \mapsto {\mathbf {K}}p(x) := \int _{{\mathbb {B}}^{n}} p(y) K(x,y) d\mu (y) = {1\over 2^n} \sum _{y \in {\mathbb {B}}^{n}} p(y) K(x, y). \end{aligned}$$

Recall that \(\mu \) is the uniform probability distribution on \({\mathbb {B}}^{n}\). An easy but important observation is that, if p is nonnegative on \({\mathbb {B}}^{n}\), then \({\mathbf {K}}p\) is a sum-of-squares (on \({\mathbb {B}}^{n}\)) of degree at most 2r. We use this fact as follows.

Given a scalar \(\delta \ge 0\), define the polynomial \({\tilde{f}} := f + \delta \). Assuming that the operator \({\mathbf {K}}\) is non-singular, we can express \({\tilde{f}}\) as \({\tilde{f}} = {\mathbf {K}}({\mathbf {K}}^{-1} {\tilde{f}})\). Therefore, if \({\mathbf {K}}^{-1} {\tilde{f}}\) is nonnegative on \({\mathbb {B}}^{n}\), then we find that \({\tilde{f}}\) is a sum-of-squares on \({\mathbb {B}}^{n}\) with degree at most 2r, which implies \(f_{\min }- \smash {f_{({r})}} \le \delta \).

One way to guarantee that \({\mathbf {K}}^{-1} {\tilde{f}}\) is indeed nonnegative on \({\mathbb {B}}^{n}\) is to select the operator \({\mathbf {K}}\) in such a way that \({\mathbf {K}}(1)=1\) and

$$\begin{aligned} \Vert {\mathbf {K}}^{-1}-I\Vert := \sup _{p \in {\mathbb {R}}[x]_d} \frac{\Vert {\mathbf {K}}^{-1} p - p \Vert _\infty }{\Vert p\Vert _\infty } \le \delta . \end{aligned}$$
(9)

We collect this as a lemma for further reference.

Lemma 1

If the kernel operator \({\mathbf {K}}\) associated to \(u \in {\mathbb {R}}[t]_r\) via relation (8) satisfies \({\mathbf {K}}(1)=1\) and \(\Vert {\mathbf {K}}^{-1}-I\Vert \le \delta \), then we have \(f_{\min }-f_{(r)}\le \delta \).

Proof

With \({\tilde{f}}=f+\delta \), we have: \(\Vert {\mathbf {K}}^{-1}{\tilde{f}}-\tilde{f}\Vert _\infty =\Vert {\mathbf {K}}^{-1}f -f\Vert _\infty \le \delta \Vert f\Vert _\infty =\delta \), where we use the fact that \({\mathbf {K}}(1) = 1 = {\mathbf {K}}^{-1} (1)\) and thus \({\mathbf {K}}^{-1}\tilde{f}-{\tilde{f}} ={\mathbf {K}}^{-1}f -f\) for the first equality and (9) for the inequality. The inequality \(\Vert {\mathbf {K}}^{-1}{\tilde{f}}-{\tilde{f}}\Vert _\infty \le \delta \) then implies \({\mathbf {K}}^{-1}{\tilde{f}}(x)\ge {\tilde{f}}(x)-\delta =f(x)\ge f_{\min }=0\) on \({\mathbb {B}}^{n}\). \(\square \)

In other words, we want the operator \({\mathbf {K}}^{-1}\) (and thus \({\mathbf {K}}\)) to be ‘close to the identity operator’ in a certain sense. As kernels of the form (8) are invariant under the symmetries of \({\mathbb {B}}^{n}\), we are able to use classical Fourier analysis on the boolean cube to express the eigenvalues \(\lambda _i\) of \({\mathbf {K}}\) in terms of the polynomial u. More precisely, it turns out that these eigenvalues are given by the coefficients of the expansion of \(u^2\) in the basis of Krawtchouk polynomials. As we show later (see (33)), inequality (9) then holds for \(\delta \approx \sum _{i} |\lambda _i^{-1} - 1|\).

It therefore remains to find a univariate polynomial \(u \in {\mathbb {R}}[t]_r\) for which these coefficients, and thus the eigenvalues of \({\mathbf {K}}\), are sufficiently close to 1. Interestingly, this problem boils down to analyzing the quality of the inner bound \(g^{(r)}\) (see 7) for a particular univariate polynomial g (given in (35)).

In order to perform this analysis and conclude the proof of Theorem 1, we make use of a connection between the inner Lasserre hierarchy and the least roots of orthogonal polynomials (in this case the Krawtchouk polynomials).

Finally, we generalize our analysis of the inner hierarchy (for the special case of the selected polynomial g in Step 3 to arbitrary polynomials) to obtain Theorem 3.

1.6 Matrix-valued polynomials

As we explain in this section the results of Theorems 1 and 3 may be extended to the setting of matrix-valued polynomials.

For fixed \(k \in {\mathbb {N}}\), let \({\mathcal {S}}^k\subseteq {\mathbb {R}}^{k \times k}\) be the space of \(k \times k\) real symmetric matrices. We write \({\mathcal {S}}^k[x] \subseteq {\mathbb {R}}^{k \times k}[x]\) for the space of n-variate polynomials whose coefficients lie in \({\mathcal {S}}^k\), i.e., the set of \(k\times k\) symmetric polynomial matrices. The polynomial optimization problem (1) may be generalized to polynomials \(F \in {\mathcal {S}}^k[x]\) as:

$$\begin{aligned} F_{\min } := \min _{x \in {\mathbb {B}}^{n}} \lambda _{\min }(F(x)), \end{aligned}$$
(10)

where \(F\in {\mathcal {S}}^k[x]\) is a polynomial matrix and \(\lambda _{\min }(F(x))\) denotes the smallest eigenvalue of F(x). That is, \(F_{\min }\) is the largest value for which \(F(x) - F_{\min } I~{ \succeq 0}\) for all \(x \in {\mathbb {B}}^{n}\), where \(\succeq \) is the positive semidefinite Löwner order.

The Lasserre hierarchies (2) and (7) may also be defined in this setting. A polynomial matrix \(S \in {\mathcal {S}}^k[x]\) is called a sum-of-squares polynomial of degree 2r if it can be written as:

$$\begin{aligned} S(x) = \sum _{i} U_i(x) U_i(x)^\top \quad (U_i \in {\mathbb {R}}^{k \times m}[x], ~~\mathrm{\deg } ~U_i \le r, ~~ m\in {\mathbb {N}}). \end{aligned}$$

We write \(\varSigma ^{k \times k}_r\) for the set of all such polynomials. See, e.g., [37] for background and results on sum-of-squares polynomial matrices. Note that a sum-of-squares polynomial matrix \(S \in \varSigma ^{k \times k}_r\) satisfies \(S(x) \succeq 0\) for all \(x \in {\mathbb {B}}^{n}\). For a matrix \(A \in {\mathcal {S}}^k\), \(\Vert A\Vert \) denotes its spectral (or operator) norm, defined as the largest absolute value of an eigenvalue of A. Then, for \(F\in {\mathcal {S}}^k[x]\), we set

$$\begin{aligned} \Vert F\Vert _\infty = \max _{x\in {\mathbb {B}}^{n}} \Vert F(x)\Vert . \end{aligned}$$

The outer hierarchy \(F_{(r)}\) for the polynomial F is then given by:

$$\begin{aligned} F_{(r)} := \sup _{\lambda \in {\mathbb {R}}} \left\{ F(x) - \lambda \cdot I = S(x) \text { on } {\mathbb {B}}^{n} \text { for some } S \in \varSigma ^{k \times k}_r \right\} . \end{aligned}$$
(11)

Similarly, the inner hierarchy \(F^{(r)}\) is defined as:

$$\begin{aligned} F^{(r)} := \inf _{S \in \varSigma ^{k \times k}_r} \left\{ \int _{{\mathbb {B}}^{n}} \mathrm{Tr} \big ( F(x) S(x)\big ) d\mu (x) : \int _{{\mathbb {B}}^{n}} \mathrm{Tr} \big (S(x)\big ) d\mu (x) = 1\right\} . \end{aligned}$$
(12)

Indeed, (2) and (7) are the special cases of the above programs when \(k=1\). As before, the parameters \(F_{(r)}\) and \(F^{(r)}\) may be computed efficiently for fixed r (and fixed k) using semidefinite programming, and they satisfy

$$\begin{aligned} F_{(r)} \le F_{(r+1)} \le F_{\min } \le F^{(r+1)} \le F^{(r)}. \end{aligned}$$

As was already noted in [11] (in the context of optimization over the unit sphere), the proof strategy outlined in Sect. 1.5 naturally extends to the setting of polynomial matrices. This yields the following generalizations of our main theorems.

Theorem 4

Fix \(d \le n\) and let \(F \in {\mathcal {S}}^k[x]\) (\(k\ge 1\)) be a polynomial matrix of degree d. For \(r, n \in {\mathbb {N}}\), let \(\xi ^n_{r}\) be the least root of the degree r Krawtchouk polynomial (19) with parameter n. Then, if \((r+1)/n \le 1/2\) and \({d(d+1) \cdot \xi _{r+1}^n / n \le 1/2}\), we have:

$$\begin{aligned} \frac{F_{\min } - F_{(r)} }{\Vert F\Vert _{\infty }} \le 2 C_d \cdot \xi _{r+1}^n / n, \end{aligned}$$
(13)

where \(C_d > 0\) is the absolute constant depending only on d from Theorem 1.

Theorem 5

Fix \(d \le n\) and let \(F \in {\mathcal {S}}^k[x]\) be a matrix-valued polynomial of degree d. Then, for any \(r, n \in {\mathbb {N}}\) with \((r+1)/n \le 1/2\), we have:

$$\begin{aligned} \frac{F^{(r)} - F_{\min } }{\Vert F\Vert _{\infty }} \le C_d \cdot \xi _{r+1}^n / n, \end{aligned}$$

where \(C_d > 0\) is the constant of Theorem 1.

1.7 Organization

The rest of the paper is structured as follows. We review the necessary background on Fourier analysis on the boolean cube in Sect. 2. In Sect. 3, we recall a connection between the inner Lasserre hierarchy and the roots of orthogonal polynomials. Then, in Sect. 4, we give a proof of Theorem 1. In Sect. 5, we discuss how to generalize the proofs of Sect. 4 to obtain Theorem 3. We group the proofs of some technical results needed to prove Lemma 5 in Sect. 6. In Appendix A, we indicate how our arguments extend to the case of polynomial optimization over the q-ary hypercube \(\{0,1,\ldots ,q-1\}^n\) for \(q>2\). Finally, we give the proofs of our results in the matrix-valued setting (Theorems 4 and 5) in Appendix B.

2 Fourier analysis on the Boolean cube

In this section, we cover some standard Fourier analysis on the boolean cube. For a general reference on Fourier analysis on (finite, abelian) groups, see e.g. [43]. See also Section 4 of [45] for the boolean case.

Some notation For \(n \in {\mathbb {N}}\), we write \({\mathbb {B}}^{n} = \{0, 1\}^n\) for the boolean hypercube of dimension n. We let \(\mu \) denote the uniform probability measure on \({\mathbb {B}}^{n}\), given by \(\mu = \frac{1}{2^n} \sum _{x \in {\mathbb {B}}^{n}}\delta _x\), where \(\delta _x\) is the Dirac measure at x. Further, we write \(|x| = \sum _{i}x_i = |\{i\in [n]: x_i = 1 \}|\) for the Hamming weight of \(x \in {\mathbb {B}}^{n}\), and \(d(x,y) = |\{i\in [n]: x_i \ne y_i\}|\) for the Hamming distance between \(x, y \in {\mathbb {B}}^{n}\). We let \(\text {Sym}(n)\) denote the set of permutations of the set \([n]=\{1,\ldots ,n\}\).

We consider polynomials \(p : {\mathbb {B}}^{n} \rightarrow {\mathbb {R}}\) on \({\mathbb {B}}^{n}\). The space \({\mathcal {R}}[{x}]\) of such polynomials is given by the quotient ring of \({\mathbb {R}}[x]\) over the equivalence relation \(p \sim q\) if \( p(x) = q(x)\) on \({\mathbb {B}}^{n}\). In other words, \({\mathcal {R}}[{x}]= \mathbb R[x]/{\mathcal {I}}\), where \({\mathcal {I}}\) is the ideal generated by the polynomials \(x_i-x_i^2\) for \(i\in [n]\), which can also be seen as the vector space spanned by the (multilinear) polynomials \(\prod _{i\in I}x_i\) for \(I\subseteq [n].\)

For \(a\le b \in {\mathbb {N}}\), we let [a : b] denote the set of integers \(a, a + 1, \dots , b\).

The character basis Let \(\langle \cdot , \cdot \rangle _\mu \) be the inner product on \({\mathcal {R}}[{x}]\) given by:

$$\begin{aligned} \langle p, q \rangle _{\mu } = \int _{{\mathbb {B}}^{n}} p(x) q(x) d \mu (x) = \frac{1}{2^n} \sum _{x \in {\mathbb {B}}^{n}} p(x) q(x). \end{aligned}$$

The space \({\mathcal {R}}[{x}]\) has an orthonormal basis w.r.t. \(\langle \cdot , \cdot \rangle _{\mu }\) given by the characters:

$$\begin{aligned} \chi _a(x) := (-1)^{x \cdot a} = \prod _{i:a_i=1}(1-2x_i) \quad \left( a \in {\mathbb {B}}^{n}\right) . \end{aligned}$$
(14)

In other words, the set \(\{ \chi _a : a \in {\mathbb {B}}^{n} \}\) of all characters on \({\mathbb {B}}^{n}\) forms a basis for \({\mathcal {R}}[{x}]\) and

$$\begin{aligned} \langle \chi _a, \chi _b \rangle _{\mu } = \frac{1}{2^n} \sum _{x \in {\mathbb {B}}^{n}} \chi _a(x) \chi _b(x) = \delta _{a, b} \quad \forall a, b \in {\mathbb {B}}^{n}. \end{aligned}$$
(15)

Then any polynomial \(p \in {\mathcal {R}}[{x}]\) can be expressed in the basis of characters, known as its Fourier expansion:

$$\begin{aligned} p(x) = \sum _{a \in {\mathbb {B}}^{n}} \widehat{p}(a) \chi _a(x) \quad \forall x \in {\mathbb {B}}^{n} \end{aligned}$$
(16)

with Fourier coefficients \(\widehat{p}(a) := \langle p, \chi _a \rangle _{\mu } \in {\mathbb {R}}\).

The group \({{\,\mathrm{Aut}\,}}({\mathbb {B}}^{n})\) of automorphisms of \({\mathbb {B}}^{n}\) is generated by the coordinate permutations, of the form \(x\mapsto \sigma (x):=(x_{\sigma (1)},\ldots ,x_{\sigma (n)})\) for \(\sigma \in \text {Sym}(n)\), and the automorphisms corresponding to bit-flips, of the form \(x\in {\mathbb {B}}^{n}\mapsto x\oplus a\in {\mathbb {B}}^{n}\) for \(a\in {\mathbb {B}}^{n}\). If we set

$$\begin{aligned} H_k := {{\,\mathrm{span}\,}}\{ \chi _a : |a| = k\} \quad (0 \le k \le n), \end{aligned}$$

then each \(H_k\) is an irreducible, \({{\,\mathrm{Aut}\,}}({\mathbb {B}}^{n})\)-invariant subspace of \({\mathcal {R}}[{x}]\) of dimension \({n \atopwithdelims ()k}\). Using (16), we may then decompose \({\mathcal {R}}[{x}]\) as the direct sum

$$\begin{aligned} {\mathcal {R}}[{x}] = H_0 \perp H_1 \perp \dots \perp H_n, \end{aligned}$$

where the subspaces \(H_k\) are pairwise orthogonal w.r.t. \(\langle \cdot , \cdot \rangle _\mu \). In fact, we have that \({\mathcal {R}}[{x}]_d = H_0 \perp H_1 \perp \dots \perp H_d\) for all \(d \le n\), and we may thus write any \(p \in {\mathcal {R}}[{x}]_d\) (in a unique way) as

$$\begin{aligned} p = p_0 + p_1 + \dots + p_d \quad (p_k \in H_k). \end{aligned}$$
(17)

The polynomials \(p_k\in H_k\) (\(k=0,\ldots ,d\)) are known as the harmonic components of p and the decomposition (17) as the harmonic decomposition of p. We will make extensive use of this decomposition throughout.

Let \({\mathrm{St}}(0) \subseteq {{\,\mathrm{Aut}\,}}({\mathbb {B}}^{n})\) be the set of automorphisms fixing \(0 \in {\mathbb {B}}^{n}\), which consists of the coordinate permutations \(x\mapsto \sigma (x) =(x_{\sigma (1)},\ldots ,x_{\sigma (n)})\) for \(\sigma \in \text {Sym}(n)\). The subspace of functions in \(H_k\) that are invariant under \({\mathrm{St}}(0)\) is one-dimensional and it is spanned by the function

$$\begin{aligned} X_k(x) := \sum _{|a| = k} \chi _a(x). \end{aligned}$$
(18)

These functions \(X_k\) are known as the zonal spherical functions with pole \({0 \in {\mathbb {B}}^{n}}\).

Krawtchouk polynomials For \(k \in {\mathbb {N}}\), the Krawtchouk polynomial of degree k (and with parameter n) is the univariate polynomial in t given by:

$$\begin{aligned} {\mathcal {K}}^n_k(t) := \sum _{i=0}^k (-1)^i {t \atopwithdelims ()i} {n-t \atopwithdelims ()k-i} \end{aligned}$$
(19)

(see, e.g. [41]). The Krawtchouk polynomials form an orthogonal basis for \({\mathbb {R}}[t]\) with respect to the inner product given by the following discrete probability measure on the set \([0:n]=\{0,1,\ldots ,n\}\):

$$\begin{aligned} \omega := \frac{1}{2^n} \sum _{t=0}^n w(t) \delta _t, \text { with } w(t) := {n \atopwithdelims ()t}. \end{aligned}$$

Indeed, for all \(k, k' \in {\mathbb {N}}\) we have:

$$\begin{aligned} \langle {\mathcal {K}}^n_k, {\mathcal {K}}^n_{k'}\rangle _\omega :={1\over 2^n} \sum _{t=0}^n {\mathcal {K}}^n_k(t) {\mathcal {K}}^n_{k'}(t) w(t) = \delta _{k, k'} {n \atopwithdelims ()k} . \end{aligned}$$
(20)

The following (well-known) lemma explains the connection between the Krawtchouk polynomials and the character basis on \({\mathcal {R}}[{x}]\).

Lemma 2

Let \(t \in [0:n]\) and choose \(x, y \in {\mathbb {B}}^{n}\) so that \(d(x, y) = t\). Then for any \(0\le k \le n\) we have:

$$\begin{aligned} {\mathcal {K}}^n_k(t) = \sum _{|a| = k} \chi _a(x) \chi _a(y). \end{aligned}$$
(21)

In particular, we have:

$$\begin{aligned} {\mathcal {K}}^n_k(t) = \sum _{|a| = k} \chi _a(1^t 0^{n-t}) = X_k(1^t 0^{n-t}), \end{aligned}$$
(22)

where \(1^t 0^{n-t} \in {\mathbb {B}}^{n}\) is given by \((1^t 0^{n-t})_i = 1\) if \(1\le i \le t\) and \((1^t 0^{n-t})_i = 0\) if \(t+1\le i\le n\).

Proof

Noting that \(\chi _a(x) \chi _a(y) = \chi _a(x+y)\) and \(|x + y| = d(x,y) = t\), we have:

$$\begin{aligned} \sum _{|a| = k} \chi _a(x) \chi _a(y)&= \sum _{i = 0}^k (-1)^i \cdot \# \{|a| = k : a \cdot (x+y) = i\} \\&= \sum _{i = 0}^k (-1)^i {t \atopwithdelims ()i} {n - t \atopwithdelims ()k - i} = {\mathcal {K}}^n_k(t). \end{aligned}$$

\(\square \)

From this, we see that any polynomial \(p\in {\mathbb {R}}[x]_d\) that is invariant under the action of \({\mathrm{St}}(0)\) is of the form \(\sum _{i=1}^d \lambda _i {\mathcal {K}}^n_i(|x|)\) for some scalars \(\lambda _i\), and thus \(p(x)=u(|x|)\) for some univariate polynomial \(u \in {\mathbb {R}}[t]_d\).

It will sometimes be convenient to work with a different normalization of the Krawtchouk polynomials, given by:

$$\begin{aligned} \widehat{{\mathcal {K}}}^n_k(t) := {\mathcal {K}}^n_k(t) / {\mathcal {K}}^n_k(0) \quad (k \in {\mathbb {N}}). \end{aligned}$$
(23)

So \(\widehat{{\mathcal {K}}}^n_k(0)=1\). Note that for any \(k \in {\mathbb {N}}\), we have

$$\begin{aligned} \Vert {\mathcal {K}}^n_k\Vert ^2_\omega := \langle {\mathcal {K}}^n_k, {\mathcal {K}}^n_{k}\rangle _\omega = {n \atopwithdelims ()k} = {\mathcal {K}}^n_k(0), \end{aligned}$$

meaning that \(\widehat{{\mathcal {K}}}^n_k(t) = {\mathcal {K}}^n_k(t) / \Vert {\mathcal {K}}^n_k\Vert ^2_\omega \).

Finally we give a short proof of two basic facts on Krawtchouk polynomials that will be used below.

Lemma 3

We have:

$$\begin{aligned} \widehat{{\mathcal {K}}}^n_k(t) \le \widehat{{\mathcal {K}}}^n_0(t) = 1 \end{aligned}$$

for all \(0\le k \le n\) and \(t \in [0 : n]\).

Proof

Given \(t \in [0:n]\) consider an element \(x \in {\mathbb {B}}^{n}\) with Hamming weight t, for instance the element \(1^t 0^{n-t}\) from Lemma 2. By (22) we have

$$\begin{aligned} {\mathcal {K}}^n_k(t) = \sum _{|a| = k} \chi _a(x) \le {n \atopwithdelims ()k} = {\mathcal {K}}^n_k(0), \end{aligned}$$

making use of the fact that \(|\chi _a(x)| = 1\) for all \(a \in {\mathbb {B}}^{n}\). \(\square \)

Lemma 4

We have:

$$\begin{aligned} \begin{aligned} |\widehat{{\mathcal {K}}}^n_k(t) - \widehat{{\mathcal {K}}}^n_k(t+1)|&\le \frac{2k}{n}, \quad \quad&(t=0,1,\dots ,n-1)\\ |\widehat{{\mathcal {K}}}^n_k(t) - 1|&\le {2k\over n} \cdot t&(t=0,1,\dots ,n) \end{aligned} \end{aligned}$$
(24)

for all \(0\le k \le n\).

Proof

Let \(t\in [0:n-1]\) and \(0 < k \le d\). Consider the elements \(1^t 0^{n-t}\) and \(1^{t+1} 0^{n-t-1}\) of \({\mathbb {B}}^{n}\) from Lemma 2. We have:

$$\begin{aligned} |{\mathcal {K}}^n_k(t) - {\mathcal {K}}^n_k(t+1)|&\overset{(22)}{=} |\sum _{|a| = k} \chi _a(1^t 0^{n-t}) - \chi _a(1^{t+1} 0^{n-t-1})| \\&\le 2 \cdot \#\big \{ a \in {\mathbb {B}}^{n} : |a| = k, a_{t+1} = 1\big \} = 2{n - 1 \atopwithdelims ()k-1}, \end{aligned}$$

where for the inequality we note that \(\chi _a(1^t 0^{n-t}) = \chi _a(1^{t+1} 0^{n-t-1})\) if \(a_{t+1} = 0\). As \({\mathcal {K}}^n_k(0) = {n \atopwithdelims ()k}\), this implies that:

$$\begin{aligned} |\widehat{{\mathcal {K}}}^n_k(t) - \widehat{{\mathcal {K}}}^n_k(t+1)| \le 2{n - 1 \atopwithdelims ()k-1} / {n \atopwithdelims ()k} = \frac{2k}{n}. \end{aligned}$$

This shows the first inequality of (24). The second inequality follows using the triangle inequality, a telescope summation argument and the fact that \(\widehat{{\mathcal {K}}}^n_k(0)=1\). \(\square \)

Invariant kernels and the Funk–Hecke formula Given a univariate polynomial \(u \in {\mathbb {R}}[t]\) of degree r with \(2r \ge d\), consider the kernel \({K: {\mathbb {B}}^{n} \times {\mathbb {B}}^{n} \rightarrow {\mathbb {R}}}\) defined by

$$\begin{aligned} K(x, y) := u^2(d(x,y)). \end{aligned}$$
(25)

A kernel of the form (25) coincides with a polynomial of degree \(2\mathrm {deg}(u)\) in x on the binary cube \({\mathbb {B}}^{n}\), as \(d(x,y) = \sum _i (x_i + y_i - 2x_iy_i)\) for \(x, y \in {\mathbb {B}}^{n}\). Furthermore, it is invariant under \({{\,\mathrm{Aut}\,}}({\mathbb {B}}^{n})\), in the sense that:

$$\begin{aligned} K(x, y) = K(\pi (x), \pi (y)) \quad \forall x,y\in {\mathbb {B}}^{n}, \pi \in {{\,\mathrm{Aut}\,}}({\mathbb {B}}^{n}). \end{aligned}$$

The kernel \(K\) acts as a linear operator \({\mathbf {K}}: {\mathcal {R}}[{x}] \rightarrow {\mathcal {R}}[{x}]\) by:

$$\begin{aligned} {\mathbf {K}}p(x) := \int _{{\mathbb {B}}^{n}} p(y) K(x, y) d\mu (y) = \frac{1}{2^n} \sum _{y \in {\mathbb {B}}^{n}} p(y) K(x,y). \end{aligned}$$
(26)

We may expand the univariate polynomial \(u^2 \in {\mathbb {R}}[t]_{2r}\) in the basis of Krawtchouk polynomials as:

$$\begin{aligned} u^2(t) = \sum _{i = 0}^{2r} \lambda _i {\mathcal {K}}^n_i(t) \quad (\lambda _i \in {\mathbb {R}}). \end{aligned}$$
(27)

As we show now, the eigenvalues of the operator \({\mathbf {K}}\) are given precisely by the coefficients \(\lambda _i\) occurring in this expansion. This relation is analogous to the classical Funk–Hecke formula for spherical harmonics (see, e.g., [11]).

Theorem 6

(Funk–Hecke) Let \(p \in {\mathcal {R}}[{x}]_d\) with harmonic decomposition \(p= p_0 + p_1 + \dots + p_d\). Then we have:

$$\begin{aligned} {\mathbf {K}}p = \lambda _0 p_0 + \lambda _1 p_1 + \dots + \lambda _d p_d. \end{aligned}$$
(28)

Proof

It suffices to show that \( {\mathbf {K}}\chi _z = \lambda _{|z|} \chi _z \) for all \(z \in {\mathbb {B}}^{n}\). So we compute for \(x \in {\mathbb {B}}^{n}\):

$$\begin{aligned} {\mathbf {K}}\chi _z(x) = \frac{1}{2^n} \sum _{y \in {\mathbb {B}}^{n}} \chi _z(y) u^2(d(x, y))&\overset{(27)}{=} \frac{1}{2^n} \sum _{y \in {\mathbb {B}}^{n}} \chi _z(y) \sum _{i = 0}^{2r} \lambda _i {\mathcal {K}}^n_i(d(x,y)) \\&\overset{(21)}{=} \sum _{i = 0}^{2r} \lambda _i \sum _{y \in {\mathbb {B}}^{n}} \chi _z(y) \sum _{|a| = i} \chi _{a}(x) \chi _{a}(y) \\&= \sum _{i = 0}^{2r} \lambda _i \sum _{|a| = i} \Big (\sum _{y \in {\mathbb {B}}^{n}} \chi _z(y) \chi _{a}(y) \Big ) \chi _{a}(x) \\&\overset{(15)}{=} \frac{1}{2^n} \sum _{i = 0}^{2r} \lambda _i \sum _{|a| = i} 2^n \delta _{z, a} \chi _{a}(x) \\&= \lambda _{|z|} \chi _z(x). \end{aligned}$$

\(\square \)

Finally, we note that since the Krawtchouk polynomials form an orthogonal basis for \({\mathbb {R}}[t]\), we may express the coefficients \(\lambda _i\) in the decomposition (27) of \(u^2\) in the following way:

$$\begin{aligned} \lambda _i = \langle {\mathcal {K}}^n_i, u^2 \rangle _\omega ~/~ \Vert {\mathcal {K}}^n_i\Vert ^2_\omega = \langle \widehat{{\mathcal {K}}}^n_i, u^2 \rangle _\omega . \end{aligned}$$
(29)

In addition, since in view of Lemma 3 we have \(\widehat{{\mathcal {K}}}^n_i(t)\le \widehat{{\mathcal {K}}}^n_0(t)\) for all \(t\in [0:n]\), it folllows that

$$\begin{aligned} \lambda _i \le \lambda _0 \quad \text { for } 0\le i\le 2r. \end{aligned}$$
(30)

3 The inner Lasserre hierachy and orthogonal polynomials

The inner Lasserre hierachy, which we have defined for the boolean cube in (7), may be defined more generally for the minimization of a polynomial g over a compact set \(M \subseteq {\mathbb {R}}^n\) equipped with a measure \(\nu \) with support M, by setting:

$$\begin{aligned} g^{(r)} := g^{(r)}_{M, \nu } = \inf _{s \in \varSigma _r} \left\{ \int _M {g \cdot s d\nu } : \int _M s d\nu = 1 \right\} \end{aligned}$$
(31)

for any integer \(r\in {\mathbb {N}}\). So we have: \(g^{(r)}\ge g_{\min }:=\min _{x\in M}g(x).\) A crucial ingredient of the proof of our main theorem below is an analysis of the error \(g^{(r)} - g_{\min }\) in this hierarchy for a special choice of \(M \subseteq {\mathbb {R}}\), g and \(\nu \).

Here, we recall a technique which may be used to perform such an analysis in the univariate case, which was developed in [9] and further employed for this purpose, e.g., in [8, 40].

First, we observe that we may always replace g by a suitable upper estimator \({\widehat{g}}\) which satisfies \(\widehat{g}_{\min } = g_{\min }\) and \({\widehat{g}}(x) \ge g(x)\) for all \(x \in M\). Indeed, it is clear that for such \({\widehat{g}}\) we have:

$$\begin{aligned} g^{(r)} - g_{\min } \le {\widehat{g}}^{(r)} - g_{\min } = \widehat{g}^{(r)} - {\widehat{g}}_{\min }. \end{aligned}$$

Next, we consider the special case when \(M\subseteq {\mathbb {R}}\) and \(g(t) = t\). Here, the bound \(g^{(r)}\) may be expressed in terms of the orthogonal polynomials on M w.r.t. the measure \(\nu \), i.e., the polynomials \(p_i \in {\mathbb {R}}[t]_i\) determined by the relation:

$$\begin{aligned} \int _M p_i p_j d\nu = 0 \text { if } i \ne j. \end{aligned}$$

Theorem 7

([9]) Let \(M \subseteq {\mathbb {R}}\) be an interval and let \(\nu \) be a measure supported on M with corresponding orthogonal polynomials \(p_i\in {\mathbb {R}}[t]_i\) (\(i \in {\mathbb {N}}\)). Then the Lasserre inner bound \(g^{(r)}\) (from 31) of order r for the polynomial \(g(t) = t\) equals

$$\begin{aligned} g^{(r)} = \xi _{r+1}, \end{aligned}$$

where \(\xi _{r+1}\) is the smallest root of the polynomial \(p_{r+1}\).

Remark 1

The upshot of the above result is that, for any polynomial \(g : {\mathbb {R}}\rightarrow {\mathbb {R}}\) which is upper bounded on an interval \(M\subseteq {\mathbb {R}}\) by a linear polynomial \({\widehat{g}}(t) = ct\) for some \(c > 0\), we have:

$$\begin{aligned} g^{(r)} - g_{\min } \le c \cdot \xi _{r+1}, \end{aligned}$$
(32)

where \(\xi _{r+1}\) is the smallest root of the corresponding orthogonal polynomial of degree \(r+1\).

4 Proof of Theorem 1

Throughout, \(d\le n\) is a fixed integer (the degree of the polynomial f to be minimized over \({\mathbb {B}}^{n}\)). Recall \(u \in {\mathbb {R}}[t]\) is a univariate polynomial with degree r (that we need to select) with \(2r\ge d\). Consider the associated kernel \(K\) defined in (25) and the corresponding linear operator \({\mathbf {K}}\) from (26). Recall from (9) that we are interested in bounding the quantity:

$$\begin{aligned} \Vert {\mathbf {K}}^{-1} - I\Vert := \sup _{p \in {\mathbb {R}}[x]_d} \frac{\Vert {\mathbf {K}}^{-1} p - p \Vert _\infty }{\Vert p\Vert _\infty }. \end{aligned}$$

Our proof consists of two parts. First, we relate the coefficients \(\lambda _i\), that appear in the decomposition (27) of \(u^2(t) = \sum _{i = 0}^{2r} \lambda _i {\mathcal {K}}^n_i(t)\) into the basis of Krawtchouk polynomials, to the quantity \(\Vert {\mathbf {K}}^{-1} - I\Vert \).

Then, using this relation and the connection between the inner Lasserre hierarchy and extremal roots of orthogonal polynomials outlined in Sect. 3, we show that u may be chosen such that \(\Vert {\mathbf {K}}^{-1}-I\Vert \) is of the order \(\xi ^n_{r+1} / n\), with \(\xi ^n_{r+1}\) the smallest root of the degree \(r+1\) Krawtchouk polynomial (with parameter n).

Bounding \(\Vert {\mathbf {K}}^{-1}-I\Vert \) in terms of the \(\lambda _i\) We need the following technical lemma, which bounds the sup-norm \(\Vert p_k\Vert _\infty \) of the harmonic components \(p_k\) of a polynomial \(p \in {\mathcal {R}}[{x}]\) in terms of \(\Vert p\Vert _\infty \), the sup-norm of p itself. The key point is that this bound is independent of the dimension n. We delay the proof which is rather technical to Sect. 6.

Lemma 5

There exists a constant \(\gamma _d > 0\), depending only on d, such that for any \(p = p_0 + p_1 + \ldots + p_d \in {\mathcal {R}}[{x}]_d\), we have:

$$\begin{aligned} \Vert p_k \Vert _\infty \le \gamma _d \Vert p\Vert _\infty \text { for all } 0\le k \le d. \end{aligned}$$

Corollary 2

Assume that \(\lambda _0 = 1\) and \(\lambda _k \ne 0\) for \(1\le k\le d\). Then we have:

$$\begin{aligned} \Vert {\mathbf {K}}^{-1} - I\Vert \le \gamma _d \cdot \varLambda , \text { where } \varLambda := \sum _{i = 1}^d |\lambda _i^{-1} - 1|. \end{aligned}$$
(33)

Proof

By assumption, the operator \({\mathbf {K}}\) is invertible and, in view of Funk–Hecke relation (28), its inverse is given by \({\mathbf {K}}^{-1}p =\sum _{i=0}^d \lambda _i^{-1}p_k\) for any \(p = p_0 + p_1 + \ldots + p_d \in {\mathcal {R}}[{x}]_d\). Then we have:

$$\begin{aligned} \begin{aligned} \Vert {\mathbf {K}}^{-1} p - p\Vert _\infty&= \Vert \sum _{i=1}^d (\lambda _i^{-1} - 1)p_i \Vert _\infty \\&\le \sum _{i=1}^d |\lambda _i^{-1} - 1| \Vert p_i\Vert _\infty \\&\le \sum _{i=1}^d |\lambda _i^{-1} - 1| \cdot \gamma _d \Vert p \Vert _\infty , \end{aligned} \end{aligned}$$
(34)

where we use Lemma 5 for the last inequality. \(\square \)

The expression \(\varLambda \) in (33) is difficult to analyze. Therefore, following [11], we consider instead the simpler expression:

$$\begin{aligned} {\tilde{\varLambda }}:= \sum _{i=1}^d (1 - \lambda _i) = d - \sum _{i=1}^d \lambda _i, \end{aligned}$$

which is linear in the \(\lambda _i\). Under the assumption that \(\lambda _0 = 1\), we have \({\lambda _i \le \lambda _0 = 1}\) for all i (recall relation (30)). Thus, \(\varLambda \) and \({\tilde{\varLambda }}\) are both minimized when the \(\lambda _i\) are close to 1. The following lemma makes this precise.

Lemma 6

Assume that \(\lambda _0 = 1\) and that \({\tilde{\varLambda }}\le 1/2\). Then we have \(\varLambda \le 2 {\tilde{\varLambda }}\), and thus

$$\begin{aligned} \Vert {\mathbf {K}}^{-1} - I\Vert \le 2 \gamma _d \cdot {\tilde{\varLambda }}. \end{aligned}$$

Proof

As we assume \({\tilde{\varLambda }}\le 1/2\), we must have \(1/2 \le \lambda _i \le 1\) for all i. Therefore, we may write:

$$\begin{aligned} \varLambda = \sum _{i = 1}^d |\lambda _i^{-1} - 1| = \sum _{i = 1}^d |(1 - \lambda _i) / \lambda _i| = \sum _{i = 1}^d (1 - \lambda _i) / \lambda _i \le 2\sum _{i = 1}^d (1 - \lambda _i) = 2 {\tilde{\varLambda }}. \end{aligned}$$

\(\square \)

Optimizing the choice of the univariate polynomial u In light of Lemma 6, and recalling (29), we wish to find a univariate polynomial \(u \in {\mathbb {R}}[t]_r\) for which

$$\begin{aligned} \lambda _0&= \langle 1, u^2 \rangle _\omega = 1, \text { and }\\ {\tilde{\varLambda }}= d - \sum _{i = 1}^d \lambda _i&= d - \sum _{i=1}^d \langle \widehat{{\mathcal {K}}}^n_i, u^2 \rangle _\omega \text { is small.} \end{aligned}$$

Unpacking the definition of \(\langle \cdot , \cdot \rangle _{\omega }\), we thus need to solve the following optimization problem:

$$\begin{aligned} \inf _{u \in {\mathbb {R}}[t]_r} \left\{ \int g \cdot u^2 d\omega : \int u^2 d \omega = 1 \right\} , \text { where } g(t) := d - \sum _{i=1}^d \widehat{{\mathcal {K}}}^n_i(t). \end{aligned}$$
(35)

(Indeed \(\int gu^2d\omega =\langle g,u^2\rangle _\omega = {\tilde{\varLambda }}\) and \(\int u^2d\omega = \langle 1,u^2\rangle _\omega \).) We recognize this program to be the same as the program (31) defining the inner Lasserre boundFootnote 1 of order r for the minimum \(g_{\min } = g(0) = 0\) of the polynomial g over [0 : n], computed with respect to the measure \(d\omega (t) = 2^{-n}{n \atopwithdelims ()t}\). Hence the optimal value of (35) is equal to \(g^{(r)}\) and, using Lemma 6, we may conclude the following.

Theorem 8

Let g be as in (35). Assume that \(g^{(r)} - g_{\min } \le 1/2\). Then there exists a polynomial \(u \in {\mathbb {R}}[t]_r\) such that \(\lambda _0 = 1\) and

$$\begin{aligned} \Vert {\mathbf {K}}^{-1} - I \Vert \le 2 \gamma _d \cdot (g^{(r)} - g_{\min }). \end{aligned}$$

Here, \(g^{(r)}\) is the inner Lasserre bound on \(g_{\min }\) of order r, computed on [0, n] w.r.t. \(\omega \), via the program (35), and \(\gamma _d\) is the constant of Lemma 5.

It remains, then, to analyze the range \(g^{(r)} - g_{\min }\). For this purpose, we follow the technique outlined in Sect. 3. We first show that g can be upper bounded by its linear approximation at \(t = 0\).

Lemma 7

We have:

$$\begin{aligned} g(t) \le {\widehat{g}}(t) := d(d+1) \cdot (t/n) \quad \forall t \in [0:n]. \end{aligned}$$

Furthermore, the minimum \({\hat{g}}_{\min }\) of \({\hat{g}}\) on [0 : n] clearly satisfies \({\widehat{g}}_{\min } = {\widehat{g}}(0) = g(0) = g_{\min }\).

Proof

Using (24), we find for each \(k \le n\) that:

$$\begin{aligned} \widehat{{\mathcal {K}}}^n_k(t) \ge \widehat{{\mathcal {K}}}^n_k(0) - \frac{2k}{n} \cdot t = 1 - \frac{2k}{n} \cdot t \quad \forall t \in [0 : n]. \end{aligned}$$

Therefore, we have:

$$\begin{aligned} g(t) := d - \sum _{k=1}^d \widehat{{\mathcal {K}}}^n_k(t) \le \sum _{k=1}^d \frac{2k}{n} \cdot t = d(d+1) \cdot (t/n) \quad \forall t \in [0 : n]. \end{aligned}$$

\(\square \)

Lemma 8

We have:

$$\begin{aligned} g^{(r)} - g_{\min } \le d(d+1) \cdot (\xi _{r+1}^n/ n), \end{aligned}$$
(36)

where \(\xi _{r + 1}^n\) is the smallest root of the Krawtchouk polynomial \({\mathcal {K}}^n_{r+1}(t)\).

Proof

This follows immediately from Lemma 7 and Remark 1 at the end of Sect. 3, noting that the Krawtchouk polynomials are indeed orthogonal w.r.t. the measure \(\omega \) on [0 : n] (cf. 20). \(\square \)

Putting things together, we may prove our main result, Theorem 1.

Proof of Theorem 1

Assume that r is large enough so that \({d(d+1)\cdot (\xi ^n_{r+1}/n) \le 1/2}\). By Lemma 8, we then have

$$\begin{aligned} g^{(r)} - g_{\min } \le d(d+1) \cdot (\xi _{r+1}^n/ n) \le 1/2. \end{aligned}$$

By Theorem 8 we are thus able to choose a polynomial \(u \in {\mathbb {R}}[t]_r\) whose associated operator \({\mathbf {K}}\) satisfies \({\mathbf {K}}(1)=1\) and

$$\begin{aligned} \Vert {\mathbf {K}}^{-1}-I\Vert \le 2\gamma _d \cdot d(d+1) \cdot (\xi _{r+1}^n/n). \end{aligned}$$

We may then use Lemma 1 to obtain Theorem 1 with constant \({C_d := \gamma _d \cdot d(d+1)}\). \(\square \)

5 Proof of Theorem 3

We turn now to analyzing the inner hierarchy \(\smash {f^{({r})}}\) defined in (7) for a polynomial \(f\in {\mathbb {R}}[x]_d\) on the boolean cube, whose definition is repeated for convenience:

$$\begin{aligned} \smash {f^{({r})}} := \min _{s \in \varSigma [x]_r} \left\{ \int _{{\mathbb {B}}^{n}} f(x) \cdot s(x) d\mu : \int _{{\mathbb {B}}^{n}} s(x) d \mu = 1 \right\} \ge f_{\min }. \end{aligned}$$
(37)

As before, we may assume w.l.o.g. that \(f_{\min } = f(0) = 0\) and that \(f_{\max } = 1\). To facilitate the analysis of the bounds \(f^{(r)}\), the idea is to restrict in (37) to polynomials s(x) that are invariant under the action of \({\mathrm{St}}(0) \subseteq {{\,\mathrm{Aut}\,}}({\mathbb {B}}^{n})\), i.e., depending only on the Hamming weight |x|. Such polynomials are of the form \(s(x)=u(|x|)\) for some univariate polynomial \(u\in {\mathbb {R}}[t]\). Hence this leads to the following, weaker hierarchy, where we now optimize over univariate sums-of-squares:

$$\begin{aligned} \smash {f_{\mathrm{sym}}^{({r})}} := \min _{u\in \varSigma [t]_r} \left\{ \int _{{\mathbb {B}}^{n}} f(x) \cdot u(|x|) d\mu (x) : \int _{{\mathbb {B}}^{n}} u(|x|) d \mu (x) = 1 \right\} . \end{aligned}$$

By definition, we must have \(\smash {f_{\mathrm{sym}}^{({r})}} \ge \smash {f^{({r})}} \ge f_{\min }\), and so an analysis of \(\smash {f_{\mathrm{sym}}^{({r})}}\) extends immediately to \(\smash {f^{({r})}}\).

The main advantage of working with the hierarchy \(\smash {f_{\mathrm{sym}}^{({r})}}\) is that we may now assume that f is itself invariant under \({\mathrm{St}}(0)\), after replacing f by its symmetrization:

$$\begin{aligned} \frac{1}{|{\mathrm{St}}(0)|}\sum _{\sigma \in {\mathrm{St}}(0)} f(\sigma (x)). \end{aligned}$$

Indeed, for any \(u \in \varSigma [t]_r\), we have that:

$$\begin{aligned} \int _{{\mathbb {B}}^{n}} f(x) u(|x|) d\mu (x)&= \frac{1}{|{\mathrm{St}}(0)|} \sum _{\sigma \in {\mathrm{St}}(0)}\int _{{\mathbb {B}}^{n}} f(\sigma (x)) u(|\sigma (x)|) d\mu (\sigma (x)) \\&= \int _{{\mathbb {B}}^{n}} \frac{1}{|{\mathrm{St}}(0)|} \sum _{\sigma \in {\mathrm{St}}(0)} f(\sigma (x)) u(|x|) d\mu (x). \end{aligned}$$

So we now assume that f is \({\mathrm{St}}(0)\)-invariant, and thus we may write:

$$\begin{aligned} f(x) = F(|x|) \text { for some polynomial } F(t) \in {\mathbb {R}}[t]_d. \end{aligned}$$

By the definitions of the measures \(\mu \) on \({\mathbb {B}}^{n}\) and \(\omega \) on [0 : n] we have the identities:

$$\begin{aligned} \int _{{\mathbb {B}}^{n}}u(|x|)d\mu (x)&= \int _{[0:n]}u(t)d\omega (t), \\ \int _{{\mathbb {B}}^{n}} F(|x|) u(|x|) d\mu (x)&= \int _{[0:n]} F(t)u(t)d\omega (t). \end{aligned}$$

Hence we get

$$\begin{aligned} \smash {f_{\mathrm{sym}}^{({r})}} = \min _{u \in \varSigma [t]_r} \Big \{ \int _{[0:n]} F(t) \cdot u(t) d\omega (t) : \int _{[0:n]} u(t) d \omega (t) = 1 \Big \} = F_{[0: n], \omega }^{(r)}. \end{aligned}$$
(38)

In other words, the behaviour of the symmetrized inner hierarchy \(\smash {f_{\mathrm{sym}}^{({r})}}\) over the boolean cube w.r.t. the uniform measure \(\mu \) is captured by the behaviour of the univariate inner hierarchy \(F_{[0: n], \omega }^{(r)}\) over [0 : n] w.r.t. the discrete measure \(\omega \).

Now, we are in a position to make use again of the technique outlined in Sect. 3. First we find a linear upper estimator \({\widehat{F}}\) for F on [0 : n].

Lemma 9

We have

$$\begin{aligned} F(t) \le {\widehat{F}}(t):= d(d+1) \cdot \gamma _d \cdot t/n \quad \forall t \in [0 : n], \end{aligned}$$

where \(\gamma _d\) is the same constant as in Lemma 5.

Proof

Write \(F(t)= \sum _{i=0}^d \lambda _i \widehat{{\mathcal {K}}}^n_i(t) \) for some scalars \(\lambda _i\). By assumption, \({F(0)=0}\) and thus \(\sum _{i=0}^d\lambda _i=0\). We now use an analogous argument as for Lemma 7:

$$\begin{aligned} F(t)=\sum _{i=0}^d \lambda _i (\widehat{{\mathcal {K}}}^n_i(t)-1)&\le \sum _{i=0}^d |\lambda _i| |\widehat{{\mathcal {K}}}^n_i(t)-1| {\mathop {\le }\limits ^{(24)}} \max _i|\lambda _i| \cdot t \cdot \sum _{i=0}^d \frac{2i}{n} \\&\le \max _i|\lambda _i| \cdot t \cdot {d(d+1)\over n}. \end{aligned}$$

As \(\Vert f\Vert _\infty =1\), using Lemma 5, we can conclude that:

$$\begin{aligned} |\lambda _i|=\max _{t\in [0:n]} |\lambda _i\widehat{{\mathcal {K}}}^n_i(t)| \le \gamma _d \end{aligned}$$

which gives the desired result. \(\square \)

In light of Remark 1 in Sect. 3, we may now conclude that

$$\begin{aligned} F_{[0: n], \omega }^{(r)} \le d(d+1) \gamma _d \cdot \xi _{r+1}^n/n. \end{aligned}$$

As \(\smash {f^{({r})}} \le \smash {f_{\mathrm{sym}}^{({r})}} = F_{[0: n], \omega }^{(r)}\), we have thus shown Theorem 3 with constant \({C_d = d(d+1) \gamma _d}\). Note that in comparison to Lemma 8, we only have the additional constant factor \(\gamma _d\).

Exactness of the inner hierarchy As is the case for the outer hierarchy, the inner hierarchy is exact when r is large enough. Whereas the outer hierarchy, however, is exact for \(r \ge (n+d-1)/2\), the inner hierarchy is exact in general if and only if \(r \ge n\). We give a short proof of this fact below, for reference.

Lemma 10

Let f be a polynomial on \({\mathbb {B}}^{n}\). Then \(\smash {f^{({r})}} = f_{\min }~\) for all \(r \ge n\).

Proof

We may assume w.l.o.g. that \(f(0) = f_{\min }\). Consider the interpolation polynomial:

$$\begin{aligned} s(x) := \sqrt{2^n} \prod _{i=1}^n (1-x_i) \in {\mathbb {R}}[x]_n, \end{aligned}$$

which satisfies \(s^2(0) = 2^n\) and \(s^2(x) = 0\) for all \(0 \ne x \in {\mathbb {B}}^{n}\). Clearly, we have:

$$\begin{aligned} \int f s^2 d\mu = f(0) = f_{\min }\ \text { and }\ \int s^2 d\mu = 1, \end{aligned}$$

and so \(\smash {f^{({n})}} = f_{\min }\). \(\square \)

The next lemma shows that this result is tight, by giving an example of polynomial f for which the bound \(\smash {f^{({r})}}\) is exact only at order \(r=n\).

Lemma 11

Let \(f(x) = |x|=x_1+\cdots +x_n\). Then \(\smash {f^{({r})}} - f_{\min }> 0\) for all \(r < n\).

Proof

Suppose not. That is, \(f^{(r)}= f_{\min }=0\) for some \(r\le n-1\). As \({f(x) > 0 = f_{\min }}\) for all \(0 \ne x \in {\mathbb {B}}^{n}\), this implies that there exists a polynomial \(s \in {\mathbb {R}}[x]_r\) such that \(s^2\) is interpolating at 0, i.e. such that \(s^2(0) = 1\) and \(s^2(x) = 0\) for all \(0 \ne x \in {\mathbb {B}}^{n}\). But then s is itself interpolating at 0 and has degree \(r < n\), a contradiction. \(\square \)

6 Proof of Lemma 5

In this section we give a proof of Lemma 5, where we bound the sup-norm \(\Vert p_k \Vert _{\infty }\) of the harmonic components \(p_k\) of a polynomial p by \(\gamma _d \Vert p\Vert _\infty \) for some constant \(\gamma _d\) depending only on the degree d of p. The following definitions will be convenient.

Definition 1

For \(n \ge d \ge k \ge 0\) integers, we write:

$$\begin{aligned} \rho (n, d, k)&:= \sup \{\Vert p_k\Vert _\infty : p = p_0 + p_1 + \dots + p_d \in {\mathcal {R}}[{x}]_d, \Vert p\Vert _{\infty } \le 1 \}, \text { and }\\ \rho (n, d)&:= \max _{0 \le k \le d} \rho (n, d, k). \end{aligned}$$

We are thus interested in finding a bound \(\gamma _d\) depending only on d such that:

$$\begin{aligned} \gamma _d \ge \rho (n, d) \text { for all } n \in {\mathbb {N}}. \end{aligned}$$
(39)

We will now show that in the computation of the parameter \(\rho (n,d,k)\) we may restrict to feasible solutions p having strong structural properties. First, we show that we may assume that the sup-norm of the harmonic component \(p_k\) of p is attained at \(0 \in {\mathbb {B}}^{n}\).

Lemma 12

We have

$$\begin{aligned} \rho (n, d, k) = \sup _{p \in {\mathcal {R}}[{x}]_d^n} \left\{ p_k(0) : \Vert p\Vert _\infty \le 1 \right\} . \end{aligned}$$
(40)

Proof

Let p be a feasible solution for \(\rho (n, d, k)\) and let \(x \in {\mathbb {B}}^{n}\) for which \(p_k(x) = \Vert p_k\Vert _\infty \) (after possibly replacing p by \(-p\)). Now choose \(\sigma \in {{\,\mathrm{Aut}\,}}({\mathbb {B}}^{n})\) such that \(\sigma (0) = x\) and set \({\widehat{p}} = p \circ \sigma \). Clearly, \({\widehat{p}}\) is again a feasible solution for \(\rho (n, d, k)\). Moreover, as \(H_k\) is invariant under \({{\,\mathrm{Aut}\,}}({\mathbb {B}}^{n})\), we have:

$$\begin{aligned} \Vert {\widehat{p}}_k\Vert _\infty = {\widehat{p}}_k(0) = (p \circ \sigma )_k(0) = (p_k \circ \sigma )(0) = \Vert p_k\Vert _\infty , \end{aligned}$$

which shows the lemma. \(\square \)

Next we show that we may in addition restrict to polynomials that are highly symmetric.

Lemma 13

In the program (40) we may restrict the optimization to polynomials of the form

$$\begin{aligned} p(x) = \sum _{i = 0}^d \lambda _i \sum _{|a| = i} \chi _a(x) = \sum _{i = 0}^d \lambda _i {\mathcal {K}}^n_i(|x|)\quad \text { where } \lambda _i\in {\mathbb {R}}. \end{aligned}$$

Proof

Let p be a feasible solution to (40). Consider the following polynomial \({\widehat{p}}\) obtained as symmetrization of p under action of \({\mathrm{St}}(0)\), the set of automorphism of \({\mathbb {B}}^{n}\) corresponding to the coordinate permutations:

$$\begin{aligned} {\widehat{p}}(x) = \frac{1}{|{\mathrm{St}}(0)|} \sum _{\sigma \in {\mathrm{St}}(0)} (p \circ \sigma ) (x). \end{aligned}$$

Then \(\Vert {\widehat{p}}\Vert _\infty \le 1\) and \({\widehat{p}}_k(0) = p_k(0)\), so \({\widehat{p}}\) is still feasible for (40) and has the same objective value as p. Furthermore, for each i, \({\widehat{p}}_i\) is invariant under \({\mathrm{St}}(0)\), which implies that \({\widehat{p}}_i(x) = \lambda _i X_i(x) = \lambda _i \sum _{|a| = i} \chi _a(x)=\lambda _i {\mathcal {K}}^n_i(|x|)\) for some \(\lambda _i \in {\mathbb {R}}\) (see 18). \(\square \)

A simple rescaling \(\lambda _i \leftarrow \lambda _i \cdot {n \atopwithdelims ()i}\) allows us to switch from \({\mathcal {K}}^n_i\) to \(\widehat{{\mathcal {K}}}^n_i={\mathcal {K}}^n_i/{n\atopwithdelims ()i}\) and to obtain the following reformulation of \(\rho (n, d, k)\) as a linear program.

Lemma 14

For any \(n\ge d\ge k\) we have:

$$\begin{aligned} \begin{aligned}&\rho (n, d, k) = \max \quad \lambda _k \\&s.t. \quad -1 \le \sum _{i=0}^d \lambda _i \widehat{{\mathcal {K}}}^n_i(t) \le 1 \quad (t=0,1, \dots , n). \end{aligned} \end{aligned}$$
(41)

Limit functions The idea now is to prove a bound on \(\rho (n, d, d)\) which holds for fixed d and is independent of n. We will do this by considering ‘the limit’ of problem (41) as \(n \rightarrow \infty \). For each \(k \in {\mathbb {N}}\), we define the limit function:

$$\begin{aligned} \widehat{{\mathcal {K}}}^\infty _k(t) := \lim _{n \rightarrow \infty } \widehat{{\mathcal {K}}}^n_{k}(nt), \end{aligned}$$

which, as shown in Lemma 15 below, is in fact a polynomial. We first present the polynomial \(\widehat{{\mathcal {K}}}^\infty _k(t)\) for small k as an illustration.

Example 1

We have:

$$\begin{aligned} \widehat{{\mathcal {K}}}^n_{0}(nt) = 1&\implies \widehat{{\mathcal {K}}}^\infty _{0}(t) = 1,&\\ \widehat{{\mathcal {K}}}^n_{1}(nt) = -2t + 1&\implies \widehat{{\mathcal {K}}}^\infty _{1}(t) = -2t + 1,&\\ \widehat{{\mathcal {K}}}^n_{2}(nt) = \frac{2n^2t^2 - 2n^2t + {n \atopwithdelims ()2}}{{n \atopwithdelims ()2}}&\implies \widehat{{\mathcal {K}}}^\infty _{2}(t) = 4t^2 - 4t + 1=(1-2t)^2.&\end{aligned}$$

Lemma 15

We have: \(\widehat{{\mathcal {K}}}^\infty _k(t) = (1 - 2t)^k\) for all \(k \in {\mathbb {N}}\).

Proof

The Krawtchouk polynomials satisfy the following three-term recurrence relation (see, e.g., [28]):

$$\begin{aligned} (k+1){\mathcal {K}}^n_{k+1}(t) = (n - 2t) {\mathcal {K}}^n_{k}(t) - (n - k + 1) {\mathcal {K}}^n_{k-1}(t) \end{aligned}$$

for \(1\le k\le n-1\). By evaluating the polynomials at nt we obtain:

$$\begin{aligned}&(k+1){\mathcal {K}}^n_{k+1}(nt) = (n - 2nt) {\mathcal {K}}^n_{k}(nt) - (n - k + 1) {\mathcal {K}}^n_{k-1}(nt), \\&\quad \implies (k+1){n \atopwithdelims ()k+1}\widehat{{\mathcal {K}}}^n_{k+1}(nt)\\&\quad = (n - 2nt) {n \atopwithdelims ()k}\widehat{{\mathcal {K}}}^n_{k}(nt) - (n - k + 1){n \atopwithdelims ()k-1} \widehat{{\mathcal {K}}}^n_{k-1}(nt), \\&\quad \implies \widehat{{\mathcal {K}}}^n_{k+1}(nt) = {n(1 - 2t)\over (n-k)} \cdot \widehat{{\mathcal {K}}}^n_{k}(nt) - {k\over n-k} \cdot \widehat{{\mathcal {K}}}^n_{k-1}(nt), \\&\quad \implies \widehat{{\mathcal {K}}}^\infty _{k+1}(t) = (1-2t) \widehat{{\mathcal {K}}}^\infty _k(t). \end{aligned}$$

As \(\widehat{{\mathcal {K}}}^\infty _{0}(t) = 1\) and \(\widehat{{\mathcal {K}}}^\infty _{1}(t) = 1 - 2t\) we can conclude that indeed \({\widehat{{\mathcal {K}}}^\infty _k(t) = (1 - 2t)^k}\) for all \(k \in {\mathbb {N}}\). \(\square \)

Next, we show that solutions to (41) remain feasible after increasing the dimension n.

Lemma 16

Let \(\lambda = (\lambda _0, \lambda _1, \dots , \lambda _d)\) be a feasible solution to (41) for a certain \(n \in {\mathbb {N}}\). Then it is also feasible to (41) for \(n+1\) (and thus for any \(n'\ge n+1\)). Therefore, \(\rho (n+1,d,k) \ge \rho (n,d,k)\) for all \(n\ge d\ge k\) and thus \(\rho (n+1,d)\ge \rho (n,d)\) for all \(n\ge d\).

Proof

We may view \({\mathbb {B}}^{n}\) as a subset of \({\mathbb {B}}^{n+1}\) via the map \(a \mapsto (a,0)\), and analogously \({\mathcal {R}}[{x_1,\ldots ,x_n}]\) as a subset of \( {\mathcal {R}}[{x_1,\ldots ,x_n,x_{n+1}}]\) via \(\chi _a \mapsto \chi _{(a,0)}\). Now for \(m, i \in {\mathbb {N}}\) we consider again the zonal spherical harmonic (18):

$$\begin{aligned} X_i^m = \sum _{|a|= i, a \in {\mathbb {B}}^{m}} \chi _a. \end{aligned}$$

Consider the set \({\mathrm{St}}(0) \subseteq {{\,\mathrm{Aut}\,}}({\mathbb {B}}^{n+1})\) of automorphisms fixing \(0 \in {\mathbb {B}}^{n+1}\), i.e., the coordinate permutations arising from \(\sigma \in \text {Sym}(n+1)\). We will use the following identity:

$$\begin{aligned} \frac{1}{|{\mathrm{St}}(0)|} \sum _{\sigma \in {\mathrm{St}}(0)} \frac{X_i^n}{{n \atopwithdelims ()i}} \circ \sigma = \frac{X_i^{n+1}}{{n+1 \atopwithdelims ()i}}. \end{aligned}$$
(42)

To see that (42) holds note that its left hand side is equal to

$$\begin{aligned} {1\over (n+1)!{n\atopwithdelims ()i}} \sum _{\sigma \in \text {Sym}(n+1)} \sum _{a\in {\mathbb {B}}^{n}, |a|=i} \chi _{(a,0)}\circ \sigma = {1\over (n+1)!{n\atopwithdelims ()i}}\sum _{b\in {\mathbb {B}}^{n+1}, |b|=i} N_b \chi _b, \end{aligned}$$

where \(N_b\) denotes the number of pairs \((\sigma ,a)\) with \(\sigma \in \text {Sym}(n+1)\), \(a\in {\mathbb {B}}^{n}\), \(|a|=i\) such that \(b=\sigma (a,0)\). As there are \(n\atopwithdelims ()i\) choices for a and \(i!(n + 1 - i)!\) choices for \(\sigma \) we have \(N_b= {n\atopwithdelims ()i}i!(n + 1 - i)!\) and thus (42) holds.

Assume \(\lambda \) is a feasible solution of (41) for a given value of n. Then, in view of (21), this means

$$\begin{aligned} \Big |\sum _{i=0}^d \lambda _i \cdot \frac{X_i^n(x)}{{n \atopwithdelims ()i}}\Big | \le 1 \quad \text { for all } x\in {\mathbb {B}}^{n}, \quad \text { and thus for all } x \in {\mathbb {B}}^{n + 1}. \end{aligned}$$

Using (42) we obtain:

$$\begin{aligned} \Big |\sum _{i=0}^d \lambda _i \frac{X_i^{n+1}(x)}{{n+1 \atopwithdelims ()i}}\Big |&= \Big |\sum _{i=0}^d \lambda _i \cdot \frac{1}{|{\mathrm{St}}(0)|} \sum _{\sigma \in {\mathrm{St}}(0)} \frac{X_i^n(\sigma (x))}{{n \atopwithdelims ()i}}\Big | \\&= \Big | \bigg (\frac{1}{|{\mathrm{St}}(0)|} \sum _{\sigma \in {\mathrm{St}}(0)} \big (\sum _{i=0}^d \lambda _i \frac{X_i^n}{{n \atopwithdelims ()i}}\big ) \circ \sigma \bigg )(x)\Big |\le 1 \end{aligned}$$

for all \(x\in {\mathbb {B}}^{n+1}\). Using (21) again, this shows that \(\lambda \) is a feasible solution of program (41) for \(n+1\). \(\square \)

Example 2

To illustrate the identity (42), we give a small example with \(n=i=2\). Consider:

$$\begin{aligned} X^2_2 = \sum _{|a| = 2, a \in {\mathbb {B}}^{2}} \chi _a = \chi _{11}. \end{aligned}$$

The automorphisms in \({\mathrm{St}}(0) \subseteq {{\,\mathrm{Aut}\,}}({\mathbb {B}}^{3})\) fixing \(0 \in {\mathbb {B}}^{3}\) are the permutations of \(x_1, x_2, x_3\). So we get:

$$\begin{aligned} \frac{1}{|{\mathrm{St}}(0)|} \sum _{\sigma \in {\mathrm{St}}(0)} X^2_2 \circ \sigma&= \frac{1}{6}(\chi _{110} + \chi _{101} + \chi _{110} + \chi _{011} + \chi _{101} + \chi _{011}) \\&= \frac{2}{6}(\chi _{110} + \chi _{101} + \chi _{011}) = \frac{1}{3}X^3_2, \end{aligned}$$

and indeed \({2 \atopwithdelims ()2} / {3 \atopwithdelims ()2} = 1/3\).

Lemma 17

For \(d\ge k \in {\mathbb {N}}\), define the program:

$$\begin{aligned} \begin{aligned}&\rho (\infty , d, k) := \max \quad \lambda _k \\&s.t. \quad -1 \le \sum _{i=0}^d \lambda _i \widehat{{\mathcal {K}}}^\infty _i(t) \le 1 \quad (t \in [0, 1]). \end{aligned} \end{aligned}$$
(43)

Then, for any \(n\ge d\), we have: \(\rho (n, d, k) \le \rho (\infty , d, k)\).

Proof

Let \(\lambda \) be a feasible solution to (41) for (ndk). We show that \(\lambda \) is feasible for (43). For this fix \(t \in [0, 1] \cap {\mathbb {Q}}\). Then there exists a sequence of integers \((n_j)_j \rightarrow \infty \) such that \(n_j \ge n\) and \(tn_j \in [0, n_j]\) is integer for each \(j \in {\mathbb {N}}\). As \(n_j\ge n\), we know from Lemma 16 that \(\lambda \) is also a feasible solution of program (41) for \((n_j,d,k)\). Hence, since \(n_jt\in [0:n_j]\) we obtain

$$\begin{aligned} |\sum _{i=0}^d \lambda _i \widehat{{\mathcal {K}}}_{i}^{n_j}(n_j t)| \le 1 \quad \forall j \in {\mathbb {N}}. \end{aligned}$$

But this immediately gives:

$$\begin{aligned} |\sum _{i=0}^d \lambda _i \widehat{{\mathcal {K}}}^\infty _{i}(t)| = \lim _{j \rightarrow \infty } |\sum _{i=0}^d \lambda _i \widehat{{\mathcal {K}}}_{i}^{n_j}(n_j t)| \le 1. \end{aligned}$$
(44)

As \([0, 1] \cap {\mathbb {Q}}\) lies dense in [0, 1] (and the \(\widehat{{\mathcal {K}}}^\infty _i\)’s are continuous) we may conclude that (44) holds for all \(t \in [0, 1]\). This shows that \(\lambda \) is feasible for (43) and we thus have \(\rho (n, d, k) \le \rho (\infty , d, k)\), as desired. \(\square \)

It remains now to compute the optimum solution to the program (43). In light of Lemma 15, and after a change of variables \(x = 1 - 2t\), this program may be reformulated as:

$$\begin{aligned} \begin{aligned}&\max \quad |\lambda _k| \\&s.t. \quad -1 \le \sum _{i=0}^d \lambda _i x^i \le 1 \quad (x \in [-1, 1]). \end{aligned} \end{aligned}$$
(45)

In other words, we are tasked with finding a polynomial p(x) of degree d satisfying \(|p(x)| \le 1\) for all \(x \in [-1, 1]\), whose k-th coefficient is as large as possible in absolute value. This is a classical extremal problem solved by V. Markov.

Theorem 9

(see, e.g., Theorem 7, pp. 53 in [29]) For \(m \in {\mathbb {N}}\), let \(T_m(x) = \sum _{i=0}^m t_{m, i} x^i\) be the Chebyshev polynomial of degree m. Then the optimum solution \(\lambda \) to (45) is given by:

$$\begin{aligned} \sum _{i=0}^d \lambda _i x^i = {\left\{ \begin{array}{ll} T_{d}(x) \quad &{} \text {if } k \equiv d \mod 2, \\ T_{d-1}(x) \quad &{} \text {if } k \not \equiv d \mod 2. \end{array}\right. } \end{aligned}$$

In particular, \(\rho (\infty , d, k)\) is equal to \(|t_{d, k}|\) (resp. \(|t_{d-1, k}|\)).

As the coefficients of the Chebyshev polynomials are known explicitely, Theorem 9 allows us to give exact values of the constant \(\gamma _d\) appearing in our main results (see Table 1). Using the following identity:

$$\begin{aligned} \sum _{i=0}^d |t_{d, i}| = \frac{1}{2}(1+\sqrt{2})^d + \frac{1}{2}(1 - \sqrt{2})^d \le (1+\sqrt{2})^d, \end{aligned}$$

we are also able to concretely estimate:

$$\begin{aligned} \gamma _d \le \max _{k \le d} \rho (\infty , d, k) \le (1+\sqrt{2})^d. \end{aligned}$$

7 Concluding remarks

Summary We have shown a theoretical guarantee on the quality of the sum-of-squares hierarchy \(\smash {f_{({r})}} \le f_{\min }\) for approximating the minimum of a polynomial f of degree d over the boolean cube \({\mathbb {B}}^{n}\). As far as we are aware, this is the first such analysis that applies to values of r smaller than \((n+d) / 2\), i.e., when the hierarchy is not exact. Additionally, our guarantee applies to a second, measure-based hierarchy of bounds \(\smash {f^{({r})}} \ge f_{\min }\). Our result may therefore also be interpreted as bounding the range \(\smash {f^{({r})}} - \smash {f_{({r})}}\). Our analysis also applies to polynomial optimization over the cube \(\{\pm 1\}^n\) (by a simple change of variables), over the q-ary cube (see Appendix A) and in the setting of matrix-valued polynomials (see Appendix B).

Analysis for small values of r A limitation of Theorem 1 is that the analyis of \(\smash {f_{({r})}}\) applies only for choices of drn satisfying \(d(d+1) \xi ^n_{r+1} \le 1/2\). One may partially avoid this limitation by proving a slightly sharper version of Lemma 6, showing instead that \( \varLambda \le {\tilde{\varLambda }}/(1 - {\tilde{\varLambda }}), \) assuming now only that \({\tilde{\varLambda }}\le 1\). Indeed, Lemma 6 is a special case of this result, assuming that \({\tilde{\varLambda }}\le 1/2\) to obtain \(\varLambda \le 2 {\tilde{\varLambda }}\). Nevertheless, our methods exclude values of r outside of the regime \(r = \varOmega (n)\).

The constant \(\gamma _d\) The strength of our results depends in large part on the size of the constant \(C_d\) appearing in Theorem 1 and Theorem 3, where we may set \(C_d=d(d+1)\gamma _d\). Recall that \(\gamma _d\) is defined in Lemma 5 as a constant for which \(\Vert p_k\Vert _{\infty } \le \gamma _d \Vert p\Vert _{\infty }\) for any polynomial \({p = p_0 + p_1 + \ldots + p_d}\) of degree d and \(k \le d\) on \({\mathbb {B}}^{n}\), independently of the dimension n. In Sect. 6 we have shown the existence of such a constant. Furthermore, we have shown there that we may choose \(\gamma _d \le (1 + \sqrt{2})^d\), and have given an explicit expression for the smallest possible value of \(\gamma _d\) in terms of the coefficients of Chebyshev polynomials. Table 1 lists these values for small d.

Table 1 Values of the constant \(\gamma _d\)

Computing extremal roots of Krawtchouk polynomials Although Theorem 2 provides only an asymptotic bound on the least root \(\xi _{r}^n\) of \({\mathcal {K}}^n_r\), it should be noted that \(\xi _{r}^n\) can be computed explicitely for small values of rn, thus allowing for a concrete estimate of the error of both Lasserre hierarchies via Theorems 1 and 3, respectively. Indeed, as is well-known, the root \(\xi _{r+1}^n\) is equal to the smallest eigenvalue of the \((r+1)\times (r+1)\) matrix A (aka Jacobi matrix), whose entries are given by \(A_{i,j} = \langle t\widehat{{\mathcal {K}}}^n_i(t), \widehat{{\mathcal {K}}}^n_j(t) \rangle _\omega \) for \(i,j\in \{0,1,\ldots , r\}\). See, e.g., [41] for more details.

Connecting the hierarchies Our analysis of the outer hierarchy \(\smash {f_{({r})}}\) on \({\mathbb {B}}^{n}\) relies essentially on knowledge of the the inner hierarchy \(\smash {f^{({r})}}\). Although not explicitely mentioned there, this is the case for the analysis on \(S^{n-1}\) in [11] as well. As the behaviour of \(\smash {f^{({r})}}\) is generally quite well understood, this suggests a potential avenue for proving further results on \(\smash {f_{({r})}}\) in other settings.

For instance, the inner hierarchy \(\smash {f^{({r})}}\) is known to converge at a rate in \(O(1/r^2)\) on the unit ball \(B^n\) or the unit box \([-1, 1]^n\), but matching results on the outer hierarchy \(\smash {f_{({r})}}\) are not available. The question is thus whether the strategy used for the hypersphere \(S^{n-1}\) in [11] and for the boolean cube \({\mathbb {B}}^{n}\) here might be extended to these cases as well.

A difficulty is that the sets \(B^n\) and \([-1, 1]^n\) have a more complicated symmetric structure than \(S^{n-1}\) and \({\mathbb {B}}^{n}\), respectively. In particular, the group actions have uncountably many orbits in these cases, and a direct analog of the Funk–Hecke formula (28) is not available. New ideas are therefore needed to define the kernel \(K(x,y)\) (cf. 8) and analyze its eigenvalues.