Keywords

1 Introduction

Sparse polynomial interpolation algorithms, where the number of values required depends on the number of nonzero terms in a chosen representation base rather than on the degree of the polynomial, originate from two sources. One is Prony’s 1795 algorithm for reconstructing an exponential sum [18] (see also [2]) and another is Blahut’s exact sparse polynomial interpolation algorithm in the decoding phase of the Reed–Solomon error correcting code. Both algorithms first determine the term structure via the generator (“error locator polynomial”) of the linear recurrent sequence of the values \(f(\omega ^i), i=0,1,2,\ldots \), of the sparse function \(f\). Blahut’s algorithm has led to a rich collection of exact sparse multivariate polynomial interpolation algorithms, among them [1, 12, 13, 16, 20]. Prony’s algorithm suffers from numerical instability unless randomization controls, with high probability and for functions of significant sparsity, the conditioning of intermediate Hankel matrices. The probabilistic spectral analysis in the GLL algorithm  [5, 7] adapts the analysis of the exact early termination algorithm of [13]. The resulting numerical sparse interpolation algorithms have recently had a high impact on medical signal processing; see the web site http://smartcare.be of Wen-shin Lee and her collaborators. The GLL algorithm can be generalized to multivariate polynomial and rational function recovery via Zippel’s variable-by-variable sparse interpolation  [14].

Already in the beginning days of symbolic computation, the choice of polynomial basis was recognized: \((x-2)^{100} + 1\) is a concise representation of a polynomial with \(101\) terms in power basis representation. The discrete-continuous optimization problem of computing the sparsest shift of an exact univariate polynomial surprisingly has a polynomial-time solution  [4, 9, 10]. Our subject is the computation of an approximate interpolant that is sparsified through a shift. One can interpret our algorithm as a numerical version of the exact sparsest shift algorithms. As in least squares fitting, noise can be controlled by oversampling (cf. [8]). The main difficulty is that the shift is unknown. Our numerical algorithm adapts Algorithm UniSparsestShifts \(\,\langle \) one proj, two seq \(\,\rangle \) in [4] to compute the shift: UniSparsestShifts \(\,\langle \) one proj, two seq \(\,\rangle \) carries the shift as a symbolic variable \(z\) throughout the sparse interpolation algorithm. Since the coefficients of the polynomials in the shift variable \(z\) are spoiled by noise, the GCD step becomes an approximate polynomial GCD. A main question answered here is whether the arising nonlinear optimization problems remain well-conditioned. Our answer is a conditional yes: an optimal approximate shift is found among the arguments of all local minima, but the number of local minima is high, preventing the application of standard approximate GCD algorithms. Instead, we perform global optimization, as a fallback, by computing all zeros of the gradient ideal. In addition, our algorithm requires high precision floating point arithmetic.

In [3], we have introduced outlier values to the sparse interpolation problem. There, outlier removal requires high oversampling, as the worst case of \(k\)-error linear complexity is \(2t(2k+1)\), where \(t\) is the generator degree. However, ours is only an upper bound for sparse interpolation. The situation is different for Algorithm UniSparsestShifts \(\,\langle \) one proj, two seq \(\,\rangle \). Outliers can be removed at the construction stage of the values containing the shift variable \(z\), by a numeric version of Blahut’s decoding algorithm for interpolation with errors. The algorithm, numerical interpolation with outliers, is interesting in its own right. As we will show in Sect. 3, the analysis in [3, 7] does not directly apply, as randomization can only be applied with a limited choice of random evaluation points. We have successfully tested it as a subroutine of our numerical sparsest shift algorithm. Note that a few outliers per interpolation lead to a very small sparse interpolation problem for error location, which can be handled successfully by sparse interpolation with noisy values.

For the sake of comparison with this algorithm, we restrict to characteristic \(0\) and compare a sparse shift representation to a Taylor expansion expressed at a point that will make the representation sparse. This leads to finding a root common to many derivatives. Combined with a weighted least squares fit for removing outliers and tolerating noise, we manage to compare favorably to the main algorithm.

In Sect. 4, we present the preliminary experimental results that our algorithms can recover sparse models even in the presence of substantial noise and outliers. See Sect. 4.3 for our conclusions.

2 Computing Sparse Shifts

We introduce in this subsection an algorithm to compute a shifted sparse interpolant in a numerical setting. The exact algorithm accepts outliers and uses early termination; we adapt it to a numerical setting, considering noisy and erroneous data. It is based on a numerical version of Blahut’s decoding algorithm.

2.1 Main Algorithm with Early Termination

The Early Termination Theorem in [13] is at the heart of computing a sparsest shift. Let

$$\begin{aligned} g(x) = \sum _{j=1}^t c_j x^{e_j},\quad c_j \ne 0 \quad \text {for all}\quad 1\le j\le t, \end{aligned}$$

be a \(t\)-sparse polynomial with coefficients in an integral domain \(\mathsf {D}\). Furthermore, let

$$\begin{aligned} \alpha _i(y) = g(y^i) \in \mathsf {D}[y],\quad \text { for} \quad i = 1,2,\ldots \end{aligned}$$

be evaluations of \(g\) at powers of an indeterminate \(y\). Prony’s/Blahut’s theorem states that the sequence of the \(\alpha _i\) is linearly generated by \(\prod _{j=1}^t (\lambda -y^{e_j})\). Therefore, if one considers the Hankel matrices

$$\begin{aligned} {\fancyscript{H}}_i(y) = \begin{bmatrix} \alpha _{1}&\alpha _{2}&\ldots&\alpha _{i}\\ \alpha _{2}&\alpha _{3}&\ldots&\alpha _{i+1} \\ \vdots&\vdots&\ddots&\vdots \\ \alpha _{i}&\alpha _{i+1}&\ldots&\alpha _{2i-1} \end{bmatrix} \in \mathsf {D}[y]^{i\times i},\quad \text {for }i=1,2,\ldots \end{aligned}$$

one must have \(\det ({\fancyscript{H}}_{t+1}) = 0\). Theorem 4 in [13] simply states that \(\det ({\fancyscript{H}}_i) \ne 0\) for all \(1\le i \le t\). One can replace the indeterminate \(y\) by a randomly sampled coefficient domain element to have \(\det ({\fancyscript{H}}_i) \ne 0\) for all \(1\le i \le t\) with high probability (w.h.p.).

We seek an \(s\) in any extension of the field of \(\mathsf {K}\) such that for a given \(f(x) \in \mathsf {K}[x]\) the polynomial \(f(x+s) = h(x)\) is \(t\)-sparse for a minimal \(t\). Now consider \(g(x) = f(x+z) \in \mathsf {D}[x]\) with \(\mathsf {D}= \mathsf {K}[z]\). We have \(\varDelta _i(y,z) = \det ({\fancyscript{H}}_{i}) \in (\mathsf {K}[z])[y]\); note that \(\alpha _i(y,z) = g(y^i) = f(y^i+z)\). By the above Theorem 4, the sparsest shift is an \(s\) with \(\varDelta _{t+1}(y,s) = 0\) for the smallest \(t\). Algorithm UniSparsestShifts \(\,\langle \) one proj, two seq \(\,\rangle \) computes \(s\) as

$$\begin{aligned}&z\,+\,s\,\mathrm{divides\,(w.h.p.)\,GCD}(\varDelta _{t+1}(y_1,z),\varDelta _{t+1}(y_2,z)), \\&\qquad \qquad \qquad \quad {\mathrm{where}}\,y_1,y_2\,{\mathrm{are\,random\,in}}\, \mathsf {K}; \end{aligned}$$

note that the first \(t\) with a nontrivial GCD is possibly smaller for the projection by \(y = y_1\) and \(y = y_2\), but with low probability.

For numeric sparse interpolation with a shift, we assume that for \(f(x) \in {\mathbb C}[x]\) we can obtain

$$\begin{aligned} f(\zeta ) + noise + outlier\,error,\quad {\text {for any}}\quad \zeta \in {\mathbb C}. \end{aligned}$$

Here only a fraction of the values contain an outlier error, and noise is a random perturbation of \(f(\zeta )\) by a relative error of \(10^{-10}\), say. Our algorithm returns a sparse interpolant \(g(x)\) that at all probed values \(\zeta \), save for a fraction that are removed as outliers, approximates the returned \(f(\zeta ) + {\textit{noise}}\). Note that probing \(f\) at \(\zeta \) twice may produce a different noise and possibly an outlier.

We now give the outline of our Algorithm ApproxUniSparseShift \(\,\langle \) one proj, sev seq \(\,\rangle \). Note that because of the approximate nature of the shifted sparse interpolant, there is a trade-off between backward error and sparsity. Hence we call our algorithm a “sparse shift” algorithm. As in Algorithm UniSparsestShifts \(\,\langle \) one proj, two seq \(\,\rangle \), for \(L\) complex values \(y =\omega ^{[1]}\), \(\omega ^{[2]}\), \(\ldots \), \(\omega ^{[L]}\) we compute \({\widetilde{\delta _i}}^{[\ell \,]}(z) = {\widetilde{\varDelta _i}}(\omega ^{[\ell \,]},z)\) from \({\widetilde{\alpha _i}}(\omega ^{[\ell \,]},z)\), \(\ell =1,2,\ldots ,L\). Here the tilde accent mark \({\widetilde{\ \ }}\) indicates that the values have noise in their scalars. As in [7], we choose the \(\omega ^{[\ell \,]}\) to be different random roots of unity of prime order. Our algorithm consists of the four following tasks:

Step 1::

For \(\ell =1,2,\ldots ,L\), compute the numeric complex polynomials \({\widetilde{\alpha _i}}(\omega ^{[\ell \,]},z)\) via a numeric version of the Blahut decoding algorithm; see Sect. 3. Step 1 is assumed to have removed all outliers.

Step 2::

Compute the determinants \({\widetilde{\delta _i}}^{[\ell \,]}(z)\) of numeric polynomial Hankel matrices \({\widetilde{{\fancyscript{H}}_i}}^{[\ell \,]}(z)\) for all \(\ell \), iterating Steps 3 and 4 on \(i\). We perform the determinant computations with twice the floating point precision as we use for Steps 1, 3 and 4.

Step 3::

Determine the sparsity and an approximate shift. Note that the approximate shift \(\widetilde{s}\) is an approximate root of the polynomials \({\widetilde{\delta _i}}^{[1]}(z)\), \({\widetilde{\delta _i}}^{[2]}(z)\), \(\ldots \), \({\widetilde{\delta _i}}^{[L]}(z)\). Our method finds the smallest perturbation of the \(\widetilde{\delta }_i^{[\ell \,]}(z)\) that produces a common root, simultaneously for all \(\ell \). If that distance is large, we assume that there is no common root and the dimension of the Hankel matrix was too small. It might happen that an accurate shift is diagnosed too early, but then the constructed model produces a worse backward error. The 2-norm distance to the nearest polynomial system with a common root \(\widetilde{s}\) is given by formula (see [11] and the literature cited there):

$$\begin{aligned} \widetilde{s} = \mathop {\text {arginf}}\limits _{\zeta \in {\mathbb C}} \sum _{\ell =1}^L |\widetilde{\delta }_i^{[\ell \,]}(\zeta )|^2 \Big / \Big (\sum _{m=0}^d |\zeta ^m|^2\Big ), \end{aligned}$$

where \(d = \max _{\ell } \{\deg (\widetilde{\delta }_i^{[\ell \,]}(z))\}\) and in all polynomials, any term coefficients of \(z^m\), where \(m \in \{0,1,\ldots ,d\}\), can be deformed. In our experiments in Sect. 4, we have only considered real shifts \(\widetilde{s}\in {\mathbb R}\). The optimization problem is then

$$\begin{aligned} \widetilde{s}_{\text {real}} = \mathop {\text {arginf}}\limits _{\xi \in {\mathbb R}} \sum _{\ell =1}^L \Big ((\mathfrak {R}\widetilde{\delta }_i^{[\ell \,]})(\xi )^2 + (\mathfrak {I}\widetilde{\delta }_i^{[\ell \,]})(\xi )^2\Big ) \Big / \Big (\sum _{m=0}^d \xi ^{2m}\Big ), \end{aligned}$$
(1)

where \(\mathfrak {R}\widetilde{\delta }_i^{[\ell \,]}\) and \(\mathfrak {I}\widetilde{\delta }_i^{[\ell \,]}\) are the real and imaginary parts of the polynomials \(\delta _i^{[\ell \,]}\), respectively. We find \(\widetilde{s}_{\text {real}}\) among the real roots of the numerator of the derivative of the objective function in (1),

$$\begin{aligned}&\frac{\textstyle \partial \sum _l((\mathfrak {R}\widetilde{\delta }_i^{[\ell \,]})(z)^2 + (\mathfrak {I}\widetilde{\delta }_i^{[\ell \,]})(z)^2)}{\partial z} \times \Big (\sum _{m=0}^d z^{2m}\Big ) \nonumber \\&\quad - \sum _{\ell =1}^L((\mathfrak {R}\widetilde{\delta }_i^{[\ell \,]})(z)^2 + (\mathfrak {I}\widetilde{\delta }_i^{[\ell \,]})(z)^2) \times \Big ( \sum _{m=0}^d (2m) z^{2m-1}\Big ), \end{aligned}$$
(2)

and choose the root that minimizes the objective function in (1).

We have observed that a larger number \(L\) of separate \(\omega ^{[\ell \,]}\) can improve the accuracy of the optimal shift, at a cost of oversampling. We have also observed that the optimization problem (1) and (2) has numerous local optima, some near the optimal approximate shift, which prevents the use of any local approximate GCD algorithm.

Step 4::

With the approximate sparsest shift \(\widetilde{s}\), complete the sparse polynomial reconstruction, as in [6] and [3, Sect. 6]. One reuses the evaluations from previous steps, having removed those that were declared outliers in Step 1.

In the remainder of this section, we restrict ourselves to characteristic \(0\). We now describe a more naïve approach for the same problem. Some early termination can also be achieved here. Unlike the main algorithm, this one cannot recover errors in the exact setting.

2.2 Using Taylor Expansions

Let \(f(x) = \sum _{i=1}^{t} c_i (x-s)^{e_i}\) be a \(t\)-sparse shifted polynomial of degree \(d\). We can see this expression as a Taylor expansion of \(f\) at \(x=s\):

$$\begin{aligned} f(x) = \sum _{i=0}^\infty \frac{\left( \partial ^{i}{f}/\partial {x}^{i}\right) (s)}{i!} \left( x-s \right) ^i. \end{aligned}$$

A sparsest shift is then an \(s\) that is a root of the maximum number of polynomials in the list \(S = \left\{ \left( \partial ^{i}{f}/\partial {x}^{i}\right) (x) \mid i \in \{ 0,\ldots , d-1 \}\right\} \).

Remark 1

It is stated in Theorem 1 in [17] that if \(t \le (d+1)/2\) then the shift \(s\) is unique and rational. Moreover, the proof gives the stronger statement: for any other shift \(\hat{s}\), with a sparsity \(\hat{t}\), one has \(\hat{t} > d+1-t\).

This statement is not true in characteristic \(p \ne 0\): for instance, consider the two shifts \(-1\) and \(0\) in the polynomial \(\left( x+1 \right) ^p = x^p + 1\mod p\).

Lemma 1

Let \(S_{2t}\) be the list of the last \(2t\) elements in \(S\). The root that zeros the maximum number of polynomials in \(S_{2t}\) is the sparsest shift.

Proof

We prove this by contradiction. Assume a shift \(s\) appears \(r\) times in \(S_{2t}\) and another shift \(\hat{s}\) appears \(\hat{r}\) times. We first notice that the number of elements in \(S\) for which \(\hat{s}\) is a root, and the number of elements for which it is not a root, sum to \(d+1\). So we have the inequality \(\hat{r}+\hat{t} \le d+1\).

Suppose now that \(\hat{r} \ge r\). The sparsity of \(f\) in the \(s\)-shifted basis being \(t\), the number of elements in \(S_{2t}\) that do not have \(\hat{s}\) as a root is \(2t-\hat{r} \le t\), thus \(\hat{r} \ge t\). On the other hand, \(\hat{t} > d+1-t\). Summing these last two inequalities yields \(\hat{r}+\hat{t} > d+1\), which is impossible.   \(\square \)

Early termination can be achieved; indeed, under certain circumstances, one need not compute all \(2t\) derivatives. For example, suppose that the degree \((d-1)\) term of \(f\) in the sparsest shifted basis is missing and try \(\bar{s}\), the root of \(\left( \partial ^{d-1}{f}/\partial {x}^{d-1}\right) (x)\), as a shift; this is the Tschirnhaus transformation (originally introduced for solving cubic equations). If the “back-shifted” polynomial \(f(x+\bar{s})\) has fewer than \((d+1)/2\) terms, then by Remark 1, \(\bar{s}\) is the unique sparsest shift. We can extend this technique by trying all rational roots in the list \(S_{\tau }\) for a small \(\tau \).

Now we state the naïve algorithm based on the above, then we modify it for a numeric setting. Consider first the following exact algorithm:

Step 1::

Compute the exact interpolant using \(D+1\) calls to the black box, where \(D \ge d\).

Step 2::

Try early termination on \(S_{\tau }\), for a small \(\tau \), and return if successful.

Step 3::

Compute all remaining derivatives in \(S_{\theta }\), for \(\theta =\min (2T,D)\).

Step 4::

The sparsest shift is the rational root \(s\) that zeros the most derivatives in \(S_{\theta }\).

Step 5::

The “back-shifted” polynomial \(f(x+s)\) gives the support for the sparse polynomial.

This algorithm can be easily translated to a numerical one, based on least squares fitting:

Step 1::

Compute a degree-\(D\) weighted least squares fit with O(\(D{{+}E}\)) calls to the black box.

Step 2::

Remove outliers by comparing relative errors, then update the fit.

Step 3::

Compute the \(\theta \) derivatives in \(S_{\theta }\) (possibly terminate early and proceed to Step 6).

Step 4::

The approximate root \(s\) that zeros most derivatives is the sparsest shift.

Step 5::

The polynomial \(f(x+s)\) gives the support for the sparse polynomial.

Step 6::

A Newton iteration can be conducted on the result of Step 5 to increase accuracy.

2.3 Discussion on the Numeric Algorithm

Step 4 is sensitive to noise and requires more sampling from Step 1. The approximate roots are determined to be equal up to a certain tolerance (for instance \(10^{-2}\)). In Steps 5 and 6, the coefficients near \(0\) may be forced to \(0\) (which would accelerate convergence in Step 6). Step 6 is conducted on the function \(f(m'_1,\ldots ,m'_k,s') = \sum _{i=1}^{k} m'_i (x-s')^{h_i}\), with initial condition from Step 5, random samples \(x_j\) and noisy evaluations \(f(x_j)\); the outliers are removed by checking relative errors. If the random samples \(x_j\) are not only taken from data in Step 1, then oversampling will help “de-noising” the outputs.

Remark 2

It is unknown to us, in the exact algorithm, how to use a number of calls to the black box in Step 1 depending only on \(T\), in order to compute the derivatives. However, it is reasonable to expose the following: we are only interested in the higher-degree terms of \(f\). Consider the Euclidean division \(f(x) = Q(x)x^q+R(x)\); then, with high numeric precision and big random \(x_i\), we can recover an approximation of \(Q\) by a least squares fit on samples \(f(x_i)/x_i^q \approx Q(x_i)\).

3 Numeric Interpolation with Outliers

Blahut’s decoding algorithm for Reed/Solomon codes is based on sparse interpolation. Suppose one has values of

$$\begin{aligned} f(x) = c_{d-1}x^{d-1} + \cdots + c_0\in \mathsf {K}[x],\quad \deg (f)\le d-1 \end{aligned}$$

at powers \(\omega ^i\): \(a_i = f(\omega ^i)\), \(i=0,1,2,\ldots ,n-1\), where \(n=d+2E\). Furthermore suppose for \(k \le E\), where the upper bound \(E\) is known, those values are spoiled by \(k\) outlier errors: \(b_i = a_i+a^\prime _i\), with \(a^\prime _{e_j} \ne 0\) exactly at the indices \(0 \le e_1 < e_2 < \cdots < e_k \le d + 2E-1\). If \(\omega \) is an \(n\)th \(=\) \((d + 2E)\)th primitive root of unity, then the \(n\times n\) Fourier (Vandermonde) matrix \(V(\omega ) = [\omega ^{i\cdot m}]_{0\le i,m\le n-1}\) satisfies

$$\begin{aligned} W = V(\omega )^{-1} = \frac{1}{n} V(\omega ^{-1})\quad \text {where}\quad \omega ^{-1} = \omega ^{n-1}\text {,} \end{aligned}$$
(3)

hence

$$\begin{aligned} W \mathbf b = W \mathbf a + W \mathbf a^\prime = \begin{bmatrix} \mathbf c\\ 0 \end{bmatrix} + \frac{1}{n} V(\omega ^{-1})\mathbf a^\prime . \end{aligned}$$
(4)

The last \(2E\) entries in \(W \mathbf b\) allow sparse interpolation of \(g(x) = \sum _{j=1}^k a^\prime _{e_j} x^{e_j}\):

$$\begin{aligned} c^\prime _l= (V(\omega ^{-1}) \mathbf b\,)_{l} = g(\omega ^{-l}) \quad \text {for }d\le l\le d+2E-1. \end{aligned}$$

Note that all vectors are indexed \(0,1,\ldots ,n-1\), e.g.,

$$\begin{aligned} \mathbf a = \begin{bmatrix}a_0\\ \vdots \\ a_{n-1}\end{bmatrix}\quad \text {and}\quad \mathbf b = \begin{bmatrix}b_0\\ \vdots \\ b_{n-1}\end{bmatrix}. \end{aligned}$$

By our convention, primed \(^{\prime }\) quantities contain outlier information. Thus, as in Sect. 2, the sequence \(c^\prime _d,c^\prime _{d+1},\ldots \) is linearly generated by \(\Lambda (\lambda ) = \prod _{j=1}^k (\lambda -\omega ^{-e_j})\) ( called the “error locator polynomial”), which is a squarefree polynomial by virtue of the primitivity of \(\omega \). One may also compute \(\Lambda \) from the reverse sequence \(c^\prime _{d+2E-1},c^\prime _{d+2E-2},\ldots \), which is linearly generated by the reciprocal polynomial \(\prod _{j=1}^k (\lambda -\omega ^{e_j})\).

Not knowing \(k\), the probabilistic analysis of early termination as in [13] and Sect. 2 does not directly apply, as the choice of \(\omega \) is restricted to a primitive \(n\)’th root of unity. Furthermore, the locations \(e_j\) of the outlier errors \(a_{e_j}^\prime \) may depend on the evaluation points \(\omega ^i\). Blahut’s decoding algorithm processes all \(2E\) values \(c^\prime _l\).

If one has \(\varepsilon _i\) in each evaluation, namely \(\widetilde{b}_i = a_i + a_i^\prime + \varepsilon _i\), where \(|a_{e_j} - a_{e_j}^\prime |/|a_{e_j}| \gg ~0\) (one may assume that \(\varepsilon _{e_j} = 0\)), then

$$\begin{aligned} W {\tilde{\mathbf {b}}} = \begin{bmatrix}\mathbf c\\ 0\end{bmatrix} + \frac{1}{n} V(\omega ^{-1})\mathbf a^\prime + \frac{1}{n} V(\omega ^{-1}) \mathbf \varepsilon , \end{aligned}$$

so

$$\begin{aligned} \widetilde{c}_l^\prime = (V(\omega ^{-1}) \tilde{\mathbf {b}}\,)_l= g(\omega ^{-l}) + (V(\omega ^{-1}) \mathbf \varepsilon \,)_l= g(\omega ^{-l}) + \bar{\varepsilon }_l\quad \text {for}\quad d\le l\le d+2E-1, \end{aligned}$$

where \(|\bar{\varepsilon }_l| \le |\varepsilon _1|\,+\,\cdots \,+\,|\varepsilon _n|\). Again, there is an immediate trade-off between noise and outliers: at what magnitude does noise \(\varepsilon _i\) become an outlier \(a_i^\prime \)? For now we assume that the relative error in noise is small, say \(10^{-10}\), while the relative error in outliers is big, say \(10^5\). The recovery of an approximate interpolant \(\widetilde{g}(x) = \sum _{j=1}^k \widetilde{a}_{e_j}^\prime x^{e_j}\) for the evaluations \(\widetilde{c}_l^\prime \) hinges on the condition number of the \(k\times k\) Hankel matrix:

$$\begin{aligned} \widetilde{\fancyscript{H}}_k^\prime = \begin{bmatrix} \widetilde{c}^\prime _{d}&\widetilde{c}^\prime _{d+1}&\ldots&\widetilde{c}^\prime _{d+k-1}\\ \widetilde{c}^\prime _{d+1}&\widetilde{c}^\prime _{d+2}&\ldots&\widetilde{c}^\prime _{d+k} \\ \vdots&\vdots&\ddots&\vdots \\ \widetilde{c}^\prime _{d+k-1}&\widetilde{c}^\prime _{d+k}&\ldots&\widetilde{c}^\prime _{d+2k-2} \end{bmatrix}. \end{aligned}$$

If the matrix is well-conditioned, the error locations \(e_j\) can be determined from the approximate linear generator \({\widetilde{\varLambda }}\) as in the GLL algorithm [3, 7]. As is shown there, the conditioning is bounded by \(1/|\omega ^{e_u} - \omega ^{e_v}|\). Large values there are prevented by randomizing \(\omega \), as the term exponents \(e_j\) are fixed for any evaluation. Using an \(\omega ^r\) instead of \(\omega \) here, where \(\text {GCD}(r,n)=1\), allows redistributing of the \(\omega ^{e_j}\), but the \(e_j\) may then become different.

A special case is \(k=1\): In that case

$$\begin{aligned} {\widetilde{{\fancyscript{H}}_1}}^\prime = [\widetilde{c}^\prime _d ] = [g(\omega ^{-d})+\bar{\varepsilon }_d] = [a^\prime _{e_1} \omega ^{-de_1} +\bar{\varepsilon }_d], \end{aligned}$$

which, by our assumption on a large outlier \(a^\prime _{e_1}\) and small noise, is a well-conditioned matrix. This is the case we tested in Sect. 4.

Remark 3

When the relative difference between the magnitudes of the outlier \(a_{e_1}^\prime \) and noise \(\varepsilon _0,\varepsilon _1,\ldots ,\varepsilon _{d+2E-1}\) is not so pronounced, erroneous recovery of the exponent \(e_1\) can occur: we have \((\widetilde{c}_d^\prime , \widetilde{c}_{d+1}^\prime , \ldots , \widetilde{c}_{d+2E-1}^\prime ) = (\widetilde{c}_d^\prime ,\widetilde{c}_{d+1}^\prime )\), so the linear generator \({\widetilde{\varLambda }}(\lambda )=\lambda -\omega ^{-e_1}\) can be approximated by computing

$$\begin{aligned} \frac{ \widetilde{c}_{d+1}^\prime }{ \widetilde{c}_d^\prime } = \frac{ a_{e_1}^\prime \omega ^{-(d+1)e_1}+\bar{\varepsilon }_{d+1} }{ a_{e_1}^\prime \omega ^{-de_1}+\bar{\varepsilon }_d } = \omega ^{-e_1} + \frac{ \bar{\varepsilon }_{d+1}-\omega ^{-e_1}\bar{\varepsilon }_d }{ a_{e_1}^\prime \omega ^{-de_1}+\bar{\varepsilon }_d } = \widetilde{\omega }. \end{aligned}$$
(5)

For this reason, we define a bound \(\varepsilon _{\text {noise}}\ge \max _i|\varepsilon _i|\) and assume \(n\varepsilon _{\text {noise}}<|a_{e_1}^\prime |\) so that

$$\begin{aligned} |\widetilde{\omega }-\omega ^{-e_1}| = \left| \frac{ \bar{\varepsilon }_{d+1}-\omega ^{-e_1}\bar{\varepsilon }_d }{ a_{e_1}^\prime \omega ^{-de_1}+\bar{\varepsilon }_d } \right| \le \frac{ |\bar{\varepsilon }_{d+1}|+|\bar{\varepsilon }_d| }{ |a_{e_1}^\prime |-|\bar{\varepsilon }_d| } \le \frac{ 2n\varepsilon _{\text {noise}}}{ |a_{e_1}^\prime |-n\varepsilon _{\text {noise}}} . \end{aligned}$$
(6)

By the distribution of complex roots of unity (of order \(n\)) on the unit circle, we have that \(|\omega ^{s+1}-\omega ^s|=|\omega -1|=2\sin (\uppi /n)\) for any integer \(s\). Thus, \(|\widetilde{\omega }-\omega ^{-e_1}|<\sin (\uppi /n)\) will guarantee \(|\widetilde{\omega }-\omega ^{-e_1}|<|\widetilde{\omega }-\omega ^s|\) for any \(s \not \equiv -e_1 \pmod n\).

Combining this fact with (6) above, we arrive at the sufficient condition

$$\begin{aligned} |\widetilde{\omega }-\omega ^{-e_1}| \le \frac{ 2n\varepsilon _{\text {noise}}}{ |a_{e_1}^\prime |-n\varepsilon _{\text {noise}}} < \sin (\uppi /n) \quad \Leftrightarrow \quad n \varepsilon _{\text {noise}}< |a_{e_1}^\prime | \cdot \frac{ \sin (\uppi /n) }{ 2+\sin (\uppi /n) } . \end{aligned}$$
(7)

Table 1 shows some experiments of decreasing \(\varTheta _\text {outlier}^\text {[abs]}\) for a fixed \(\varepsilon _\text {noise}^\text {[abs]}\). Throughout the experiment, we have \(f(x) = 87\,{x}^{11} -56\,{x}^{10} -62\,{x}^{8} +97\,{x}^{7} -73\,{x}^{4} -4\,{x}^{3} -83\,x-10\) and \(d-1=11\), evaluating at powers of the order \(n = d+2E= 14\) complex root of unity \(\omega =\exp (2\uppi {{\varvec{i}}}\,/14)\). We add to each evaluation noise, which is implemented as a complex number with polar modulus uniformly chosen at random in the range \([0,\varepsilon _\text {noise}^\text {[abs]}]\) and polar argument uniformly chosen at random in the range \([0,2\uppi ]\). An absolute outlier value is chosen the same way, but the modulus is in the range \([\varTheta _\text {outlier}^\text {[abs]},2\varTheta _\text {outlier}^\text {[abs]}]\); the exponent \(e_1\) is also chosen uniformly at random from \(\{0,1,\ldots ,d+2E-1=13\}\). Each row of the table corresponds to 1,000 realizations of the random variable that generates noise and outliers, re-seeding the random number generator with each run. All computations were performed with 15 floating point digits of precision. In the table, \(C_n = \sin (\uppi /n)/\big (2+\sin (\uppi /n)\big )\).

Table 1 Experiments of varying outlier error in the presence of noise

The column “% Circle” shows the percentage of runs where \(|\widetilde{\omega }- \omega ^{-e_1}| < \sin (\uppi /n)\); “% Sector” shows the percentage of runs where \(|\widetilde{\omega }- \omega ^{-e_1}| \ge \sin (\uppi /n)\), but \(|\widetilde{\omega }- \omega ^{-e_1}| < |\widetilde{\omega }-\omega ^s|\) for any \(s \not \equiv -e_1 \pmod n\); “% Wrong” shows the percentage of the remainder of the runs. When the ratio \(\varepsilon _\text {noise}^\text {[abs]}\)/\(\varTheta _\text {outlier}^\text {[abs]}\) is either sufficiently large or small, one can see from (5) that the value of \(\widetilde{\omega }\) is determined mainly by the value of either \(a_{e_1}^\prime \) or \(\bar{\varepsilon }_{d+1}/\bar{\varepsilon }_d\), respectively; this corresponds with the first and last rows of each section of Table 1, where \(\bar{\varepsilon }_{d+1}/\bar{\varepsilon }_d\) is far from \(\omega ^{-e_1}\) in general.

Fig. 1
figure 1

Examples of varying outlier relative error (labeled as percentages). Noise relative error is approximately \(0.40\,\%\)

However, between the extreme values of \(\widetilde{\omega }\), more interesting behavior can occur. Figure 1 shows two individual algorithm runs of the table rows for \(\varepsilon _\text {noise}^\text {[abs]}=1\). Each power of \(\omega \) is represented by a “\(\times \)”; the sphere of radius \(\sin (\uppi /n)\) is drawn around each power of \(\omega \), as well as the corresponding (interior) sector; the solid square denotes \(\omega ^{-e_1}\), while the solid circle denotes \(\bar{\varepsilon }_{d+1}/\bar{\varepsilon }_d\); a complex outlier \(a_{e_1}^\prime =\xi \) is fixed, then the function \(\widetilde{\omega }(t\xi )\) (for \(t\in [2^{-7},2^7]\)) is plotted as a curve, with several points whose label is the relative error of \(t\xi \) compared to \(\omega ^{-e_1}\). In Fig. 1a, outliers of relative error less than \(6\,\%\) cause \(\widetilde{\omega }\) to approach \(0\), so that it becomes infeasible to compute a reliable guess for \(e_1\); here, noise constitutes approximately a \(0.38\,\%\) relative error. By contrast, Fig. 1b shows an example where nearly any outlier relative error greater than \(0.375\,\%\) would result in \(|\widetilde{\omega }-\omega ^s| < \sin (\uppi /n)\) for one of three values of \(s \pmod n\), so that the “nearest \(\omega ^s\) neighbor” criterion is no longer reliable; here, noise constitutes approximately a \(0.40\,\%\) relative error.

Decoding the interpolant \(W \mathbf b\) can also be done via the extended Euclidean algorithm for any \(\omega \) with \(\omega ^{e_u} \ne \omega ^{e_v}\): the Berlekamp–Welch algorithm; see [15]. We will study the numerical properties of variants based on approximate GCD techniques in follow-up work.

4 Implementation and Experiments of NumericSparsest Shift

4.1 Illustrative Examples for the Main Algorithm

We reversely engineer a noisy black box for

$$\begin{aligned} f_1(x)&= 2\,(x-7)^3+3\,(x-7)^6-7\,(x-7)^{10}\nonumber \\&=-7 x^{10} + 490 x^9 - 15435 x^8+ 288120 x^7 - 3529467 x^6 \nonumber \\&\quad \,+ 29647422 x^5 - 172941825 x^4 + 691755542 x^3 - 1815804312 x^2\nonumber \\&\quad \,+ 2824450258 x - 1976974482. \end{aligned}$$
(8)

Our algorithm computes with a precision of \(100\) floating-point digits (except in Step 2, where the precision is doubled). To each evaluation, we add random noise causing a relative error of \(1 \times 10^{ -28 }\). For each interpolation problem of a given degree \(i\) in Step 1, we add one outlier error of relative error \(5\). We use \(L = 3\) different \(17\)th roots of unity \(\omega ^{[\ell \,]}\).

Step 1 correctly locates each of the outliers in its \(21 = 3 \times 7\) interpolation calls. The relative \(2\)-norm differences

$$\begin{aligned} \Vert \delta _4^{[\ell \,]}(z) - \widetilde{\delta }_4^{[\ell \,]}(z)\Vert _2/\Vert \delta _4^{[\ell \,]}(z)\Vert _2 \end{aligned}$$

of the coefficient vectors of the \(4 \times 4\) matrix determinants after Step 2 are \(2.126 \times 10^{ -27 }\), \(2.681 \times 10^{ -27 }\), \(6.596 \times 10^{ -27 }\) for \(\ell = 1,2,3\) all within the added noise (after outlier removal).

The polynomial (2) in Step 3 has 4 real roots, and its minimum objective function value (1) is at \(\widetilde{s} = 6.9989162726\) with an objective function value of \(2.028 \times 10^{ -57 }\), as opposed to the exact case (without noise) of \(2.280 \times 10^{ -71 }\) at \(s = 7\) (there is one more root with much larger objective value).

The sparse model recovered from \(\widetilde{s}\) produces the correct term exponents \(e_1 = 3\), \(e_2 = 6\), and \(e_3 = 10\), and the least squares fit at the non-erroneous \(252 = 273 - 21\) prior black box evaluations produces the approximate model for (8),

$$\begin{aligned} 2.009369 (x - \widetilde{s})^3 +2.998102 (x - \widetilde{s})^6 -6.997705 (x - \widetilde{s})^{10},\quad \widetilde{s} = 6.9989162726. \end{aligned}$$

The relative 2-norm backward error of the model (with respect to the noisy black box evaluations) is \(1.596557 \times 10^{ -3 }\), while that of \(f_1\) itself is \(5.774667 \times 10^{ -28 }\). A similar model can be produced with \(90\) floating-point digits, but not with \(80\).

When doubling the noise to relative error \(2 \times 10^{ -28 }\) with \(100\) floating-point digits, the computed model is

$$\begin{aligned} 2.036489 (x - \widetilde{s})^3 +2.992182 (x - \widetilde{s})^6 -6.991277 (x - \widetilde{s})^{10},\quad \widetilde{s} = 6.9957389337, \end{aligned}$$

with relative 2-norm backward error \(6.222096 \times 10^{ -3 }\), compared to \(1.154933 \times 10^{ -27 }\) for \(f_1\). Even with relative noise of \(4 \times 10^{ -28 }\), the computed model is

$$\begin{aligned} 2.125876 (x - \widetilde{s})^3 +2.967832 (x - \widetilde{s})^6 -6.972579 (x - \widetilde{s})^{10},\quad \widetilde{s} = 6.9848087178, \end{aligned}$$

with relative 2-norm backward error \(2.151040 \times 10^{ -2 }\), compared to \(2.309866 \times 10^{ -27 }\) for \(f_1\). At relative noise of \(8 \times 10^{ -28 }\), the algorithm fails to determine a sparse approximant, even when increasing the number of sequences to \(L=10\).

Such failure is deceptive. The lack of sparsity, namely 3 of a maximum of 11 terms, allows for denser models that provide fits. In addition, a shift of \(7\) produces large evaluations at roots of unity, as indicated in the power basis representation of (8). Making the shift smaller and the degree larger, and considering the polynomial

$$\begin{aligned} f_2(x) = 2\,(x-1.55371)^3+3\,(x-1.55371)^6-7\,(x-1.55371)^{15}, \end{aligned}$$

we can recover from \(L =3\) sequences, with a relative noise in the evaluations of \(1 \times 10^{ -14 }\), and again 1 outlier per interpolation, the approximate model

$$\begin{aligned} 1.999718 (x - \widetilde{s})^3 + 2.998609 (x - \widetilde{s})^6 - 7.000117 (x - \widetilde{s})^{15},\quad \widetilde{s} = 1.5537114392, \end{aligned}$$

with relative 2-norm backward error \(8.000329 \times 10^{ -1 }\), compared to \(8 \times 10^{ -1 }\) (to 7 digits) for \(f_1\) itself.

For this particular example, we see a case of the effect mentioned in [3], where the sparse model can fit the noisy evaluations nearly as well (and sometimes better) than the exact black box.

Increasing the noise still, another model with \(\widetilde{s} = 1.5537013193\) can be recovered with relative noise of \(2 \times 10^{ -13 }\), where now the model and \(f_1\) relative 2-norm backward errors are \(5.180450 \times 10^{ -2 }\) and \(1.108027 \times 10^{ -11 }\), respectively. In this case, a different choice of \(L=3\) different \(17\)th roots of unity was needed in order to compute a sparse model. Both computations used \(357\) black box evaluations.

4.2 Comparison with the Naïve Algorithm

For the examples given above, the naïve algorithm recovers the sparsest representation with noise such as \(1 \times 10^{ -10 }\) and precision \(20\) floating point digits. The precision obtained is close to the level of noise (\(1 \times 10^{ -8 }\) relative error for the shift and \(2 \times 10^{ -10 }\) maximum relative error on the coefficients in the shifted basis). The number of calls to the black box is below \(170\).

For a more demanding example such as a degree \(55\) polynomial with sparsity \(8\) and a shift between \(1\) and \(2\), a level of relative noise \(1 \times 10^{ -28 }\) is tolerable with precision \(200\) digits (as in an example above). However, the number of calls was above \(600\) to get a relative error less than \(1 \times 10^{ -20 }\) on shift and coefficients. Due to the numerical optimization in Step 3, this is unattainable with the main algorithm, for the moment. The Tschirnhaus early termination was not used yet.

Besides, with more calls to the black box during the Newton iterations, we can further increase the precision on the shift and coefficients, this may however be considered as de-noising.

We can also run experiments on a black box of the type \(P+Q_{\varepsilon }\) where \(P\) is a polynomial with a sparse shift representation and \(Q_{\varepsilon }\) is a dense polynomial of same degree with coefficients bounded by \(\varepsilon \)—this may be viewed as perturbation on the coefficients. The algorithms described perform well, however they do not remove outliers if they are introduced as an erroneous term.

4.3 Discussion

Our preliminary experiments lead to the following conclusions: Our correction of \(1\) outlier per interpolation with Blahut’s numerical decoding is highly numerically reliable. The optimization problem in Step 3 requires substantial precision for its real root finding, and is numerically sensitive when the shift is large and there is noise in the evaluations. Our main algorithm works well without noise and outliers, or in high precision with noise when the shift is small and the sparsity is high. We plan to work on a more thorough experimental analysis, including the case of two or more outliers per interpolation. The naïve algorithm gives motivation and potential for improvements to the main one. On the other hand, the number of calls to the black box in the former could be reduced.