1 Introduction

The problem of root finding for a polynomial f(z) is a classical problem from antiquity, but remains the subject of active research to the present [6]. We consider a classic version of root finding:

figure a

It is local because we only look for roots in a locality, as specified by \(B_0\). The local problem is useful in applications (especially in geometric computation) where we know where to look for the roots of interest. There are several variants of this problem: in the global version, we are not given \(B_0\), signifying that we wish to find all the roots of f. The global version is easily reduced to the local one by specifying a \(B_0\) that contains all roots of f. If we omit \(\varepsilon \), it amounts to setting \(\varepsilon =\infty \), representing the pure isolation problem.

Our main interest is a generalization of root isolation, to the lesser-studied problem of root clustering [8, 10, 12]. It is convenient to introduce two definitions: for any set \(S\subseteq {\mathbb {C}}\), let \(Z_f(S)\) denote the set of roots of f in S, and let \(\#_f(S)\) count the total multiplicity of the roots in \(Z_f(S)\). Typically, S is a disc or a box. For boxes and discs, we may write kS (for any \(k>0\)) to denote the dilation of S by factor k, keeping the same center. The following problem was introduced in [16]:

figure b

This generalization of root isolation is necessary when we consider polynomials whose coefficients are non-algebraic (or when f(z) is an analytic function, as in [16]). The requirement that \(\#_f(\varDelta _i)=\#_f(3\varDelta _i)\) ensures that our output clusters are natural [1]; a polynomial of degree d has at most \(2d-1\) natural clusters (see [16, Lemma 1]). The local root clustering algorithm for analytic functions of [16] has termination proof, but no complexity analysis. By restricting f(z) to a polynomial, Becker et al. [2] succeeded in giving an algorithm and also its complexity analysis based on the geometry of the roots. When applied to the benchmark problem, where f(z) is an integer polynomial of degree d with L-bit coefficients, the algorithm can isolate all the roots of f(z) with bit complexity \(\widetilde{O}(d^2(L+d))\). Pan [13] calls such bounds near-optimal (at least when \(L\ge d\)). The clustering algorithm studied in this paper comes from [1], which in turn is based on [2]. Previously, the Pan-Schönhage algorithm has achieved near-optimal bounds with divide-and-conquer methods [13], but [1, 2] was the first subdivision algorithm to achieve the near-optimal bound for complex roots. For real roots, Sagraloff-Mehlhorn [15] had earlier achieved near-optimal bound via subdivision.

Why the emphasis on “subdivision”? It is because such algorithms are implementable and quite practical (e.g., [14]). Thus the near-optimal real subdivision algorithm of [15] was implemented shortly after its discovery, and reported in [11] with excellent results. In contrast, all the asymptotically efficient root algorithms (not necessarily near-optimal) based on divide-and-conquer methods of the last 30 years have never been implemented; a proof-of-concept implementation of Schönhage’s algorithm was reported in Gourdon’s thesis [9]. Computer algebra systems mainly rely on algorithms with a priori guarantees of correctness. But in practice, algorithms without such guarantees are widely used. For complex root isolation, one of the most highly regarded multiprecision software is MPSolve [3]. The original algorithm in MPSolve was based on Erhlich-Aberth (EA) iteration; but since 2014, a “hybrid” algorithm [4] was introduced. It is based on the secular equation, and combines ideas from EA and eigensolve  [7]. These algorithms are inherently global solvers (they must approximate all roots of a polynomial simultaneously). Another theoretical limitation is that the global convergence of these methods is not proven.

In this paper, we give a preliminary report about Ccluster, ourFootnote 1 implementation of the root clustering algorithm from [1].

Fig. 1.
figure 1

Left: the connected components isolating all roots of the Bernoulli polynomial of degree 100. Right: the connected components isolating all roots of the Spiral polynomial of degree 64.

To illustrate the performance for the local versus global problem, consider the Bernoulli polynomials \(\texttt {Bern}_d(z) \mathrel {:=}\sum _{k=0}^{d} {{d}\atopwithdelims (){k}}b_{d-k}z^k\) where \(b_i\)’s are the Bernoulli numbers. Figure 1(Left) shows the graphical output of Ccluster for \(\texttt {Bern}_{100}(z)\). Table 1 has four timings \(\tau _{X}\) (for \(X=\ell , g, u, s\)) in seconds: \(\tau _\ell \) is the time for solving the local problem over a box \(B_0=[-1,1]^2\); \(\tau _g\) is the time for the global problem over the box \(B_0=[-150,150]^2\) (which contains all the roots). The other two timings from MPSolve (\(\tau _u\) for unisolve, \(\tau _s\) for secsolve) will be explained later. For each instance, we also indicate the numbers of solutions (#Sols) and clusters (#Clus). When #Sols equals #Clus, we know the roots are isolated. Subdivision algorithms like ours naturally solve the local problem, but MPSolve can only solve the global problem. Table 1 shows that MPSolve remains unchallenged for the global problem. But in applications where locality can be exploited, local methods may win, as seen in the last two rows of the table. The corresponding time for Maple’s fsolve is also given; fsolve is not a guaranteed algorithm and may fail.

Table 1. Bernoulli polynomials with five timings: local (\(\tau _\ell \)), global (\(\tau _\ell \)), unisolve (\(\tau _\ell \)), secsolve (\(\tau _\ell \)) and Maple’s fsolve (\(\tau _f\)).

Overview of Paper. In Sect. 2, we describe the experimental setup for Ccluster. Sections 35 describe some techniques for speeding up the basic algorithm. We conclude with Sect. 6.

2 Implementation and Experiments

The main implementation of Ccluster is in C language. We have an interface for JuliaFootnote 2. We based our big number computation on the arbFootnote 3 library. The arb library implements ball arithmetic for real numbers, complex numbers and polynomials with complex coefficients. Each arithmetic operation is carried out with error bounds.

Test Suite. We consider 7 families of polynomials, classic ones as well as some new ones constructed to have interesting clustering or multiple root structure.

  1. (F1)

    The Bernoulli polynomial \(\texttt {Bern}_d(z)\) of degree d is described in Sect. 1.

  2. (F2)

    The Mignotte polynomial \(\texttt {Mign}_d(z;a) \mathrel {:=}z^d - 2(2^az-1)^2\) for a positive integer a, has two roots whose separation is near the theoretical minimum separation bound.

  3. (F3)

    The Wilkinson polynomials \(\texttt {Wilk}_d(z)\mathrel {:=}\prod _{k=1}^d (z-k)\).

  4. (F4)

    The Spiral Polynomial \(\texttt {Spir}_d(z) \mathrel {:=}\prod _{k=1}^d\Big (z - \frac{k}{d}e^{4k{\mathbf{i }}\pi /n}\Big ).\) See Fig. 1(Right) for \(\texttt {Spir}_{64}(z)\).

  5. (F5)

    Wilkinson Multiple: \(\texttt {WilkMul}_{(D)}(z)\mathrel {:=}\prod _{k=1}^D (z-k)^k\). \(\texttt {WilkMul}_{(D)}(z)\) has degree \(d=D(D+1)/2\) where the root \(z=k\) has multiplicity k (for \(k=1{ ,{\ldots } , }D\)).

  6. (F6)

    Mignotte Cluster: \(\texttt {MignClu}_{d}(z;a,k) \mathrel {:=}x^d - 2(2^az-1)^k(2^az+1)^k\). This polynomial has degree d (assuming \(d\ge 2k\)) and has a cluster of k roots near \(2^{-a}\) and a cluster of k roots near \(-2^{-a}\).

  7. (F7)

    Nested Cluster: \(\texttt {NestClu}_{(D)}(z)\) has degree \(d=3^D\) and is defined by induction on D: \(\texttt {NestClu}_{(1)}(z)\mathrel {:=}z^3-1\) with roots \({\varvec{\omega }}, {\varvec{\omega }}^2, {\varvec{\omega }}^3=1\) where \({\varvec{\omega }}=e^{2\pi {\mathbf{i }}/3}\). Inductively, if the roots of \(\texttt {NestClu}_{(D)}(z)\) are \(\left\{ r_j: j=1{ ,{\ldots } , }3^D \right\} \), then we define \(\texttt {NestClu}_{(D+1)}(z) \mathrel {:=}\prod _{j=1}^{3^D} \Big (z - r_j - \frac{\omega }{16^D}\Big ) \Big (z - r_j - \frac{\omega ^2}{16^D}\Big ) \Big (z - r_j - \frac{1}{16^D}\Big )\) See Fig. 2 for the natural \(\varepsilon \)-clusters of \(\texttt {NestClu}_{(3)}(z)\).

Fig. 2.
figure 2

Left: 3 clusters of \(\texttt {NestClu}_{(3)}\) found with \(\varepsilon =1\). Right: Zoomed view of 9 clusters of \(\texttt {NestClu}_{(3)}\) found with \(\varepsilon =\frac{1}{10}\). Note: The initial box is in thick lines; the thin lines show the subdivisions tree.

Timing. Running times are sequential times on a Intel(R) Core(TM) i3 CPU 530 @ 2.93 GHz machine with linux. Ccluster implements the algorithm described in [1] with differences coming from the improvements described in Sects. 35 below. Unless explicitly specified, the value of \(\varepsilon \) for Ccluster is set to \(2^{-53}\); roughly speaking, it falls back to asking for 15 guaranteed decimal digits.

MPSolve. For external comparison, we use MPSolve. It was shown to be superior to major software such as Maple or Mathematica [3]. There are two root solvers in MPSolve: the original unisolve [3] which is based on the Ehrlich-Aberth iteration and the new hybrid algorithm called secsolve [4]. These are called with the commands and (respectively). -Gi means that MPSolve tries to find for each root a unique complex disc containing it, such that Newton iteration is guaranteed to converge quadratically toward the root starting from the center of the disc. means that \(10^{-\gamma }\) is used as an escape bound, i.e., the algorithm stops when the complex disc containing the root has radius less that \(10^{-\gamma }\), regardless of whether it is isolating or not. Unless explicitly specified, we set \(\gamma =16\). -j1 means that the process is not parallelized. Although MPSolve does not do general local search, it has an option to search only within the unit disc. This option does not seem to lead to much improvement.

Fig. 3.
figure 3

\(\widetilde{T}^{G}_{k}(\varDelta ,k)\). \(|\tilde{f}|_i\) is the absolute value of the coefficient of the monomial of degree i of \(\tilde{f}\), for \(0\le i \le d\).

3 Improved Soft Pellet Test

The key predicate in [1] is a form of Pellet test denoted \(\widetilde{T}^{G}_{k}(\varDelta ,k)\) (with implicit f(z)). This is modified in Fig. 3 by adding an outer while-loop to control the number of Graeffe-Dandelin iterations. We try to get a definite decision (i.e., anything other than a unresolved) from the soft comparison for the current Graeffe iteration. This is done by increasing the precision L for approximating the coefficients of \(\tilde{f}\) in the innermost while-loop. Thus we have two versions of our algorithm: (V1) uses the original \(\widetilde{T}^{G}_{k}(\varDelta ,k)\) in [1], and (V2) uses the modified form in Fig. 3. Let \(\tau \)V1 and \(\tau \)V2 be timings for the 2 versions. Table 2 shows the time \(\tau \)V1 (in seconds) and the ratio \(\tau \)V1/\(\tau \)V2. We see that (V2) achieves a consistent 2.3 to 3-fold speed up.

Table 2. Solving within the initial box \([-50,50]^2\) with \(\varepsilon =2^{-53}\) with versions (V1), (V2) and (V3) of Ccluster. n1: number of discarding tests. n2: number of discarding tests returning \(-1\) (inconclusive). n3: total number of Graeffe iterations. \(\tau \)V1 (resp. \(\tau \)V2, \(\tau \)V3): sequential time for V1 (resp. V2, V3) in seconds.

In (V2), as in [1], we use \(\widetilde{T}^{G}_{0}(\varDelta )\) (defined as \(\widetilde{T}^{G}_{k}(\varDelta ,0)\)) to prove that a box B has no root. We propose a new version (V3) that uses \(\widetilde{T}^{G}_{*}(\varDelta )\) (defined as \(\widetilde{T}^{G}_{k}(\varDelta ,d)\), where d is the degree of f) instead of \(\widetilde{T}^{G}_{0}(\varDelta )\) to achieve this goal: instead of just showing that B has no root, it upper bounds \(\#_f(B)\). Although counter-intuitive, this yields a substantial improvement because it led to fewer Graeffe iterations overall. The timing for (V3) is \(\tau \)V3, but we display only the ratio \(\tau \)V1/\(\tau \)V3 in the last column of Table 2. This ratio shows that (V3) enjoys a 3.3-7.7 fold speedup. Comparing n3 for (V2) and (V3) explains this speedup.

4 Filtering

A technique for speeding up the evaluation of predicates is the idea of filters (e.g., [5]). The various Pellet tests can be viewed as a box predicate C that maps a box \(B\subseteq {\mathbb {C}}\) to a valueFootnote 4 in . If \(C^-\) is another box predicate with property that \(C^-(B)=\mathbf{false }\) implies \(C(B)=\mathbf{false }\), we call \(C^-\) a falsehood filter. If \(C^-\) is efficient relatively to C, and “efficacious” (informally, \(C(B)=\mathbf{false }\) is likely to yield \(C^-(B)=\mathbf{false }\)), then it is useful to first compute \(C^-(B)\). If \(C^-(B)=\mathbf{false }\), we do not need to compute C(B). The predicate \(C_0\) used in Ccluster  is defined as follows: \(C_0(B)\) is \(\mathbf{true }\) if \(\widetilde{T}^{G}_{*}(\varDelta _B)\) returns 0 (then B contains no root of f) and is \(\mathbf{false }\) if \(\widetilde{T}^{G}_{*}(\varDelta _B)\) returns \(-1\) or \(k>0\) (then B may contain some roots of f). We next present the falsehood filter \(C_0^-(B)\) for \(C_0\).

Table 3. Solving within the initial box \([-50,50]^2\) with \(\varepsilon =2^{-53}\) with versions (V3), (V4) of Ccluster. n3: number of Graeffe iterations. \(\tau \)V3 and \(\tau \)V4: sequential time in seconds.

Let \(f_{\varDelta }\) denote the Taylor shift of f in \(\varDelta \), \(f_{\varDelta }^{[i]}\) its i-th Graeffe iterate, \((f_{\varDelta }^{[i]})_j\) the j-th coefficient of \(f_{\varDelta }^{[i]}\), and \(|f_{\varDelta }^{[i]}|_j\) the absolute value of the j-th coefficient. Let d be the degree of f. The assertion below is a direct consequence of the classical test of Pellet (see [2, p. 12]) and justify the correctness of our filters:

(A) if \(|f_{\varDelta }^{[N]}|_0\le |f_{\varDelta }^{[N]}|_1+|f_{\varDelta }^{[N]}|_d\) then \(\widetilde{T}^{G}_{*}(\varDelta )\) returns \(-1\) or \(k>0\).

Our \(C_0^-\) filter computes \(|f_{\varDelta }^{[N]}|_0\), \(|f_{\varDelta }^{[N]}|_1\) and \(|f_{\varDelta }^{[N]}|_d\) and checks hypothesis of (A) using IntCompare. \(|f_{\varDelta }^{[N]}|_0\) and \(|f_{\varDelta }^{[N]}|_d\) can respectively be computed as \((|f_{\varDelta }|_0)^{2^N}\) and \((|f_{\varDelta }|_d)^{2^N}\). \(|f_{\varDelta }^{[N]}|_1\) can be computed with the following well known formula:

$$\begin{aligned} (f^{[i+1]}_{\varDelta })_k = (-1)^k((f^{[i]}_{\varDelta })_k)^2 + 2\sum \limits _{j=0}^{k-1} (-1)^j(f^{[i]}_{\varDelta })_j(f^{[i]}_{\varDelta })_{2k-j} \end{aligned}$$
(1)

Obtaining \(|f^{[N]}_{\varDelta }|_1\) with Eq. (1) requires to know \(2^{N-1}+1\) coefficients of \(f^{[1]}_{\varDelta }\), \(2^{N-2}+1\) coefficients of \(f^{[2]}_{\varDelta }, \ldots ,\) and finally \(3=2^1+1\) coefficients of \(f^{[N-1]}_{\varDelta }\). In particular, it requires to compute entirely the iterations \(f^{[i]}_{\varDelta }\) such that \(2^{N-i}\le d\), and it is possible to do it more efficiently that with Eq. (1) (for instance with the formula given in definition 2 of [2]).

Our \(C_0^-\) filter takes as input a precision L, the Taylor shift \(f_{\varDelta }\) of the L bit approximation of f and its i-th Graeffe iteration \(f^{[i]}_{\varDelta }\) such that \(2^{N-i}\le \frac{d}{4}\) and \(2^{N-(i+1)}> \frac{d}{4}\). It computes \(|f_{\varDelta }^{[N]}|_0\), \(|f_{\varDelta }^{[N]}|_d\) and the \(2^{N-j}+1\) first coefficients of \(f^{[j]}_{\varDelta }\) for \(i<j\le N\) with Eq. (1). Then it checks the hypothesis of (A) using IntCompare, and returns \(\mathbf{false }\) if it is verified, and \(\mathbf{true }\) otherwise. In practice, it is implemented within the procedure implementing \(\widetilde{T}^{G}_{*}(\varDelta _B)\).

Incorporating \(C_0^-\) into Version (V3), we obtain (V4) and the speed up can be seen in Table 3. Filtering with \(C_0^-\) becomes more effective as degree grows and this is because one has \(2^{N-i}\le \frac{d}{4}\) for smaller i (recall that \(N=4+\left\lceil {\log _2(1+\log _2(d))} \right\rceil \)).

5 Escape Bound

The \(\varepsilon \) parameter is usually understood as the precision desired for roots. But we can also view it as an escape bound for multiple roots as follows: we do not refine a disc that contains a simple root, even if its radius is \(\ge \varepsilon \). But for clusters of size greater than one, we only stop when the radius is \(<\varepsilon \). MPSolve has a similar option. This variant of (V4) is denoted (V4\(^{\prime }\)). We see from Table 4 that (V4\(^{\prime }\)) gives a modest improvement (up to 25% speedup) over (V4) when \(-\log \varepsilon =53\). This improvement generally grows with \(-\log \varepsilon \) (but \(\texttt {WilkMul}_{(11)}(z)\) shows no difference).

Table 4. Solving within the box \([-50,50]^2\) with versions (V4) and (V4\(^{\prime }\)) of Ccluster  with three values of \(\varepsilon \). \(\tau \)53 (resp. \(\tau \)530, \(\tau \)5300): sequential time for (V4) and (V4\(^{\prime }\)) in seconds.

6 Conclusion

Implementing subdivision algorithms is relatively easy but achieving state-of-art performance requires much optimization and low-level development. This paper explores several such techniques. We do well compared to fsolve in Maple, but the performance of MPSolve is superior to the global version of Ccluster. But Ccluster can still shine when looking for local roots or when \(\varepsilon \) is large.