Keywords

1 Introduction

The goal of this paper is to investigate how several optimization techniques for subresultant chain computations benefit polynomial system solving in practice. These optimizations rely on ideas which have appeared in previous works, but without the support of successful experimental studies. Therefore, this paper aims at filling this gap.

The first of these optimizations takes advantage of the Half-GCD algorithm for computing GCDs of univariate polynomials over a field \(\mathbf {k}\). For input polynomials of degree (at most) n, this algorithm runs within \(O(\text {M}(n)\log n)\) operations in \(\mathbf {k}\), where \(\text {M}(n)\) is a polynomial multiplication time, as defined in [12, Chapter 8]. The Half-GCD algorithm originated in the ideas of [16, 18] and [26], while a robust implementation was a challenge for many years. One of the earliest correct designs was introduced in [28].

The idea of speeding up subresultant chain computations by means of the Half-GCD algorithm takes various forms in the literature. In [25], Reischert proposes a fraction-free adaptation of the Half-GCD algorithm, which can be executed over an effective integral domain \({\mathbb {B}}\), within \(O(\text {M}(n)\log n)\) operations in \({\mathbb {B}}\). We are not aware of any implementation of Reischert’s algorithm.

In [20], Lickteig and Roy propose a “divide and conquer” algorithm for computing subresultant chains, the objective of which is to control coefficient growth in defective cases. Lecerf in [17] introduces extensions and a complexity analysis of the algorithm of Lickteig and Roy, with a particular focus on bivariate polynomials. When run over an effective ring endowed with the partially defined division routine, the algorithm yields a running time estimate similar to that of Reischert’s. Lecerf realized an implementation of that algorithm, but observed that computations of subresultant chains based on Ducos’ algorithm [10], or on evaluation-interpolation strategies, were faster in practice.

In [12, Chapter 11], von zur Gathen and Gerhard show how the nominal leading coefficients (see Sect. 2 for this term) of the subresultant chain of two univariate polynomials ab over a field can be computed within \(O(\text {M}(n)\log n)\) operations in \({\mathbf {k}}\), by means of an adaptation of the Half-GCD algorithm. In this paper, we extend their approach to compute any pair of consecutive non-zero subresultants of ab within the same time bound. The details are presented in Sect. 3.

Our next optimization for subresultant chain computations relies on the observation that not all non-zero subresultants of a given subresultant chain may be needed. To illustrate this fact, consider two commutative rings \(\mathbb {A}\) and \(\mathbb {B}\), two non-constant univariate polynomials ab in \({\mathbb {A}}[y]\) and a ring homomorphism \(\varPsi \) from \(\mathbb {A}\) to \(\mathbb {B}\) so that \(\varPsi (\mathrm{lc}(a)) \ne 0\) and \(\varPsi (\mathrm{lc}(b)) \ne 0\) both hold. Then, the specialization property of subresultants (see the precise statement in Sect. 2) tells us that the subresultant chain of \(\varPsi (a), \varPsi (b)\) is the image of the subresultant chain of ab via \(\varPsi \).

This property has at least two important practical applications. When \(\mathbb {B}\) is polynomial ring over a field, say \(\mathbb {B}\) is \({\mathbb {Z}}/p{\mathbb {Z}}[x]\) and \(\mathbb {A}\) is \({\mathbb {Z}}/p{\mathbb {Z}}\), then one can compute a GCD of \(\varPsi (a), \varPsi (b)\) via evaluation and interpolation techniques. Similarly, say \(\mathbb {B}\) is \({\mathbb {Q}}[x]/\langle m(x) \rangle \), where m(x) is a square-free polynomial, then \(\mathbb {B}\) is a product of fields then, letting \(\mathbb {A}\) be \({\mathbb {Q}}[x]\), one can compute a GCD of \(\varPsi (a), \varPsi (b)\) using the celebrated D5 Principle [8]. More generally, if \(\mathbb {B}\) is \({\mathbb {Q}}[x_1, \ldots , x_n]/\langle T \rangle \), where \(T = (t_1(x_1), \ldots , t_n(x_1,\ldots , x_n))\) is a zero-dimensional regular chain (generating a radical ideal), and \(\mathbb {A}\) is \({\mathbb {Q}}[x_1, \ldots , x_n]\), then one can compute a so-called regular GCD of a and b modulo \(\langle T \rangle \), see [5]. The principle of that calculation generalizes the D5 Principle as follows:

  1. 1.

    if the resultant of ab is invertible modulo \(\langle T \rangle \) then 1 is a regular GCD of a and b modulo \(\langle T \rangle \);

  2. 2.

    if, for some k, the nominal leading coefficients \(s_0, \ldots , s_{k-1}\) are all zero modulo \(\langle T \rangle \), and \(s_k\) is invertible modulo \(\langle T \rangle \), then the subresultant \(S_k\) of index k of ab is a regular GCD of a and b modulo \(\langle T \rangle \); and

  3. 3.

    one can always reduce to one of the above two cases by splitting T, when a zero-divisor of \(\mathbb {B}\) is encountered.

In practice, in the above procedure, k is often zero, which can be seen as a consequence of the celebrated Shape Lemma [4]. This suggests to compute the subresultant chain of ab in \({\mathbb {A}}[y]\) speculatively. To be precise, and taking advantage of the Half-GCD algorithm, it is desirable to compute the subresultants of index 0 and 1, delaying the computation of subresultants of higher index until proven necessary.

We discuss that idea of computing subresultants speculatively in Sect. 3. Making that approach successful, in comparison to non-speculative approaches, requires to overcome several obstacles:

  1. 1.

    computing efficiently the subresultants \(S_0\) and \(S_1\), via the Half-GCD; and

  2. 2.

    developing an effective “recovery” strategy in case of “misprediction”, that is, when subresultants of index higher than 1 turn out to be needed.

To address the first obstacle, our implementation combines various schemes for the Half-GCD, inspired by the work done in NTL [27]. To address the second obstacle, when we compute the subresultants of index 0 and 1 via the Half-GCD, we record or cache the sequence of quotients (associated with the Euclidean remainders) so as to easily obtain subresultants of index higher than 1, if needed.

There are subresultant algorithms in almost all computer algebra software. Most notably, the RegularChains library [19] in Maple provides three different algorithms to compute the entire chain based on Ducos’ optimization [9], Bézout matrix [1], or evaluation-interpolation based on FFT. Each one is well-suited for a particular type of input polynomials w.r.t the number of variables and the coefficients ring; see the Maple help page for SubresultantChain command. Similarly, the Algebramix library in Mathemagix [14] implements different subresultant algorithms, including routines based on evaluation-interpolation, Ducos’ algorithm, and an enhanced version of Lickteig-Roy’ s algorithm [17].

The extensive experimentation results in Sect. 5 indicate that the performance of our univariate polynomials over finite fields (based on FFT) are closely comparable with their counterparts in NTL. In addition, we have aggressively tuned our subresultant schemes based on evaluation-interpolation techniques. Our modular subresultant chain algorithms are up to 10\(\times \) and 400\(\times \) faster than non-modular counterparts (mainly Ducos’ subresultant chain algorithm) in \(\mathbb {Z}[y]\) and \(\mathbb {Z}[x, y]\), respectively. Further, utilizing the Half-GCD algorithm to compute subresultants yields an additional speed-up factor of 7\(\times \) and 2\(\times \) for polynomials in \(\mathbb {Z}[y]\) and \(\mathbb {Z}[x, y]\), respectively.

Further still, we present a third optimization for subresultant chain computations through a simple improvement of Ducos’ subresultant chain algorithm. In particular, we consider memory usage and data locality to improve practical performance; see Sect. 4. We have implemented both the original Ducos algorithm [10] and our optimized version over arbitrary-precision integers. For univariate polynomials of degree as large as 2000, the optimized algorithm uses 3.2\(\times \) and 11.7\(\times \) less memory, respectively, than our implementation of the original Ducos’ algorithm and the implementation of Ducos’ algorithm in Maple.

All of our code, providing also univariate and multivariate polynomial arithmetic, is open source and part of the Basic Polynomial Algebra Subprograms (BPAS) library available at www.bpaslib.org. Our many subresultant schemes have been integrated, tested, and utilized in the multithreaded BPAS polynomial system solver [3].

This paper is organized as follows. Section 2 presents a review of subresultant theory following the presentations of [9] and [15]. Our modular method to compute subresultants speculatively via Half-GCD is discussed in Sect. 3. Section 4 examines practical memory optimizations for Ducos’ subresultant chain algorithm. Lastly, implementation details and experimental results are presented in Sect. 5.

2 Review of Subresultant Theory

In this review of subresultant theory, we follow the presentations of [9] and [15]. Let \(\mathbb {B}\) be a commutative ring with identity and let \(m \le n\) be positive integers. Let M be a \(m \times n\) matrix with coefficients in \(\mathbb {B}\). Let \(M_i\) be the square submatrix of M consisting of the first \(m-1\) columns of M and the i-th column of M, for \(m \le i \le n\); let \(\text{ det }(M_i)\) be the determinant of \(M_i\). The determinantal polynomial of M denoted by \({\text{ dpol }(M)}\) is a polynomial in \(\mathbb {B}[y]\), given by

$$\begin{aligned} \text{ dpol }(M) = \text{ det }(M_m) y^{n-m} + \text{ det }(M_{m+1}) y^{n-m-1} + \cdots + \text{ det }(M_n). \end{aligned}$$

Note that, if \({\text{ dpol }(M)}\) is not zero, then its degree is at most \(n-m\). Let \(f_1, \ldots , f_m\) be polynomials of \({\mathbb {B}}[y]\) of degree less than n. We denote by \(\text{ mat }(f_1, \ldots , f_m)\) the \(m \times n\) matrix whose i-th row contains the coefficients of \(f_i\), sorted in order of decreasing degree, and such that \(f_i\) is treated as a polynomial of degree \(n-1\). We denote by \(\text{ dpol }(f_1, \ldots , f_m)\) the determinantal polynomial of \(\text{ mat }(f_1, \ldots , f_m)\).

Let \(a, b \in {\mathbb {B}}[y]\) be non-constant polynomials of respective degrees \(m = \deg (a),\) \(n = \deg (b)\) with \(m \ge n\). The leading coefficient of a w.r.t. y is denoted by \(\mathrm{lc}(a)\). Let k be an integer with \(0 \le k < n\). Then, the k-th subresultant of a and b (also known as the subresultant of index k of a and b), denoted by \(S_k(a, b)\), is

$$\begin{aligned} S_k(a, b) = \text{ dpol }(y^{n-k-1}a, y^{n-k-2}a, \ldots , a, y^{m-k-1}b,\ldots , b). \end{aligned}$$

This is a polynomial which belongs to the ideal generated by a and b in \({\mathbb {B}}[y]\). In particular, \(S_0(a, b)\) is the resultant of a and b denoted by \({\text{ res }(a,b)}\). Observe that if \(S_k(a, b)\) is not zero then its degree is at most k. If \(S_k(a, b)\) has degree k, then \(S_k(a, b)\) is said to be non-defective or regular; if \(S_k(a, b) \ne 0\) and \(\deg (S_k(a, b)) < k\), then \(S_k(a, b)\) is said to be defective. We call k-th nominal leading coefficient, demoted by \(s_k\), the coefficient of \(S_k(a, b)\) in \(y^k\). Observe that if \(S_k(a, b)\) is defective, then we have \(s_k = 0\). For convenience, we extend the definition to the n-th subresultant as follows:

$$ S_n(a, b) = \left\{ \begin{array}{ll} \gamma (b)b, &{} \ \text {if } m > n \text { or }\mathrm{lc}(b)\in \mathbb {B}\text { is regular} \\ \text {undefined}, &{} \ otherwise \end{array} \right. , $$

where \(\gamma (b) ={ \mathrm{lc}(b)}^{m-n-1}\). In the above, regular means not a zero-divisor. Note that when m equals n and \(\mathrm{lc}(b)\) is a regular element in \(\mathbb {B}\), then \(S_n(a, b)=\mathrm{lc}(b)^{-1}b\) is in fact a polynomial over the total fraction ring of \(\mathbb {B}\). We call specialization property of subresultants the following property. Let \(\mathbb {A}\) be another commutative ring with identity and \(\varPsi \) a ring homomorphism from \(\mathbb {B}\) to \(\mathbb {A}\) such that we have \(\varPsi (\mathrm{lc}(a)) \ne 0\) and \(\varPsi (\mathrm{lc}(b)) \ne 0\). Then, for \(0 \le k \le n\), we have \(S_k(\varPsi (a),\varPsi (b)) = \varPsi (S_k(a, b)).\)

From now on, we assume that the ring \({\mathbb {B}}\) is an integral domain. Writing \({\delta } = \deg (a) - \deg (b)\), there exists a unique pair (qr) of polynomials in \({\mathbb {B}}[y]\) satisfying \(h a = q b + r\), where \(h = {\mathrm{lc}(b)}^{\delta + 1}\), and either \(r = 0\) or \(\deg (r) < \deg (b)\); the polynomials q and r, denoted respectively \(\text{ pquo }(a,b)\) and \(\text{ prem }(a,b)\), are the pseudo-quotient and pseudo-reminder of a by b. The subresultant chain of a and b, defined as \(\text{ subres }(a, b) = ( S_{n}(a, b), S_{n-1}(a, b), S_{n-2}(a, b), \ldots , S_0(a, b)),\) satisfies relations which induce a Euclidean-like algorithm for computing the entire subresultant chain: \(\text{ subres }(a, b)\). This algorithm runs within \(O(n^2)\) operations in \({\mathbb {B}}\), when \(m=n\), see [9]. For convenience, we simply write \(S_{k}\) instead of \(S_{k}(a, b)\) for each k. We write \(a \sim b\), for \(a, b \in {\mathbb {B}}[y]\), whenever ab are associate elements in \(\text{ frac }(\mathbb {B})[y]\), the field of fractions of \(\mathbb {B}\). Then for \(1 \le k < n\), we have:

(i):

\(S_{n-1} = \text{ prem }(a, -b)\); if \(S_{n-1}\) is non-zero, defining \(e := {\deg }(S_{n-1})\), then we have:

$$\begin{aligned} S_{e-1} = \frac{\text{ prem }(b, - S_{n-1})}{{\mathrm{lc}(b)}^{(m-n)(n-e)+1}}, \end{aligned}$$
(ii):

if \(S_{k-1} \ne 0\), defining \(e := {\deg }(S_{k-1})\) and assuming \(e < k-1\) (thus assuming \(S_{k-1}\) defective), then we have:

(a):

\({\deg }(S_{k}) = k\), thus \(S_{k}\) is non-defective,

(b):

\(S_{k-1} \sim S_e\) and \({\mathrm{lc}(S_{k-1})}^{k-e-1}S_{k-1} = {s_k}^{k-e-1} {S_e}\), thus \(S_e\) is non-defective,

(c):

\(S_{k-2} = S_{k-3} = \cdots = S_{e+1} = 0\),

(iii):

if both \(S_k\) and \(S_{k-1}\) are non-zero, with respective degrees k and e then we have:

$$\begin{aligned} S_{e-1} = \frac{\text{ prem }(S_k, - S_{k-1})}{{\mathrm{lc}(S_k)}^{k-e+1}}. \end{aligned}$$
figure a

Algorithm 1 from [10] is a known version of this procedure that computes all non-zero subresultants \(a, b \in \mathbb {B}[y]\). Note that the core of this algorithm is the while-loop in which the computation of the subresultants \(S_{e}\) and \(S_{e-1}\), with the notations of the above points (ii) and (iii), are carried out.

3 Computing Subresultant Chains Speculatively

As discussed in the introduction, when the ring \({\mathbb {B}}\) is a field \(\mathbf {k}\), the computation of the subresultant chain of the polynomials \(a, b \in {\mathbb {B}}[y]\) can take advantage of asymptotically fast algorithms for computing \(\text{ gcd }(a,b)\). After recalling its specifications, we explain how we take advantage of the Half-GCD algorithm in order to compute the subresultants in \(\text{ subres }(a, b)\) speculatively.

Consider two non-zero univariate polynomials \(a, b \in \mathbf {k}[y]\) with \(n_0 := \deg (a),\) \(n_1 := \deg (b)\) with \(n_0 \ge n_1\). The extended Euclidean algorithm (EEA) computes the successive remainders \((r_0 := a, r_1 := b, r_2, \ldots , r_{\ell } = \text{ gcd }(a,b))\) with degree sequence \((n_0, n_1, n_2 :=\deg (r_2) \ldots , n_{\ell } :=\deg (r_{\ell }))\) and the corresponding quotients \((q_1, q_2, \ldots , q_{\ell })\) defined by \(r_{i+1} = \text{ rem }(r_i, r_{i-1}) = r_{i-1} - q_{i}r_{i}\), for \(1 \le i \le {\ell }\), \(q_i = \text{ quo }(r_i, r_{i-1})\) for \(1 \le i \le {\ell }\), \(n_{i+1} < n_{i}\), for \(1 \le i < {\ell }\), and \(r_{{\ell }+1} = 0\) with \(\deg (r_{l+1}) = -\infty \). This computation requires \(O(n^2)\) operations in \(\mathbf {k}\). We denote by \(Q_i\), the quotient matrices, defined, for \(1 \le i \le {\ell }\), by \(Q_i = \begin{bmatrix} 0 &{} 1 \\ 1 &{} -q_i \end{bmatrix},\) so that, for \(1 \le i < {\ell }\), we have

(1)

We define \(m_i := \deg (q_i)\), so that we have \(m_i = n_{i-1} - n_{i}\) for \(1 \le i \le {\ell }\). The degree sequence \((n_0, \ldots , n_l)\) is said to be normal if \(n_{i+1} = n_{i} - 1\) holds, for \(1 \le i < {\ell }\), or, equivalently if \(\deg (q_i) = 1\) holds, for \(1 \le i \le {\ell }\).

Using the remainder and degree sequences of non-zero polynomials \(a, b \in \mathbf {k}[y]\), Proposition 1, known as the fundamental theorem on subresultants, introduces a procedure to compute the nominal leading coefficients of polynomials in the subresultant chain.

Proposition 1

For \(k = 0, \ldots , n_1\), the nominal leading coefficient of the k-th subresultant of (ab) is either 0 or \(s_{k}\) if there exists \(i \le \ell \) such that \(k = \deg (r_i)\),

$$ s_{k} = (-1)^{\tau _{i}} \prod _{1 \le j < i} \mathrm{lc}(r_j)^{n_{j-1} - n_{j+1}} \mathrm{lc}(r_{i})^{n_{i-1} - n_{i}}, $$

where \(\tau _i = \sum _{1 \le j < i} (n_{j-1} - n_{i})(n_{j} - n_{i})\)  [12, Theorem 11.16].

The Half-GCD, also known as the fast extended Euclidean algorithm, is a divide and conquer algorithm for computing a single row of the EEA, say the last one. This can be interpreted as the computation of a \(2 \times 2\) matrix Q over \(\mathbf {k}[y]\) so that we have:

$$\begin{aligned} \begin{bmatrix} \text{ gcd }(a,b) \\ 0 \end{bmatrix} = Q \begin{bmatrix} a \\ b \end{bmatrix}. \end{aligned}$$

The major difference between the classical EEA and the Half-GCD algorithm is that, while the EEA computes all the remainders \(r_0, r_1, \ldots , r_{\ell } = \text{ gcd }(r_0, r_1)\), the Half-GCD computes only two consecutive remainders, which are derived from the \(Q_i\) quotient matrices, which in turn are obtained from a sequence of “truncated remainders”, instead of the original \(r_i\) remainders.

Here, we take advantage of the Half-GCD algorithm presented in [12, Chapter 11]. For a non-negative \(k \le n_0\), this algorithm computes the quotients \(q_1, \ldots , q_{h_k}\) where \(h_k\) is defined as

$$\begin{aligned} h_k = \text {max}\Big \{ 0 \le j \le \ell \ \vert \ \sum _{i = 1}^{j} m_i \le k \Big \}, \end{aligned}$$
(2)

the maximum \(j \in \mathbb {N}\) so that \(\sum _{1 \le i \le j} \deg (q_i) \le k\). This is done within \((22\text {M}(k) + O(k))\log k\) operations in \(\mathbf {k}\). From Eq. 2, \(h_{k} \le \min (k, \ell )\), and

$$\begin{aligned} \sum \limits _{i = 1}^{h_k} m_i = \sum \limits _{i = 1}^{h_k} (n_{i-1} - n_{i}) = n_0 - n_{h_k} \le k < \sum \limits _{i = 1}^{h_{k}+1} m_i = n_0 - n_{h_{k}+1}. \end{aligned}$$
(3)

Thus, \( n_{h_{k}+1} < n_0 - k \le n_{h_{k}}\), and so \(h_{k}\) can be uniquely determined; see Algorithm 11.6 in [12] for more details.

Due to the deep relation between subresultants and the remainders of the EEA, the Half-GCD technique can support computing subresultants. This approach is studied in [12]. The Half-GCD algorithm is used to compute the nominal leading coefficient of subresultants up to \(s_{\rho }\) for \(\rho = n_{h_{k}}\) by computing the quotients \(q_1, \ldots , q_{h_{k}}\), calculating the \(\mathrm{lc}(r_{i}) = {\mathrm{lc}(r_{i-1})}/{\mathrm{lc}(q_i)}\) from \(\mathrm{lc}(r_{0})\) for \(1 \le i \le h_{k}\), and applying Proposition 1. The resulting procedure runs within the same complexity as the Half-GCD algorithm.

However, for the purpose of computing two successive subresultants \(S_{n_v}, S_{n_{v+1}}\) given \( 0 \le \rho < n_1\), for \(0 \le v < \ell \) so that \(n_{v+1} \le \rho < n_{v}\), we need to compute quotients \(q_1, \ldots , q_{h_{\rho }}\) where \(h_{\rho }\) is defined as

$$\begin{aligned} h_{\rho } = \text {max}\Big \{ 0 \le j < \ell \ \vert \ n_j > \rho \Big \}, \end{aligned}$$
(4)

using Half-GCD. Let \(k = n_0 - \rho \), Eqs. 3 and 4 deduce \( n_{h_{\rho } +1} \le n_0 - k < n_{h_{\rho }}\), and \(h_{\rho } \le h_k\). So, to compute the array of quotients \(q_1, \ldots , q_{h_{\rho }}\), we can utilize an adaptation of the Half-GCD algorithm of [12]. Algorithm 2 is this adaptation and runs within the same complexity as the algorithm of [12].

figure b

Algorithm 2 receives as input two polynomials \(r_0 :=a, r_1 :=b\) in \(\mathbf {k}[y]\), with \(n_0 \ge n_1\), \(0 \le k \in \mathbb {N}\), \(\rho \le n_0\) where \(\rho \), by default, is \(n_0-k\), and the array \(\mathcal {A}\) of the leading coefficients of the remainders that have been computed so far. This array should be initialized to size \(n_0 + 1\) with \(\mathcal {A}[n_0] = \mathrm{lc}(r_0)\) and \(\mathcal {A}[i] = 0\) for \(0 \le i < n_0\). \(\mathcal {A}\) is updated in-place as necessary. The algorithm returns the array of quotients \(\mathcal {Q}:=( q_1, \ldots , q_{h_{\rho }} )\) and matrix \(M :=Q_{h_{\rho }} \cdots Q_1\).

Algorithm 2 and the fundamental theorem on subresultants yield Algorithm 3. This algorithm is a speculative subresultant algorithm based on Half-GCD to calculate two successive subresultants without computing others in the chain. Moreover, this algorithm returns intermediate data that has been computed by the Half-GCD algorithm—the array \(\mathcal {R}\) of the remainders, the array \(\mathcal {Q}\) of the quotients and the array \(\mathcal {A}\) of the leading coefficients of the remainders in the Euclidean sequence—to later calculate higher subresultants in the chain without calling Half-GCD again. This caching scheme is shown in Algorithm 4.

figure c

Let us explain this technique with an example. For non-zero polynomials \(a, b \in \mathbf {k}[y]\) with \(n_0 = \deg (a), n_1 = \deg (b)\), so that we have \(n_0 \ge n_1\). The subresultant call \(\text{ Subresultant }(a, b, 0)\) returns \(S_0(a, b), S_1(a, b)\) speculatively without computing \((S_{n_1}, S_{n_1-1}, S_{n_1-2}, \ldots , S_2)\), arrays \(\mathcal {Q}= (q_1, \ldots , q_{\ell })\), \(\mathcal {R}= (r_{\ell }, r_{\ell -1})\), and \(\mathcal {A}\). Therefore, any attempt to compute subresultants with higher indices can be addressed by utilizing the arrays \(\mathcal {Q}, \mathcal {R}, \mathcal {A}\) instead of calling Half-GCD again. In the Triangularize algorithm for solving systems of polynomial equations by triangular decomposition, the RegularGCD subroutine relies on this technique for improved performance; see [3, 5] for more details and algorithms.

figure d

For polynomials \(a, b \in \mathbb {Z}[y]\) with integer coefficients, a modular algorithm can be achieved by utilizing the Chinese remainder theorem (CRT). In this approach, we use Algorithms 2 and 3 for a prime field \(\mathbf {k}\). We define \(\mathbb {Z}_{p}[y]\) as the ring of univariate polynomials with coefficients in \(\mathbb {Z}/p\mathbb {Z}\), for some prime p. Further, we use an iterative and probabilistic approach to CRT from [22]. We iteratively calculate subresultants modulo different primes \(p_0, p_1, \ldots \), continuing to add modular images to the CRT direct product \(\mathbb {Z}_{p_0} \otimes \cdots \otimes \mathbb {Z}_{p_{i}}\) for \(i \in \mathbb {N}\) until the reconstruction stabilizes. That is to say, the reconstruction does not change from \(\mathbb {Z}_{p_0} \otimes \cdots \otimes \mathbb {Z}_{p_{i-1}}\) to \(\mathbb {Z}_{p_0} \otimes \cdots \otimes \mathbb {Z}_{p_{i}}\).

We further exploit this technique to compute subresultants of bivariate polynomials over prime fields and the integers. Let \(a, b \in \mathbb {B}[y]\) be polynomials with coefficients in \(\mathbb {B}= \mathbb {Z}_{p}[x]\), thus \(\mathbb {B}[y] = \mathbb {Z}_p[x, y]\), where the main variable is y and \(p \in \mathbb {N}\) is an odd prime. A desirable subresultant algorithm then uses an evaluation-interpolation scheme and the aforementioned univariate routines to compute subresultants of univariate images of ab over \(\mathbb {Z}_p[y]\) and then interpolates back to obtain subresultants over \(\mathbb {Z}_{p}[x, y]\). This approach is well-studied in [22] to compute the resultant of bivariate polynomials. We can use the same technique to compute the entire subresultant chain, or even particular subresultants speculatively through Algorithms 2 and 3.

We begin with choosing a set of evaluation points of size \(N \in \mathbb {N}\) and evaluate each coefficient of \(a, b \in \mathbb {Z}_{p}[x, y]\) with respect to the main variable (y). Then, we call the subresultant algorithm to compute subresultants images over \(\mathbb {Z}_{p}[y]\). Finally, we can retrieve the bivariate subresultants by interpolating each coefficient of each subresultant from the images. The number of evaluation points is determined from an upper-bound on the degree of subresultants and resultants with respect to x. From [12], the following inequality holds: \(N \ge \deg (b, y) \deg (a, x) + \deg (a, y) \deg (b, x) + 1.\)

For bivariate polynomials with integer coefficients, we can use the CRT algorithm in a similar manner to that which has already been reviewed for univariate polynomials over \(\mathbb {Z}\). Figure 1 demonstrates this procedure for two polynomials \(a, b \in \mathbb {Z}[x, y]\). In this commutative diagram, \(\bar{a}, \bar{b}\) represent the modular images of the polynomials ab modulo prime \(p_i\) for \(0 \le i \le e\).

Fig. 1.
figure 1

Computing the subresultant chain of \(a, b \in \mathbb {Z}[x, y]\) using modular arithmetic, evaluation-interpolation and CRT algorithms where \(( t_0, \ldots , t_N )\) is the list of evaluation points, \(( p_0, \ldots , p_i,)\) is the list of distinct primes, \(\bar{a} = a \text{ mod } p_i\), and \(\bar{b} = b \text{ mod } p_i\)

In practice, as the number of variables increases, the use of dense evaluation-interpolation schemes become less effective, since degree bound estimates become less sharp. In fact, sparse evaluation-interpolation schemes become more attractive [23, 29], and we will consider them in future works.

4 Optimized Ducos’ Subresultant Chain

In [10], Ducos proposes two optimizations for Algorithm 1. The first one, attributed to Lazard, deals with the potentially expensive exponentiations and division at Line 11 of Algorithm 1. The second optimizations considers the potentially expensive exact division (of a pseudo-remainder by an element from the coefficient ring) at Line 15 of this algorithm. Applying both improvements to Algorithm 1 yields an efficient subresultant chain procedure that is known as Ducos’ algorithm.

figure e

The Ducos optimization that is presented in Algorithm 5, and borrowed from [10], is a well-known improvement of Algorithm 1 to compute the subresultant \(S_{e-1}\) (Line 15). This optimization provides a faster procedure to compute the pseudo-division of two successive subresultants, namely \(S_d, S_{d-1} \in \mathbb {B}[y]\), and a division by a power of \(\mathrm{lc}(S_d)\). The main part of this algorithm is for-loops to compute:

$$\begin{aligned} D :=\frac{\sum \limits _{j = 0}^{d-1} \text{ coeff }(S_{d} ,j) H_j}{\mathrm{lc}(S_{d})}, \end{aligned}$$

where \(\text{ coeff }(S_{d}, j)\) is the coefficient of \(S_{d}\) in \(y^j\).

We now introduce a new optimization for this algorithm to make better use of memory resources through in-place arithmetic. This is shown in Algorithm 6. In this algorithm we use a procedure named InplaceTail to compute the tail (the reductum of a polynomial with respect to its main variable) of a polynomial, and its leading coefficient, in-place. This operation is essentially a coefficient shift. In this way, we reuse existing memory allocations for the tails of polynomials \(S_d, S_{d-1}\), and \(S_e\).

figure f

Furthermore, we reduce the cost of calculating \(\sum _{j = e}^{d-1} \text{ coeff }(S_{d} ,j) H_j\) with computing the summation iteratively and in-place in the same for-loop that is used to update polynomial h (lines 6–10 in Algorithm 6). This greatly improves data locality. We also update the value of h depending on its degree with respect to y as \(\deg (h) \le e-1\) for all \(e+1 \le i < d\). We utilize an optimized exact division algorithm denoted by ExactQuotient to compute quotients rather a classical Euclidean algorithm.

5 Implementation and Experimentation

In this section, we discuss the implementation and performance of our various subresultant algorithms and their underlying core routines. Our methods are implemented as part of the Basic Polynomial Algebra Subprograms (BPAS) library [2] and we compare their performance against the NTL library [27] and Maple 2020 [21]. Throughout this section, our benchmarks were collected on a machine running Ubuntu 18.04.4, BPAS v1.791, GMP 6.1.2, and NTL 11.4.3, with an Intel Xeon X5650 processor running at 2.67 GHz, with 12\(\times \)4GB DDR3 memory at 1.33 GHz.

5.1 Routines over \(\mathbb {Z}_p[y]\)

We begin with foundational routines for arithmetic in finite fields and polynomials over finite fields. For basic arithmetic over a prime field \(\mathbb {Z}_{p}\) where p is an odd prime, Montgomery multiplication, originally presented in [24], is used to speed up multiplication. This method avoids division by the modulus without any effect on the performance of addition, and so, yields faster modular inverse and division algorithms.

We have developed a dense representation of univariate polynomials which take advantage of Montgomery arithmetic (following the implementation in [6]) for prime fields with \(p < 2^{64}\). Throughout this section we examine the performance of each operation for two randomly generated dense polynomials \(a, b \in \mathbb {Z}_{p}\) with a 64-bit prime \(p = 4179340454199820289\). Figures 2, 3, 4 and 5 examine, respectively, multiplication, division, GCD, and subresultant chain operations. These plots compare the various implementations within BPAS against NTL.

Fig. 2.
figure 2

Comparing plain, Karatsuba, and FFT-based multiplication in BPAS with the wrapper mul method in NTL to compute ab for polynomials \(a, b \in \mathbb {Z}_{p}[y]\) with \(\deg (a) = \deg (b)+1 = d\)

Fig. 3.
figure 3

Comparing Euclidean and fast division algorithms in BPAS with the division method in NTL to compute \(\text{ rem }(a, b)\) and \(\text{ quo }(a, b)\) for polynomials \(a, b \in \mathbb {Z}_{p}[y]\) with \(\deg (a) = 2(\deg (b)-1) = d\)

Fig. 4.
figure 4

Comparing Euclidean-based GCD and Half-GCD-based GCD algorithms in BPAS with the GCD algorithm in NTL to compute \(\text {gcd}(a, b) = 1\) for polynomials \(a, b \in \mathbb {Z}_{p}[y]\) with \(\deg (a) = \deg (b)+1 = d\)

Fig. 5.
figure 5

Comparing EEA, modular subresultant, and Half-GCD-based subresultant (BPAS_specSRC, \(\rho =0, 2\)), in BPAS for dense polynomials \(a, b \in \mathbb {Z}_{p}[y]\) with \(\deg (a) = \deg (b)+1 = d\)

Our multiplication over \(\mathbb {Z}_{p}[y]\) dynamically chooses the appropriate algorithm based on the input polynomials: plain or Karatsuba algorithms (following the routines in [12, Chapter 8]), or multiplication based on fast Fourier transform (FFT). The implementation of FFT itself follows that which was introduced in [7]. Figure 2 shows the performance of these routines in BPAS against a similar “wrapper” multiplication routine in NTL. From empirical data, our wrapper multiplication function calls the appropriate implementation of multiplication as follows. For polynomials ab over \(\mathbb {Z}_{p}[y]\), with \(p < 2^{63}\), the plain algorithm is called when \(s :=\min {(\deg (a), \deg (b))} < 200\) and the Karatsuba algorithm is called when \(s \ge 200\). For 64-bit primes (\(p > 2^{63}\)), plain and Karatsuba algorithms are called when \(s < 10\) and \(s < 40\), respectively, otherwise FFT-based multiplication is performed.

The division operation is again a wrapper function, dynamically choosing between Euclidean (plain) and fast division algorithms. The fast algorithm is an optimized power series inversion procedure that is firstly implemented in Aldor [11] using the so-called middle-product trick. Figure 3 shows the performance of these two algorithms in comparison with the NTL division over \(\mathbb {Z}_{p}[y]\). For polynomials ab over \(\mathbb {Z}_{p}[y]\), b the divisor, empirical data again guides the choice of appropriate implementation. Plain division is called for primes \(p < 2^{63}\) and \(\deg (b) < 1000\). However, for 64-bit primes, the plain algorithm is used when \(\deg (b) < 100\), otherwise fast division supported by FFT is used.

Our GCD operation over \(\mathbb {Z}_{p}[y]\) had two implementations: the classical extended Euclidean algorithm (EEA) and the Half-GCD (fast EEA) algorithm, respectively following the pseudo-codes in [12, Chapter 11] and the implementation in the NTL library [27]. Figure 4 shows the performance of these two approaches named BPAS_plainGCD and BPAS_fastGCD, respectively, in comparison with the NTL GCD algorithm for polynomials \(a, b \in \mathbb {Z}_{p}[y]\) where \(\text{ gcd }(a, b) = 1\).

To analyze the performance of our subresultant schemes, we compare the naïve EEA algorithm with the modular subresultant chain and the speculative subresultant algorithm for \(\rho = 0, 2\) in Fig. 5. As this figure shows, using the Half-GCD algorithm to compute two successive subresultants \(S_{1}, S_{0}\) for \(\rho = 0\) is approximately 5\(\times \) faster than computing the entire chain, while calculating other subresultants, e.g. \(S_{3}, S_{2}\) for \(\rho = 2\) with taking advantage of the cached information from the first call (for \(\rho = 0\)), is nearly instantaneous.

5.2 Subresultants over \(\mathbb {Z}[y]\) and \(\mathbb {Z}[x,y]\)

We have developed a dense representation of univariate and bivariate polynomials over arbitrary-precision integers, using low-level procedures of the GNU Multiple Precision Arithmetic library (GMP) [13]. Basic dense arithmetic operations, like addition, multiplication, and division, follows [12]. The representation of a dense bivariate polynomial \(a \in \mathbb {Z}[x, y]\) (or \(\mathbb {Z}_{p}[x, y]\) for a prime p) is stored as a dense array of coefficients (polynomials in \(\mathbb {Z}[x]\)), possibly including zeros.

Following our previous discussion of various schemes for subresultants, we have implemented several subresultant algorithms over \(\mathbb {Z}[y]\) and \(\mathbb {Z}[x,y]\). We have four families of implementations:

(i):

BPAS_modSRC, that computes the entire subresultant chain using Proposition 1 and the CRT algorithm (and evaluation-interpolation over \(\mathbb {Z}[x,y]\));

(ii):

BPAS_specSRC, that refers to Algorithms 3 and 4 to compute two successive subresultants using Half-GCD and caching techniques;

(iii):

BPAS_Ducos, for Ducos’ algorithm, based on Algorithm 5; and

(iv):

BPAS_OptDucos, for Ducos’ algorithm based on Algorithm 6.

Fig. 6.
figure 6

Comparing (optimized) Ducos’ subresultant chain algorithm, modular subresultant chain, and speculative subresultant for \(\rho =0, 2\), algorithms in BPAS with Ducos’ subresultant chain algorithm in Maple for polynomials \(a, b \in \mathbb {Z}[y]\) with \(\deg (a) = \deg (b)+1 = d\)

Fig. 7.
figure 7

Comparing (optimized) Ducos’ subresultant chain, modular subresultant chain, and speculative subresultant for \(\rho =0, 2, 4, 6\), in BPAS with Ducos’ algorithm in Maple for dense polynomials \(a, b \in \mathbb {Z}[x < y]\) with \(\deg (a, y) = \deg (b, y)+1 = 50\) and \(\deg (a, x) = \deg (b, x) + 1 = d\)

Figure 6 compares the running time of those subresultant schemes over \(\mathbb {Z}[y]\) in the BPAS library and Maple. The modular approach is up to 5\(\times \) faster than the optimized Ducos’ algorithm. Using speculative algorithms to compute only two successive subresultants yields a speedup factor of 7 for \(d = 2000\). Figure 7 provides a favourable comparison between the family of subresultant schemes in BPAS and the subresultant algorithm in Maple for dense bivariate polynomials \(a, b \in \mathbb {Z}[x, y]\) where the main degree is fixed to 50, i.e. \(\deg (a, y) = \deg (b, y)+1 = 50\), and \(\deg (a, x) = \deg (b, x) + 1 = d\) for \( d \in \{ 10, 20, \ldots , 100 \}\). Note that the BPAS_specSRC algorithm for \(\rho = 0, 2, 4, 6\) is caching the information for the next call with taking advantage of Algorithm 4.

We further compare our routines with the Ducos subresultant chain algorithm in Maple, which is implemented as part of the RegularChains library [19]. Table 1 shows the memory usage for computing the entire subresultant chain of polynomials \(a, b \in \mathbb {Z}[y]\), with \(\deg (a) = \deg (b) + 1 = d\). The table presents BPAS_Ducos, BPAS_OptDucos, and Maple_Ducos. For \(d = 2000\), Table 1 shows that the optimized algorithm uses approximately 3\(\times \) and 11\(\times \) less memory than our original implementation and the Ducos’ algorithm in Maple, respectively.

Table 1. Comparing memory usage (GB) of Ducos’ subresultant chain algorithms for polynomials \(a, b \in \mathbb {Z}[y]\) with \(\deg (a) = \deg (b) + 1 = d\) in Fig. 6 over \(\mathbb {Z}[y]\)
Fig. 8.
figure 8

Comparing Opt. Ducos’ algorithm (the top surface) and modular subresultant chain (the bottom surface) to compute the entire chain for polynomials \(a, b \in \mathbb {Z}[x < y]\) with \(\deg (a, y) = \deg (b, y)+1 = Y\) and \(\deg (a, x) = \deg (b, x) + 1 = X\)

Fig. 9.
figure 9

Comparing modular subresultant chain with using FFT (the top surface), and speculative subresultant \((\rho =0)\) (the bottom surface) for polynomials \(a, b \in \mathbb {Z}[x < y]\) with \(\deg (a, y) = \deg (b, y)+1 = Y\) and \(\deg (a, x) = \deg (b, x) + 1 = X\)

We next compare more closely the two main ways of computing an entire subresultant chain: the direct approach following Algorithm 1, and a modular approach using evaluation-interpolation and CRT (see Fig. 1). Figure 8 shows the performance of the direct approach (the top surface), calling our memory-optimized Ducos’ algorithm BPAS_OptDucos, in comparison with the modular approach (the bottom surface), calling BPAS_modSRC. Note that, in this figure, interpolation may be based on Lagrange interpolation or FFT algorithms depending on the degrees of the input polynomials.

Next, Fig. 9 highlights the benefit of our speculative approach to compute the resultant and subresultant of index 1 compared to computing the entire. The FFT-based modular algorithm is presented as the top surface, while the speculative subresultant algorithm based on the Half-GCD is the bottom surface.

Lastly, we investigate the effects of different subresultant algorithms on the performance of the BPAS polynomial system solved based on triangular decomposition and regular chains; see [3, 5]. Subresultants play a crucial role in computing regular GCDs (see Sect. 1) and thus in solving systems via triangular decomposition. Tables 2, 3, and 4 investigate the performance of BPAS_modSRC, and BPAS_specSRC and the caching technique, for system solving.

Table 2 shows the running time of well-known and challenging bivariate systems, where we have forced the solver to use only one particular subresultant scheme. In , BPAS_specSRC does not cache data and thus does not reuse the sequence of quotients computed from previous calls. Among those systems, the caching ratio ( ) of vert_lines, L6_circles, ten_circles, and SA_2_4_eps are 24.5, 21.6, 19.8, 9.2, respectively, while the speculative ratio ( ) of tryme, mignotte_xy, and vert_lines are 1.5, 1.2, and 1.2, respectively.

Table 2. Comparing the execution time (in seconds) of subresultant schemes on the BPAS Triangularize solver for well-known bivariate systems in the literature. We call optimized Ducos’ subresultant chain algorithm in the OptDucos mode, modular subresultant chain algorithms (FFT and Lagrange) in the ModSRC mode, and Half-GCD based subresultant algorithms in the and modes. We do cache subresultant information for further calls in the ModSRC and modes; \(\deg (\text {src[idx]})\) shows a list of minimum main degrees of the computed subresultants in each subresultant call and Indexes indicates a list of requested subresultant indexes.

Tables 3 and 4 examine the performance of the polynomial system solver on constructed systems which aim to exploit the maximum speed-up of these new schemes. Listing 1.1 and in Appendix A provide the Maple code to construct these input systems. For those systems created by Listing 1.1, we get 3\(\times \) speed-up through caching the intermediate speculative data rather than repeatedly calling the Half-GCD algorithm for each subresultant call. Using BPAS_specSRC provides a 1.5\(\times \) speed-up over using the BPAS_modSRC algorithm. Another family of constructed examples created by Listing 1.2 is evaluated in Table 4. Here, we get up to 3\(\times \) speed-up with the use of cached data, and up to 2\(\times \) speed-up over the modular method.

Table 3. Comparing the execution time (in seconds) of subresultant schemes on the BPAS Triangularize system solver for constructed bivariate systems in Listing 1.1 to exploit the speculative scheme. Column headings follow Table 2, and FFTBlockSize is block size used in the FFT-based evaluation and interpolation algorithms.
Table 4. Comparing the execution time (in seconds) of subresultant schemes on the BPAS Triangularize system solver for constructed bivariate systems in Listing 1.2 to exploit the speculative scheme. Column headings follow Table 3.