1 Introduction

Multiplication and inversion operations in rings and finite fields are the most frequently used arithmetic operations in many cryptographic algorithms such as RSA algorithm [1], Diffie–Hellman key exchange algorithm [2], the US federal Digital Signature Standard (DSS) [3], elliptic and hyper-elliptic curve cryptography [4,5,6], and paring-based cryptography [7]. The multiplication operation dominates the execution times of these algorithms, and thus, its performance is of extreme importance. The Montgomery multiplication algorithm introduced in a seminal work by Montgomery [8] is usually considered to be the most efficient method in digital computers as it replaces prohibitively expensive division operations by bit-level or word-level shifts.

The Montgomery multiplication algorithm requires its operands in the Montgomery domain, i.e., \(a2^n \bmod {p}\) and \(b2^n\bmod {p}\), given \(a, b <p\) where p is an odd integer or prime and n is generally taken as the bit size of the modulus p. The algorithm takes \(a2^n \bmod {p}\) and \(b2^n\bmod {p}\) as inputs and computes \(c2^n \bmod {p}\), where \(c = a \cdot b \bmod {p}\). In other words, the algorithm ensures the result in the Montgomery domain provided that the operands are given in the Montgomery domain. Therefore, other arithmetic operations are expected to satisfy the same condition when Montgomery arithmetic is used to avoid forward and backward transformation from and to the Montgomery domain. For addition and subtraction, the condition is satisfied immediately, as \(a2^n \pm b2^n \equiv (a \pm b)2^n\equiv c2^n \bmod {p}\). The inversion, on the other hand, requires more effort since the classical inversion algorithm outputs \(a^{-1}2^{-n} \bmod {p}\) given \(a2^{n} \bmod {p}\), instead of \(a^{-1}2^{n} \bmod {p}\). The correct definition of the Montgomery inversion would be \(a^{-1}2^{n} \bmod {p}\), as given in [9].

The roots of direct inversion algorithms in finite fields can be traced to the Euclidean algorithm for the computation of greatest common divisor (GCD) reported by Euclid in his Elements [10]. Stein’s algorithm [11] for computing the GCD of two nonnegative integers replaces division operations by less complex comparison, subtraction and shift operations. The extended version of Stein’s algorithm [12], referred as the binary extended Euclidean algorithm (bEEA), can be used to compute the inverse of integer \(a<p\), where GCD(ap) = 1.

Kaliski’s method [13] further improves bEEA by reducing the number of addition and subtraction operations. However, it computes \(a^{-1}2^k \bmod {p}\) given \(a<p\) (alternatively, \(a^{-1}2^{k-n} \bmod {p}\) given \(a2^{n} \bmod {p}\)), where k is the number of iterations in the algorithm and follows a random distribution in the interval [n, 2n]. Therefore, Kaliski’s method requires a correction phase to obtain the correct result. In the literature, several other algorithms are proposed to accelerate the computation of inversion operation in hardware or software implementations [14,15,16,17,18,19].

The inversion operation is used in elliptic curve cryptography to transform the projective coordinates to affine coordinates after the point operations are calculated in projective coordinates [20]. It is also used in DSS algorithm to compute the inverse of ephemeral key. Finally in elliptic curve cryptography, when precomputation is applied for scalar multiplication in the fixed-based point scenario, one can use the affine formulae for precomputed points, which requires the computation of inversion. In some cases, the computations involve sensitive information and simple side-channel attacks become feasible if the inversion computation is not protected. A recent work [21] modifies Kaliski’s method for inversion to obtain a constant-time algorithm as a protection against a class of side-channel attacks.

Our contributions in this paper are as follows:

  • We show that there are alternative methods to compute the Montgomery inverse of an integer modulo a prime integer including a method based on the classical extended Euclidean algorithm.

  • We propose a simple method to parallelize the inversion algorithms, which can be useful for multi-core processors if inter-core data sharing is possible with low overhead.

  • We provide two constant-time algorithms for computing Montgomery inversion efficiently.

Intentionally, we provide neither hardware nor software implementation results such as timing benchmarks on a specific platform as we are not proposing a new and fast algorithm for computing multiplicative inverse. Naturally, we refrain from giving any assessment as to whether inversion can be computed sufficiently fast to avoid using projective coordinates in elliptic curve cryptography, which eliminates almost all inversion operations at the expense of increased number of multiplications and higher memory/storage requirement. This paper can be regarded as a guide book or a short survey which can be useful for future work on the subject. The literature provides a wide collection of works on modular inversion targeting its various other aspects that are not included in this work. Hardware implementations of various designs for performing fast, efficient and scalable computation of inversion are reported in [22,23,24,25,26,27,28,29]. A recent work in [30] uses the approach based on Fermat’s little theorem to compute multiplicative inverse in constrained computing platforms such as those used in wireless sensor nodes to save code space as the approach utilizes only modular multiplication as a building block. On the other hand, the work in [31] uses the extended Euclidean algorithm in an embedded processor. In [32], the authors discuss the ratio of inversion timing to multiplication timing which is the key factor in choosing between the affine and projective coordinates in elliptic curve cryptography. As the ratio is usually large as attested by state-of-the-art implementations such as GMP,Footnote 1 using projective coordinates that trades inversion for several extra multiplication operations is a better strategy of implementing elliptic curve cryptography. Finally, [21] proposes a constant-time inversion algorithm and reports implementation results.

The organization of the paper is as follows: in Sect. 2, we start with the classical Euclidean algorithm and show how we can modify it to compute the Montgomery inversion. Section 3 presents the definitions pertaining to Montgomery arithmetic. In Sect. 4, we describe the binary GCD algorithm in a slightly different way that is useful to follow the explanations in subsequent sections. Section 5 explains the method to compute the Montgomery inversion using bEEA while Sect. 6 focuses on Kaliski’s inversion algorithm for the same purpose. Section 7 compares binary EEA and Kaliski’s method from efficiency point of view. We briefly explain the parallelization idea that can be utilized with many inversion algorithms in Sect. 8. Constant-time inversion algorithms are provided in Sect. 9. Finally, we conclude our paper in Sect. 10.

2 The classical Euclidean algorithm

The Euclidean algorithm provides a simple means for computing the greatest common divisor, denoted GCD(uv) of two positive integers u and v without factoring them. The extended Euclidean algorithm (EEA) returns two unique integers s and r in addition to the greatest common divisor of u and v. The EEA is depicted in Algorithm 1.

figure a

After the EEA successfully terminates, the following relation holds

$$\begin{aligned} ar + ps = \text{ GCD }(a, p). \end{aligned}$$

If a and p are relatively prime, \(\text{ GCD }(a, p)\) would be equal to 1, and thus, it immediately follows that

$$\begin{aligned} ar \equiv 1 \pmod p. \end{aligned}$$

Hence, the extended Euclidean algorithm computes the modular inverse \(r=a^{-1} \bmod {p}\). The classical EEA can be modified to compute the Montgomery inverse of an integer by changing Step 3 of the algorithm as \(r_1 \leftarrow 0\) and \(r_1 \leftarrow 2^{2n} \bmod {p}\). Given the input \(a<p\), the modified algorithm mEEA computes

$$\begin{aligned} r = \text{ mEEA }(a2^n \bmod {p}, p) = a^{-1}\cdot 2^{-n}\cdot 2^{2n} \ \bmod {p}, \end{aligned}$$

which is the Montgomery inverse of \(a2^n \bmod {p}\). The classical inverse algorithm, on the other hand, heavily relies on division operation; although efficiently computed over integers with relatively small sizes and on computers with large word, division is a prohibitively expensive operation. Therefore, the algorithms in subsequent sections, which avoid expensive division operation, are preferred in many circumstances.

3 Definitions

In this section, we provide definitions pertaining to Montgomery arithmetic with focus on inversion operation. We always assume \(p>2\) is a prime throughout the paper.

Definition 1

The positive integer R with GCD(pR) = 1 is the Montgomery constant, and selected as a power of two for efficient arithmetic in digital computers. A typical selection is \(R=2^n \bmod p\), where \(n = \lceil \log _2{p} \rceil \); namely, n is the bit size of p. Without loss of generality, we always use this selection in the rest of the paper.

Definition 2

Given an integer \(a \bmod {p}, a\cdot R \bmod {p}\) is said to be in the Montgomery domain.

Definition 3

Given two integers, \(a, b < p\), the operation \(a\cdot b \cdot R^{-1}\bmod {p}\) is the Montgomery multiplication of a and b and denoted as \(\text{ MonMul }(a, b, p)\). Given two integers in the Montgomery domain, \(aR \bmod {p}\) and \(bR \bmod {p}\), the Montgomery multiplication yields

$$\begin{aligned} \text{ MonMul }(a, b, p) = aR\cdot bR \cdot R^{-1} =abR=cR \bmod {p}, \end{aligned}$$

where \(c=a\cdot b \bmod {p}\).

Definition 4

Given the positive integer in the Montgomery domain, \(aR \bmod {p}\), the positive integer \(r=a^{-1}\cdot R \bmod {p}\) is the Montgomery inverse of a.

Definition 5

Given the positive integer in the Montgomery domain, \(a\cdot R \bmod {p}\), the integer \(r=a^{-1} \bmod {p}\) is the backward Montgomery inverse of a.

Definition 6

Given the positive integer in the regular domain, \(a \bmod {p}\), the integer \(r=a^{-1}R \bmod {p}\), which is in the Montgomery domain, is the forward Montgomery inverse of a.

4 The binary GCD algorithm

Stein’s algorithm [11] is an efficient method for computing the greatest common divisor of two positive integers; it employs only comparison, subtraction and shift operations, which are easy to implement in digital computers. Its most important aspect is that expensive division operations in the classical Euclidean GCD algorithm are replaced by divisions by 2, which can be implemented as simple logical shift operations over the binary representation of the integers, hence the name binary GCD.

A version of Stein’s algorithm that works when the numbers are odd is depicted in Algorithm 2. Steps 3, 4 and 10 of the algorithm perform logical operations; therefore, Boolean variables \(\pi _i\)s take only binary values of TRUE and FALSE. Note that the while loop starts with a logical operation, which checks whether the Boolean variable \(\pi _0\) is TRUE or FALSE. Note also that \(\bar{\pi _i}\) denotes the complement of Boolean variable \(\pi _i\). The operations in Steps 6–9 are predicated, which are executed only when the associated Boolean variable (henceforth predicate) in the bracket is TRUE. One important aspect of the notation is that all Boolean variables in Steps 6–9 are checked; it is as if there is an if statement preceding these steps as opposed to if–elseif–else statements. Note that only one of \(\pi _1\)\(\pi _4\) is TRUE at a time; thus, only one of Steps 6–9 is executed in each iteration of the while loop. We adopt this notation for two basic reasons: (i) propose a method to parallelize the inversion algorithms and (ii) explain versions of inversion algorithms that executes in constant time which protects them against a class of side-channel attacks.

figure b

From the complexity point of view, the expected number of iterations is experimentally found as about 1.4n, where n is the bit size of p. In addition, we can give upper and lower bound for the number of iterations as \(n\le k \le 2n\). For either u or v gets smaller after each iteration, performing shift and subtraction operations over these integers can become easier, depending on the implementation platform, as the computation progresses. In particular when multi-precision arithmetic is used in the implementation, it is advantageous to work with the exact lengths of u and v. We can assume that the average sizes u and v are about n / 2 during the computations.

As we will see in the following sections, both binary extended Euclidean and Kaliski’s algorithms employ the while loop in Algorithm 2, which determines the number of iterations needed to compute multiplicative inverse of a given integer modulo a prime number.

5 The Montgomery inversion with bEEA

In Algorithm 3, we provide a slightly modified version of the original bEEA described in [12]. The algorithm works also for odd modulus p when GCD(ap) = 1. While the original algorithm directly computes \(a^{-1} \bmod {p}\) given \(a<p\), the new version computes \(a^{-1}M \bmod {p}\), where M is an additional input of the algorithm besides a and p, and is called the adjuster. Also, in Step 1 of the algorithm, the initialization of s is different (compare \(s\leftarrow M\) in Algorithm 3 and \(s\leftarrow 1\) in the original algorithm).

figure c

The modification in the initialization step does not result in any problem as the operations applied to r and s mostly right shifts, subtractions and occasionally additions with the modulus. Therefore, the integers r and s do not grow faster in the new algorithm than in the original algorithm. Consequently, we can safely state that Algorithm 3 is not less efficient than the original algorithm from the perspective of complexity. However, this modification, which is useful for the computation of Montgomery inversion, cannot be adopted in Kaliski’s method as can be seen in the next section.

Using Algorithm 3, we can compute three types of Montgomery inversion algorithms. The Montgomery inversion of a number in the Montgomery domain, \(a\cdot R \bmod {p}\) is calculated as follows

$$\begin{aligned} \begin{array}{l} \text{ MontInvbEEA }(a\cdot R \bmod {p}, p, R^2 \bmod {p}) \\ \quad = a^{-1}\cdot R^{-1} \cdot R^2 = a^{-1} \cdot R \bmod {p}~. \end{array} \end{aligned}$$

As can be seen, the adjuster used in the computation is \(M=R^2 \bmod {p}\). Similarly, we can compute the forward and backward Montgomery inverses of \(a \bmod {p}\) and \(a\cdot R \bmod {p}\) using the formulae

$$\begin{aligned} \begin{array}{l} \text{ MontInvbEEA }(a \bmod {p}, p, R \bmod {p}) \\ \quad = a^{-1} \cdot R \bmod {p} , \\ \text{ MontInvbEEA }(a\cdot R \bmod {p}, p, R \bmod {p}) \\ \quad = a^{-1}\cdot R^{-1} \cdot R \bmod {p} = a^{-1} \bmod {p}, \end{array} \end{aligned}$$

respectively. In both cases, \(M=R \bmod {p}\).

6 Montgomery inversion with Kaliski’s method

Kaliski’s almost inversion algorithm [13], described in Algorithm 4, computes \(a^{-1}2^{k} \pmod {p}\) given \(a<p\). The algorithm works also for odd modulus p provided that \(\text{ GCD }(a, p)=1\). As can be observed, the operations on u and v are identical to those in the binary GCD and inversion algorithms (see Algorithms 23) with the only difference in the operations performed over r and s. Therefore, the expected numbers of iterations are also identical in all three algorithms.

figure d

Several analyses such as the one in [33] suggest that Kaliski’s method is more efficient than the classical EEA method based on Algorithm 1. Also as the operations over r and s are much simpler than those in binary inversion algorithm, the almost inverse algorithm (Kaliski’s method) is generally regarded as more efficient. However, the almost inversion algorithm returns \(a^{-1} 2^k \bmod {p}\), where \(n\le k \le 2n\), we need a correction phase to obtain the Montgomery inversion of the input. A method for the correction phase is described in Algorithm 5, which computes the Montgomery inversion (Steps 2–10) and backward and forward Montgomery inversions (Steps 11–19).

figure e

Using Algorithms 4 and 5, we can compute all three types of Montgomery inversion algorithms. Assuming \(R=2^n \bmod {p}\) and knowing \(2n \ge k \ge n\), the Montgomery inversion of integer a in the Montgomery domain, \(a\cdot 2^n \bmod {p}\) is calculated as follows

$$\begin{aligned} \begin{array}{l} \text{ AlmInv }(a2^n \bmod {p}, p) \\ \quad = a^{-1} \cdot 2^{k-n} \bmod {p} \\ \text{ CorPhase }(a^{-1} 2^{k-n} \bmod {p}, p, k, n) \\ \quad = (a^{-1} \cdot 2^{k-n}) \cdot 2^{2n-k} \bmod {p} \\ \quad = a^{-1} 2^n \bmod {p} ~. \end{array} \end{aligned}$$

The forward Montgomery inversion algorithm is computed as follows

$$\begin{aligned} \begin{array}{l} \text{ AlmInv }(a \bmod {p}, p) \\ \quad = a^{-1} \cdot 2^{k} \bmod {p} \\ \text{ CorPhase }(a^{-1} 2^{k} \bmod {p}, p, k, n) \\ \quad = (a^{-1} \cdot 2^{k})/2^{k-n} \bmod {p} \\ \quad = a^{-1} 2^n \bmod {p}. \end{array} \end{aligned}$$

Finally, the backward Montgomery inversion algorithm is computed using exactly the same steps of Algorithm 5 as in the case of the forward Montgomery inversion algorithm.

$$\begin{aligned} \begin{array}{l} \text{ AlmInv }(a 2^n \bmod {p}, p) \\ \quad = a^{-1} \cdot 2^{k-n} \bmod {p} \\ \text{ CorPhase }(a^{-1} 2^{k-n} \bmod {p}, p, k, n) \\ \quad = (a^{-1} \cdot 2^{k-n})/2^{k-n} \bmod {p} \\ \quad = a^{-1} \bmod {p} . \end{array} \end{aligned}$$

Although the correction steps can also be implemented using several Montgomery multiplication operations as shown in [9], we prefer Algorithm 5, which consists of only addition, subtraction and shift operations (as in the cases of Algorithms 3 and 4). The fact that the solution in [9] features different types of operations, which can be more efficient than Algorithm 5 in certain platforms, renders a fair and generic comparison with the bEEA-based Montgomery inversion algorithm impossible.

An alternative method to compute Montgomery inverse using Kaliski’s method is to modify Step 1 of Algorithm 4 as follows:

$$\begin{aligned} u \leftarrow p, \quad v \leftarrow a, \quad r \leftarrow 0, \quad s \leftarrow M, \end{aligned}$$

where M is the adjuster similar to that in Algorithm 3. Although this results in a functionally correct algorithm, it will not be efficient. A close inspection on the operations on r and s in Algorithm 4 reveals that they are growing in bit size. Therefore, both r and s in the alternative solution will reach to bit sizes of about 2n when the algorithm terminates and performing arithmetic with double size integers is not efficient.

7 Comparing binary and Kaliski’s algorithms

In this section, we compare the Montgomery inversion algorithms based on bEEA (Algorithm 3) and Kaliski’s method (Algorithms 4 + 5) by counting the average number of iterations, shifts, and additions/subtractions. We run the algorithm 10,000 times for randomly generated parameters for bit sizes \(n = \in \{256,384,512\}\) and list the results in Table 1. We separate the results for three bit sizes using the symbol “/.” As all algorithms utilize the main loop of Algorithm 2, the average number of iterations are the same. The number of iterations in the correction phase of Kaliski’s method (see Algorithm 5) is not included in these figures, while the additions and shifts due to the correction phase are all taken into account in the table (see the last three rows).

Table 1 Comparison of binary and Kaliski’s inverse algorithms for \(n \in \{256, 384, 512\}\)

As can be observed from Table 1, the bEEA-based algorithm executes more additions than the other algorithms while its average number of shifts is always less than the others. When the Montgomery inversion is computed the total of number of shifts and additions in Algorithm 5 is slightly less than the algorithm based on Kaliski’s method. One can always argue that shifts are less complex operations than additions and therefore that the Kaliski’s method is more efficient. However, due to extensive hardware support for efficient addition and shift operations in many modern microprocessors, from clock count perspective this argument does not necessarily hold for multiple precision integers in all microprocessors. As there are different microprocessor architectures with different hardware support for arithmetic operations, we do not provide implementation results for any particular microprocessor. Notwithstanding, we can argue that bEEA-based algorithm and the algorithm based on Kaliski’s method for computing Montgomery inverse have comparable performance.

On the other hand, for backward and forward Montgomery inverse computations, one can argue that Kaliski’s method is more efficient (compare the total number of shifts and additions) provided that a multi-precision shift operation does not take significantly more time than a multi-precision addition on a given processor platform.

One issue with Algorithm 3 is that integers r and s can take negative values, while they are always positive in Algorithms 4 + 5. This basically means that we should use signed integer representation for r and s, which can require one additional computer word in software implementations. On the other hand, as r can become larger than the modulus p in Algorithms 4 + 5, we may have to use one computer word in addition to number of words used to represent p. Consequently, we can safely claim that r and s taking negative values have no adverse performance implications.

8 Parallelization of inversion algorithms

All inversion algorithms presented here utilize the main loop of Algorithm 2 which operates on u and v. The inversion algorithms, on the other hand, additionally process the secondary variables of r and s, on which types of operations differ depending on the inversion algorithm. One simple parallelization idea is to perform the operations on u and v and those on r and s in two separate computing cores concurrently. The first core, processing u and v, computes the predicates, \(\pi _0\)\(\pi _4\) in iteration i of the main loop, sends them to second core and processes u and v for that iteration. The second core, using the predicates, performs the operations on r and s concurrently. Assuming that the predicates reach the second core \(\varDelta \) seconds later at most, we can theoretically say (with some simplified assumptions on the idiosyncratic behavior of the computing platform otherwise) that the second core completes its computations on r and s (\(\varDelta - \tau _{uv} + \tau _{rs}\)) seconds after the first core, where \(\tau _{uv}\) and \(\tau _{rs}\) are the execution times of all operations in one iteration on u and v, and r and s, respectively. We further assume that \(\tau _{uv}< \varDelta < k\cdot \tau _{uv}\).

The proposed parallelization method can result in a speedup (upper-bounded by 2) only on multi-core processors where data sharing among cores results in comparatively low overhead. Our experiments with processor architectures where we conclude this overhead is high, yield no speedup. We use a computer featuring six cores, with hyper-threading support running 64-bit Ubuntu Linux 12.04 operating system. Each core is an Intel Xeon CPU E1650 operating at 3.50 GHz. We utilized C++ programming language with the GMP library optimized for big number arithmetic. For parallel implementation we used the OpenMP API that allows shared-memory multiprocessing programming. As the overhead for sharing of the predicates between two cores is presumably high, we report no speedup due to parallelization for the primes of bit lengths of interest to contemporary cryptographic usage. If the proposed parallelization method turns to be useful in a multi-core processor, we speculate that binary inversion algorithm performs slightly better than Kaliski’s method as the latter requires a correction phase which is not easily parallelizable.

Finally, we do not expect that the proposed parallelization method leads to any acceleration in hardware implementations as they usually feature different functional units. However, the method can simplify the control circuit and/or the design.

9 The constant-time inversion algorithms

In this section, we propose two Montgomery inversion algorithms that take constant execution time as a protection against a class of side-channel attacks. They are basically modifications of Algorithms 3 and 4 + 5. The following assumptions are necessary (yet most probably not sufficient) for constant execution time of the algorithms given in the subsequent sections. Note that more assumptions may be needed depending on the hardware platform and software tool chain employed for implementation.

Assumption 1

All predicates, \(\pi _0-\pi _8\) and \(\bar{\pi _1}\), are evaluated and checked once in every iteration of the main loop.

Assumption 2

When integers (uvr, and s) are represented as multiple precision numbers, the precision is the same for all integers.

Example 1

When p is a 256-bit integer, we represent \(r=5\) as \(r = 0 0 0 0 5\) on a 64-bit processor, where each digit is a 64-bit number. Note that we use five 64-bit words to represent 256-bit integers to capture carry out bits resulting from arithmetic operations.

Assumption 3

Arithmetic and logic operations (addition, subtraction, shift and comparison) are always executed on the full precision of integers in case they are represented as multiple precision numbers.

Example 2

Suppose p is a 256-bit integer and a 64-bit processor and we want to perform the addition of \(r+s\), where \(r,s<2^{64}\). Then we perform the following word-wise addition operations in this order

$$\begin{aligned} (C_0, S_0)\leftarrow & {} r + s \nonumber \\ (C_1, S_1)\leftarrow & {} C_0 + 0 \nonumber \\ (C_2, S_2)\leftarrow & {} 0 + 0 \nonumber \\ (C_3, S_3)\leftarrow & {} 0 + 0, \nonumber \\ (C_4, S_4)\leftarrow & {} 0 + 0, \end{aligned}$$
(1)

where \(C_i\) and \(S_i\) correspond to the carry and sum parts of the addition operation result.

Assumption 4

No register-level optimization is applied on assignment operations, such as \(c \leftarrow a \text{ op } b\). The operands, a and b, are read from the memory, operation is executed, and the result, c is written back the memory. Also, the address of operands are also read from the memory.

These assumptions can only be satisfied through extensive support of assembly-level programming as software tool chains such as compilers tend to perform various optimizations that violate these assumptions. A secure implementation in assembly language is challenging, but feasible and sometimes preferred as assembly-level support in cryptographic implementations is a common practice either for speed or for security. Note that the assumptions necessitate dummy operations to be performed (e.g., Example 2), which can make the implementations of the constant-time algorithms vulnerable to fault attacks.

9.1 Constant-time inversion based on bEEA

Algorithm 6 is a version of Montgomery inversion algorithm based on bEEA (Algorithm 3), which executes in constant time independent of its inputs for a given input size (\(n = \lceil \log _2{p}\rceil \)). More precisely, Algorithm 6 always iterates 2n times and every iteration features two subtractions, one addition and two shift operations. Note that the if–then–else statements in Steps 10-12 are dummy operations as both if and else branches perform exactly the same operations on the variable s. These steps correct the final result r whenever \(r>p\) or \(r<0\), but perform dummy operations otherwise. Therefore, the Boolean expression “if s is even” must always be evaluated and protected against any compiler optimization.

figure f

9.2 Constant-time inversion using Kaliski’s method

Algorithm 7 always finishes after exactly \(2n+1\) iterations; only one addition, one subtraction, two shift operations and one assignment operation are executed in each iteration. Assuming an assignment operation is always much less complicated than a subtraction operation, we can claim than Algorithm 7 is more efficient than Algorithm 6.

figure g

A similar algorithm for constant-time execution is proposed in [21], which has two disadvantages compared to Algorithm 7: (i) in every iteration, the algorithm in [21] executes one addition, one subtraction, six shifts and several predicated assignment operations, and (ii) Algorithm 7 has an integrated correction phase while the algorithm in [21] needs an extra operation for it.

Another method for computing the inverse in constant time is to use Fermat’s little theorem. For prime p, we have \(a^{p-1} \equiv 1 \bmod {p}\) and \(a^{p-1} = a^{p-2}a \equiv 1 \bmod {p}\); then, we obtain \(a^{p-2} \equiv a^{-1} \bmod {p}\). Consequently, we can compute the Montgomery inverse of \(a2^n \bmod {p}\), \(r \equiv a^{-1}2^{-n}2^{2n} \equiv a^{-1}2^{n} \bmod {p}\). Basically, one inversion operation takes approximately same time as one exponentiation operation. An exponentiation operation with a modulus of general form takes about 1.5 modular multiplications, on average. As modular multiplication is of quadratic complexity, we can claim that the proposed constant-time algorithm performs better than the method based on Fermat’s theorem for sufficiently large numbers.

Also, blinding, which is a common technique in many cryptographic applications as protection against side-channel attacks, can also be applied in inverse computation. In blinding, a random integer is used to blind the (sensitive/secret) inputs so that the computation is randomized, therefore not dependent on sensitive/secret values. The technique can be used as a generic countermeasure against side-channel attacks; for instance, it is used in the context of the computation of quadratic residues in a recent work [34]. In inversion, the blinding requires a random number generator and two extra multiplication operations. Suppose we want to compute the inverse of \(a<p\) with respect to modulus p. Before the inversion operation, we perform the modular multiplication \(\alpha = a\cdot \rho \bmod p\), where the blinding factor \(\rho \) is selected uniformly random in the interval [\(1,p-1\)]. The inversion algorithm, which takes the randomized input \(\alpha \), returns \(\alpha ^{-1} = a^{-1}\cdot \rho ^{-1} \bmod p\). A second modular multiplication \(\alpha ^{-1}\cdot \rho \bmod p = a^{-1} \bmod p\), yields the correct result. It is not straightforward to compare the blinding technique, which is effective and simple to implement, with constant-time inversion algorithms. If performing modular multiplication operations takes a relatively small fraction of modular inversion time on a given platform, then the blinding is probably a superior approach.

10 Conclusions

We explored alternative algorithms for computing Montgomery inverse of an integer modulo a prime number. We showed that a Montgomery inversion algorithm based on binary extended Euclidean algorithm has comparable complexity to the method based on Kaliski’s almost inverse algorithm. We proposed a parallelization method that can be applied to all inversion algorithms intended for software implementations on multi-core processors featuring low inter-process data sharing overhead. Finally, we introduced two efficient, constant-time algorithms for computing Montgomery inverse. The computational complexity of the proposed constant-time algorithms compares favorably with a method in the literature.