Keywords

1 Introduction

The first algorithmsRondepierre, Franck performing public-key cryptography, such as the Rivest-Shamir-Adleman (RSA) algorithm [27], have been published in the seventies and remain widely used nowadays. However, current key lengths required with these protocols are limiting their efficiency. Elliptic Curve Cryptography (ECC) provides equivalent cryptographic primitives, but with significant improvements in terms of speed and memory, and is now recommended by governmental organizations such as the National Institute of Standards and Technology (NIST). The main resource-consuming operation in ECC is the computation of a scalar multiplication \(\left[ k\right] P\) for a secret scalar \(k\) and a public point \(P\) on an elliptic curve.

We hence consider the implementation on smart cards of scalar multiplications on standardized elliptic curves over \(\mathbb {F}_p\). In this context, side-channel resistance, memory and power consumptions have to be taken into account before designing a fast implementation.

Side-Channel Analysis (SCA) is one of the main attack used to disclose secret data hidden in low-resource devices. SCA exploits the fact that a device leaks information about the processed operations and data, that can be physically measured: timing, power consumption, electromagnetic emanations, etc. Among all kinds of SCA, the Simple Side-Channel Analysis (SSCA) [21] focuses on detecting on a single execution trace differences of behavior depending on a secret value. Many proposals have been made to thwart SSCA and the atomicity principle introduced by Chevallier-Mames et al. [3] is one of the most efficient propositions. This countermeasure has been widely studied and improved first by P. Longa [22] and then by C. Giraud and V. Verneuil [12].

In this paper, we revisit EC formulæ in a novel way and propose corresponding patterns to optimally benefit from the Straus-Shamir trick, a twice as fast method to evaluate double scalar multiplications \(\left[ u\right] P+\left[ v\right] Q\). Few ECC protocols explicitly require double scalar multiplications. However as shown in [28] it can be adapted to process single scalar multiplications \(\left[ k\right] P\) which gives us an advantage since the best implementation known so far [12] cannot take advantage of this trick. Besides, when this method is not used (e.g. because of memory constraints) our formulæ still allow the most interesting ratio between performances and memory cost. In all cases, our implementation requires less memory than other methods. Besides, we also point out a security flaw concerning many implementations computing over Weierstrass curves with \(a=0\). Therefore, our method outperforms existing solutions in terms of security, memory and speed and is suited for low-resource devices.

The rest of the paper is organized as follows. In the next section we introduce some background on elliptic curves and detail known techniques to perform efficient scalar multiplications. Then Sect. 3 deals with the security of scalar multiplications against SSCA. In Sect. 4 we present our formulæ which allow the use of the most efficient scalar multiplication algorithms. Eventually we conclude in Sect. 5.

2 Elliptic Curve Background

2.1 Definitions

An elliptic curve \(\mathcal {E}\) over \(\mathbb {F}_p\), for a prime \(p>3\) is defined with the short Weierstrass equation [15]:

$$\begin{aligned} \mathcal {E}:y^2= x^3+ax+b, \end{aligned}$$
(1)

where \(x,y,a,b\in \mathbb {F}_p\) and \(4a^3+27b^2\ne 0\).

With the so-called chord-and-tangent law, the set of all points on the elliptic curve together with the point at infinity (denoted by \(\mathcal {O}\)) form an abelian group \(\mathcal {E}(\mathbb {F}_p)\). Excepting trivial cases, the group law requires the computation of one inverse in \(\mathbb {F}_p\) which is significantly more expensive than a multiplication. Therefore we use Jacobian coordinates to represent points in order to limit the number of inversions performed in a scalar multiplication. These coordinates use the following equivalence class, for non all-zero triples:

$$ (X:Y:Z)=\{(\lambda ^2X,\lambda ^3Y,\lambda Z):\lambda \in \mathbb {F}_p^*\}. $$

In this case the short Weierstrass equation (1) becomes:

$$\begin{aligned} \mathcal {E}:Y^2=X^3+aXZ^4+bZ^6 \end{aligned}$$
(2)

and \(\mathcal {O}=(1:1:0)\). The opposite of \((X:Y:Z)\) is the point \((X:-Y:Z)\).

The sum of two points \(P=(X_p:Y_p:Z_p)\) and \(Q=(X_q:Y_q:Z_q)\) is the point \(P+Q=(X_{p+q}:Y_{p+q}:Z_{p+q})\) such that:

$$\begin{aligned} \begin{array}{lcr} \left\{ \begin{array}{rcl} X_{p+q} &{}=&{} F^2-E^3-2AE^2\\ Y_{p+q} &{}=&{} F(AE^2-X_{p+q}) - CE^3\\ Z_{p+q} &{}=&{} Z_pZ_qE \\ \end{array} \right. &{} \text {with} &{} \begin{array}{rcl} A &{}=&{} X_pZ_q^2\\ B &{}=&{} X_q Z_p^2\\ C &{}=&{} Y_pZ_q^3 \\ D &{}=&{} Y_q Z_p^3\\ E &{}=&{} B-A\\ F &{}=&{} D-C\\ \end{array} \end{array} \end{aligned}$$
(3)

However, the addition formula is only valid under the following assumptions: \(P\ne Q\), \(P\ne \mathcal {O}\) and \(Q\ne \mathcal {O}\). Several operations in \(\mathbb {F}_p\) are required to evaluate this formula: squarings (denoted by \(S\)), multiplications (denoted by \(M\)), additions and subtractions (both denoted by \(A\)). This formula requires \(4S+12M+7A\).

The double of a point \(P=(X:Y:Z)\) is the point \(\left[ 2\right] P=(X_2:Y_2:Z_2)\) such that:

$$\begin{aligned} \begin{array}{lcr} \left\{ \begin{array}{rcl} X_2 &{}=&{} A^2-2C\\ Y_2 &{}=&{} A(C-X_2) - D\\ Z_2 &{}=&{} 2YZ \\ \end{array} \right. &{} \text {with} &{} \begin{array}{rcl} A &{}=&{} 3X^2+aZ^4\\ B &{}=&{} 2Y\cdot Y\\ C &{}=&{} 2BX \\ D &{}=&{} 2B\cdot B\\ \end{array} \end{array} \end{aligned}$$
(4)

Using this formula, the evaluation of a doubling requires \(4S+6M+9A\).

The result of \(P+P\) is naturally denoted by the point \(\left[ 2\right] P\) and such an operation is called a doubling whereas the addition rather refers to the computation of \(P+Q\) with \(P\ne \pm Q\). More generally, the operation \(P+\ldots +P\) where the point \(P\) is added \(k\) times is called a scalar multiplication and is denoted by \(\left[ k\right] P\).

2.2 Efficient Scalar Multiplication Implementation

This section presents the different known tools to optimize the implementation of an elliptic curve multiplication.

Double and Add. The scalar multiplication is efficiently computed with the so-called Double and Add algorithm [20]. Using the EC group law, this algorithm evaluates:

$$ \left[ k\right] P=\sum _{i=0}^{\ell -1} \left[ k_i\right] \left[ 2^i\right] P =\left[ k_0\right] P+\left[ 2\right] \left[ \sum _{i=1}^{\ell -1}k_i2^{i-1}\right] P $$

where digits \(k_i\in \mathcal {S}\) and \(\mathcal {S}\) is some set of integers containing \(0\) and \(1\), as presented in [23]. The \(\ell \) digits of \(k\) can be evaluated in two ways, i.e. starting from least significant ones (right to left) or from most significant ones (left to right), see Algorithm 1 for the later case.

figure a

Double and Add trade-off. The scalar multiplication consists in a succession of doublings and additions. Depending on the ratio between the number of evaluated additions and the number of doublings, one formula may be favored at the expense of the other in order to reduce the overall cost.

  • First, the cost of the addition formula (3) can be reduced with some assumptions. Indeed, in Algorithm 1, the points \(\left[ m\right] P\) are constant. Hence Chudnovsky [4] proposed to compute \(Z_{\left[ m\right] P}^2\) and \(Z_{\left[ m\right] P}^3\) once for all, which saves \(1S+1M\) per point addition. Furthermore, at the cost of one inversion and few multiplicationsFootnote 1 during the precomputation phase, one can choose the representative of \(\left[ m\right] P\) with \(Z_{\left[ m\right] P}=1\), which instead saves \(1S+4M\) in (3). Besides, since there is always a doubling of \(R\) before the point addition \(R+\left[ k_i\right] P\), we propose to move the computation of \(Z_r^2\) and \(Z_r^3\) in the doubling formula which also saves \(1S+1M\).

  • The doubling formula (4) can be speeded up [5] with the help of one extra value \(W\) initialized with the value \(aZ^4\). In this case, \(A\) is evaluated as \(3X^2+W\) and \(W_2\) is equal to \(2DW\), which gives a global cost of \(2S+6M+10A\). However, in Algorithm 1, if the point \(R\) is represented as \((X_r:Y_r:Z_r:W_r)\) then the value \(W_r\) has to remain consistent when the addition \(R+\left[ k_i\right] P\) is performed. Therefore Formula (3) has to update this value \(W\): \(W_{r+q}=W_r (E^2Z_{\left[ k_i\right] P}^2)^2\) which costs either \(1S+2M\) if \(Z_{\left[ k_i\right] P}\ne 1\) or \(1S+1M\).

We will now see two techniques used to change the addition/doubling ratio.

Straus-Shamir Trick. The Straus-Shamir trick [8, 30] is a simple but very efficient way of evaluating double scalar multiplications \( \left[ u\right] P+\left[ v\right] Q. \) The naive way to evaluate the double multiplication performs two multiplications and adds the results. Using two calls to Algorithm 1, this approach costs on average \(2\ell \) doublings and \(\ell \) additions (for random scalars \(u\) and \(v\)). However, neither \(\left[ u\right] P\) nor \(\left[ v\right] Q\) are needed values. The trick consists then in building one sequence of intermediate results directly converging to the value \(\left[ u\right] P+\left[ v\right] Q\) in one execution of a Double and Add algorithm. Algorithm 2 implements such a trick which only requires on average \(\ell \) doublings and \(0.75\ell \) additions.

This trick can also be used to improve the evaluation of \(\left[ k\right] P\) if the multiplication is decomposed as:

$$ \left[ k\right] P=\left[ k_0\right] P+\left[ k_1\right] \left( \left[ \lambda \right] P\right) . $$

Of course, the trick has an interest when \(k_0\) and \(k_1\) have a size \(\ell /2\) and when \(k_0\), \(k_1\) and \(Q=\left[ \lambda \right] P\) are available with reasonable cost. As recalled in [28], one can take \(\lambda =2^{\ell /2}\) and precompute \(Q\) when the point \(P\) is reused for many scalar multiplications. One can remark that this trick – working only with left-to-right scan of digits – is generalizable for multi-multiplication.

Table 1. Several scalar recoding techniques and their average number of point additions per bit of \(k\) in the context of single scalar multiplication \(\left[ k\right] P\)
figure b

Scalar Recoding. The previous trick allows to cut to half the number of doublings. Independently, the number of additions can also be reduced by using windowing and signed digit representations in order to maximize the number of null digits of the scalar. The best signed digit representations to reduce the number of non-zero digits are the Non-Adjacent Form (NAF) [1] for one scalar, and the Joint Sparse Form (JSF) [28] for a couple of scalars. NAF and JSF have been generalized to larger digit set than \(\{0,\pm 1\}\) [26, 29] but only \(\text {NAF}_{w+1}\) (which indicates a NAF representation with window size \(w\)) fits smart card memory constraints. Tables 1 and 2 indicate, for several cases, the definition of the set \(\mathcal {S}\), the number of involved points which corresponds to a certain RAM cost, and the average number of point addition performed per bit which also corresponds to the ratio between additions and doublings.

Table 2. Several scalar recoding techniques and their average number of point additions per bit of \(k\) in the context of double scalar multiplication \(\left[ u\right] P+\left[ v\right] Q\)

3 Secure Scalar Multiplication Implementation

As shown in Formulæ (3) and (4) the evaluation of a doubling is different from the evaluation of an addition: the number of each type of operation and the number of operations in total are not the same. These differences are easily detected via SSCA and give information on the handled scalar \(k\). Indeed in Algorithm 1 the addition is performed only if \(k_i\) is not null, thus knowing that 2 successive doublings have been performed means that the corresponding \(k_i\) was null. If the binary representation has been used, i.e. \(k_i\in \mathcal {S}=\{0,1\}\), then the knowledge of all indexes where \(k_i\) is null is enough to retrieve the whole value \(k\). Using other scalar representation reduces the amount of information leaked but sophisticated attacks [9] only need some bits to be successful. This section deals with the proposed countermeasures to thwart such a leakage.

3.1 State-of-the-Art

Secure Implementation. Many algorithms have been published so far to resist this SSCA with different strategies. A unified formula has been proposed [2] to evaluate the sum of two points \(P,Q\) (different from \(\mathcal {O}\)) without the restriction \(P\ne Q\). The Double and Add Always [6] performs the addition whatever the value of the digit \(k_i\). Other regular algorithms such as the Montgomery Ladder [2, 11, 18] perform a doubling and an addition for each bit but without dummy operations which is more interesting from a security point of view. Recent works [14, 16] have investigated efficient Montgomery Ladder formulæ with good results, but this approach cannot benefit from scalar recoding and hence remains costly if compared with best non-secure implementations. The principle of atomicity focuses on reaching the same security level at a closer cost of non-secure implementations.

Atomicity Principle. The atomicity principle [3] can be seen as a refined unified formula. Instead of having one formula valid in both situations (\(P=Q\) and \(P\ne \pm Q\)) this countermeasure focuses on evaluating two different formulæ using the same operations or the same flow of operations. On can notice that such an evaluation is always possible but may induce additional dummy operations. The tour-de-force of Chevallier-Mames et al. [3] was to propose such an evaluation at an almost negligible cost, which is without any dummy modular multiplication. They have split the two formulæ in a sequence of identical atomic patterns such that ten calls of this pattern make a doubling, while sixteen calls give an addition. Their pattern is depicted below, where the \(R_i\)’s denote some intermediate values.

$$\begin{aligned} \left[ \begin{aligned} R_1&\leftarrow R_2\cdot R_3\\ R_4&\leftarrow R_5+R_6\\ R_7&\leftarrow -R_8\\ R_9&\leftarrow R_{10}+R_{11}\\ \end{aligned} \right. \end{aligned}$$
(5)

However, as shown in Table 3 such a pattern implies a lot of dummy additions (\(A\)) and negations (\(N\)) when evaluating Formulæ (3) and (4). If the cost of one addition or a negation is small compared to a modular multiplication it cannot be neglected yet. Another drawback of the formula lies in the loss of squares traded for multiplications, which implies a loss of efficiency when a dedicated function is available to evaluate squares faster than multiplications. Therefore, this pattern (5) has been first improved by Longa [22] and more reworked by Giraud and Verneuil (GV) [12]. In the rest of the paper, as explained in [12], we assume \(a=-3\) and Chudnovsky optimization for Longa’s implementation of Algorithm 1. The GV implementation optimizes doublings with the extra value \(W\) but uses a right-to-left scan of digits to limit the cost of point additions.

Table 3. Comparing operation cost between main atomic implementations and non-secure implementations, with \(2N=A\)

3.2 Scalar Evaluation

The elliptic curve formulæ are not the only part to secure when considering implementations protected with the atomicity principle. The security of the scalar treatment is also crucial to thwart SSCA similar to the following example. For instance, let us assume that the evaluation at round \(i\) of a doubling \(\mathcal {D}_i\) cannot be distinguished from the evaluation of a point addition \(\mathcal {A}_i\) (i.e. atomic patterns have been used), let \(E_i\) denote the scalar treatment at round \(i\), then observing the following sequence (6) gives information on the scalar \(k\). Indeed, patterns corresponding to a doubling and an addition are performed between \(E_i\) and \(E_{i-1}\) which means that \(k_i\ne 0\) while \(k_{i-1}=0\) since only a doubling is performed between \(E_{i-1}\) and \(E_{i-2}\).

$$\begin{aligned} \mathcal {D}_i, E_i, \mathcal {A}_i, \mathcal {D}_{i-1}, E_{i-1}, \mathcal {D}_{i-2}, E_{i-2}, \ldots \end{aligned}$$
(6)

Hence, if the atomic pattern must be called \(x\) (resp. \(y\)) times to perform a point addition (resp. point doubling), then a call to the scalar treatment has to be done every gcd \((x,y)\) pattern. Besides, this scalar treatment has to implement the atomicity principle to have the same behavior whatever the value of the digit \(k_i\). The induced overhead required to securely evaluate the secret scalar is generally under-estimated. Actually, this consists in the main limitation of the atomic countermeasure when the scalar multiplication has to be implemented on components that cannot process the scalar treatment in parallel of the pattern evaluation. Taking \(x=y\) allows to reduce the cost of this treatment.

4 New Atomic Pattern

4.1 New Fomulæ

Up to now, improvements on atomic implementations have mainly focused on optimizing the doubling formula. This strategy comes from the observation that a doubling has to be computed for each bit of the scalar, while the addition is only performed for non-zero digit. In the context of processing one scalar (e.g. Algorithm 1), there is a low addition/doubling ratio which validates this strategy (see Table 1). However this is not the case in the double scalar multiplication.

Hence, we propose hereafter to focus on improving point addition to better match double multiplication case. The new formulæ are detailed hereafter to ease the verification of the proposed atomic patterns. The sum of \(P=(X:Y:Z:Z^2:Z^3)\) and \(Q=(X_q:Y_q:1)\) is the point \(P+Q=(X_3:Y_3:Z_3:Z_3^2)\) such that:

$$\begin{aligned} \left\{ \begin{aligned} X_3&= F^2-E^3-2AE^2\\ Y_3&= F(AE^2-X_3) - CE^3\\ Z_3&= ZE \\ Z_3^2&= (Z_3)^2\\ \end{aligned} \right. \quad \text {with}\quad \begin{aligned} A&= X\\ B&= X_q Z^2\\ C&= Y\\ D&= Y_q Z^3\\ E&= B-A\\ F&= D-C\\ \end{aligned} \end{aligned}$$
(7)

The subtraction \(P-Q\) is obtained by replacing \(F\) by \(\bar{F}=D+2C-C\) and then \(Y_3=\bar{F}(X_3-AE^2)-CE^3\). The double of a point \(P=(X:Y:Z:Z^2)\) is the point \(\left[ 2\right] P=(X_2:Y_2:Z_2:Z_2^2:Z_2^3)\) such that:

$$\begin{aligned} \left\{ \begin{aligned} X_2&= A^2-2C\\ Y_2&= A(C-X_2) - D\\ Z_2&= 2YZ \\ Z_2^2&= (Z_2)^2\\ Z_2^3&= (Z_2)^3\\ \end{aligned} \right. \quad \text {with}\quad \begin{aligned} A&= 3(X-IZ^2)(X+IZ^2)\\ B&= 2Y\cdot Y\\ C&= 2BX \\ D&= 2B\cdot B\\ \end{aligned} \end{aligned}$$
(8)

We introduce a new constant value \(I=\sqrt{-a3^{-1}}\) in order to reduce the cost of doublings. For a random value \(a\) (i.e. a random curve) this constant \(I\) exists with probability \(0.5\). However, for most standard curves (NIST [7], Brainpool [7], ANSSI [19]) the constant \(I\) exists and may have some particular value (e.g. \(I=1\) or \(I=0\)). The case \(I=0\) has a special treatment in Sect. 4.3. A pattern working for all curves is given in Appendix A. This pattern is less efficient, trading an addition for a square, but still of interest.

Eventually we have \(3S+7M+7A\) to evaluate the addition and \(2S+8M+10A\) for the doubling, which means that a square has to be performed as a modular multiplication and three dummy modular additions are required in the addition formula in order to get a unified cost. Besides, in order to save some memory when using NAF or JSF recodings, we benefit from the dummy modular additions to also present a subtraction formula which avoids the storage of opposite points.

From \(P=(X,Y,Z,Z^2,Z^3)\) and \(Q(X_q,Y_q,1)\), one can compute \(P\leftarrow P+Q\), \(P\leftarrow P-Q\) and \(P\leftarrow \left[ 2\right] P\) with the following patterns, where each pattern requires \(2S+8M+10A\):

Addition

Subtraction

Doubling

\( \begin{aligned} R_1&\leftarrow X_q\cdot Z^2 \\R_1&\leftarrow R_1-X \\\star&\leftarrow \star +\star \\R_2&\leftarrow R_1\cdot R_1 \\\star&\leftarrow \star + \star \\R_3&\leftarrow X\cdot R_2 \\R_0&\leftarrow Y_q\cdot Z^3 \\\star&\leftarrow \star +\star \\Z^3&\leftarrow R_1\cdot R_2 \\R_2&\leftarrow Z\cdot R_1 \\X_3&\leftarrow R_3+R_3 \\X_3&\leftarrow Z^3+X_3 \\Z_3^2&\leftarrow (R_2)^2 \\R_0&\leftarrow R_0-Y \\R_1&\leftarrow (R_0)^2 \\X_3&\leftarrow R_1 - X_3 \\R_1&\leftarrow R_3 - X_3 \\R_3&\leftarrow R_1\cdot R_0 \\R_0&\leftarrow Y\cdot Z^3 \\Y_3&\leftarrow R_3-R_0 \\Z_3&\leftarrow R_2 \\\end{aligned} \)

\( \begin{aligned} R_1&\leftarrow X_q\cdot Z^2 \\R_1&\leftarrow R_1-X \\Z^2&\leftarrow Y+Y \\R_2&\leftarrow R_1\cdot R_1 \\\star&\leftarrow \star + \star \\R_3&\leftarrow X\cdot R_2 \\R_0&\leftarrow Y_q\cdot Z^3 \\R_0&\leftarrow Z^2+R_0 \\Z^3&\leftarrow R_1\cdot R_2 \\R_2&\leftarrow Z\cdot R_1 \\X_3&\leftarrow R_3+R_3 \\X_3&\leftarrow Z^3+X_3 \\Z_3^2&\leftarrow (R_2)^2 \\R_0&\leftarrow R_0-Y \\R_1&\leftarrow (R_0)^2 \\X_3&\leftarrow R_1 - X_3 \\R_1&\leftarrow X_3 - R_3 \\R_3&\leftarrow R_1\cdot R_0 \\R_0&\leftarrow Y\cdot Z^3 \\Y_3&\leftarrow R_3-R_0 \\Z_3&\leftarrow R_2 \\\end{aligned} \)

\( \begin{aligned} R_0&\leftarrow I\cdot Z^2\\R_1&\leftarrow X-R_0\\R_2&\leftarrow Y+Y\\Z_2^2&\leftarrow Y\cdot R_2\\Y_2&\leftarrow Z_2^2+ Z_2^2\\R_3&\leftarrow R_2\cdot Z\\R_2&\leftarrow Y_2\cdot X\\X_2&\leftarrow X+R_0\\R_0&\leftarrow R_1\cdot X_2\\R_1&\leftarrow Z_2^2\cdot Y_2\\X_2&\leftarrow R_0+R_0\\R_0&\leftarrow R_0+X_2\\X_2&\leftarrow (R_0)^2\\X_2&\leftarrow X_2-R_2\\Z_2^2&\leftarrow (R_3)^2\\X_2&\leftarrow X_2-R_2\\R_2&\leftarrow R_2-X_2\\Z_2^3&\leftarrow Z_2^2\cdot R_3\\Y_2&\leftarrow R_0\cdot R_2\\Y_2&\leftarrow Y_2-R_1\\Z_2&\leftarrow R_3\\\end{aligned} \)

In order to reduce the number of intermediate buffers \(R_i\), the output buffers (e.g. \(X_3\), \(X_2\)) have been used as intermediate buffers which allows to use 4 intermediate buffers \(R_i\) only. Besides particular attention has been paid to allow in-place EC operations (e.g. an overlap of buffers \(X\) and \(X_2\)). However, the patterns do not contain in-place modular multiplications (\(R_1\leftarrow R_1\cdot R_2\)) as discussed in [16] to limit the memory consumption.

4.2 Performances

In this section, the performances of the new pattern are compared with best known atomic patterns for both single scalar and double scalar multiplication cases. In order to evaluate the cost of an implementation we use the average per bit cost. We put aside the cost of pre-computations, or post-computations since it gives only a small advantage to the right-to-left variant. The per bit cost is obtained with the following formula:

$$\begin{aligned} \mathcal {D}+E+H\cdot (\mathcal {A}+E) \end{aligned}$$

where \(\mathcal {D}\) (resp. \(\mathcal {A}\)) stands for the cost of a doubling (resp. addition), \(E\) is the cost to treat the scalar and \(H\) is the average number of point additions per bit of the scalar.

Theoretical Study. We combine here the cost of point addition and doubling (see Table 3 and Sect. 4.1) with the addition/doubling ratios given in tables 1 and 2. Remark: As indicated in Table 3 we assume \(2N=A\). Indeed the negation of a value \(v\) already reduced modulo \(p\) can be performed as the subtraction \(p-v\) without the need of modular reduction.

Double Scalar Multiplication. The RAM available in the smart card context limits the ratio \(H\) to be greater than \(0.5\) when using Algorithm 2. As shown in Table 4, our pattern always offers the best performances when compared with Longa’s pattern.

Table 4. Double scalar multiplication cost per bit

Single Scalar Multiplication. Due to smart card memory constraints, lower values for \(H\) (i.e. less than \(0.5\)) correspond to single scalar multiplication cases. We present in Table 5 the limit of our method encountered with our practical values (Table 9). The new solution still requires less modular additions than other propositions but it requires more products. The ratios \(A/M\) and \(S/M\) will help to select the solution with the best per bit cost. The cost of exponent recoding \(E\) shall also be carefully taken into account in the case of on-the-fly scalar recoding. Indeed, protecting this heavy scalar treatment against SSCA may be costly.

Table 5. Single scalar multiplication cost per bit

Memory Cost. The smaller the ratio point additions per bit (i.e. \(H\)), the more buffers required in memory. Depending on the supported ECC bit lengths these buffers correspond to more or less RAM. Therefore the developer selects a ratio with regard to the desired memory cost. The new pattern has been optimized in order to only use four intermediate registers (see Sect. 4.1) and the same results can be obtained for the pattern of Giraud-Verneuil – and should be possible for Longa’s pattern. If \(X\) denotes the number of extra points (see Tables 1 and 2), an implementation using Longa’s pattern requires \(12+5X\) buffers (assuming a subtraction pattern), \(11+3X\) buffers are required using GV pattern and only \(11+2X\) buffers are required with the new pattern.

Table 6. Number of buffers required for the scalar multiplication

Practical Values. One usually considers the chip characteristics before selecting the best matching algorithm. In Table 7 we give measured costs of elementary operations performed in the field \(\mathbb {F}_p\) on our chip. We also give the relative cost of a secured on-the-fly NAF scalar recoding. Therefore, the following table 8 contains a line with \(E/M=0\) since we did not implement on-the-fly JSF recoding. This choice is greatly in favor of Longa’s pattern (see Table 4) and in practice \(E/M\) may not be negligible. If we look at the \(224\)-bit size (considered as secured in the midterm [13]) we see an improvement of at least \(14.6\,\%\) for the double scalar multiplication case. In the single scalar multiplication case (Table 9), the new proposition has a performance close to the Giraud-Verneuil solution. However, a fair comparison must take into account the memory cost of each algorithm which emphasize the interest of the new pattern. Besides, the small gain/loss of the new pattern in the single scalar multiplication case may be seen as a fair cost if we consider that only one implementation is now required for both double scalar multiplication and single scalar multiplication situations. Eventually, about components with \(S<M\), our propositions benefit from fast squares but to a lesser extent than GV since it relatively contains less squares (see Tables 4 and 5).

In conclusion, though the new pattern is not optimal in case of low addition/doubling ratio – i.e. when large amount of memory is available – and asymptotically not optimal – i.e. when \(A/M\approx 0\) – it has revealed to be efficient in our context.

Table 7. Characteristics of our implementation on a smart card
Table 8. Number of equivalent modular multiplications for the double scalar multiplication with \(S/M=1\)
Table 9. Number of equivalent modular multiplications for the single scalar multiplication with \(S/M=1\)

4.3 The Special Case \(a=0\)

If the curve parameter \(a\) is null then a multiplication with zero is performed in the doubling formula which may result in a security flaw. Indeed, such a product may have a leakage distinguishable from other products. Since the addition formula does not use this parameter \(a\), it can hence be distinguished from a doubling. This case \(a=0\) allows to only save one multiplication in efficient doubling formulas (using an extra coordinate \(W\) or \(a=-3\)). Therefore, using Longa’s pattern or Giraud-Verneuil pattern, one cannot benefit from this case to improve the scalar multiplication since their atomic patterns contain several multiplications. An improvement is possible with the pattern of Chevallier-Mames et al. but this pattern remains costly due to the high number of dummy additions. Hence, a new formula is proposed here to thwart this attack and to benefit from the possible speed improvement.

With \(a=0\), there are no more available modular subtractions at the beginning of the doubling formula therefore the atomic pattern has to be rebuilt to match with the addition formula. The trick used here consists in saving in memory the opposite values of the coordinates of \(Q\): \(-X_q,-Y_q\). The point equivalent class is also used with \(\lambda =-1\) to represent \(P+Q=(X_3:-Y_3:-Z_3)\). However, no trick has been found to propose a subtraction formula.

The sum of \(P=(X:Y:Z:Z^2:Z^3)\) and \(Q=(X_q:Y_q:1)\) is the point \(P+Q=(X_3:Y_3:Z_3)\). This sum is depicted in Formula (9). The double of a point \(P=(X:Y:Z)\) is the point \(\left[ 2\right] P=(X_2:Y_2:Z_2:Z_2^2:Z_2^3)\). The evaluation is performed using Formula (8) with \(A=3X^2\). The resulting pattern only contains 9 multiplications (\(2S+7M\)) and 8 additions which represents an improvement of more than \(10\,\%\) if compared with the case \(a\ne 0\). It is depicted in Table 11 in Appendix A.

$$\begin{aligned} \left\{ \begin{aligned} X_3&= F^2-(2A\bar{E}^2-\bar{E}^3)\\ Y_3&= \bar{F}(A\bar{E}^2-X_3) - C\bar{E}^3\\ Z_3&= Z\bar{E} \\ \end{aligned} \right. \quad \text {with}\quad \begin{aligned} A&= X\\ \bar{B}&= (-X_q) Z^2\\ C&= Y\\ \bar{D}&= (-Y_q) Z^3\\ \bar{E}&= \bar{B}+A\\ \bar{F}&= \bar{D}+C\\ \end{aligned} \end{aligned}$$
(9)
Table 10. Example costs to perform a secure scalar multiplication on a \(256\)-bit curve

5 Conclusion

In this paper a new atomic pattern has been proposed that outperforms the implementations of most scalar multiplications used in elliptic curve cryptography standards. Our pattern enables to securely perform double scalar multiplications on curves over \(\mathbb {F}_p\) with the Straus-Shamir trick which has not been done before. The new pattern also turns out to be efficient for single scalar multiplication. We give hereafter in Table 10 the results obtained on our chip when using main standard algorithms compared with the best known implementations. The first row corresponds to the process of \(\left[ k\right] G\) evaluated as \(\left[ k_0\right] G+\left[ k_1\right] 2^{\ell /2}G\). The double scalar multiplication can also be used in the PACE [17] protocol to evaluate a point \(\tilde{G}\). A Key Agreement requires the computation of \(\left[ k\right] P\). Longa’s atomic pattern using Straus-Shamir trick is given for information only as its implementation requires too much memory to be used with our component. Hence the GV implementation was the fastest algorithm available to perform signatures and key generations and this proposal improves it by 38.6 %.