Keywords

1 Introduction

1996 wasGiraud, Christophe one of the most amazing years for the Crypto community. Indeed in a few months, two revolutionary attacks called Side-Channel Analysis (SCA) [21] and Fault Injection (FI) [5] were published. These two attacks definitely affected practitioners by changing the way of implementing cryptographic algorithms and they also challenged theoreticians to design new cryptosystems meant to resist such threats. The first kind of attack takes advantage of physical interactions between the embedded device and its environment during the execution of the cryptosystem to recover information on the corresponding secret key [24]. Indeed, it was noticed that these interactions, such as the device power consumption [22] or its electromagnetic radiation [15], contain information on the operations and on the variables manipulated by the device. The second kind of attack aims to disturb the correct execution of the algorithm and uses the corresponding faulty output to obtain information on the secret key [19]. Of course, numerous countermeasures have been published since 1996 to efficiently counteract these attacks and the fields of SCA and FI are now the most active fields of research in cryptography.

As well as being the first practical public-key cryptosystem, RSA [29] has also been the most widely used for many years, especially in electronic signature schemes. It has thus been a privileged target for cryptologists to mount effective SCA and FI and to propose efficient countermeasures. Concerning FI-resistant RSA implementation, the most efficient method consists in using the public exponent to verify the signature before outputting it. Whereas such an approach has been published more than 15 years ago [6], no publication deals with the exploitation of the public exponent to improve RSA implementation efficiency and side-channel resistance. This article addresses such an open topic.

The rest of this paper is organised as follows. Section 2 briefly presents the state-of-the-art of secure CRT-RSA implementation on embedded devices. In Sect. 3, we present our new approach to implement CRT-RSA by taking advantage of the knowledge of the public exponent. After presenting a functional version of our implementation, we improve its side-channel resistance by proposing a free message blinding method. This new approach is then compared with the state-of-the-art implementation. Finally, we conclude in Sect. 4.

2 State-of-the-Art Secure CRT-RSA Implementation

In this section, we firstly describe RSA before presenting the main SCA and FI countermeasures used nowadays to obtain a secure implementation.

2.1 RSA Presentation

In the following we briefly recall how to compute the RSA signature in both standard and CRT modes.

Let \(N\) denote the public modulus being the product of two secret large prime integers \(p\) and \(q\). Let \(d\) refer to the private exponent and \(e\) refer to the public exponent satisfying \(d \cdot e \equiv 1 \mathrm{{~mod~}}\varphi (N)\), where \(\varphi \) denotes Euler’s totient function. The RSA signature of a message \(m \in \mathbb {Z}_N\) is then obtained by computing \(S= m^d \mathrm{{~mod~}}N\). To verify the signature, one computes \(S^e \mathrm{{~mod~}}N\) and checks if the corresponding result is equal to \(m\).

In embedded systems, most RSA implementations use the Chinese Remainder Theorem (CRT) which yields a speed-up factor of four [13]. By using the CRT, the signature generation is composed of two exponentiations \(S_p= m^{d_p} \mathrm{{~mod~}}p \) and \(S_q= m^{d_q} \mathrm{{~mod~}}q \), where \(d_p= d \mathrm{{~mod~}}(p-1)\) and \(d_q =d \mathrm{{~mod~}}(q-1)\). The signature is then obtained by recombining \(S_p\) and \(S_q\), which is usually done by using Garner’s formula [16]:

$$\begin{aligned} S = S_q + q \cdot (i_q \cdot (S_p-S_q)\ \mathrm{{mod}}\ p), \end{aligned}$$
(1)

where \(i_q = q^{-1}~\mathrm{{mod}}~p\).

We depict in Algorithm 1 the algorithmic of a standard CRT-RSA implementation as described above.

figure a

Although RSA cryptosystem using signature protocol PSS [27] is proved secure against theoretical cryptanalysis, it can be broken if straightforwardly implemented on embedded devices by using Side-Channel Analysis or Fault Injection. In the next sections, we present the main countermeasures which are generally implemented to counteract SCA and FI.

2.2 SCA Countermeasures

When published, SCA was divided into two groups: Simple Side-Channel Analysis (SSCA) and Differential Side-Channel Analysis (DSCA). The first kind aims at recovering information on the secret key by using the side-channel leakage of only one execution of the algorithm whereas DSCA uses several executions of the algorithm and applies statistical analysis to the corresponding measurements to exhibit information on the secret key.

In the particular case of RSA, the most common countermeasure to prevent SSCA consists in using exponentiation methods where the sequence of modular operations does not depend on the corresponding secret exponent. Example of such exponentiations are the Montgomery Ladder [20] or the Atomic exponentiation [8]. Concerning DSCA countermeasures, most techniques aim at randomizing the message and the exponents. This can be done for instance by applying additive masking to these variables [9]. In such a case, one can pick four 64-bit random values \(k_i\), \(i\in \{0,\cdots ,3\}\), and compute \(S'_p = (m+k_0\cdot p)^{d_p+k_1\cdot (p-1)}\mathrm{{~mod~}}2^{64}\cdot p\) and \(S'_q = (m+k_2\cdot q)^{d_p+k_3\cdot (q-1)}\mathrm{{~mod~}}2^{64}\cdot q\) before combining them using the CRT-recombination. The expected signature is finally obtained by performing a final reduction modulo \(N=p\cdot q\).

Instead of using additive masking to blind the message, one can apply multiplicative masking [24] which consists generally in multiplying the message with \(r^e\mathrm{{~mod~}}N\) where \(r\) is a non null random value. The blinding is then removed at the end of the computation by multiplying the final result with \(r^{-1}\mathrm{{~mod~}}N\). However, the inverse computation to obtain \(r^{-1}\mathrm{{~mod~}}N\) is costly in terms of both performance and memory consumption. Such an approach is therefore generally avoided in favor of the traditional additive masking.

When combining both SSCA and DSCA countermeasures, RSA implementations resist most kind of side-channel attacks. However, a third class of SCA called Horizontal Analysis (HA) has been published recently and could defeat such implementations by using only one execution of the algorithm [4, 10, 11, 31]. This kind of attack aims generally at distinguishing if each modular operation is a multiplication with the input message or not. To counteract such powerful attacks, one must randomize the order of the single-precision multiplications [4, 11] or randomize the blinding of each operand before each long integer multiplication [10].

2.3 FI Countermeasures

RSA has been the first cryptosystem to be analysed versus Fault Injection [6]. In the case of CRT-RSA, only one fault injected during one of the two exponentiations provides a faulty signature which allows the attacker to recover one of the two secret primes \(p\) or \(q\). For instance, if a fault is injected during the computation of \(S_p\) leading to a faulty signature \(\widetilde{S}\) then one can notice that \(\widetilde{S} \equiv S \mathrm{{~mod~}}q\) but \(\widetilde{S} \not \equiv S \mathrm{{~mod~}}p\). Therefore, the secret parameter \(q\) can be easily recovered by computing the gcd of \(S-\widetilde{S}\) and \(N\). The other private key parameters can then be straightforwardly deduced.

To protect RSA against such a threat, dozens of countermeasures have been proposed over the last decade. These methods can be divided into four different groups. The first group is based on Shamir’s method proposed in [30]. The idea is to perform the two exponentiations over GF\((p\cdot t)\) and GF\((q\cdot t)\) respectively where \(t\) is a small random value and then compare both results modulo \(t\). Amongst the numerous variants of Shamir’s method, only the improved version of Vigilant’s proposal is considered secure against fault injection [12]. The second methodology has been proposed by Giraud in which the fault detection comes from the exponentiation algorithm itself [17]. He pointed out that by using the Montgomery powering ladder [20], both values \(m^{d-1} \mathrm{{~mod~}}N\) and \(m^d \mathrm{{~mod~}}N\) are available at the end of the computation. These values can then be used to verify the integrity of the exponentiation by testing if \(m\) times the first value is equal to the second one. This method has then been extended in [7, 28]. The third group corresponds to the infective computation method which has been introduced by Yen et al. in [32]. The idea of the countermeasure consists in modifying the signature if a fault is detected such that it provides no information to the attacker. Despite several proposals, each and every infective method has been broken [3]. The fourth and last kind of countermeasure consists in verifying the signature by using the public exponent before outputting it [6].

In the rest of this paper we assume that the public exponent is small, typically less than \(2^{16}+1\), which is nearly always the case in practiceFootnote 1. Therefore the fourth approach presented above is the most efficient way to counteract fault attacks on CRT-RSA in terms of both security and performances.

2.4 Summary

To sum up Sect. 2, we depict in Algorithm 2 the skeleton of a state-of-the-art secure CRT-RSA [2, Sect. 6.1].

figure b

3 A New Approach

Whereas the public exponent has been used for more than 15 years to counteract Fault Injection, no study has been done to investigate how such a value can be used to improve RSA performance and side-channel resistance. This is unfortunate since when setting an RSA private key, the corresponding public exponent \(e\) is often known. For example in the case of EMV banking applications [14], there are only 2 different public exponents possible (\(3\) or \(2^{16}+1\)) and the correct one can be recovered from the private key by using 2 multiplications [18]. It is therefore interesting to investigate an alternative implementation of CRT-RSA taking advantage of the knowledge of the public exponent value.

3.1 Generic Description

In practice, the CRT-recombination is implemented by using Garner’s formula as presented in (1) since it is the most efficient formula published so far. However, the CRT-recombination can also be performed by using the Gauss recombination:

$$\begin{aligned} S = p\cdot i_p\cdot S_q + q\cdot i_q\cdot S_p \mathrm{{~mod~}}N \end{aligned}$$
(2)

where \(S_p = m^{d_p} \mathrm{{~mod~}}p\), \(S_q = m^{d_q} \mathrm{{~mod~}}q\), \(i_p = p^{-1} \mathrm{{~mod~}}q\) and \(i_q = q^{-1} \mathrm{{~mod~}}p\). Of course, such a method requires either to consume extra memory to add the extra private parameter \(i_p\) or to perform a costly inverse computation to obtain such a value on-the-fly. However, we explain in the following that CRT-RSA using such a recombination can be more efficient than using Garner’s method if the public exponent is known.

Our new method is based on Relation (3):

$$\begin{aligned} (m\cdot q^e)^{d_p-1}\cdot m\cdot q^{e-2} \equiv i_q \cdot S_p \mathrm{{~mod~}}p \end{aligned}$$
(3)

Proof

When expanding the first term of left part of (3), we obtain:

$$\begin{aligned} (m\cdot q^e)^{d_p-1}&\equiv m^{d_p-1}\cdot q^{e\cdot (d_p-1)}\mathrm{{~mod~}}p\end{aligned}$$
(4)
$$\begin{aligned}&\equiv m^{d_p-1}\cdot q^{e\cdot d_p-e}\mathrm{{~mod~}}p\end{aligned}$$
(5)
$$\begin{aligned}&\equiv m^{d_p-1}\cdot q^{1-e} \mathrm{{~mod~}}p \end{aligned}$$
(6)

Therefore

$$\begin{aligned} (m \cdot q^e)^{d_p-1}\cdot m\cdot q^{e-2}&\equiv m^{d_p-1} \cdot q^{1-e}\cdot m\cdot q^{e-2}\mathrm{{~mod~}}p \end{aligned}$$
(7)
$$\begin{aligned}&\equiv m^{d_p}\cdot q^{1-e+e-2} \mathrm{{~mod~}}p \end{aligned}$$
(8)

This straightforwardly leads to (3).    \(\square \)

Obviously, a similar relation is obtained modulo \(q\):

$$\begin{aligned} (m \cdot p^e)^{d_q-1}\cdot m\cdot p^{e-2} \equiv i_p\cdot S_q\mathrm{{~mod~}}q \end{aligned}$$
(9)

Finally, one may note that (2) is equivalent to the following relation:

$$\begin{aligned} S = p\cdot (i_p \cdot S_q\mathrm{{~mod~}}q)+ q\cdot (i_q \cdot S_p\mathrm{{~mod~}}p)\mathrm{{~mod~}}N \end{aligned}$$
(10)

Therefore, by combining Relations (3) and (9) with Relation (10), the signature \(S=m^d \mathrm{{~mod~}}N\) of a message \(m\) can be computed by using the following relation:

$$\begin{aligned} S = p\cdot S1_q + q\cdot S1_p \mathrm{{~mod~}}N \end{aligned}$$
(11)

where

$$\begin{aligned} S1_p&= (m \cdot q^e)^{d_p-1}\cdot m\cdot q^{e-2} \mathrm{{~mod~}}p,\\ S1_q&= (m \cdot p^e)^{d_q-1}\cdot m\cdot p^{e-2} \mathrm{{~mod~}}q. \end{aligned}$$

We depict in Algorithm 3 the algorithmic of our new method.

figure c

Comparison with the Standard Method. The main advantage of Algorithm 3 over Algorithm 1 consists in a much smaller key since it does not require the private parameter \(i_q\). This leads to a gain of log\(_2(N)/2-\)log\(_2(e)\) bits of memory to store the key. When using a 2048-bit RSA for instance, we gain 125 bytes when \(e=2^{16}+1\). Such an improvement is of uttermost importance on embedded devices where the memory space is very limited.

By comparing the complexity of the standard method depicted in Algorithm 1 and of our new proposal depicted in Algorithm 3, one can notice that the performances are very similar for public exponents which are generally used. For instance, if \(e=3\) (resp. \(e=2^{16}+1\)) then we add \(8\) (resp. \(68\)) modular operations to perform the two exponentiations. In the case of a 2048-bit RSA, this corresponds to a tiny overhead of \(0.3\,\%\) (resp. \(2.2\,\%\)) on average in terms of modular operationsFootnote 2. Moreover, one may note that the modular reduction of Step 7 of Algorithm 3 can be replaced by a conditional subtraction with \(N\) since \(p\cdot S1_q+q\cdot S1_p\) is always smaller than \(2\cdot N\).

One can also notice that the key generation of our method is slightly faster than the traditional one, cf. Algorithms 5 and 6 in Appendix A. Indeed in such a case, the costly inverse computation of \(i_q=q^{-1} \mathrm{{~mod~}}p\) is not necessary.

Last but not least, we do not need to change the key structure defined in the Java Card standard [26] to use our method. Indeed, we just need to store the public exponent \(e\) instead of the parameter \(i_q\). To do so, the methods setPQ and getPQ, which are meant to set and to output the value of \(i_q\) respectively, must be adapted to fit our approach while keeping in line with the Java Card standard functionality. The first method setPQ must compute the public key \(e\) from the private key parameters \((p,q,d_p,d_q)\) and store it in the buffer PQ. Most of the time, such a computation can be performed by using the efficient method of [18]. Regarding the method getPQ, it must output \(q^{-1}\mathrm{{~mod~}}p\) instead of outputting the content of the buffer PQ. Even if this inverse computation is costly, this is not a problem in practice since this method is almost never used.

3.2 A Free Message Blinding Method

In this section, we take advantage of the new approach previously described to provide a very efficient message blinding method to counteract Side-Channel Analysis.

We notice that by replacing \(q\) in Relation (3) by \(q'=q\cdot r\mathrm{{~mod~}}p\) where \(r\) is a random different from \(0\) modulo \(p\) and modulo \(q\), we obtain:

$$\begin{aligned} (m \cdot q'^e)^{d_p-1}\cdot m\cdot q'^{e-2} \equiv i_q\cdot S_p \cdot r^{-1} \mathrm{{~mod~}}p \end{aligned}$$
(12)

Similarly, by replacing \(p\) with \(p'=p\cdot r\mathrm{{~mod~}}q\) in Relation (9), we obtain:

$$\begin{aligned} (m \cdot p'^e)^{d_q-1}\cdot m\cdot p'^{e-2} \equiv i_p\cdot S_q\cdot r^{-1} \mathrm{{~mod~}}q \end{aligned}$$
(13)

By combining Relations (12) and (13) with Relation (11), we obtain a randomized signature \(S'\) which is equal to:

$$\begin{aligned} S'&= p\cdot S1'_q + q\cdot S1'_p \mathrm{{~mod~}}N\end{aligned}$$
(14)
$$\begin{aligned}&= r^{-1}\cdot S\mathrm{{~mod~}}N \end{aligned}$$
(15)

where

$$\begin{aligned} S1'_p&= (m \cdot q'^e)^{d_p-1}\cdot m\cdot q'^{e-2} \mathrm{{~mod~}}p,\\ S1'_q&= (m \cdot p'^e)^{d_q-1}\cdot m\cdot p'^{e-2} \mathrm{{~mod~}}q. \end{aligned}$$

The expected signature \(S\) is then obtained by multiplying \(S'\) with \(r\) modulo \(N\).

To reach a fully secure CRT-RSA, one need also to blind the exponents \(d_p\) and \(d_q\), use SSCA-HA-resistant exponentiations and to verify the signature by using the public exponent. We depict in Algorithm 4 such an implementation.

figure d

Comparison with the Standard State-of-the-Art Method. Firstly, Algorithm 4 inherits from the various advantages presented in Sect. 3.1 over the standard Algorithm 2. In particular, it does not require the private key parameter \(i_q\). Since Algorithms 2 and 4 both require the value of the public exponent \(e\), we gain in memory the size of one private key parameter, i.e. log\(_2(N)/2\) bits. Moreover, since Algorithm 2 requires the full private key and the public exponent, the latter must be computed on-the-fly in the context of Java Card environment where the format of the CRT-RSA key is standardized and an extra parameter cannot be added. In such a context, our method simply stores the public exponent \(e\) instead of the private parameter \(i_q\).

From a performance point of view, our method keeps the original size of the operands whereas the traditional additive masking used in Algorithm 2 requires to work with 64-bit longer operands and modulus. Since the performance of the crypto-processor is directly linked to the size of the variables which are used, Algorithm 4 is thus expected to be faster than Algorithm 2 since we work with smaller operand length.

Table 1 represents our analysis on a smart card providing a 32-bit modular multiplication co-processor. The difference between Algorithms 4 and 2 could be much more significant in some cases, especially with co-processors having a precision of 128 bits.

Moreover, when comparing our method versus the original multiplicative message blinding, one can notice that the costly inverse computation \(r^{-1}\mathrm{{~mod~}}N\) is done for free during the exponentiations.

Table 1. Performance improvement of Algorithm 4 compared to Algorithm 2 on a smart card providing a 32-bit modular multiplication co-processor.

Our approach is not only faster but it also provides various advantages versus Side-Channel Analysis. For instance, since we use Gauss’ method to recombine the results of the exponentiations, our method is not vulnerable to specific side-channel attacks on Garner’s formula such as the ones presented in [1, 25].

4 Conclusion

Despite the fact that the public exponent has been used for a long time to protect RSA implementation against Fault Injection, no study has been done to investigate the benefit we can obtain from a performance and side-channel point of view. In this paper, we present a novel approach to implement CRT-RSA making use of the knowledge of the public exponent. We show that we can shrink the key length and reach the same level of performance. Moreover, we also show that this new approach can be combined with multiplicative message blinding method without any overhead, leading to the most efficient message blinding scheme published so far.