Keywords

1 Introduction

The main operation in many cryptographic protocols based on elliptic curves is scalar multiplication, which is performed via repeated point addition and doubling. In early works formulas for the group operation used different sequences of instructions for addition and doubling [22, 28]. This resulted in more optimized implementations, since doublings can be faster than general additions, but naïve implementations suffered from side-channel attacks [23]. Indeed, as all special cases have to be treated differently, it is not straightforward to come up with an efficient and side-channel secure implementation.

A class of elliptic curves which avoids these problems is the family of curves proposed by Bernstein and Lange, the so-called Edwards curves [8]. Arguably, the primary reason for their popularity is their “complete” addition law. That is, a single addition law which can be used for all inputs. The benefit of having a complete addition law is obvious for both simplicity and side-channel security. Namely, having only one set of formulas that works for all inputs simplifies the task of implementers and thwarts side-channel analysis and more refined attacks, e. g. safe-error attacks [38]. After the introduction of Edwards curves, more curves models have been shown to possess complete addition laws [6, 7]. Moreover, (twisted) Edwards curves are being deployed in software, for example in the library NaCl [10]. In particular, software implementations typically rely on specific curves, e. g. on the Montgomery curves Curve25519 [5] by Bernstein or Curve448 [19] proposed by Hamburg.

Moving to a hardware scenario, using the nice properties of these specific curves is not as straightforward anymore. Hardware development is costly, and industry prefers IP cores as generic solutions for all possible clients. Moreover, backwards compatibility is a serious concern, and most current standards [12, 15, 29] regarding large prime fields contain prime order curves in short Weierstrass form. This prohibits using (twisted) Edwards, (twisted) Hessian and Montgomery curves. The desire for complete addition formulas for prime order curves in short Weierstrass form was recognized and Renes, Costello and Batina [31] proved this to be realistic. They present complete addition formulas with an efficiency loss of 34 %–44 % in software when compared to formulas based on Jacobian coordinates, depending on the size of the field.

As the authors mention, one can expect to have better performance in hardware, but they do not present results. In particular, when using Montgomery multiplication one can benefit from very efficient modular additions and subtractions (which appear a lot in their formulas), which changes the performance ratio derived in the original paper. Therefore, it is of interest to investigate the new complete formulas from a hardware point of view. In this paper we show that the hardware performance is competitive with the literature, building scalar multiplication on top of three parallel Montgomery multipliers. In more detail, we summarize our contributions as follows:

  • we present the first hardware implementation based on the work of [26], working for every prime order curve over a prime field of up to 522 bits, and obtain competitive results;

  • we present algorithms for various levels of parallelism for the new formulas to boost the performance.

Related Work. Mainly there are numerous works on curve-based hardware implementations. These are on various FPGA platforms, making a meaningful comparison very difficult. Güneysu and Paar [17] proposed a new speed-optimized architecture that makes intensive use of the DSP blocks in an FPGA platform. Guillermin [18] introduced a prime field ECC hardware architecture and implemented it on several Altera FPGA boards. The design is based on Residue Number System (RNS), facilitating carry-free arithmetic and parallelism. Yao et al. [37] followed the idea of using RNS to design a high-speed ECC co-processor for pairings. Sakiyama et al. [33] proposed a superscalar coprocessor that could deal with three different curve-based cryptosystems, all in characteristic 2 fields. Varchola et al. [35] designed a processor-like architecture, with instruction set and decoder, on top of which they implemented ECC. This approach has the benefit of having a portion written in software, which can be easily maintained and updated, while having special optimized instructions for the elliptic curve operations. The downside of this approach is that the resource costs are higher than a fully optimized processor. As was the case for Güneysu and Paar [17], their targets were standardized NIST prime curves P–224 and P–256. Consequently, each of their synthesized circuit would only work for one of the two primes. Pöpper et al. [30] follow the same approach as Varchola et al. [35], with some side-channel related improvements. The paper focuses on an analysis of each countermeasure and its effective cost. Roy et al. [32] followed the same path, but with more optimizations with respect to resources and only for curve NIST P–256. However, the number of Block RAMs necessary for the architecture is much larger than of Pöpper et al. [30] or Varchola et al. [35]. Fan et al. [16] created an architecture for special primes and curves, namely the standardized NIST P–192. The approach was to parallelize Montgomery multiplication and formulas for point addition and doubling on the curve. Vliegen et al. [36] attempted to reduce the resources with a small core aimed at 256-bit primes.

Organization. We start with preliminaries in Sect. 2, and briefly discuss parallelism for the complete formulas in Sect. 3. Finally we present our hardware implementation using three Montgomery multipliers in Sect. 4.

2 Preliminaries for Elliptic Curve Cryptography

Let \(\mathbb {F}_q\) be a finite field of characteristic p, i. e. \(q=p^n\) for some \(n\), and assume that p is not two or three. For well-chosen \(a,b\in \mathbb {F}_q\), an elliptic curve E over \(\mathbb {F}_q\) is defined as the set of solutions (xy) to the curve equation \(E:y^2=x^3+ax+b\) with an additional point \(\mathcal {O}\), called the point at infinity. The \(\mathbb {F}_q\)-rational points \(E(\mathbb {F}_q)\) are all \((x,y)\in E\) such that \((x,y)\in \mathbb {F}_q^2\), together with \(\mathcal {O}\). They form a group, with \(\mathcal {O}\) as its identity element. From now on when we write \(E\), we mean \(E(\mathbb {F}_q)\). The order of E is the order of this group. To compute the group law on E one can use the chord and tangent process. To implement this, however, it is necessary to use at least one inversion. Since inversions are very costly, we choose a different point representation to avoid them.

Define an equivalence relation on \(\mathbb {F}_q^3\) by letting \((x_0,x_1,x_2)\sim (y_0,y_1,y_2)\) if and only if there exists \(\lambda \in \mathbb {F}_q^{*}\) such that \((x_0,x_1,x_2)=(\lambda y_0,\lambda y_1,\lambda y_2)\). Then the projective plane over \(\mathbb {F}_q\), denoted \(\mathbb {P}^2(\mathbb {F}_q)\), is defined by \(\mathbb {F}_q^3\setminus \{(0,0,0)\}\) modulo the equivalence relation \(\sim \). We write \((x_0:x_1:x_2)\) to emphasize that the tuple belongs to \(\mathbb {P}^2(\mathbb {F}_q)\) as opposed to \(\mathbb {F}_q^3\). Now we can define \(E(\mathbb {F}_q)\) to be the set of solutions \((X:Y:Z)\in \mathbb {P}^2(\mathbb {F}_q)\) to the curve equation \(E:Y^2=X^3+aXZ^2+bZ^3\). Note that we can easily map between the two representations by \((x,y)\mapsto (x:y:1)\), \(\mathcal {O}\mapsto (0:1:0)\), and \((X:Y:Z)\mapsto (X/Z,Y/Z)\) (for \(Z\ne 0\)), \((0:1:0)\mapsto \mathcal {O}\).

There are many ways to compute the group law on E, see [9]. These differ depending on the representation of the curve and the points. As mentioned in the introduction, we put emphasis on complete addition formulas for prime order elliptic curves. The work of Renes et al. [31] presents addition formulas for curves in short Weierstrass form embedded in the projective plane. They compute the sum of two points \(P=(X_1:Y_1:Z_1)\) and \(Q=(X_2:Y_2:Z_2)\) as \(P+Q=(X_3:Y_3:Z_3)\), where

$$\begin{aligned} X_3 = {}&(X_1Y_2+X_2Y_1)(Y_1Y_2 - a(X_1Z_2+X_2Z_1) - 3bZ_1Z_2)\nonumber \\&- (Y_1Z_2 + Y_2Z_1)(aX_1X_2 + 3b(X_1Z_2+X_2Z_1)-a^2Z_1Z_2),\nonumber \\ Y_3 = {}&(3X_1X_2 + aZ_1Z_2)(aX_1X_2 + 3b(X_1Z_2+X_2Z_1)-a^2Z_1Z_2)\nonumber \\&+ (Y_1Y_2 + a(X_1Z_2+X_2Z_1) + 3bZ_1Z_2)(Y_1Y_2 - a(X_1Z_2+X_2Z_1) - 3bZ_1Z_2) ,\nonumber \\ Z_3 = {}&(Y_1Z_2 + Y_2Z_1)(Y_1Y_2 + a(X_1Z_2+X_2Z_1) + 3bZ_1Z_2)\nonumber \\&+ (X_1Y_2+X_2Y_1)(3X_1X_2 + aZ_1Z_2). \end{aligned}$$
(1)

Elliptic curve cryptography [22, 28] commonly relies on the hard problem called the “Elliptic Curve Discrete Logarithm Problem (ECDLP)”. This means that given two points PQ on an elliptic curve, it is hard to find a scalar \(k\in \mathbb {Z}\) such that \(Q=kP\), if it exists. Therefore the main component of curve based cryptosystems is the scalar multiplication operation \((k,P)\mapsto kP\). Since in many cases k is a secret, this operation is very sensitive to attacks. In particular many side-channel attacks [4, 23] and countermeasures [14] have been proposed. To ensure protection against simple power analysis (SPA) attacks it is important to use regular scalar multiplication algorithms, e. g. Montgomery ladder [20] or Double-And-Add-Always [14], executing both an addition and a doubling operation per scalar bit.

3 Parallelism

An important way to increase the efficiency of the implementation is to use multiple Montgomery multipliers in parallel. In this section we give a brief explanation for our choice of three multipliers.

The addition formulas on which our scalar multiplication is built are shown in Algorithm 1 of [31]. We choose to ignore additions and subtractions since we assume to be relying on a Montgomery multiplier for which the cost of field multiplications is far higher than that of field additions. The total (multiplicative) cost in the most general case is \(12\mathbf{M}+2\mathbf{m_a}+3\mathbf{m_{3b}}\) Footnote 1. Because our processors do not distinguish full multiplications and multiplications by constants, we consider this cost simply as 17M. The authors of [31] introduce optimizations for mixed addition and doubling, but in our case this only saves a single multiplication (and some additions). Since this does not make up for the price we would have to pay for the implementation of a second algorithm, we only examine the most general case. In Table 1 we show the interdependencies of the multiplications.

Table 1. Dependencies of multiplications inside the complete addition formulas
Table 2. Efficiency approximation of the number of Montgomery multipliers against the area used.

This allows us to write down algorithms for implementations running \(n\) processors in parallel. Denote by \(\mathbf{M_n}\) resp. \(\mathbf{a_n}\) the cost of doing \(n\) multiplications resp. additions (or subtractions) in parallel. In Table 2 we present the costs for \(1\le n\le 6\). We make the simple approximations that \(\mathbf{M_n}=\mathbf{M}\) and \(\mathbf{a_n}=\mathbf{a}\). We note that this ignores some practical aspects. For example a larger number of Montgomery multipliers can result in scheduling overhead, which we do not take into account. All algorithms and their respective Magma [11] verification code can be found in Appendices B and C. For our implementation we have chosen for \(n=3\), i. e. three Montgomery multipliers. This number of multipliers achieves a great area-time trade-off, while obtaining a good speed-up compared to \(n=1\). Moreover, the aforementioned practical issues (e. g. scheduling) are not as complicated to deal with as for larger \(n\).

4 Implementation of the Formulas with Three Processors

In this section we introduce a novel hardware implementation, parallelizing the new formulas using three Montgomery processors. We make use of the Montgomery processors which have been proposed by Massolino et al. [26] for Microsemi® IGLOO2® FPGAs, for which the architecture is shown in Fig. 1. We give a short description of the processor in Sect. 4.1, but for more details on its internals we refer to [26]. As a consequence of building on top of this processor, we target the same FPGA. However, it is straightforward to port to other FPGA’s or even ASICs which have a Montgomery multiplier with the same interface and instructions.

Fig. 1.
figure 1

Montgomery addition, subtraction and multiplication processor.

The elliptic curve scalar multiplication routine is constructed on top of the Montgomery processors. As mentioned before, to protect against simple power analysis attacks, we implement a regular scalar multiplication algorithm (i. e. Double-And-Add-Always [14]). The algorithm relies on three registers \(\mathtt{R_0}\), \(\mathtt{R_1}\) and \(\mathtt{R_2}\). The register \(\mathtt{R_0}\) contains the operand which is always doubled. The registers \(\mathtt{R_1}\) resp. \(\mathtt{R_2}\) contain the result of the addition when the exponent bit is zero resp. one. This algorithm should be applied carefully since it is prone to fault attacks [3]. From a very high level point of view the architecture consists of the three Montgomery multipliers and a single BRAM block, shown in Fig. 2. We note that this BRAM block is more than large enough to store the necessary temporary variables. So although Algorithm 2 tries to minimize the number of these, this is not necessary for our case. In the rest of this section we elaborate on the details of the implementation.

4.1 The Montgomery Processor

Massolino et al. [26] proposed two different Montgomery processors. Our scalar multiplication builds on top of “version 2”, which has support for two internal multipliers and two memory blocks. It can perform three operations: Montgomery multiplication, addition without reduction and subtraction without reduction. To perform Montgomery multiplication, the processor employs the FIOS algorithm proposed by Koç et al. [21]. In short, FIOS computes the partial product and partial reduction inside the same iterative loop. This can be translated into a hardware architecture, see Fig. 1, with a unit for the partial product and another partial modular reduction. The circuit behaves like a three-stage pipeline: in the first stage operands are fed into the circuit, in the second they are computed and in the third they are stored into memory. The pipeline system is reused for the addition and subtraction operation in the multiplier, and values are added or subtracted directly. In case of subtraction the computation also adds a multiple of the prime modulus. Those operations can be done without applying reduction, because reduction will be applied later during a multiplication operation. However, there is a limit to the number of consecutive additions/subtractions with no reduction, on which we elaborate in Sect. 4.4.

4.2 Memory

The main RAM memory in Fig. 2 is subdivided in order to lower control logic resources and to facilitate the interface. The main memory operates as a true dual port memory of 1024 words of 17 bits. We create a separation in the memory, composing a big word of 32 words (i. e. 544 bits). This way we construct the memory as \(32\times 32\) big words. A big word can accommodate any temporary variable, input or output of our architecture. An exception is possibly the scalar of the point scalar multiplication. Although a single word would be large enough to contain 523-bit scalars (in the largest case of a 523-bit field), the scalar blinding technique can double the size of the scalar. Therefore, we use two words to store the scalar. By doing this, it will in the future be possible to execute scalar multiplication with a blinded scalar [13]. Lastly, there is a 17-bit shift register into which the scalar is loaded word by word.

Fig. 2.
figure 2

Entire architecture with three Montgomery processors from [26], where MM = Montgomery processor, SHR = Shift register, REG = Register.

4.3 Control Logic

The formulas and control system are done through two state machines: a main one which controls everything, and one related to memory transfer.

The memory-transfer state machine was created with the purpose to reduce the number of states in the main machine. This was done by providing the operation of transfer between the main memory and the Montgomery processors memory. Therefore, the main machine can transfer values with just one state, and can reuse most of the transfer logic. This memory-transfer machine becomes responsible for various parts of the bus between main memories, processors and other counters. However, the main state machine still has to be able to control everything. Hence, the main state machine shares some components with the memory transfer machine, increasing control circuit costs.

The main state machine controls all the circuits that compose the entire cryptographic core. Given it controls the entire circuit, the machine also has the entire Table 2 scheduling implemented as states. The advantage of doing this through states is the possible optimization of the design and the entire control. However, the cost of maintenance is a lot higher than a small instruction set or microcode that can also implement the addition formulas or scalar multiplication. Because the addition formulas are complete, it is possible to reduce the costs of performing both addition and doubling through only the addition formulas. This decreases the amount of states and therefore makes the final implementation a lot more compact. Hence, the implementation only iterates over the addition formulas, until the end of the computations.

4.4 Consecutive Additions

For the Montgomery processor to work in our architecture, part of the original design was changed. The authors of [26] did not need to reduce after each addition or subtraction, as they assumed that these operations would always be followed by Montgomery multiplications (and its corresponding reduction). However, they were not able to do multiple consecutive additions and subtractions, as the Montgomery division value \(r\) was chosen to be only 4 bits larger than the prime. On the other hand, it is readily seen that in Algorithm 2 there are several consecutive additions and subtractions. One example of such additions is \(t_{9}\) in line 7, then latter on line 8 is added and stored on \(t_{10}\), which on line 10 is added with a fourth value. To be able to execute these without having to reduce, we need a Montgomery division value at least 5 bits larger than the prime. As a consequence, the processor only works for primes up to 522 bits (as opposed to 523), which is still one bit more than the largest standardized prime curve [29].

Table 3. Scheduling for point addition \(P\leftarrow P+Q\), where \(P=(X_1:Y_1:Z_1)\) and \(Q=(X_2:Y_2:Z_2)\). For doubling simply put \(P=Q\).

4.5 Scheduling

The architecture presented in Fig. 2 has one dual port memory, whereas it has three processors. This means that we can only load values to two processors at the same time. As a consequence the three processors do not run completely in parallel, but one of the three is unsynchronized. Table 3 showcases how operations are split into different processors. They are distributed with the goal of minimizing the number of loads and stores for each processor and to minimize MM2 being idle. The process begins by loading the necessary values into MM0 and MM1 and executing their respective operations. As soon as the operations in MM0 and MM1 are initialized, it loads the corresponding value into MM2 and executes the operation. As soon as MM0 and MM1 finish their operations, this process restarts. Since the operations executed in MM2 are not synchronized with those in MM0 and MM1, both of the operations in MM0 and MM1 should be independent of the output of MM2, and vice versa. Furthermore, since multiplications are at least ten times slower than additions for our processor choice [26], the additions and subtractions from lines seven and eight in Algorithm 2 can be done by the otherwise idle processor MM2 in stage six. This makes them basically free of cost.

4.6 Comparison

As our architecture supports primes from 116 to 522 bits, we can run benchmarks and do comparisons for multiple bitsizes. The results for different common prime sizes are shown in Table 5 in Appendix A. In this section we consider only the currently widely adopted 128-bit security level, presented in Table 4. Integer addition, subtraction and Montgomery modular multiplication results are the same as in Massolino et al. [26]. This is the first work implementing the new complete formulas for elliptic curves in short Weierstrass form [31], and leads to a scalar multiplication routine which takes about 14.21 ms for a 256-bit prime.

It is not straightforward to do a well-founded comparison between work in the literature. Table 4 contains different implementations of elliptic curve scalar multiplication, but they have different optimization goals. For example we top [35, 36] in terms of milliseconds per scalar multiplication, but they use less multipliers or run at a lower frequency. On the other hand [1, 17, 18, 25, 27, 34] outperform our architecture in terms of speed, but use a much larger number of embedded multipliers. Also, implementations only focusing on NIST curves are able to use the special prime shape, yielding a significant speed-up. Depending on the needs of a specific hardware designer, this specialization of curves might not always be desirable. As mentioned before, many parties in industry might prefer generic cores. Despite these remarks, we argue that the implementation is competitive with the literature, making a similar trade-off between size and speed. Thus the new formulas can be implemented with little to no penalties, while having the benefit of not having to deal with exceptions.