Implementing Complete Formulas on Weierstrass Curves in Hardware

Massolino, Pedro Maat C.; Renes, Joost; Batina, Lejla

doi:10.1007/978-3-319-49445-6_5

Pedro Maat C. Massolino¹⁶,
Joost Renes¹⁶ &
Lejla Batina¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 10076))

Included in the following conference series:

International Conference on Security, Privacy, and Applied Cryptography Engineering

1636 Accesses
3 Citations

Abstract

This work revisits the recent complete addition formulas for prime order elliptic curves of Renes, Costello and Batina in light of parallelization. We introduce the first hardware implementation of the new formulas on an FPGA based on three arithmetic units performing Montgomery multiplication. Our results are competitive with current literature and show the potential of the new complete formulas in hardware design. Furthermore, we present algorithms to compute the formulas using anywhere between two and six processors, using the minimum number of field multiplications.

This work was supported in part by the Technology Foundation STW (project 13499 - TYPHOON & ASPASIA), from the Dutch government.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Four $$\mathbb {Q}$$ on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields

Towards Faster and Greener Cryptoprocessor for Eta Pairing on Supersingular Elliptic Curve over $\mathbb{F}_{2^{1223}}$

Fast Point Multiplication Algorithms for Binary Elliptic Curves with and without Precomputation

Keywords

1 Introduction

The main operation in many cryptographic protocols based on elliptic curves is scalar multiplication, which is performed via repeated point addition and doubling. In early works formulas for the group operation used different sequences of instructions for addition and doubling [22, 28]. This resulted in more optimized implementations, since doublings can be faster than general additions, but naïve implementations suffered from side-channel attacks [23]. Indeed, as all special cases have to be treated differently, it is not straightforward to come up with an efficient and side-channel secure implementation.

A class of elliptic curves which avoids these problems is the family of curves proposed by Bernstein and Lange, the so-called Edwards curves [8]. Arguably, the primary reason for their popularity is their “complete” addition law. That is, a single addition law which can be used for all inputs. The benefit of having a complete addition law is obvious for both simplicity and side-channel security. Namely, having only one set of formulas that works for all inputs simplifies the task of implementers and thwarts side-channel analysis and more refined attacks, e. g. safe-error attacks [38]. After the introduction of Edwards curves, more curves models have been shown to possess complete addition laws [6, 7]. Moreover, (twisted) Edwards curves are being deployed in software, for example in the library NaCl [10]. In particular, software implementations typically rely on specific curves, e. g. on the Montgomery curves Curve25519 [5] by Bernstein or Curve448 [19] proposed by Hamburg.

Moving to a hardware scenario, using the nice properties of these specific curves is not as straightforward anymore. Hardware development is costly, and industry prefers IP cores as generic solutions for all possible clients. Moreover, backwards compatibility is a serious concern, and most current standards [12, 15, 29] regarding large prime fields contain prime order curves in short Weierstrass form. This prohibits using (twisted) Edwards, (twisted) Hessian and Montgomery curves. The desire for complete addition formulas for prime order curves in short Weierstrass form was recognized and Renes, Costello and Batina [31] proved this to be realistic. They present complete addition formulas with an efficiency loss of 34 %–44 % in software when compared to formulas based on Jacobian coordinates, depending on the size of the field.

As the authors mention, one can expect to have better performance in hardware, but they do not present results. In particular, when using Montgomery multiplication one can benefit from very efficient modular additions and subtractions (which appear a lot in their formulas), which changes the performance ratio derived in the original paper. Therefore, it is of interest to investigate the new complete formulas from a hardware point of view. In this paper we show that the hardware performance is competitive with the literature, building scalar multiplication on top of three parallel Montgomery multipliers. In more detail, we summarize our contributions as follows:

we present the first hardware implementation based on the work of [26], working for every prime order curve over a prime field of up to 522 bits, and obtain competitive results;
we present algorithms for various levels of parallelism for the new formulas to boost the performance.

Related Work. Mainly there are numerous works on curve-based hardware implementations. These are on various FPGA platforms, making a meaningful comparison very difficult. Güneysu and Paar [17] proposed a new speed-optimized architecture that makes intensive use of the DSP blocks in an FPGA platform. Guillermin [18] introduced a prime field ECC hardware architecture and implemented it on several Altera FPGA boards. The design is based on Residue Number System (RNS), facilitating carry-free arithmetic and parallelism. Yao et al. [37] followed the idea of using RNS to design a high-speed ECC co-processor for pairings. Sakiyama et al. [33] proposed a superscalar coprocessor that could deal with three different curve-based cryptosystems, all in characteristic 2 fields. Varchola et al. [35] designed a processor-like architecture, with instruction set and decoder, on top of which they implemented ECC. This approach has the benefit of having a portion written in software, which can be easily maintained and updated, while having special optimized instructions for the elliptic curve operations. The downside of this approach is that the resource costs are higher than a fully optimized processor. As was the case for Güneysu and Paar [17], their targets were standardized NIST prime curves P–224 and P–256. Consequently, each of their synthesized circuit would only work for one of the two primes. Pöpper et al. [30] follow the same approach as Varchola et al. [35], with some side-channel related improvements. The paper focuses on an analysis of each countermeasure and its effective cost. Roy et al. [32] followed the same path, but with more optimizations with respect to resources and only for curve NIST P–256. However, the number of Block RAMs necessary for the architecture is much larger than of Pöpper et al. [30] or Varchola et al. [35]. Fan et al. [16] created an architecture for special primes and curves, namely the standardized NIST P–192. The approach was to parallelize Montgomery multiplication and formulas for point addition and doubling on the curve. Vliegen et al. [36] attempted to reduce the resources with a small core aimed at 256-bit primes.

Organization. We start with preliminaries in Sect. 2, and briefly discuss parallelism for the complete formulas in Sect. 3. Finally we present our hardware implementation using three Montgomery multipliers in Sect. 4.

2 Preliminaries for Elliptic Curve Cryptography

Let $\mathbb {F}_q$ be a finite field of characteristic p, i. e. $q=p^n$ for some $n$, and assume that p is not two or three. For well-chosen $a,b\in \mathbb {F}_q$, an elliptic curve E over $\mathbb {F}_q$ is defined as the set of solutions (x, y) to the curve equation $E:y^2=x^3+ax+b$ with an additional point $\mathcal {O}$, called the point at infinity. The $\mathbb {F}_q$-rational points $E(\mathbb {F}_q)$ are all $(x,y)\in E$ such that $(x,y)\in \mathbb {F}_q^2$, together with $\mathcal {O}$. They form a group, with $\mathcal {O}$ as its identity element. From now on when we write $E$, we mean $E(\mathbb {F}_q)$. The order of E is the order of this group. To compute the group law on E one can use the chord and tangent process. To implement this, however, it is necessary to use at least one inversion. Since inversions are very costly, we choose a different point representation to avoid them.

Define an equivalence relation on $\mathbb {F}_q^3$ by letting $(x_0,x_1,x_2)\sim (y_0,y_1,y_2)$ if and only if there exists $\lambda \in \mathbb {F}_q^{*}$ such that $(x_0,x_1,x_2)=(\lambda y_0,\lambda y_1,\lambda y_2)$. Then the projective plane over $\mathbb {F}_q$, denoted $\mathbb {P}^2(\mathbb {F}_q)$, is defined by $\mathbb {F}_q^3\setminus \{(0,0,0)\}$ modulo the equivalence relation $\sim $. We write $(x_0:x_1:x_2)$ to emphasize that the tuple belongs to $\mathbb {P}^2(\mathbb {F}_q)$ as opposed to $\mathbb {F}_q^3$. Now we can define $E(\mathbb {F}_q)$ to be the set of solutions $(X:Y:Z)\in \mathbb {P}^2(\mathbb {F}_q)$ to the curve equation $E:Y^2=X^3+aXZ^2+bZ^3$. Note that we can easily map between the two representations by $(x,y)\mapsto (x:y:1)$, $\mathcal {O}\mapsto (0:1:0)$, and $(X:Y:Z)\mapsto (X/Z,Y/Z)$ (for $Z\ne 0$), $(0:1:0)\mapsto \mathcal {O}$.

There are many ways to compute the group law on E, see [9]. These differ depending on the representation of the curve and the points. As mentioned in the introduction, we put emphasis on complete addition formulas for prime order elliptic curves. The work of Renes et al. [31] presents addition formulas for curves in short Weierstrass form embedded in the projective plane. They compute the sum of two points $P=(X_1:Y_1:Z_1)$ and $Q=(X_2:Y_2:Z_2)$ as $P+Q=(X_3:Y_3:Z_3)$, where

$$\begin{aligned} X_3 = {}&(X_1Y_2+X_2Y_1)(Y_1Y_2 - a(X_1Z_2+X_2Z_1) - 3bZ_1Z_2)\nonumber \\&- (Y_1Z_2 + Y_2Z_1)(aX_1X_2 + 3b(X_1Z_2+X_2Z_1)-a^2Z_1Z_2),\nonumber \\ Y_3 = {}&(3X_1X_2 + aZ_1Z_2)(aX_1X_2 + 3b(X_1Z_2+X_2Z_1)-a^2Z_1Z_2)\nonumber \\&+ (Y_1Y_2 + a(X_1Z_2+X_2Z_1) + 3bZ_1Z_2)(Y_1Y_2 - a(X_1Z_2+X_2Z_1) - 3bZ_1Z_2) ,\nonumber \\ Z_3 = {}&(Y_1Z_2 + Y_2Z_1)(Y_1Y_2 + a(X_1Z_2+X_2Z_1) + 3bZ_1Z_2)\nonumber \\&+ (X_1Y_2+X_2Y_1)(3X_1X_2 + aZ_1Z_2). \end{aligned}$$

(1)

Elliptic curve cryptography [22, 28] commonly relies on the hard problem called the “Elliptic Curve Discrete Logarithm Problem (ECDLP)”. This means that given two points P, Q on an elliptic curve, it is hard to find a scalar $k\in \mathbb {Z}$ such that $Q=kP$, if it exists. Therefore the main component of curve based cryptosystems is the scalar multiplication operation $(k,P)\mapsto kP$. Since in many cases k is a secret, this operation is very sensitive to attacks. In particular many side-channel attacks [4, 23] and countermeasures [14] have been proposed. To ensure protection against simple power analysis (SPA) attacks it is important to use regular scalar multiplication algorithms, e. g. Montgomery ladder [20] or Double-And-Add-Always [14], executing both an addition and a doubling operation per scalar bit.

3 Parallelism

An important way to increase the efficiency of the implementation is to use multiple Montgomery multipliers in parallel. In this section we give a brief explanation for our choice of three multipliers.

The addition formulas on which our scalar multiplication is built are shown in Algorithm 1 of [31]. We choose to ignore additions and subtractions since we assume to be relying on a Montgomery multiplier for which the cost of field multiplications is far higher than that of field additions. The total (multiplicative) cost in the most general case is $12\mathbf{M}+2\mathbf{m_a}+3\mathbf{m_{3b}}$ ^{Footnote 1}. Because our processors do not distinguish full multiplications and multiplications by constants, we consider this cost simply as 17M. The authors of [31] introduce optimizations for mixed addition and doubling, but in our case this only saves a single multiplication (and some additions). Since this does not make up for the price we would have to pay for the implementation of a second algorithm, we only examine the most general case. In Table 1 we show the interdependencies of the multiplications.

Table 1. Dependencies of multiplications inside the complete addition formulas

Full size table

Table 2. Efficiency approximation of the number of Montgomery multipliers against the area used.

Full size table

This allows us to write down algorithms for implementations running $n$ processors in parallel. Denote by $\mathbf{M_n}$ resp. $\mathbf{a_n}$ the cost of doing $n$ multiplications resp. additions (or subtractions) in parallel. In Table 2 we present the costs for $1\le n\le 6$. We make the simple approximations that $\mathbf{M_n}=\mathbf{M}$ and $\mathbf{a_n}=\mathbf{a}$. We note that this ignores some practical aspects. For example a larger number of Montgomery multipliers can result in scheduling overhead, which we do not take into account. All algorithms and their respective Magma [11] verification code can be found in Appendices B and C. For our implementation we have chosen for $n=3$, i. e. three Montgomery multipliers. This number of multipliers achieves a great area-time trade-off, while obtaining a good speed-up compared to $n=1$. Moreover, the aforementioned practical issues (e. g. scheduling) are not as complicated to deal with as for larger $n$.

4 Implementation of the Formulas with Three Processors

In this section we introduce a novel hardware implementation, parallelizing the new formulas using three Montgomery processors. We make use of the Montgomery processors which have been proposed by Massolino et al. [26] for Microsemi^® IGLOO2^® FPGAs, for which the architecture is shown in Fig. 1. We give a short description of the processor in Sect. 4.1, but for more details on its internals we refer to [26]. As a consequence of building on top of this processor, we target the same FPGA. However, it is straightforward to port to other FPGA’s or even ASICs which have a Montgomery multiplier with the same interface and instructions.

The elliptic curve scalar multiplication routine is constructed on top of the Montgomery processors. As mentioned before, to protect against simple power analysis attacks, we implement a regular scalar multiplication algorithm (i. e. Double-And-Add-Always [14]). The algorithm relies on three registers $\mathtt{R_0}$, $\mathtt{R_1}$ and $\mathtt{R_2}$. The register $\mathtt{R_0}$ contains the operand which is always doubled. The registers $\mathtt{R_1}$ resp. $\mathtt{R_2}$ contain the result of the addition when the exponent bit is zero resp. one. This algorithm should be applied carefully since it is prone to fault attacks [3]. From a very high level point of view the architecture consists of the three Montgomery multipliers and a single BRAM block, shown in Fig. 2. We note that this BRAM block is more than large enough to store the necessary temporary variables. So although Algorithm 2 tries to minimize the number of these, this is not necessary for our case. In the rest of this section we elaborate on the details of the implementation.

4.1 The Montgomery Processor

Massolino et al. [26] proposed two different Montgomery processors. Our scalar multiplication builds on top of “version 2”, which has support for two internal multipliers and two memory blocks. It can perform three operations: Montgomery multiplication, addition without reduction and subtraction without reduction. To perform Montgomery multiplication, the processor employs the FIOS algorithm proposed by Koç et al. [21]. In short, FIOS computes the partial product and partial reduction inside the same iterative loop. This can be translated into a hardware architecture, see Fig. 1, with a unit for the partial product and another partial modular reduction. The circuit behaves like a three-stage pipeline: in the first stage operands are fed into the circuit, in the second they are computed and in the third they are stored into memory. The pipeline system is reused for the addition and subtraction operation in the multiplier, and values are added or subtracted directly. In case of subtraction the computation also adds a multiple of the prime modulus. Those operations can be done without applying reduction, because reduction will be applied later during a multiplication operation. However, there is a limit to the number of consecutive additions/subtractions with no reduction, on which we elaborate in Sect. 4.4.

4.2 Memory

The main RAM memory in Fig. 2 is subdivided in order to lower control logic resources and to facilitate the interface. The main memory operates as a true dual port memory of 1024 words of 17 bits. We create a separation in the memory, composing a big word of 32 words (i. e. 544 bits). This way we construct the memory as $32\times 32$ big words. A big word can accommodate any temporary variable, input or output of our architecture. An exception is possibly the scalar of the point scalar multiplication. Although a single word would be large enough to contain 523-bit scalars (in the largest case of a 523-bit field), the scalar blinding technique can double the size of the scalar. Therefore, we use two words to store the scalar. By doing this, it will in the future be possible to execute scalar multiplication with a blinded scalar [13]. Lastly, there is a 17-bit shift register into which the scalar is loaded word by word.

4.3 Control Logic

The formulas and control system are done through two state machines: a main one which controls everything, and one related to memory transfer.

The memory-transfer state machine was created with the purpose to reduce the number of states in the main machine. This was done by providing the operation of transfer between the main memory and the Montgomery processors memory. Therefore, the main machine can transfer values with just one state, and can reuse most of the transfer logic. This memory-transfer machine becomes responsible for various parts of the bus between main memories, processors and other counters. However, the main state machine still has to be able to control everything. Hence, the main state machine shares some components with the memory transfer machine, increasing control circuit costs.

The main state machine controls all the circuits that compose the entire cryptographic core. Given it controls the entire circuit, the machine also has the entire Table 2 scheduling implemented as states. The advantage of doing this through states is the possible optimization of the design and the entire control. However, the cost of maintenance is a lot higher than a small instruction set or microcode that can also implement the addition formulas or scalar multiplication. Because the addition formulas are complete, it is possible to reduce the costs of performing both addition and doubling through only the addition formulas. This decreases the amount of states and therefore makes the final implementation a lot more compact. Hence, the implementation only iterates over the addition formulas, until the end of the computations.

4.4 Consecutive Additions

For the Montgomery processor to work in our architecture, part of the original design was changed. The authors of [26] did not need to reduce after each addition or subtraction, as they assumed that these operations would always be followed by Montgomery multiplications (and its corresponding reduction). However, they were not able to do multiple consecutive additions and subtractions, as the Montgomery division value $r$ was chosen to be only 4 bits larger than the prime. On the other hand, it is readily seen that in Algorithm 2 there are several consecutive additions and subtractions. One example of such additions is $t_{9}$ in line 7, then latter on line 8 is added and stored on $t_{10}$, which on line 10 is added with a fourth value. To be able to execute these without having to reduce, we need a Montgomery division value at least 5 bits larger than the prime. As a consequence, the processor only works for primes up to 522 bits (as opposed to 523), which is still one bit more than the largest standardized prime curve [29].

Table 3. Scheduling for point addition $P\leftarrow P+Q$, where $P=(X_1:Y_1:Z_1)$ and $Q=(X_2:Y_2:Z_2)$. For doubling simply put $P=Q$.

Full size table

4.5 Scheduling

The architecture presented in Fig. 2 has one dual port memory, whereas it has three processors. This means that we can only load values to two processors at the same time. As a consequence the three processors do not run completely in parallel, but one of the three is unsynchronized. Table 3 showcases how operations are split into different processors. They are distributed with the goal of minimizing the number of loads and stores for each processor and to minimize MM2 being idle. The process begins by loading the necessary values into MM0 and MM1 and executing their respective operations. As soon as the operations in MM0 and MM1 are initialized, it loads the corresponding value into MM2 and executes the operation. As soon as MM0 and MM1 finish their operations, this process restarts. Since the operations executed in MM2 are not synchronized with those in MM0 and MM1, both of the operations in MM0 and MM1 should be independent of the output of MM2, and vice versa. Furthermore, since multiplications are at least ten times slower than additions for our processor choice [26], the additions and subtractions from lines seven and eight in Algorithm 2 can be done by the otherwise idle processor MM2 in stage six. This makes them basically free of cost.

4.6 Comparison

As our architecture supports primes from 116 to 522 bits, we can run benchmarks and do comparisons for multiple bitsizes. The results for different common prime sizes are shown in Table 5 in Appendix A. In this section we consider only the currently widely adopted 128-bit security level, presented in Table 4. Integer addition, subtraction and Montgomery modular multiplication results are the same as in Massolino et al. [26]. This is the first work implementing the new complete formulas for elliptic curves in short Weierstrass form [31], and leads to a scalar multiplication routine which takes about 14.21 ms for a 256-bit prime.

It is not straightforward to do a well-founded comparison between work in the literature. Table 4 contains different implementations of elliptic curve scalar multiplication, but they have different optimization goals. For example we top [35, 36] in terms of milliseconds per scalar multiplication, but they use less multipliers or run at a lower frequency. On the other hand [1, 17, 18, 25, 27, 34] outperform our architecture in terms of speed, but use a much larger number of embedded multipliers. Also, implementations only focusing on NIST curves are able to use the special prime shape, yielding a significant speed-up. Depending on the needs of a specific hardware designer, this specialization of curves might not always be desirable. As mentioned before, many parties in industry might prefer generic cores. Despite these remarks, we argue that the implementation is competitive with the literature, making a similar trade-off between size and speed. Thus the new formulas can be implemented with little to no penalties, while having the benefit of not having to deal with exceptions.

Notes

1.
We denote by $\mathbf{M}$, $\mathbf{m_a}$, $\mathbf{m_{3b}}$, $\mathbf{a}$ the cost of a general multiplication, a multiplication by curve constant $a$, a multiplication by curve constant $3b$, and an addition respectively.

References

Alrimeih, H., Rakhmatov, D.: Fast and flexible hardware support for ECC over multiple standard prime fields. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 22(12), 2661–2674 (2014)
Article Google Scholar
Baldwin, B., Moloney, R., Byrne, A., McGuire, G., Marnane, W.P.: A hardware analysis of twisted edwards curves for an elliptic curve cryptosystem. In: Becker, J., Woods, R., Athanas, P., Morgan, F. (eds.) ARC 2009. LNCS, vol. 5453, pp. 355–361. Springer, Heidelberg (2009). doi:10.1007/978-3-642-00641-8_41
Chapter Google Scholar
Barenghi, A., Breveglieri, L., Koren, I., Naccache, D.: Fault injection attacks on cryptographic devices: theory, practice, and countermeasures. Proc. IEEE 100(11), 3056–3076 (2012)
Article Google Scholar
Batina, L., Chmielewski, Ł., Papachristodoulou, L., Schwabe, P., Tunstall, M.: Online template attacks. In: Meier, W., Mukhopadhyay, D. (eds.) INDOCRYPT 2014. LNCS, vol. 8885, pp. 21–36. Springer, Heidelberg (2014). doi:10.1007/978-3-319-13039-2_2
Google Scholar
Bernstein, D.J.: Curve25519: new diffie-hellman speed records. In: Yung, M., Dodis, Y., Kiayias, A., Malkin, T. (eds.) PKC 2006. LNCS, vol. 3958, pp. 207–228. Springer, Heidelberg (2006). doi:10.1007/11745853_14
Chapter Google Scholar
Bernstein, D.J., Birkner, P., Joye, M., Lange, T., Peters, C.: Twisted edwards curves. In: Vaudenay, S. (ed.) AFRICACRYPT 2008. LNCS, vol. 5023, pp. 389–405. Springer, Heidelberg (2008). doi:10.1007/978-3-540-68164-9_26
Chapter Google Scholar
Bernstein, D.J., Chuengsatiansup, C., Kohel, D., Lange, T.: Twisted hessian curves. In: Lauter, K., Rodríguez-Henríquez, F. (eds.) LATINCRYPT 2015. LNCS, vol. 9230, pp. 269–294. Springer, Heidelberg (2015). doi:10.1007/978-3-319-22174-8_15
Chapter Google Scholar
Bernstein, D.J., Lange, T.: Faster addition and doubling on elliptic curves. In: Kurosawa, K. (ed.) ASIACRYPT 2007. LNCS, vol. 4833, pp. 29–50. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76900-2_3
Chapter Google Scholar
Bernstein, D.J., Lange, T.: Explicit-Formulas Database. http://hyperelliptic.org/EFD/index.html. Accessed 21 Feb 2015
Bernstein, D.J., Lange, T., Schwabe, P.: The security impact of a new cryptographic library. In: Hevia, A., Neven, G. (eds.) LATINCRYPT 2012. LNCS, vol. 7533, pp. 159–176. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33481-8_9
Chapter Google Scholar
Bosma, W., Cannon, J.J., Playoust, C.: The Magma algebra system I: the user language. J. Symb. Comput. 24(3/4), 235–265 (1997)
Article MathSciNet MATH Google Scholar
Certicom Research. SEC 2: Recommended Elliptic Curve Domain Parameters, Version 2.0. Technical report, Certicom Research (2010)
Google Scholar
Clavier, C., Joye, M.: Universal exponentiation algorithm a first step towards Provable SPA-Resistance. In: Koç, Ç.K., Naccache, D., Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp. 300–308. Springer, Heidelberg (2001). doi:10.1007/3-540-44709-1_25
Chapter Google Scholar
Coron, J.-S.: Resistance against differential power analysis for elliptic curve cryptosystems. In: Koç, Ç.K., Paar, C. (eds.) CHES 1999. LNCS, vol. 1717, pp. 292–302. Springer, Heidelberg (1999). doi:10.1007/3-540-48059-5_25
Chapter Google Scholar
ECC Brainpool: ECC Brainpool standard curves and curve generation. Technical report, Brainpool (2005)
Google Scholar
Fan, J., Sakiyama, K., Verbauwhede, I.: Elliptic curve cryptography on embedded multicore systems. Design Autom. Embedded Syst. 12(3), 231–242 (2008). doi:10.1007/s10617-008-9021-3
Article Google Scholar
Güneysu, T., Paar, C.: Ultra high performance ECC over NIST primes on commercial FPGAs. In: Oswald, E., Rohatgi, P. (eds.) CHES 2008. LNCS, vol. 5154, pp. 62–78. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85053-3_5
Chapter Google Scholar
Guillermin, N.: A high speed coprocessor for elliptic curve scalar multiplications over $\mathbb{F}_p$. In: Mangard, S., Standaert, F.-X. (eds.) CHES 2010. LNCS, vol. 6225, pp. 48–64. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15031-9_4
Chapter Google Scholar
Hamburg, M.: Ed448-Goldilocks, a new elliptic curve. Cryptology ePrint Archive, Report 2015/625 (2015). http://eprint.iacr.org/2015/625.pdf
Joye, M., Yen, S.-M.: The montgomery powering ladder. In: Kaliski, B.S., Koç, K., Paar, C. (eds.) CHES 2002. LNCS, vol. 2523, pp. 291–302. Springer, Heidelberg (2003). doi:10.1007/3-540-36400-5_22
Chapter Google Scholar
Koç, Ç.K., Acar, T., Kaliski, B.S.: Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro 16(3), 26–33 (1996)
Google Scholar
Koblitz, N.: Elliptic curve cryptosystems. Math. Comput. 48, 203–209 (1987)
Article MathSciNet MATH Google Scholar
Kocher, P., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999). doi:10.1007/3-540-48405-1_25
Chapter Google Scholar
Loi, K.C.C., Ko, S.B.: Scalable elliptic curve cryptosystem FPGA processor for NIST prime curves. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 23(11), 2753–2756 (2015)
Article Google Scholar
Ma, Y., Liu, Z., Pan, W., Jing, J.: A high-speed elliptic curve cryptographic processor for generic curves over GF(p). In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC 2013. LNCS, vol. 8282, pp. 421–437. Springer, Heidelberg (2014)
Google Scholar
Massolino, P.M.C., Batina, L., Chaves, R., Mentens, N.: Low Power Montgomery Modular Multiplication on Reconfigurable Systems. Cryptology ePrint Archive, Report 2016/280 (2016). http://eprint.iacr.org/2016/280
McIvor, C., McLoone, M., McCanny, J.V.: Hardware elliptic curve cryptographic processor over GF(p). IEEE Trans. Circuits Syst. I Regul. Pap. 53(9), 1946–1957 (2006)
Article MathSciNet Google Scholar
Miller, V.S.: Use of elliptic curves in cryptography. In: Williams, H.C. (ed.) CRYPTO 1985. LNCS, vol. 218, pp. 417–426. Springer, Heidelberg (1986). doi:10.1007/3-540-39799-X_31
Chapter Google Scholar
National Institute for Standards and Technology. Federal information processing standards publication 186–4. digital signature standard. Technical report, NIST (2013)
Google Scholar
Pöpper, C., Mischke, O., Güneysu, T.: MicroACP - a fast and secure reconfigurable asymmetric crypto-processor. In: Goehringer, D., Santambrogio, M.D., Cardoso, J.M.P., Bertels, K. (eds.) ARC 2014. LNCS, vol. 8405, pp. 240–247. Springer, Heidelberg (2014). doi:10.1007/978-3-319-05960-0_24
Chapter Google Scholar
Renes, J., Costello, C., Batina, L.: Complete addition formulas for prime order elliptic curves. In: Fischlin, M., Coron, J.-S. (eds.) EUROCRYPT 2016. LNCS, vol. 9665, pp. 403–428. Springer, Heidelberg (2016). doi:10.1007/978-3-662-49890-3_16
Chapter Google Scholar
Roy, D.B., Das, P., Mukhopadhyay, D.: ECC on your fingertips: a single instruction approach for lightweight ECC design in GF(p). In: Dunkelman, O., Keliher, L. (eds.) SAC 2015. LNCS, vol. 9566, pp. 161–177. Springer, Heidelberg (2015)
Google Scholar
Sakiyama, K., Batina, L., Preneel, B., Verbauwhede, I.: Superscalar coprocessor for high-speed curve-based cryptography. In: Goubin, L., Matsui, M. (eds.) CHES 2006. LNCS, vol. 4249, pp. 415–429. Springer, Heidelberg (2006). doi:10.1007/11894063_33
Chapter Google Scholar
Sasdrich, P., Güneysu, T.: Efficient elliptic-curve cryptography using curve25519 on reconfigurable devices. In: Goehringer, D., Santambrogio, M.D., Cardoso, J.M.P., Bertels, K. (eds.) ARC 2014. LNCS, vol. 8405, pp. 25–36. Springer, Heidelberg (2014). doi:10.1007/978-3-319-05960-0_3
Chapter Google Scholar
Varchola, M., Guneysu, T., Mischke, O.: MicroECC: A lightweight reconfigurable elliptic curve crypto-processor. In: 2011 International Conference on Reconfigurable Computing and FPGAs (ReConFig), pp. 204–210, November 2011
Google Scholar
Vliegen, J., Mentens, N., Genoe, J., Braeken, A., Kubera, S., Touhafi, A., Verbauwhede, I.: A compact FPGA-based architecture for elliptic curve cryptography over prime fields. In: 2010 21st IEEE International Conference on Application-specific Systems Architectures and Processors (ASAP), pp. 313–316, July 2010
Google Scholar
Yao, G.X., Fan, J., Cheung, R.C.C., Verbauwhede, I.: Faster pairing coprocessor architecture. In: Abdalla, M., Lange, T. (eds.) Pairing 2012. LNCS, vol. 7708, pp. 160–176. Springer, Heidelberg (2013). doi:10.1007/978-3-642-36334-4_10
Chapter Google Scholar
Yen, S., Joye, M.: Checking before output may not be enough against fault-based cryptanalysis. IEEE Trans. Comput. 49(9), 967–970 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Radboud University, Nijmegen, The Netherlands
Pedro Maat C. Massolino, Joost Renes & Lejla Batina

Authors

Pedro Maat C. Massolino
View author publications
You can also search for this author in PubMed Google Scholar
Joost Renes
View author publications
You can also search for this author in PubMed Google Scholar
Lejla Batina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Maat C. Massolino .

Editor information

Editors and Affiliations

Universities of Paris 8 and Paris 13, LAGA, Paris, France
Claude Carlet
University of Waterloo, Waterloo, Ontario, Canada
M. Anwar Hasan
CRRao AIMSCS, Hyderabad, India
Vishal Saraswat

Appendices

A More complete results comparison

Table 4. Comparison of our results to the literature on hardware implementations for ECC. The speed results are for one scalar multiplication.

Full size table

Table 5. Complete comparison and results from Table 4

Full size table

B Algorithms

C Verification code

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Massolino, P.M.C., Renes, J., Batina, L. (2016). Implementing Complete Formulas on Weierstrass Curves in Hardware. In: Carlet, C., Hasan, M., Saraswat, V. (eds) Security, Privacy, and Applied Cryptography Engineering. SPACE 2016. Lecture Notes in Computer Science(), vol 10076. Springer, Cham. https://doi.org/10.1007/978-3-319-49445-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-49445-6_5
Published: 18 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49444-9
Online ISBN: 978-3-319-49445-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Implementing Complete Formulas on Weierstrass Curves in Hardware

Abstract

Similar content being viewed by others

Four $$\mathbb {Q}$$ on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields

Towards Faster and Greener Cryptoprocessor for Eta Pairing on Supersingular Elliptic Curve over $\mathbb{F}_{2^{1223}}$

Fast Point Multiplication Algorithms for Binary Elliptic Curves with and without Precomputation

Keywords

1 Introduction

2 Preliminaries for Elliptic Curve Cryptography

3 Parallelism