1 Introduction

1.1 Background

One can argue that public-key cryptosystems become more secure as the advances in hardware speed up the computation of cryptographic algorithms. Take the RSA cryptosystem as an example. The effort of cracking RSA through factorisation of the product of two large primes approximately doubles for every 35 bits at a key size of \(2^{10}\) bits [12]. However, adding 35 bits to the key only increases the work involved in decryption by 10 %. Thus, speeding up the hardware by just 10 % makes the cryptosystem about twice as strong without any other extra resources [26]. Speed, therefore, is an important goal for public-key cryptosystems. Indeed, it is essential not just for cryptographic strength but also to clear the large number of transactions performed by central servers in electronic commerce systems.

This work aims to speed up public-key cryptosystems by accelerating their fundamental operation: the multiplication \(X = A \times B\) followed by a reduction modulo M, \(X \mod M = \langle X \rangle _M\), where A, B and the modulus M are all n-bit positive integers. This is the most frequent operation in elliptic curve cryptosystems (ECC) [15]. In RSA, it is the only operation required to implement the modular exponentiations which constitute encryption and decryption [36]. The Residue Number System (RNS) [41] offers advantages for long wordlength arithmetic of this kind by representing integers in independent short wordlength channels.

Indeed, implementing public-key cryptosystems using RNS is an interesting avenue of research [6, 25]. The drawback to this approach is RNS modular reduction which is a computationally complex operation. Early publications [42] avoided it altogether by converting from RNS representation back to a positional system, performing modular reduction there, and converting the result back into RNS. Later, algorithms using look-up tables [23, 39, 40, 43] were proposed to perform short wordlength modular reduction. Most of these avoided converting numbers from RNS to positional systems, but were limited to 32-bit inputs [9, 19, 46] by tables available. The work in [1] uses Chinese remainder theorem (CRT) to perform modular reduction within RNS channels; however, no implementation results are given for their proposed algorithm. Another alternative is the use of core function to perform RNS-based modular multiplication [27]. More recently, variations of Montgomery’s reduction algorithm [31] have been developed which work entirely within a RNS [5, 17, 34].

Montgomery’s reduction algorithm is only one of the alternatives available in positional number systems [28]. This raises a question: can any of the other reduction algorithms from positional number systems be applied to RNS? This paper provides an answer in the affirmative by presenting an RNS reduction architecture which uses sum of residues reduction with a fast implementation on FPGA.

Early attempts in positional number systems reduce \(Z = A \times B\) modulo \(M\) by finding a sum of residues modulo \(M\) [16, 44]. If \(Z = \sum _i Z_i\) then we have \(\sum _i \langle Z_i \rangle _M \equiv Z \mod M\). Although this does not produce a fully reduced result, it is possible to determine bounds for intermediate values such that the output from one modular multiplication can be used as the input to subsequent modular multiplications without overflow.

We use this sum of residues method for modular multiplication regarding large integers in the RNS, with the advantage that all of the residues \(\langle Z_i \rangle _M\) can be evaluated in parallel. The proposed novel algorithm performs the modular reduction completely within the RNS channels without any conversion to/from binary number system ensuring high-speed operation.

The rest of the paper is arranged as follows: Section 1.2 highlights our contribution to the topic of modular multiplication in RNS. Section 2 briefly explains the representation and benefits of residue number system. Section 3 describes the Barrett algorithm used in the proposed design to perform modulus operation within each RNS channel. Section 4 explains the development of the proposed algorithm to perform modular multiplication in RNS. Section 5 describes the implementation of the proposed algorithm and comparison with RNS-based modular multipliers. The work is concluded in Sect. 6.

1.2 Contribution

This paper makes the following contributions.

  1. 1.

    Confinement of all the computations of modular multiplication within the RNS channels without any long wordlength operations or conversion to a positional number system.

  2. 2.

    Proposal of a novel algorithm that can perform modular multiplication in a single iteration. The proposed architecture − based on this algorithm − is the first hardware implementation of a single iteration modular multiplication.

  3. 3.

    A high scalability of the proposed algorithm and architecture of the modular multiplier. The dynamic range of the modular multiplication can be easily decreased or increased by changing the RNS channel width or number of channels in the proposed algorithm. This allows easy scale-up of the architecture by adding additional RNS channels to the existing architecture. The paper provides the detailed analysis and criteria to compute the pre-computed values required to construct modular multipliers of different wordlengths.

2 The Residue Number System

A Residue Number System [42] is characterised by a set of \(N\) co-prime moduli \(\{m_1,...,m_N\}\) with \(m_1< m_2< \dots < m_N\). In the RNS a non-negative integer \(A\) is represented in \(N\) channels: \(A=\{a_1,a_2,...,a_N\}\), where \(a_i\) is the residue of \(A\) with respect to \(m_i\), i.e. \(a_i = \langle A \rangle _{m_i} = A \mod m_i\). The wordlength of \(m_i\) (in bits) − which defines the RNS channel width − is denoted by w. Within the RNS there is a unique representation of all integers in the range \(0 \le A < D\) where \(D = m_1m_2...m_N\). \(D\) is therefore known as the dynamic range of the RNS. Two other values, \(D_i\) and \(\langle D_i^{-1}\rangle _{m_i}\) are commonly used in RNS computations and are worth defining here. \(D_i = D/m_i\) and \(\langle D_i^{-1}\rangle _{m_i}\) is its multiplicative inverse with respect to \(m_i\) such that \(\langle D_i \times D_i^{-1}\rangle _{m_i} = 1\).

If \(A\), \(B\) and \(C\) have RNS representations given by \(A = \{a_1,a_2, \dots ,a_N\}\), \(B = \{b_1,b_2, \dots ,b_N\}\) and \(C = \{c_1,c_2, \dots ,c_N\}\), then denoting * to represent the operations +, − or \(\times \), the RNS version of \(C\) = \(A\) * \(B\) satisfies

$$\begin{aligned} C = \{\langle a_1 \text{* } b_1 \rangle _{m_1}, \langle a_2 \text{* } b_2 \rangle _{m_2}, \dots ,\langle a_N \text{* } b_N \rangle _{m_N}\}. \end{aligned}$$
(1)

Thus, addition, subtraction and multiplication can be concurrently performed on the \(N\) residues within \(N\) parallel channels, and it is this high-speed parallel operation that makes the RNS attractive. There is, however, no such parallel form of the modular reduction regarding large modulus used in public-key cryptosystems. In order to implement RNS-based public-key cryptosystems, it is of utmost importance to devise an algorithm which can perform fast modular multiplication in RNS.

3 The Modular Reduction within RNS Channels

In Eq. (1), all the operations are accomplished by performing basic operations (addition, subtraction or multiplication) first and a reduction modulo for a channel modulus \(m_i\) second. Compared with the modular reduction, these basic operations of addition, subtraction and multiplication are trivial. This section explains how Barrett modular reduction algorithm [8] is used in our implementation to perform this modular reduction within RNS channels.

The relationship between division and modular reduction is made explicit in Eq. (2).

$$\begin{aligned} z = c \mod m = c - \left\lfloor \frac{c}{m} \right\rfloor \times m. \end{aligned}$$
(2)

where \(c\) is \(2w\) bits, \(m\) is the \(w\)-bit modulus and \(\lfloor x \rfloor \) returns the largest integer smaller than or equal to \(x\). To differ from the large modular multiplication over the whole RNS discussed in Sect. 4, lower case letters are used here to imply this is an operation running within a RNS channel \(m_i\). Barrett algorithm, proposed for positional number system in [7] and [8], gives a fast computation of the division \(y = \left\lfloor \frac{c}{m} \right\rfloor \) as

$$\begin{aligned} y = \left\lfloor \frac{c}{m} \right\rfloor = \left\lfloor \frac{\frac{c}{2^{w + v}}\frac{2^{w + u}}{m}}{2^{u - v}} \right\rfloor , \end{aligned}$$

where \(u\) and \(v\) are two parameters. Furthermore, the quotient \(y\) can be estimated with an error of at most 1 from

$$\begin{aligned} \hat{y} = \left\lfloor \frac{\left\lfloor \frac{c}{2^{w + v}} \right\rfloor \left\lfloor \frac{2^{w + u}}{m} \right\rfloor }{2^{u - v}} \right\rfloor . \end{aligned}$$

The value \(K = \left\lfloor \frac{2^{w + u}}{m} \right\rfloor \) is a constant and can be pre-computed.

The algorithm used in our implementation is shown in Algorithm 1 where u and v are set to \(w+3\) and −2, respectively, as suggested by [14] and [13]. The bounds on the quotient, input and output for these specific values of u and v are calculated to be \(w+3\), \(2w+2\) and \(w+1\), respectively [13].

figure a

4 Modular Multiplication in Residue Number System

This section derives our main RNS modular multiplication (MM) algorithm using a sum of residues. More upper case variables reappear denoting large operands involved in modular multiplication over the whole RNS.

4.1 Moduli Selection

In our application, RNS is used to accelerate a 256-bit modular multiplication; therefore, the dynamic range \(D\) of the RNS should be no smaller than 512 bits so that the product of two 256-bit numbers does not overflow. One important rule to be considered is the uniform distribution of this 512-bit dynamic range into the \(N\) moduli. The smaller the RNS channel width \(w\), the faster the computation within RNS and the more remarkable the advantage of RNS. Therefore, we want \(w\) to be as small as possible. In this paper, the \(N\) moduli are selected to be the same wordlength. This means that the dynamic range of the RNS system is evenly distributed into the \(N\) moduli.

A lot of work in the literature has been using the moduli in special forms, e.g. pseudo Mersenne numbers [10], or in the form of \(2^w \pm 1\) [32]. However, this work only focuses on general moduli rather than special ones to demonstrate that fast implementation of modular multiplication does not have to rely on the special characteristics of the moduli, which is shown by our proposed algorithm.

4.2 Sum of Residues Reduction in the RNS

To define an RNS modular reduction algorithm, we start with the Chinese remainder theorem (CRT) [42]. Using the CRT, an integer \(X\) can be expressed as

$$\begin{aligned} X = \left\langle \sum _{i=1}^N D_i \langle D_i^{-1}x_i \rangle _{m_i} \right\rangle _D, \end{aligned}$$
(3)

where \(D\), \(D_i\) and \(\langle D_i^{-1} \rangle _{m_i}\) are pre-computed constants. Defining \(\gamma _i = \langle D_i^{-1}x_i \rangle _{m_i}\) in (3) yields,

$$\begin{aligned} X= & {} \left\langle \sum _{i=1}^N \gamma _i D_i \right\rangle _{D}\nonumber \\= & {} \sum _{i=1}^N \gamma _i D_i - \alpha D. \end{aligned}$$
(4)

Reducing this modulo the long wordlength modulus \(M\) yields

$$\begin{aligned} Z= & {} \sum _{i=1}^N \gamma _i \langle D_i \rangle _M - \langle \alpha D \rangle _{M}\nonumber \\= & {} \sum _{i=1}^N Z_i - \langle \alpha D \rangle _M \nonumber \\\equiv & {} X \mod M \end{aligned}$$
(5)

where \(Z_i = \gamma _i \langle D_i \rangle _M\). Thus, we have expressed \(Z \equiv X \mod M\) as a sum of residues \(Z_i\) modulo \(M\) and a correction factor \(\langle \alpha D \rangle _M\).

Note that \(\gamma _i = \langle D_i^{-1}x_i \rangle _{m_i}\) can be found using a single RNS multiplication as \(\langle D_i^{-1} \rangle _{m_i}\) is just a pre-computed constant. For the same reason, only one RNS multiplication is needed for \(Z_i = \gamma _i \langle D_i \rangle _M\) as \(\left\langle \langle D_i \rangle _M \right\rangle _{m_i} \) can be pre-computed.

In addition, to avoid negative residues resulted in the RNS channels from the subtraction in (5), the \(- \langle \alpha D \rangle _M\) can be replaced by \(+ \langle - \alpha D \rangle _M\), which in the RNS is also a set of \(N\) pre-computed residues \(\left\langle \langle - \alpha D \rangle _M \right\rangle _{m_i}\). This makes the last operation in (5) a simple RNS addition and (5) becomes

$$\begin{aligned} Z = \sum _{i=1}^N \gamma _i \langle D_i \rangle _M + \langle - \alpha D \rangle _M, \end{aligned}$$
(6)

A further expansion to an expression of vectors of the pre-computed residues will make this equation clearer:

$$\begin{aligned} \left( \begin{array}{c} z_1\\ z_2\\ \vdots \\ z_N\\ \end{array} \right)= & {} \sum _{i=1}^N \langle D_i^{-1}x_i \rangle _{m_i} \left( \begin{array}{c} \left\langle \langle D_i \rangle _M \right\rangle _{m_1}\\ \left\langle \langle D_i \rangle _M \right\rangle _{m_2}\\ \vdots \\ \left\langle \langle D_i \rangle _M \right\rangle _{m_N}\\ \end{array} \right) \nonumber \\&+ \alpha \left( \begin{array}{c} \left\langle \langle - D \rangle _M \right\rangle _{m_1}\\ \left\langle \langle - D \rangle _M \right\rangle _{m_2}\\ \vdots \\ \left\langle \langle - D \rangle _M \right\rangle _{m_N}\\ \end{array} \right) \end{aligned}$$
(7)

4.3 Approximation of \(\alpha \)

Now \(\alpha \) becomes the only value yet to be found. Here the method provided by Kawamura [24] is improved by decomposing its approximations, and more accuracy is achieved by permitting exact \(\gamma _i\).

Dividing both sides of (4) by \(D\) yields

$$\begin{aligned} \alpha + \frac{X}{D} = \frac{\sum _{i=1}^N \gamma _i D_i}{D} = \sum _{i=1}^N \frac{\gamma _i}{m_i}. \end{aligned}$$
(8)

Since \(0 \le X/D < 1\), \(\alpha \le \sum _{i=1}^N \frac{\gamma _i}{m_i} < \alpha + 1\) holds. Therefore,

$$\begin{aligned} \alpha = \left\lfloor \sum _{i=1}^N \frac{\gamma _i}{m_i} \right\rfloor . \end{aligned}$$
(9)

In subsequent discussions, \(\hat{\alpha }\) is used to approximate \(\alpha \). Firstly, an approximation of \(\hat{\alpha } = \alpha \) or \(\alpha - 1\) will be given. Secondly, some extra work will exactly assure \(\hat{\alpha } = \alpha \) under certain prerequisites.

4.3.1 Deduction of \(\hat{\alpha } = \alpha \) or \(\alpha - 1\)

The first approximation is introduced here: a denominator \(m_i\) in (9) is replaced by \(2^w\), where \(w\) is the RNS channel width and \(2^{w-1} < m_i \le 2^w - 1\). Then the estimate of (9) becomes

$$\begin{aligned} \hat{\alpha } = \left\lfloor \sum _{i=1}^N \frac{\gamma _i}{2^w} \right\rfloor . \end{aligned}$$
(10)

The error incurred by this denominator’s approximation is denoted as

$$\begin{aligned} \epsilon _i = \frac{(2^w - m_i)}{2^w}. \end{aligned}$$

Then,

$$\begin{aligned} 2^w = \frac{m_i}{1 - \epsilon _i}. \end{aligned}$$

According to the definition of RNS in Sect. 2, the RNS moduli are ordered such that \(m_i < m_j\) for all \(i < j\). Therefore, the largest error

$$\begin{aligned} \epsilon = \max (\epsilon _i) = \frac{(2^w - m_1)}{2^w}. \end{aligned}$$

The accuracy of \(\hat{\alpha }\) can be investigated:

$$\begin{aligned} 0\le & {} \gamma _i \le m_i - 1 \nonumber \\ \Rightarrow 0\le & {} \sum _{i=1}^N \frac{\gamma _i}{m_i} < N. \end{aligned}$$
(11)

Therefore,

$$\begin{aligned} \sum _{i=1}^N \frac{\gamma _i}{2^w}= & {} \sum _{i=1}^N \frac{\gamma _i(1 - \epsilon _i)}{m_i} \end{aligned}$$
(12)
$$\begin{aligned}= & {} \sum _{i=1}^N \frac{\gamma _i}{m_i} - \epsilon \sum _{i=1}^N \frac{\gamma _i}{m_i} \nonumber \\ \Rightarrow \sum _{i=1}^N \frac{\gamma _i}{2^w}> & {} \sum _{i=1}^N \frac{\gamma _i}{m_i} - N \epsilon . \end{aligned}$$
(13)

The last inequality holds due to Eq. (11). If \(0 \le N \epsilon \le 1\), then \(\sum _{i=1}^N \frac{\gamma _i}{m_i} - N \epsilon > \sum _{i=1}^N \frac{\gamma _i}{m_i} - 1\). Thus, \(\sum _{i=1}^N \frac{\gamma _i}{2^w} > \sum _{i=1}^N \frac{\gamma _i}{m_i} - 1\). In addition, obviously \(\sum _{i=1}^N \frac{\gamma _i}{2^w} < \sum _{i=1}^N \frac{\gamma _i}{m_i}\). Therefore,

$$\begin{aligned} \sum _{i=1}^N \frac{\gamma _i}{m_i} - 1< \sum _{i=1}^N \frac{\gamma _i}{2^w} < \sum _{i=1}^N \frac{\gamma _i}{m_i}. \end{aligned}$$
(14)

Then,

$$\begin{aligned} \hat{\alpha } = \left\lfloor \sum _{i=1}^N \frac{\gamma _i}{2^w} \right\rfloor = \left\lfloor \sum _{i=1}^N \frac{\gamma _i}{m_i} \right\rfloor = \alpha , \end{aligned}$$

or,

$$\begin{aligned} \hat{\alpha } = \left\lfloor \sum _{i=1}^N \frac{\gamma _i}{m_i} \right\rfloor - 1 = \alpha - 1. \end{aligned}$$

when \(0 \le N \epsilon \le 1\).

This raises the question: is it easy to satisfy the condition \(0 \le N \epsilon \le 1\) in a RNS? The answer is: the larger the dynamic range of the RNS, the easier. This is contrary to most published techniques that are only applicable to RNS with a small dynamic range [9, 19, 39, 43].

Given \(0 \le N \epsilon \le 1\) and \(\epsilon = \frac{(2^w - m_1)}{2^w}\),

$$\begin{aligned} \frac{N - 1}{N} \le \frac{m_1}{2^w} \le 1, \end{aligned}$$

which means there must be at least \(N\) co-prime numbers existing within the interval \(I = [\frac{N - 1}{N}2^w, 2^w]\) for the use of RNS moduli. Apart from this, it is also easy to satisfy the harsher condition \(0 \le N \epsilon \le \frac{1}{2}\). This requires

$$\begin{aligned} \frac{2N - 1}{2N} \le \frac{m_1}{2^w} \le 1, \end{aligned}$$

which can be derived using the process above. Thus, the new interval for RNS moduli is given by Eq. (15) and will be used for further developments in the next subsection.

$$\begin{aligned} I = \left[ \frac{2N - 1}{2N}2^w, 2^w\right] \end{aligned}$$
(15)

Table 1 lists the maximum \(N\) against different \(w\) from 4 to 24 within the interval \(I = [\frac{2N - 1}{2N}2^w, 2^w]\). It is evident that the number of available channels \(N\) increases dramatically along with the linear increase of the channel width \(w\). This is because the span of interval \(I\) is \(2^w - \frac{N - 1}{N}2^w = \frac{2^w}{N}\). \(2^w\) increases much faster than \(N\), which gives a sharp increase of the span of \(I\) with more primes existing within it as the dynamic range \(D\) of the RNS increases.

Table 1 Maximum possible N against w in new RNS modular multiplication

The actual problem now is \(\hat{\alpha }\) could be \(\alpha \) or \(\alpha - 1\). From Eq. (4), \(\hat{X}\) could be \(X\) or \(X + D\). Then two values of \(X \mod M\) will result, and it is difficult to tell the correct one. Thus, \(\hat{\alpha }\) needs to be the exact \(\alpha \).

4.3.2 Ensuring \(\hat{\alpha } = \alpha \)

To make sure \(\hat{\alpha } = \left\lfloor \sum _{i=1}^N \frac{\gamma _i}{2^w} \right\rfloor \) in (10) is equal to \(\alpha \) instead of \(\alpha - 1\), a correction factor \(\Delta \) can be added to the floor function. Equation (10) becomes

$$\begin{aligned} \hat{\alpha } = \left\lfloor \sum _{i=1}^N \frac{\gamma _i}{2^w} + \Delta \right\rfloor . \end{aligned}$$
(16)

Substituting Eq. (8) in Eqs. (13) and (14) yields

$$\begin{aligned} \alpha + \frac{X}{D} - N \epsilon< \sum _{i=1}^N \frac{\gamma _i}{2^w} < \alpha + \frac{X}{D}. \end{aligned}$$

Adding \(\Delta \) on both sides yields

$$\begin{aligned} \alpha + \frac{X}{D} - N \epsilon + \Delta< \sum _{i=1}^N \frac{\gamma _i}{2^w} + \Delta < \alpha + \frac{X}{D} + \Delta . \end{aligned}$$
(17)

If \(\Delta \ge N \epsilon \), then \(\Delta - N \epsilon \ge 0\) and \(\alpha + \frac{X}{D} - N \epsilon + \Delta \ge \alpha \). If \(0 \le X < (1 - \Delta )D\), then \(\frac{X}{D} + \Delta < 1\) and \(\alpha + \frac{X}{D} + \Delta < \alpha + 1\). Hence,

$$\begin{aligned} \alpha< \sum _{i=1}^N \frac{\gamma _i}{2^w} + \Delta < \alpha + 1. \end{aligned}$$
(18)

Therefore,

$$\begin{aligned} \hat{\alpha } = \left\lfloor \sum _{i=1}^N \frac{\gamma _i}{2^w} + \Delta \right\rfloor = \alpha \end{aligned}$$

holds. The two prerequisites obtained from the deduction above are

$$\begin{aligned} {\left\{ \begin{array}{ll} N \epsilon \le \Delta< 1 \\ 0 \le X < (1 - \Delta )D. \end{array}\right. } \end{aligned}$$
(19)

It has already been shown in the previous section that the first condition \(N \epsilon< \Delta < 1\) is easily satisfied as long as \(\Delta \) is not too small. For example, \(\Delta \) could be \(\frac{1}{2}\). The second one is not that feasible at first sight as it requires \(X\) be less than half the dynamic range \(D\) in the case of \(\Delta = \frac{1}{2}\). However, \(\frac{1}{2}D\) is just one bit shorter than \(D\), which is a number over two thousand bits. Therefore, this can be easily achieved by extending \(D\) by several bits to cover the upper bound of \(X\). This is deduced in the following subsection. Hence, we have obtained an \(\hat{\alpha } = \alpha \).

4.4 Bound Deduction

The RNS dynamic range to do a 256-bit multiplication should at least be 512 bits. However, RNS algorithms always require some redundant RNS channels. This subsection is dedicated to confirming how many channels are actually needed for the new RNS modular multiplication algorithm. Note that the result \(Z\) in Eq. (6)—the basis of the RNS modular multiplication algorithm—may be greater than the modulus \(M\) and would require subtraction of a multiple of \(M\) to be fully reduced. Instead, the dynamic range \(D\) of the RNS can be made large enough that the results of modular multiplications can be used as operands for subsequent modular multiplications without overflow.

Given that \(\gamma _i< m_i < 2^w\), \(\langle D_i \rangle _M < M\) and \(\langle \alpha D\rangle _M \ge 0\),

$$\begin{aligned} Z = \sum _{i=1}^{N} \gamma _i \langle D_i\rangle _M - \langle \alpha D \rangle _M < N2^wM. \end{aligned}$$
(20)

Thus, take operands \(A < N2^wM\) and \(B < N2^wM\) such that \(X = A \times B < N^2 2^{2w} M^2\).

According to Eq. (19), we must ensure that \(X\) does not overflow \((1-\Delta )D\). If it is assumed \(M\) can be represented in \(h\) channels so that \(M < 2^{wh}\), then

$$\begin{aligned} X < N^2 2^{2wh+2w}. \end{aligned}$$

\(X < (1-\Delta )D\) is required for

$$\begin{aligned} D > 2^{wN-1}, \end{aligned}$$

which will be satisfied if

$$\begin{aligned} N^2 2^{2wh+2w} < (1-\Delta ) 2^{wN-1}. \end{aligned}$$

This is equivalent to

$$\begin{aligned} N > 2h + 2 + \frac{1+2\log _2 \frac{N}{1-\Delta }}{w}. \end{aligned}$$

For example, for \(w \ge 32\), \(N < 128\) and \(\Delta = \frac{1}{2}\), it will be sufficient to choose \(N \ge 2h+7\). Note that this bound is conservative and fewer channels may be sufficient for a particular RNS. This is because the bound of \(Z\) can be directly computed as

$$\begin{aligned} Z = \sum _{i=1}^{N} \gamma _i \langle D_i\rangle _M - \langle \alpha D \rangle _M \le \sum _{i=1}^{N} (m_i - 1) \langle D_i\rangle _M \end{aligned}$$

using the pre-computed RNS constants, \(m_i\) and \(\langle D_i \rangle _M\), instead of worst case bounds \(N\) and \(M\) as in (20).

4.5 The New RNS Modular Multiplication Algorithm

4.5.1 Another Approximation

The computation of \(\alpha \) in Eq. (16) can be optimised by representing \(\gamma \) using its most significant q bits, where \(q<w\). Hence, the approximated \(\gamma \) can be written as

$$\begin{aligned} \hat{\gamma _i} = 2^{w-q} \left\lfloor \frac{\gamma _i}{2^{w-q}} \right\rfloor . \end{aligned}$$
(21)

The error incurred by this numerator’s approximation is denoted as

$$\begin{aligned} \delta _i = \frac{\gamma _i - \hat{\gamma _i}}{m_i}. \end{aligned}$$

Then

$$\begin{aligned} \hat{\gamma _i} = \gamma _i - \delta _i m_i. \end{aligned}$$

The largest possible error will be

$$\begin{aligned} \delta = \frac{2^{w-q} - 1}{m_1}. \end{aligned}$$

Note that this approximation, treated as a necessary part of the computation of \(\alpha \) in [24], is actually not imperative. It has been shown the algorithm works fine without this approximation in previous discussions although it does simplify the computations in hardware.

Replacing the \(\gamma _i\) in Eq. (16) by \(\hat{\gamma _i}\) yields

$$\begin{aligned} \hat{\alpha } = \left\lfloor \sum _{i=1}^N \frac{\hat{\gamma _i}}{2^w} + \Delta \right\rfloor . \end{aligned}$$
(22)

Then, Eq. (12) becomes

$$\begin{aligned} \sum _{i=1}^N \frac{\hat{\gamma _i}}{2^w}= & {} \sum _{i=1}^N \frac{(\gamma _i - \delta _i m_i)(1 - \epsilon _i)}{m_i} \nonumber \\= & {} \sum _{i=1}^N \frac{\gamma _i(1 - \epsilon _i)}{m_i} - \sum _{i=1}^N (1 - \epsilon _i) \delta _i \nonumber \\\ge & {} (1 - \epsilon ) \sum _{i=1}^N \frac{\gamma _i}{m_i} - N \delta \nonumber \\ \sum _{i=1}^N \frac{\hat{\gamma _i}}{2^w}> & {} \sum _{i=1}^N \frac{\gamma _i}{m_i} - N (\epsilon + \delta ). \end{aligned}$$
(23)

This is because

$$\begin{aligned} 0<&1 - \epsilon _i = \frac{m_i}{2^w}< 1 \\ \Rightarrow 0<&\sum _{i=1}^N (1 - \epsilon _i) < N, \end{aligned}$$

Note that the only difference between Eqs. (13) and (23) is that the \(\epsilon \) in the former is replaced by the \(\epsilon + \delta \) in the latter. Following a similar development to Sect. 4.3, Eq. (17) becomes

$$\begin{aligned} \alpha + \frac{X}{D} - N (\epsilon + \delta ) + \Delta< \sum _{i=1}^N \frac{\hat{\gamma _i}}{2^w} + \Delta < \alpha + \frac{X}{D} + \Delta . \end{aligned}$$
(24)

The two prerequisites in (19) are now

$$\begin{aligned} {\left\{ \begin{array}{ll} N (\epsilon + \delta ) \le \Delta< 1 \\ 0 \le X < (1 - \Delta )D \end{array}\right. } \end{aligned}$$
(25)

This will again guarantee

$$\begin{aligned} \hat{\alpha } = \left\lfloor \sum _{i=1}^N \frac{\hat{\gamma _i}}{2^w} + \Delta \right\rfloor = \alpha . \end{aligned}$$

Substituting (21) in Eq. (22) yields

$$\begin{aligned} \hat{\alpha } = \left\lfloor \sum _{i=1}^N \frac{\left\lfloor \frac{\gamma _i}{2^{w-q}} \right\rfloor }{2^q} + \Delta \right\rfloor . \end{aligned}$$
(26)

This is the final equation used in the new algorithm to estimate \(\alpha \).

4.5.2 RNS Modular Multiplication Algorithm and Design Example

The new sum of residues modular multiplication algorithm in RNS is shown in Algorithm 2. It computes \(Z \equiv A \times B \mod M\) using Eq. (6). Note that from Eqs. (9) and (11), \(\alpha < N\). Thus, \(\langle - \alpha D \rangle _M\) can be pre-computed in RNS for \(\alpha = 0 \dots N - 1\).

figure b

The flow chart of the proposed algorithm is shown in Fig. 1. The thick lines represent RNS values, whereas thin lines are used to represent binary values of small wordlength.

Fig. 1
figure 1

RNS modular multiplication flow chart

The proposed Algorithm 2 can be further explained with the help of an example with the following inputs and pre-computed values:

  • \(moduli=[16183, 16187, \dots , 16383]\)

  • \(M = 2^{256} - 2^{32} - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1\) (NIST standard for Koblitz curve)

  • \(N = 40, w=14, \Delta =0.75, q=8\) (from Tables 1 and 3)

  • \(D_i^{-1}=[1027, 13322, ..., 698]\)

  • \(\langle D_i\rangle _M = \{[3064, 11630, \dots , 14819], [2396, 10967, \dots , 6494], \dots , [10399, 1229, \dots , 678]\}\)

  • \(A_i = [11169, 1811, \dots , 15]\)

  • \(B_i = [6273, 5504, \dots , 9258]\)

The complete moduli set used in this example is given in Table 4. The formulae for \(D_i^{-1}\) and \(\langle D_i\rangle _M\) are given in Sect. 2. The steps below show the computation of the operation (\(A\times B\) mod M) where each step corresponds to the steps of Algorithm 2.

  1. 1.

    \(x_i = [\langle 11169\times 6273\rangle _{16183}, \langle 1811\times 5504\rangle _{16187}, ..., \langle 15\times 9258\rangle _{16383}]\) \( = [6930, 12739, ..., 7806]\)

  2. 2.

    \(\gamma _i = [\langle 6930\times 1027\rangle _{16183}, \langle 12739\times 13322\rangle _{16187}, ..., \langle 7806\times 698\rangle _{16383}]\) \( = [12773, 4450, ..., 9432]\)

  3. 3.

    \(\alpha = \lfloor \frac{(199+69+91+...+147)}{2^8}+0.75 \rfloor \) \( = 25\)

  4. 4.

    \(Y_i = \{[\langle 12773\times 3064\rangle _{16183}, \langle 12773\times 11630\rangle _{16187}, ..., \langle 12773\times 14819\rangle _{16383}],\)

    \([\langle 4450\times 2396\rangle _{16183}, \langle 4450\times 10967\rangle _{16187}, ..., \langle 4450\times 6494\rangle _{16383}], ...,\)

    \([\langle 9432\times 10399\rangle _{16183}, \langle 9432\times 1229\rangle _{16187}, ..., \langle 9432\times 678\rangle _{16383}] \}\)

    \(Y_i = \{[5978, 1891, ..., 10288], [13786, 15532, ..., 15071], ..., [14388, 2036, ..., 5526] \}\)

  5. 5.

    \(Sum = [\langle 5978+13786+...+14388\rangle ,\) \( \langle 1891+15532+...+2036\rangle , ...,\) \( \langle 10288+15071+...+5526\rangle ]\) \(Sum = [446438, 403741, ..., 373271]\)

  6. 6.

    \(\langle -\alpha D\rangle _M = [13693, 3365, ..., 10031]\) \(Z = [\langle 446438+13693\rangle _{16183}, \langle 403741+3365\rangle _{16187}, ..., \langle 373271+10031\rangle _{16383}]\) \(Z = [7007, 2431, ..., 6493]\)

The values of A, B and Z are given below in binary number system for better understanding.

5 Implementation and Synthesis Results

This section describes the implementation and synthesis results of the 256-bit modular multiplier (MM) in RNS using proposed algorithm.

5.1 Architecture of the RNS-Based MM

The architecture of the modular multiplier of Algorithm 2 is shown in Fig. 2. The pre-computed values required for the architecture are given in Table 2.

Fig. 2
figure 2

Highly parallel architecture of RNS MM

Table 2 Pre-computed values for Algorithm 1 and Algorithm 2

In Fig. 2, all of the computations are done in short wordlength (at most \(w\) bits, the RNS channel width) within the RNS. The architecture performs the following steps (the step numbers follow those in Algorithm 2).

  • In Step 1, the product \(X = A \times B\) is computed within the RNS. This RNS multiplication involves three short wordlength multiplications and one subtraction.

  • In Step 2, one RNS multiplication is performed to find the \(\gamma \). This corresponds to three multiplications followed by one subtraction in the architecture of Fig. 2.

  • Steps 3 and 4 are performed in parallel. RNS multiplications are used to compute the \(Y_i\)s in Step 4, while the \(\gamma _i\)s are used to generate \(\alpha \) in Step 3. Note that the divisions in Step 3 are accomplished by simple right shifts.

  • Step 5 and part of Step 6 are also performed simultaneously. The sum \(\sum Y_i\) is computed in Step 5 using counter-based Wallace tree reduction, while \(\langle - \alpha D \rangle _M\) is retrieved from memory in Step 6. It is to be noted that \(\sum Y_i\) are simple additions without modulus operation; therefore, the channel width for Sum will be w+7 bits. This is not a problem because the result is still within the bounds for Barrett algorithm and can be reduced in the next step.

  • Finally in the other part of Step 6, \(Z\) is produced by adding \(\langle - \alpha D \rangle _M\) and the Sum.

Hence, this is a highly parallel structure with only 3 RNS multiplication, 1 Wallace reduction tree and 2 RNS additions in the critical path. It is therefore realised that all of the computations are done in the RNS channel width. In order to achieve a higher speed, the counter-based Wallace tree [4] is used which has less delay as compared to the conventional Wallace tree.

5.2 Design Specifications of 256-bit RNS MM

The architecture of Fig. 2 is used to implement a 256-bit modular multiplier in VHDL. The 256-bit modular multiplier consists of 40 RNS channels where each channel is of 15 bits. The size of each moduli is 14 bits, and one extra bit in channels is required due to the approximation used in the Barrett algorithm as mentioned in Sect. 3. The design parameters and the RNS moduli set of the architecture are given in Tables 3 and 4, respectively. The values of q and \(\Delta \) are chosen according to the criteria discussed in Sect. 4.

Table 3 Design parameters of 256-bit RNS MM
Table 4 RNS moduli set for w=14, N=40

5.3 Complexity Comparison

The complexity of the proposed architecture can be analysed from Algorithm 2 and Fig. 2 as follows:

  1. 1.

    Step 1 of Algorithm 2 performs one RNS multiplication. This is implemented by N w-bit multipliers followed by Barrett reductions as shown in Fig. 2. One Barrett reduction requires 2 w-bit multipliers and 1 w-bit subtractor. Hence, Step 1 requires 3N w-bit multipliers and N w-bit subtractors.

  2. 2.

    Step 2 also performs one RNS multiplication; therefore, the complexity of this step is same as Step 1.

  3. 3.

    Step 3 requires two division operations which can be easily implemented by simple right shifts because the divisor is a power of 2. These shifted values are then added together by using a Wallace tree reduction of N rows. The Wallace tree reduction block is followed by an adder to add the last two rows. Hence, this step requires \(N+1\) tree reduction block and one w-bit adder.

  4. 4.

    Step 4 performs N RNS multiplications in parallel. Each RNS multiplication requires 3N w-bit multipliers and N w-bit subtractors. Hence, the overall complexity of this step is \(3N^2\) w-bit multipliers and \(N^2\) w-bit subtractors.

  5. 5.

    Step 5 performs addition on the output of Step 4, i.e. N RNS values. This is done by adding channel 1 of \(Y_0\) to \(Y_{N-1}\), channel 2 of \(Y_0\) to \(Y_{N-1}\), and so on. Each addition is performed by using a Wallace tree reduction of N rows followed by one w-bit adder to add the last two rows. The complexity of this step is analysed as N Wallace tree reductions and N w-bit adders.

  6. 6.

    Step 6 of Algorithm 2 performs one RNS addition. This is implemented by N w-bit adders followed by Barrett reductions. Thus, this step requires 2N w-bit multipliers, N w-bit adders and N w-bit subtractors.

Based on this analysis, the complexity of the proposed architecture is summarised in Table 5. Note that the adders and subtractors are assumed to be of equal complexity for simplicity of analysis.

Table 5 Complexity analysis of the proposed architecture

It can be seen from Table 5 that the critical path of the proposed architecture consists of only eleven and six \(w\)-bit multipliers and adders, respectively. It is important to note that the numbers of multipliers and adders in the critical path are independent of the number of channels. This property allows the scaling of the proposed architecture with very little decrease in the speed.

In order to compare the proposed architecture with other RNS-based modular multipliers, we evaluated the complexity in terms of \(w\)-bit modular multiplications and modular additions by following the approach given in [38]. The step-by-step analysis of the complexity is described as follows:

  • Steps 1, 2: \(N\) modular multiplications are performed in parallel in these steps. Thus, a total of \(2N\) modular multipliers are required. However, the critical path consists of only two modular multiplications.

  • Step 3: Step 3 is responsible to add \(N+1\) values using the Wallace tree reduction where each value is \(q+1\) bits long. The \(q\) is usually a few bits less than \(w\) as discussed in detail in Sect. 4.5. Since the process is very similar to that of a multiplication (with the exception of partial products generation), therefore, for simplicity, we evaluate the complexity of Wallace tree in terms of multiplier. It is reasonable to say that a Wallace tree reduction of 9 (q+1 bits) columns and 41 \(N+1\) rows has similar complexity as of two 15-bit (w bits) multipliers. Hence, the complexity of Step 3 is estimated to be equivalent to a \(\frac{2}{3}\) modular multiplication.

  • Step 4: in this step \(N^2\) modular multiplications are performed in parallel which means the delay of this step is same as the delay of one modular multiplier.

  • Step 5: Step 5 is required to add \(N\) \(w\)-bit values using the Wallace tree reduction. Based on the above explanation we can estimate the complexity of Wallace tree reduction equivalent to three \(w\)-bit multipliers. Hence, the complexity of the Step 5 is estimated to be equivalent to one modular multiplication.

  • Step 6: this step consists of one \(w\times N \times N\) ROM and \(N\) modular additions. One modular addition is estimated to be equivalent to a \(\frac{3}{4}\) modular multiplication.

Table 6 shows the comparison of the complexity of the proposed architecture with existing state-of-the-art RNS-based modular multipliers.

Table 6 Number of \(w\)-bit modular multiplications in the considered RNS MM algorithms

It can be seen from Table 6 that the proposed design requires about one-half and one-third the number of modular multiplications for the designs of [45] and [38], respectively.

The design in [18] implements a 512-bit modular multiplier; therefore, a detailed analysis is required in order to perform a fair comparison. To do this, we need to modify the proposed design such that it has the same dynamic range as of [18]. The dynamic range of [18] is 1055-bit with \(N=33\) and \(w=32\). Putting this value of N in Table 6 gives us \(2N^2+5N\)=2343 modular multiplications where each multiplication is of 32-bit.

In order to perform a fair comparison the proposed design needs to be scaled such that it has the same dynamic range as of [18]. This can be done by increasing the channel width (w+1) and/or number of channels (N). We propose N=62 and w=17 which will increase the dynamic range to 1054-bit with very little effect on the delay (note that the channel width \(w+1\) instead of w as explained in Section 3). The RNS moduli for the scaled-up design are given in Table 7.

Table 7 RNS moduli set for w=17, N=62

Thus, the proposed scaled-up design requires \(N^2+3N+2\)=4032 modular multiplications where each multiplication is of \(w+1=18\) bits. For simplicity, we can say that one 18-bit modular multiplier is equivalent to \(\frac{18}{32}=0.56\) 32-bit modular multiplier. Hence, the proposed design requires \(4032\times 0.56=2258\) 32-bit modular multiplications to compute one 512-bit modular multiplication. Based on this analysis, the complexity of the proposed design is 3.6% lower as compared to the [18].

5.4 Synthesis Results

The VHDL codes are developed for both pipelined and non-pipelined versions of the proposed 256-bit modular multiplier. The extensive simulations of the designs are performed using ModelSim SE. The designs are synthesised in Xilinx ISE 14.4 for Virtex-6 XC6VLX75T-3-FF784 FPGA with an optimisation goal of ‘Speed’ and optimisation effort of ‘Normal’. The designs are also synthesised in Synopsys Design Compiler using 90-nm CMOS technology in order to do the comparison with the recent ASIC architectures. The designs are compiled using SAED90nm_typ library with typical process corner at a power supply of 1.2 V and 25 \(^\circ C\) temperature. The compile effort of medium is used.

A large number of modular multipliers have been presented in the literature for high-speed operation, but a straightforward comparison is not always possible due to the difference in implementation technologies. The authors of [20] and [3] present the results only for the elliptic curve point multiplication and do not provide any results for the modular multiplication. Similarly, the work in [11] is focused on 128-bit pairing accelerators in RNS and analyse different types of pairing architectures. The unavailability of the modular multiplication results restricts these papers to join the comparison analysis. Yao et al. [47] perform a detailed analysis of the RNS parameter selection and its effects on the modular multiplication. The author proved the advantage of their proposed method by showing the clock cycles for one modular multiplication. However, the paper lacks the actual results and implementation details. Similarly, some recent modular multipliers [29, 37, 38, 48] are designed for 1024-bit and cannot be used for the comparison.

In order to analyse the benefits of the proposed modular multiplier, it is compared with state-of-the-art modular multiplier implementations on ASIC and FPGA. Table 8 compares the synthesis results of the proposed and reference modular multipliers. The clock cycles represent the total cycles required to perform one modular multiplication, whereas the cycle time represents the minimum time period of the clock. The results in Table 8 are divided in two sections for ASIC- and FPGA-based implementations.

Table 8 Performance comparison of the 256-bit modular multipliers

The results in Table 8 show that the proposed modular multiplier is faster than the existing high-speed modular multipliers. Note that the average time for one modular multiplication of the proposed architectures is same as their minimum cycle time. This is because the proposed architectures require only one iteration to compute one modular multiplication. The latency of the proposed pipelined architecture is 25.0\(\times \)3 = 75 ns and 14.2\(\times \)3 = 42.6 ns for ASIC and FPGA implementations, respectively, due to additional cycles required to fill the pipeline registers. The latency of non-pipelined architecture is same as its average time for one modular multiplication.

The comparison of the proposed architecture with other FPGA implementations is straightforward because they are implemented on the same FPGA device. The proposed MM outperforms the existing MM architecture implementations on FPGA in terms of speed and clock cycles. The closest FPGA implementation of existing MM is [2] which is 0.03 \(\upmu \)s slower than the proposed architecture. The proposed architecture is 37, 93 and 94% faster than [2, 22] and [21], respectively.

The comparison results of ASIC implementations include two designs among which [18] is implemented on an advanced 45-nm CMOS technology library, whereas the design in [33] uses the same technology. The proposed architecture outperforms both architectures in terms of throughput, clock cycles and average time for one modular multiplication. The cycle time of [33] is smaller than the proposed design; however, the proposed design can be modified to operate on much higher frequency by an increase in the pipeline stages. The main advantage of the proposed architecture is its less clock cycles which enables it to provide a higher throughput in spite of its slow clock frequency.

The design in [18] performs 512-bit modular multiplication, and the proposed design can be scaled up for 512-bit operation as explained in Sect. 5.3 which enables a fair analysis of the delay. The increase in the delay of the scaled design can be accurately computed by considering the scaling on the critical path as given in Table 5. It is evident from Table 5 that the number of channels has very little impact on the critical path of the architecture which means that the delay for a larger modular multiplier will approximately be same as of 256-bit modular multiplier. The impact of an increased N and w on the critical path is explained as follows:

  • w-bit multiplier: the critical path consists of eleven w-bit multipliers. The value of w is increased from 15 to 18 in the scaled design which will result in a minor increase in the delay due to the interconnection and a slightly larger final adder in the multiplier. In order to have more precise results, we synthesised the 15-bit and 18-bit multipliers separately using the same device and synthesis parameters as for the original design.

  • w-bit adder/subtractor: the critical path of the proposed modular multiplier consists of 6 w-bit adders/subtractors. The effect of a larger w on this part of the critical path will be even less than the above-mentioned component. The delay increase for this component is also analysed by the synthesis of 15-bit and 18-bit adders separately.

  • Tree reduction of N rows: the critical path consists of one tree reduction to add N values of w bits. This component requires five reduction stages for N = 40 as well as for N = 62; therefore, the only increase in the delay will be due to the more interconnection wires.

Table 9 shows the synthesised delay of the different components of the critical path for the 256-bit and 512-bit versions of the proposed architecture. Equation (27) is used to calculate the total delay.

$$\begin{aligned} Delay_\mathrm{MM} = 11 \times Delay_\mathrm{mult} + 6 \times Delay_\mathrm{add} + Delay_\mathrm{tree} \end{aligned}$$
(27)
Table 9 Synthesis results for the critical path delay of the proposed architecture

The total delay of the 256-bit architecture calculated in Table 9 is slightly less than the actual result given in Table 8. The reason for this is the absence of interconnection delay which are not modelled in the calculated result; however, this can also be estimated by calculating the percentage of the interconnection delay in the 256-bit architecture and then adding same percentage in the calculated results of 512-bit architecture. The synthesised delay of 256-bit architecture is 72.7 ns which is 22 % more than the calculated delay of 56.69 ns. Based on this percentage the total delay for 512-bit architecture is estimated to be \(68.09 + (0.22\times 68.09) = 83.07\) ns. Hence, we can claim that the proposed architecture is 12% faster than [18]. It is also be noted that the design of [18] is implemented on a much advanced technology; therefore, our design is expected to show significant increase in performance if implemented on the same advanced technology of 45-nm CMOS. Furthermore, the use of pipeline stages can greatly improve the throughput of the proposed architecture. The advantage of the proposed architecture over [18] in terms of complexity is already proved in Sect. 5.3.

6 Conclusion and Future Work

A highly parallel and scalable architecture has been described to perform modular multiplication in the RNS using sum of residues. The algorithm performs the modular multiplication completely in the RNS channels without any need of conversion to the positional number system.

A 256-bit modular multiplier is implemented in a 40-channel RNS where each channel is of 15 bits. Pre-computed values are stored in the look-up tables to speed up the operations. The pipelined and non-pipelined versions of the architecture are implemented in VHDL and synthesised in Xilinx ISE for Virtex-6 FPGA as well as on ASIC using 90-nm CMOS technology in Synopsys Design Compiler. The comparison shows that the delay and complexity of the proposed modular multiplier are 12 and 3.6 % lower, respectively, as compared to the state-of-the-art RNS modular multiplier.

As a future work, we plan to implement a 2048-bit modular multiplier based on the proposed algorithm. The proposed 256-bit and 2048-bit modular multipliers can be used to implement the RNS-based ECC and RSA cryptosystems, respectively.