1 Introduction

A number of applications require performing long multiplications in performance-restricted environments. Indeed, low-end devices such as the 68hc05 or the 80c51 microprocessors have a very limited instruction-set, very limited memory, and operations such as multiplication are rather slow: a mul instruction typically claims 10 to 20 cycles.

General multiplication has been studied extensively, and there exist algorithms with very good asymptotic complexity such as the Schönhage-Strassen algorithm [18] which runs in time O(n log n log log n) or the more recent Fürer algorithm [13], some variants of which achieve the slightly better \(O(2^{3\log ^{*}\!n}n\log n)\) complexity [14]. Such algorithms are interesting when dealing with extremely large integers, where these asymptotics prove faster than more naive approaches.

In many cryptographic contexts however, multiplication is performed between a variable and a pre-determined constant:

  • During Diffie-Hellman key exchange [9] or ElGamal [10] a constant g must be repeatedly multiplied by itself to compute g x mod p.

  • The essential computational effort of a Fiat-Shamir prover [11, 12] is the multiplication of a subset of fixed keys (denoted s i in [11]).

  • A number of modular reduction algorithms use as a building-block multiplications (in ℕ) by a constant depending on the modulus. This is for instance the case of Barrett’s algorithm [2] or Montgomery’s algorithm [17].

The main strategy to exploit the fact that one operand is constant consists in finding a decomposition of the multiplication into simpler operations (additions, subtractions, bitshifts) that are hardware-friendly [3]. The problem of finding the decomposition with the least number of operations is known as “single constant multiplication” (SCM) problem. SCM ∈ NP-complete as shown in [4], even if fairly good approaches exist [1, 7, 8, 20] for small numbers. For larger numbers, performance is unsatisfactory unless the constant operand has a predetermined format allowing for ad hoc simplifications.

In this paper, we propose a completely different approach: the constant operand is encoded in a computation-friendly way, which makes multiplication faster. This encoding is based on linear relationships detected amongst the constant’s digits (or, more generally, subwords), and can be performed offline in a reasonable time for 1024-bit numbers and 8-bit microprocessors. We use a graph-based backtracking algorithm [16] to discover these linear relationships, using recursion to keep the encoder as short and simple as possible.

2 Multiplication algorithms

We now provide a short overview of popular multiplication methods. This summary will serve as a baseline to evaluate the new algorithm’s performance.

Multiplication algorithms usually fall in two broad categories: general divide-and-conquer algorithms such as Toom-Cook [6, 19] and Karatsuba [15]; and the generation of integer multiplications by compilers, where one of the arguments is statically known. We are interested in the case where small-scale optimizations such as Bernstein’s [3] are impractical, but general purpose multiplication algorithms à la Toom-Cook are not yet interesting.

Throughout the paper we will assume unsigned integers, and denote by w the word size (typically, w = 8), a i , b i and r i the binary digits of a, b and r respectively:

$$a = \sum\limits_{i=0}^{n-1} 2^{wi} a_{i}, \quad b = \sum\limits_{i=0}^{n-1} 2^{wi} b_{i}, \quad \text{and} \quad r = a \times b = \sum\limits_{i=0}^{2n-1} 2^{wi} r_{i}. $$

2.1 Textbook multiplication

A direct way to implement long multiplication consists in extending textbook multiplication to several words. This is often done by using a mad Footnote 1 routine.

A mad routine takes as input four n-bit words {x, y, c, ρ}, and returns the two n-bit words c′, ρ′ such that 2n c′ + ρ′ = x × y + c + ρ. We write

$$\{c^{\prime},\rho^{\prime}\} \gets \textsc{mad}(x,y,c,\rho). $$

If such a routine is available then multiplication can be performed in n 2 mad calls using Algorithm 1. The MIRACL big number library [5] provides such a functionality.

figure a

This approach is unsatisfactory: it performs more computation than often needed. Assuming a constant-time mad instruction, Algorithm 1 runs in time O(n 2).

2.2 Karatsuba’s algorithm

Karatsuba [15] proposed an ingenious divide-and-conquer multiplication algorithm, where the operands a and b are split as follows:

$$r=a\times b=(2^{L}\bar{a}+\underline{a})\times (2^{L}\bar{b}+\underline{b}), $$

where typically L = n w/2. Instead of computing a multiplication between long integers, Karatsuba performs multiplications between shorter integers, and (virtually costless) multiplication by powers of 2. Karatsuba’s algorithm is described in Algorithm 2.

figure b

This approach is much faster than naive multiplication – on which it still relies for multiplication between short integers – and runsFootnote 2 in \({\Theta } (n^{\log _{2} 3})\).

2.3 Bernstein’s multiplication algorithm

When one of the operands is constant, different ways to optimize multiplication exist. Bernstein [3] provides a branch-and-bound algorithm based on a cost function.

The minimal cost, and an associated sequence, are found by exploring a tree, possibly using memoization to avoid redundant searches. More elaborate pruning heuristics exist to further speedup searching. The minimal cost path produces a list of operations which provide the result of multiplication.

Because of its exponential complexity, Bernstein’s algorithm is quickly overwhelmed when dealing with large integers. It is however often implemented by compilers for short (32 to 64-bit) constants.

3 The proposed algorithm

3.1 Intuitive idea

The proposed algorithm works with an alternative representation of the constant operand a. Namely, we wish to express some a i as a linear combination of other a j s with small coefficients. It is then easy to reconstruct the whole multiplication b × a from the values of the b × a j only.

The more linear combinations we can find, the less multiplications we need to perform. Our algorithm therefore tries to find the longest sequence of linear relationships between the digits of a. We call this sequence’s length the coverage of a.

Yet another performance parameter is the number of registers used by the multiplier. Ideally at any point in time two registers holding intermediate values should be used. This is not always possible and depends on the digits of a.

As an example, consider the set of relations of Table 1. All words are expressed ultimately in terms of the values of a 3 and a 7. In Table 1, we express a as a subset of words A ∈ {a 0, … , a n−1} and build a sparse table U where U i, j ∈ {−1, 0, 1, 2, =}, which encodes linear relationships between individual words. During multiplication, U describes how the different a i can be derived from each other.

Table 1 An example showing how linear relationships between individual words are encoded and interpreted. All the a i are computed from a 3 and a 7 only

Hence it suffices to compute b × a 3 and b × a 7 to infer all other b × a i by long integer additions. Note that the algorithm only needs to allocate three (n + 1)-word registers reg1, reg2 and reg3 to store intermediate results.

The values allowed in U can easily be extended to include more complex relationships (larger coefficients, more variables, etc.) but this immediately impacts the algorithm’s performance. Indeed, the corresponding search graph has correspondingly many more branches at each node.

Operations can be performed without overflowing (i.e., so that results fit in a word), or modulo the word size. In the latter case, it is necessary to subtract bw from the result, where w is the word size, to obtain the correct result. This incurs some additional cost.

As a concrete example, if the operations in Table 1 are understood modulo 256 (and the a i are bytes), it provides the decomposition of the following 80-bit number:

$$a = \texttt{0x8d69f6\underline{2a}9249ec\underline{1f}8d24} $$

where we have underlined the values of a 3 and a 7.

3.2 Backtracking algorithm

figure c
figure d
figure e
figure f

Linear combinations amongst words of a are found by backtracking [16], the pseudocode of which is given in Algorithm 5. Our implementation focuses on linear dependencies amongst 8-bit words, as our main recommendation for applying the proposed multiplication algorithm is exactly an 8-bit microprocessor.

We take advantage of recursion and macro expansion (see Algorithms 3 to 5) to achieve a more compact code (see Appendix). In this implementation, p encodes the current depth’s three registers of Table 1 as well as the current operation. With suitable listing, Algorithm 6 outputs a set of values being related, along with the corresponding relation. The dependencies that we take into account in our C code (provided in the full version or upon request) don’t go beyond depth 2. Thus, the corresponding operations are \(\mathcal C = \{+, -, \times 2\}\). We also add these operations performed modulo 256, to obtain more solutions. The alternative to this approach is to consider a bigger depth, which naturally leads to more possibilities.

Our program takes as an input an integer p that represents the percentage of a being covered (i.e., the coverage is p/100 times the length of the a). In a typical lightweight scenario, a 128-byte number is involved in the multiplication process. Our software attempts to backtrack over a coverage-related number of values out of 256. It follows immediately that at most a 50% coverage would be required for performing such a multiplication (as byte collisions are likely to happen).

The program takes as parameter the list of bytes of a. If some bytes appear multiple times, it is not necessary to re-generate each of them individually: generation is performed once, and the value is cloned and dispatched where needed.

Note that if precomputation takes too long, the list of a i can be partitioned into several sub-lists on which backtrackings are run independently. This would entail as many initial multiplications by the online multiplier but still yield appreciable speed-ups.

3.3 Multiplication algorithm

figure g

With the encoding of a generated by Algorithm 6, it is now possible to implement multiplication efficiently.

To that end we make use of a specific-purpose multiplication virtual machine (VM) described in Algorithm 7. The VM is provided with instructions of the form

$$\texttt{opcode t, i, j, p} $$

that are extracted offline from U. Here, opcode is the operation to perform, i and j are the indices of the operands, t is the index of the result, and pw × t is the position in r where to place the result, w being the word size. The value of p is pre-computed offline to allow for a more efficient implementation.

We store the result in a 2n-byte register initialized with zero. We also make use of a long addition procedure P l a c e A t(p, i) which “places” the contents of the (n + 1)-byte register reg[i] at position p in r. P l a c e A t performs the addition of register reg[i] starting from an offset p in r, propagating the carry as needed.

Finally, we assume that the list R = (i, v) k of root nodes (position and value) of U is provided.

After executing all the operations described in U, Algorithm 7 has computed ra × b.

Remark 1

(Karatsuba multiplication) Using the notations of Algorithm 2 one can see that in settings where a is a constant, the numbers u, v, w all result from the multiplication of \(\bar {b},\underline {b}\) and \(\bar {b}+\underline {b}\) (which are variable) by \(\bar {a},\underline {a}\) and \(\bar {a}+\underline {a}\) (which are constant). Hence our approach can independently be combined with Karatsuba’s algorithm to yield further improvements.

4 Performance

The algorithm has an offline step (backtracking) and an online step (multiplication), which are implemented on different devices.

The offline step is naturally the longest; its performance is heavily dependent on the digit combination operations allowed and on how many numbers are being dealt with. More precisely, results are near-instant when dealing with 64 individual bytes and operations {+, −, ×2}. It takes much longer if operations modulo 256 are considered as well, but this gives a better coverage of a, hence better online results. That being said, modulo 256 operations are slightly less efficient than operations over the integers (≃ 1.5 more costly), since they require a subtraction of b afterwards.

Table 2 provides comparative performance data for a multiplication by the processed constant ⌊π21024⌋. Backtracking this constant took 85 days on an Altix UV1000 cluster.

Table 2 Performance on a 68hc05 clocked at 5 MHz

As a final remark, note that one can also reverse the idea and generate a key by which multiplication is easy. This can be done by progressively picking VM operations until an operand (key) with sufficient entropy is obtained. While this is not equivalent to randomly selecting keys, the authors conjecture that, in practice, the existence of linear relationsFootnote 3 between key bytes should not significantly weaken public-key implementations.