Keywords

1 Introduction

The field of lightweight cryptography is evolving rapidly in order to meet future needs, where interconnected, highly constrained hardware and software devices are expected to proliferate.

Over the last several years the international cryptographic community has made significant strides in the development of lightweight block ciphers. There are now more than a dozen to choose from, including Present [6], Clefia [20], Twine [21], Hight [14], Klein [12], LBlock [25], Piccolo [19], Prince [7], LED [13], Katan [9], TEA [23], and ITUbee [16]. Typically, a design will offer improved performance on some platform (e.g., ASIC, FPGA, microcontroller, microprocessor) relative to its predecessors. Unfortunately, most have good performance in hardware or software, but not both. This is likely to cause problems when communication is required across a network consisting of many disparate (hardware- and software-based) devices, and the highly-connected nature of developing technologies will likely lead to many applications of this type. Ideally, lightweight cryptography should be light on a variety of hardware and software platforms. Cryptography of this sort is relatively difficult to create and such general-purpose lightweight designs are rare.

Recently, the U.S. National Security Agency (NSA) introduced two general-purpose lightweight block cipher families, Simon and Speck [2], each offering high performance in both hardware and software. In hardware, Simon and Speck have among the smallest reported implementations of existing block ciphers with a flexible key.Footnote 1 Unlike most hardware-oriented lightweight block ciphers, Simon and Speck also have excellent software performance.

In this paper, we focus on the software performance of Simon and Speck. Specifically, we show how to create high performance implementations of Simon and Speck on 8-bit Atmel AVR microcontrollers. This paper should be useful for developers trying to implement Simon or Speck for their own applications and interesting for cryptographic designers wishing to see how certain design decisions affect performance.

2 The Simon and Speck Block Ciphers

Simon and Speck are block cipher families, each comprising ten block ciphers with differing block and key sizes to closely fit application requirements. Table 1 shows the available block and key sizes for Simon and Speck.

Table 1. Simon and Speck parameters.

The Simon (resp. Speck) block cipher with a \(2n\)-bit block and \(wn\)-bit key is denoted by Simon \(2n/wn\) (resp. Speck \(2n/wn\)). Together, the round functions of the two algorithms make use of the following operations on \(n\)-bit words:

  • bitwise XOR, \(\oplus \),

  • bitwise AND, &,

  • left circular shift, \(S^j\), by \(j\) bits,

  • right circular shift, \(S^{-j}\), by \(j\) bits, and

  • modular addition, \(+\).

For \(k\in \text {GF}(2)^{n}\), the key-dependent Simon \(2n\) round function is the two-stage Feistel map \(R_{k}:\text {GF}(2)^{n}\times \text {GF}(2)^{n}\rightarrow \text {GF}(2)^{n}\times \text {GF}(2)^{n}\) defined by

$$ R_k(x,y)=(y\oplus f(x)\oplus k,\,\, x), $$

where \( f(x) = (Sx\, \& \,S^8x)\,\oplus \, S^2x\) and \(k\) is the round key. The key-dependent Speck \(2n\) round function is the map \(R_{k}:\text {GF}(2)^{n}\times \text {GF}(2)^{n}\rightarrow \text {GF}(2)^{n}\times \text {GF}(2)^{n}\) defined by

$$ R_k(x,y)=((S^{-\alpha }x + y)\oplus k,\,\, S^\beta y\oplus (S^{-\alpha }x + y)\oplus k), $$

with rotation amounts \(\alpha =7\) and \(\beta =2\) if \(n=16\) (block size \(=32\)) and \(\alpha =8\) and \(\beta =3\), otherwise. A description of the key schedules, not necessary for this paper, can be found in the Simon and Speck specification paper [2].

3 AVR Implementations of Simon and Speck

We have implemented Simon and Speck on the Atmel ATmega128, a low-power 8-bit microcontroller with 128K bytes of programmable flash memory, 4K bytes of SRAM, and thirty-two 8-bit general purpose registers. To achieve optimal performance, we have coded Simon and Speck in assembly. The simplicity of the algorithms makes this easy to do, with the help of the Atmel AVR 8-bit instruction set manual [1].

Our assembly code was written and compiled using Atmel Studio 6.0. Cycle counts, code sizes and RAM usage were also determined using this tool. Our complete set of performance results is provided in Appendix A.

For Simon and Speck, three implementations were developed, none of which included the decryption algorithm.Footnote 2

  1. (1)

    RAM-minimizing implementations. These implementations avoid the use of RAM to store round keys by including the pre-expanded round keys in the flash program memory. No key schedule is included for updating this expanded key, making these implementations suitable for applications where the key is static.

  2. (2)

    High-throughput/low-energy implementations. These implemen tations include the key schedule and unroll enough copies of the round function in the encryption routine to achieve a throughput within about 3% of a fully-unrolled implementation. The key, stored in flash, is used to generate the round keys which are subsequently stored in RAM.

  3. (3)

    Flash-minimizing implementations. The key schedule is included here. Space limitations mean we can only provide an incomplete description of these implementations. However, it should be noted that the previous two types of implementations already have very modest code sizes.

In almost all cases, the registers hold the intermediate ciphertext, the currently used round key, and any data needed to carry out the computation of a round. For the algorithms with the 128-bit block and 192-bit or 256-bit key, there may not be enough registers to hold all of this information, and so it may be necessary to store one or two 64-bit words of round key in RAM. Additional modifications were necessary for the 128-bit Simon block ciphers.

We describe our various assembly implementations of Simon 64/128 and Speck 64/128 on an AVR 8-bit microcontroller using pseudo assembly code. When it is not obvious, we show how to translate the pseudocode into actual assembly instructions. Although our implementations were aimed at the ATmega128 microcontroller, the same techniques are applicable to other AVR microcontrollers, e.g., the ATtiny line (except when noted).

4 Simon AVR Implementations

We describe various implementations of Simon 64/128 on the ATmega128 microcontroller. Implementations of the other algorithms in the family are similar.

The ATmega128 has thirty-two 8-bit registers. For the purpose of exposition, we regard a sequence of four such registers as a 32-bit register, denoted by a capital letter such as \(X\), \(Y\), \(K\), etc. For instance, \(X = (X_3,X_2,X_1,X_0)\) identifies the 32-bit register \(X\) as a sequence of four 8-bit registers. We denote the contents of these 32-bit registers by lower case letters \(x\), \(y\), \(k\), etc.

At the start of the encryption process, we assume a 64-bit plaintext pair \((x,y)\) resides in RAM, and is immediately loaded into an \((X,Y)\) register pair for processing. After 44 encryption rounds, the resulting ciphertext is stored in RAM. The encryption cost is the number of cycles needed to transform the plaintext into the ciphertext, including any loading of plaintext/ciphertext into or out of RAM.Footnote 3 Our performance numbers do not count the cost of RAM used to hold the plaintext, ciphertext, or key but do include RAM used for temporary storage (e.g., stack memory) and RAM to hold the round keys.

Except for the small code size implementations of Simon, developers should avoid the use of any loops or branching within a round, since this can significantly degrade overall performance and does not greatly reduce code size.

4.1 A Minimal RAM Implementation of Simon

Here we assume the round keys have been pre-expanded (by an external device) and stored in flash. When a round key is required, it is loaded from flash directly into a register. Loading a byte from flash consumes three cycles.

We begin by describing the Rotate operation, which performs a left circular shift by one. Let \(X=(X_3,X_2,X_1,\) \(X_0)\) and let \(Z_0=0\). The 8-bit register \(Z_0\) should be cleared and reserved at the start of encryption. The 1-bit rotation of the 32-bit register \(X\) is carried out using AVR’s logical shift left (LSL), rotate left through carry (ROL), and add with carry (ADC) instructions, as follows.

figure a

In general, this standard approach to performing a rotation requires \(n+1\) cycles on an \(n\) byte word. Rotations by more than one bit can be achieved by repeated one-bit rotations, though this is not always the most efficient approach.

Note that the rotation by 8 is free, because it’s just a byte permutation.Footnote 4 The rotations by 1 and 2 are also inexpensive. Our pseudo instruction Move is shorthand for two AVR MOVW instructions, each of which copies two 8-bit words from one register to another in a single cycle. In order to use the MOVW instruction, the 8-bit registers must be properly aligned [1]. Our other mnemonics are also shorthand for readily apparent AVR instructions. Table 2 describes how to implement a round of Simon.

Table 2. Low-RAM software implementation of the Simon round function.

The basic code for a round executes in 42 cycles. For the minimal RAM implementation, this code is put in a loop which has a 3 cycle per round overhead. Since Simon 64/128 requires 44 rounds, an encryption will take about \(44\cdot (42+3)=1980\) cycles. There are ten cycles needed for setup, a subroutine call, and a return plus an additional 32 cycles to load the plaintext into registers (2 cycles/byte) and load the resulting ciphertext back into RAM (2 cycles/byte). Altogether, Simon 64/128 has an encryption cost of 253 cycles/byte.

In terms of code size, each instruction, with the exception of the Load instruction, takes twice as many bytes to store in flash as the number of cycles it takes to execute. The four byte Load instruction, in Table 2, consumes 8 bytes of flash. The expanded key is stored in \(44\cdot 4=176\) bytes of flash. So the total amount of flash used is around \(68 + 44\cdot 4=244\) bytes. More bytes are required for the counter, loading plaintext, storing ciphertext and other miscellaneous overhead so the actual value is \(290\) bytes. No RAM was used for this implementation.

4.2 A High-Throughput/Low-Energy Implementation of Simon

High-throughput implementations are useful when encrypting multiple blocks of data. The round keys for such an implementation can initially be stored in flash, as with our low-RAM implementations, or they can be generated using the key schedule. In either case, the encryption process begins by placing all of the round keys into RAM, where they are held until all of the blocks in a stream of data have been encrypted. For our implementations, we used the key schedule to generate the round keys, but for our timings we have not included the time to generate and store these since this setup time is assumed to be small compared to the time required to encrypt the data stream.

For Simon 64/128, we unrolled four rounds of the code and iterated this 11 times to carry out the 44 round encryption. Because the key is now loaded from RAM, the Load requires only 8 cycles instead of the 12 which were needed when loading from flash. Due to the larger code size, a four cycle overhead for each update of the loop counter (as opposed to three cycles) is required, together with an additional ten cycles as described earlier and 32 more cycles for loading plaintext in registers and storing the ciphertext in RAM. The amount of unrolling was calibrated to achieve throughput to within 3 % of the fastest (fully unrolled) implementation, avoiding large increases in code size with minor improvements in throughput. The cost of this implementation is 221 cycles/byte.

The code size for the encryption algorithm is about \(68\cdot 4=272\) bytes. Together with the key schedule and other overhead, this implementation of Simon 64/128 uses 436 bytes of flash. Because all of the round keys are placed in RAM, \(44\cdot 4=176\) bytes of RAM are also required.

4.3 A Minimal Flash Implementation of Simon

To reduce the already small size of Simon, we implemented several frequently used operations as subroutines. These included the XOR, the 1-bit Rotate and the Move instructions. Doing this, we saved a dozen bytes for Simon 64, and more than 50 bytes for Simon 128. For Simon 32, these techniques ended up not saving any space and degraded throughput, so we did not use them.

Simon 64/128, in particular, can be implemented using 240 bytes of flash and with \(4\cdot 44 = 176\) bytes of RAM to store the round keys. We note that for an additional 28 bytes of flash, decrytpion capability can be added: To decrypt, one uses the round keys in reverse order, loads the swapped ciphertext words into the encryption round function, and reads out the plaintext words with a similar swap. The swapped loading and output (together with the regular loading and output) can be done with 8 additional bytes of code each, and the round keys can be generated and stored in normal or reverse order with 12 additional bytes of code, for a total of \(8+8+12=28\) bytes.

5 Speck AVR Implementations

In this section, we describe the same types of implementations that we previously developed for the Simon 64/128 block cipher. We stress that most of the implementation techniques described here apply immediately to the other members of the Speck family.

Using the same notation as in Simon, we assume a 64-bit plaintext pair \((x,y)\) resides in RAM. This is immediately loaded into the 64-bit register pair \((X,Y)\). After 27 encryption rounds, the resulting ciphertext is loaded into RAM. The encryption cost is the number of cycles required to load the plaintext into registers, transform the plaintext into the ciphertext, and store the ciphertext in RAM.

For all of our implementations, we avoid the use of any loops or branching within a round. This has the effect of making our code slightly larger but significantly improves the overall performance.

For Simon, our main tool for trading off code size for throughput was to partially unroll loops. It turns out that for Speck, there is more opportunity to tune the implementation (for example using multiply instructions to accomplish the rotation by 3 or by allowing a round to end with plaintext and key words in the wrong places) to make it smaller or faster. Consequently, we will spend a little more time describing Speck implementations.

Table 3. Low-RAM software implementation of the Speck round function.

5.1 A Low-RAM Speck Implementation

Here we assume the round keys have been pre-expanded and stored in flash. When a round key \(k\) is required, it is loaded from flash directly into a register with a cost of 12 cycles. Table 3 describes how to implement the Speck round.

Note that the rotation by 8 is free, as we noted in the Simon discussion. The Rotate and Move instructions were described earlier for Simon. To be clear, we describe the Add operation in greater detail. If \(X=(X_3,X_2,X_1,X_0)\) and \(Y=(Y_3,Y_2,Y_1,Y_0)\), then addition, Add, is carried out using the following AVR instructions in the given order.

figure b

Our basic code for a round executes in 41 cycles, and this code is iterated 27 times in a loop to produce the ciphertext. The loop has an overhead of 3 cycles per round, and since Speck 64/128 requires 27 rounds, an encryption takes about \(27\cdot (41+3)=1188\) cycles. An additional \(10\) more cycles for overhead (\(3\) cycles for setup, \(3\) for a subroutine call, and \(4\) for a return) plus 32 cycles for loading plaintext from RAM into registers and storing ciphertext back into RAM, brings the total number of cycles for an encryption to 1230, which translates to \(1230/8\approx 154\) cycles/byte.

As we noted in the Simon discussion, each instruction, with the exception of the Load instruction, takes twice as many bytes to store in flash as the number of cycles it takes to execute. The Load instruction requires 8 bytes of flash.

The Speck round keys consume \(27\cdot 4=108\) bytes of flash. The total amount of flash required for the round and expanded key is \(66 + 108=174\) bytes. Additional bytes, required for the counter, loading and storing plaintext and ciphertext and other overhead, brings the total to \(218\) bytes. No RAM was required for this implementation.

5.2 A Faster Low-RAM Speck Implementation

If a higher-throughput/lower-energy implementation is required, we can easily modify the implementation that we just described to obtain one which encrypts at a rate of \(142\) cycles/byte using around \(342\) bytes of flash and no RAM.

This is done by iterating two rounds multiple times. Two rounds can be implemented without the use of the Move that appeared in Table 3, thereby saving two cycles per round. The two rounds are iterated in a loop 13 times (for a total of 26 rounds) and a final single round of code as described in Table 3 is used for the 27th round. For members of the family requiring an even number of rounds (like Speck 64/96) this extra code for the final round is not required. For Speck 64/128, the number of cycles for a complete encryption is around \(2\cdot 39\cdot 13 + 41 + 13\cdot 3 + 10 + 32=1136\), or about \(142\) cycles/byte. About \(342\) bytes of flash are required.

The pseudocode for the two rounds is shown in Table 4. There, \(x_1\) and \(y_1\) are the values of the input to the second round, stored in the \(K\) and \(Y\) registers and \(l\) is a new round key which is loaded into the \(X\) register. The output to the second round is again stored in the proper registers, i.e., the \(X\) and \(Y\) registers.

Table 4. A faster, low-RAM software implementation of two rounds of Speck.

5.3 A High-Throughput/Low-Energy Speck Implementation

As in the corresponding Simon implementations, the Speck encryption process begins by placing all of the round keys into RAM and holding them there until the data stream has been encrypted. Although we used the key schedule to generate the round keys, the round keys could also be pre-computed and stored in flash before loading them into RAM. For our timings, we have not included the time to generate the round keys and store them in RAM.

The easiest way to obtain a fast encryption algorithm is just to modify the fast, low-RAM implementation described previously using the code found in Table 4, but now loading the round keys from RAM instead of from flash. The cost of loading four bytes from RAM is 8 cycles, as opposed to 12 if we load from flash. So the total cost to encrypt a block of data is \(2\cdot 35\cdot 13 + 37 + 13\cdot 4 + 10 + 32=1041\) cycles, or about \(130\) cycles/byte. Not including the key schedule, the code will occupy about \(232\) bytes. Taking into account the key schedule, the code size is about 352 bytes and the implementation uses \(108\) bytes of RAM.

We could speed this up even more, at the expense of using more flash, by unrolling four rounds of code instead of just two. If we unroll all of the rounds we get a heavy implementation, in terms of flash, that encrypts a block in just \(123\) cycles/byte. If this sort of implementation is appealing but the flash usage is not, there is another way to get the same throughput using significantly less flash. However, the technique only works on those AVR microcontrollers, such as the ATmega128, which include the AVR unsigned multiplication instruction MUL. The ATtiny line does not have the MUL instruction.

We now describe how to do the 3-bit rotation operation Rotate in 14 cycles using 20 bytes of flash with the AVR MUL instruction. This should be compared to the in-place rotation used earlier, which required 15 cycles and 30 bytes.

The MUL instruction operates on two 8-bit registers containing unsigned numbers and produces the 16-bit unsigned product in 2 cycles. The result is always placed in the AVR register pair \((R_1,R_0)\), the low bits in \(R_0\) and the high bits in \(R_1\). Letting \(Y=(Y_3,Y_2,Y_1,Y_0)\) and \(X=(X_3,X_2,X_1,X_0)\), the Rotate operation, \(X \leftarrow S^3(Y)\), is implemented as follows:

figure c
figure d

Here, \(8\) represents an 8-bit register with the value 8 stored in it. This register should be initialized and reserved at the start of the encryption process. It should be noted that this rotate algorithm is not in-place. Additionally, for a \(w\)-byte word, the performance can be worse than for the standard approach: For \(w=2j\), the running time is \(7j\) cycles for this technique versus \(3\cdot (2j+1)=6j+3\) cycles for the standard approach. Thus, for the highest speed implementations, the technique is advantageous for Speck 48 and Speck 64. For Speck 96 it doesn’t increase throughput, but it can still significantly decrease the flash usage, and so we adopt it. For Speck 128, the method will actually decrease throughput. The code for a round of Speck is provided in Table 5.

Table 5. A high-throughput/low-power round implementation of Speck

The output of the first round is stored in the \((K,X)\) register pair. In order to get the output in the proper, \(X\) and \(Y\) registers, we need to move \(X\) into \(Y\) and \(K\) into \(X\). However, if we don’t want to incur additional cycles by doing this, then we can proceed in a manner similar to the fast, low-RAM implementation by iterating three consecutive rounds, all identical up to a relabeling of registers. The final output after the third round will be back in the \((X,Y)\) register pair. The code for doing this is shown in Table 6.

Table 6. A high-throughput/low-energy three round implementation of Speck

If we iterate three rounds in a loop nine times as just described, Speck 64/128 performs a full encryption in around \(34\cdot 27 + 4\cdot 9 + 10 + 32=996\) cycles, or about \(125\) cycles/byte and, including the key schedule, uses \(316\) bytes of flash. The high-throughput data for Speck 64/128, found in Table 7 of Appendix A, was obtained by unrolling 6 rounds instead of 3 in order to get within 3% of the fastest, fully unrolled, implementation. That gave us a 122 cycles/byte implementation but required about twice the flash.

5.4 A Small Flash Speck Implementation

Speck can be implemented to have small code size. We have not attempted to minimize the code size without regard to the throughput, so smaller implementations are possible. The primary savings in space was achieved by implementing the Speck round function as a subroutine. In this way, both the key schedule and encryption function could use it without having to duplicate code.

6 Cipher Comparisons

Although it is not the primary purpose of this paper, we shall compare the performance of Simon and Speck with several other block ciphers, and with a few stream ciphers. Comparisons along these lines can already be found in the Simon and Speck specification paper [2]. We have not endeavored to implement all of the discussed ciphers from scratch since this was outside the scope of our paper. Instead, when possible, we (or others) modified the best implementations available from existing sources so that they fit within our framework. Our comparison data and a further discussion of our methodology can be found in Appendix B.

For brevity, we only included algorithms with an encryption cost of 1000 cycles/byte or less. For this reason, or because the data was unavailable, well-known lightweight block ciphers such as Present, Clefia, Katan, etc., are not listed. We use a performance efficiency measure, rank, that is similar to a commonly used metric (see [16, 18, 21]), proportional to throughput divided by memory usage.

Among all block ciphers, Speck ranks in the top spot for every block and key size which it supports. Except for the 128-bit block size, Simon ranks second for all block and key sizes. Among the 64-bit block ciphers, Hight has respectable performance on this platform, ranking higher than AES. Although Twine ranks lower than AES, its performance is reasonable and, additionally, it has very good hardware performance, making it one of the better lightweight designs. Some software-based designs, like ITUbee, IDEA and TEA have poorer performance than AES and are not compensated by particularly lightweight hardware implementations. Klein, a hardware-based design, is slow and has the least competitive overall performance among the 64-bit block ciphers in our comparison.

Not surprisingly, AES-128 has very good performance on this platform, although for the same block and key size, Speck has about twice the performance. For the same key size but with a 64-bit block size, Simon and Speck achieve two and four times better overall performance, respectively, than AES. A few of the block ciphers in our comparison could not outperform AES, even though they had smaller block and key sizes.

If an application requires high speed, and memory usage is not a priority, AES has the fastest implementation (using 1912 bytes of flash, 432 bytes RAM) among all block ciphers with a 128-bit block and key that we are aware of, with a cost of just \(125\) cycles/byte [8]. The closest AES competitor is Speck 128/128, with a cost of 138 cycles/byte for a fully unrolled implementation. Since speed is correlated with energy consumption, AES-128 may be a better choice in energy critical applications than Speck 128/128 on this platform.Footnote 5 However, if a 128-bit block is not required, as we might expect for many applications on an 8-bit microcontroller, then a more energy efficient solution (using 628 bytes of flash, 108 bytes RAM) is Speck 64/128 with the same key size as AES-128 and an encryption cost of just 122 cycles/byte, or Speck 64/96 with a cost of 118 cycles/byte.

We have also compared three of the four Profile I (software) eSTREAM competition finalists, Salsa 20/12 [4], Sosemanuk [3], and HC-128 [24] since there is a general perception that well-designed stream ciphers must have better performance than block ciphers.Footnote 6 Interestingly, all of the stream ciphers are less efficient than the majority of the block ciphers listed. For high-speed/low-energy applications, if memory is not a concern, only Sosemanuk can beat the fastest implementations of AES and Speck.