1 Introduction

Cryptography is the study of methods to transform information from its original comprehensible form into a scrambled incomprehensible form, such that its content can only be disclosed to some qualified persons. In the past, cryptography helped ensure secrecy in important communications, such as those of spies, military leaders, and diplomats. In recent decades, it has expanded in two main ways: (1) firstly, it provides mechanisms for more than just keeping secrets through schemes like digital signatures, digital cash, etc.; (2) secondly, cryptography is used by almost all computer users as it is embedded into the infrastructure for computing and telecommunications. Cryptography ensures secure communications through confidentiality, integrity, authenticity and non-repudiation.

Cryptography has evolved over the years from Julius Cesar’s cipher, which simply shifts the letters of the words a fixed number of times, to the sophisticated RSA algorithm, which was invented by Ronald L. Rivest, Adli Shamir and Leonard M. Adleman, and the elegant AES cipher (Advanced Encryption Standard), which was invented by Joan Daemen and Vincent Rijmen.

Cryptographic algorithms used by nowadays cryptosystems fall into two main categories: symmetric-key algorithms [11, 13] and asymmetric-key algorithms [9]. Symmetric-key ciphers use the same key for encryption and decryption, or to be more precise, the key used for decryption is computationally easy to compute given the key used for encryption. Cryptography using symmetric ciphers is also called private-key cryptography [16].

Symmetric-key ciphers use the same key for encryption and decryption, or to be more precise, the key used for decryption is computationally easy to compute given the key used for encryption. Symmetric-key ciphers, in turn, can fall into two categories: block ciphers and stream ciphers. Stream ciphers encrypt the plaintext one bit at a time, in contrast to block ciphers, which operate on a block of bits of a predefined length. Most popular block ciphers are DES, IDEA [8] and AES [18], and most popular stream cipher is RC6 [19].

Using symmetric-key cryptography, two parties who want to communicate confidentially must have access to the private key. This is somehow a limiting aspect for this category of cryptography. In contrast with symmetric-key, the key used during encryption is distinct from that used during decryption in asymmetric-key algorithms. The encryption key is made public while the decryption key is kept secret. Within this scheme, two parties can communicate securely as long as it is computationally hard to deduce the private key from the public one. This is the case in nowadays asymmetric-key, or simply public-key algorithms such as RSA, which relies on the difficulty of integer factorization [10, 12, 14, 15]. The future of cryptography resides in systems that are based on elliptic curves, which are kind of public-key algorithms that may offer efficiency gains over other schemes.

The Advanced Encryption System—AES is a block cipher, adopted as the new encryption standard in substitution to its predecessor Data Encryption Standard—DES [17]. AES main scrambling computation is performed on a fixed block size of 128 bits with a key size of 128, 192 or 256 bits. This core computation is iterated for many rounds. The number of rounds depends on the key size. Currently, it is set to 10, 12 and 14 for the cited keys sizes respectively. The resilience of AES against breaking attacks depends entirely on the number of rounds used. So far, the best known attacks are on 7 rounds for 128-bit keys, 8 rounds for 192-bit keys, and 9 rounds for 256-bit keys [4]. The small margin between these round numbers and the actual ones is very worrying for the cryptographer’s community.

It is noteworthy to point out that there are many FPGA based AES implementations [2, 57, 2022]. However, there no implementation based on real-time data encryption and decryption to be used in fast communication interface.

The need for fast but secure cryptographic systems is growing bigger. Therefore, dedicated hardware for cryptography is becoming a key issue for designers. With the spread of reconfigurable hardware such as FPGAs, embedded cryptographic hardware became cost-effective. Nevertheless, it is worthy to note that nowadays, even hardwired cryptographic algorithms are not safe. Attacks based on power consumption and electromagnetic Analysis, such as SPA, DPA and EMA have been successfully used to retrieve secret information stored in cryptographic devices. Besides performance in terms of area and throughput, designer of embedded cryptographic hardware must worry about the leakage of their implementations.

In this paper, we propose a novel hardware implementation of AES-128. The architecture allows one to perform the core computation of the algorithm is a pipelined manner. The throughput of the cryptographic hardware is 1Gbits per second. A unique hardware is used for encryption and decryption. The pipelined encryption and decryption allows an increase of the number of rounds without much loss of efficiency. Recall that increasing the number of rounds applied, increases the resistance of the AES algorithm.

This rest of this paper is organized in four subsequent sections. First, in Sect. 2, we give a brief description of the AES encryption and decryption algorithms as well as the modified version of these two algorithms, which are the basis of the proposed hardware architecture. Thereafter, in Sect. 3, we describe in a structured manner, the pipelined hardware architecture of AES-128 for encryption and decryption. Subsequently, in Sect. 4, we present some experimental result and compare our implementation to existing ones. Last but not least, in Sect. 5, we draw some conclusions and introduce some directions for future work.

2 Advanced Encryption Standard

The AES [18] is an elegant and a so-far-secure cipher. The encryption and decryption processes are performed through a repetitive process of four main stages. The encryption and decryption are done in a slightly different way. However, both processes can be modified so that the main stages are equivalent in the sense that for each stage, the computational process is the same but some parameters such as the s-box used and the key schedule exploited is different. In the remaining part of this section, we proceed with the description of the algorithms used by AES in the encryption and decryption processes as well as their respective modified versions that allowed us to yield a versatile hardware that can be used for both computations.

2.1 Encryption with AES

Encryption using AES proceeds as described in Algorithm 1, wherein functions SubBytes, ShiftRows, MixColumns and AddroundKey are defined later in this section.

figure a

For hardware efficiency reasons, we modified the AES cipher algorithm as in Algorithm 2. Note that Algorithms 1 and 2 are equivalent and yield the same output.

figure b

2.1.1 Function SubBytes

The function yields a new state simply by substituting each of the 16 bytes of state using a substitution box. The four most significant bits of the byte in question is used as the S-box row index while the remaining four bits are used as the S-box column index as shown in Fig. 1.

Fig. 1
figure 1

Illustration of SubBytes state transformation

2.1.2 Function ShiftRows

The function obtains a new state by cyclically shifting the state rows. The bytes of row i are shifted i times, where \(0 \le i \le 4\), as shown in Fig. 2.

Fig. 2
figure 2

Illustration of ShiftRows state transformation

2.1.3 Function MixColumns

The function operates on the states columns. The bytes of a given column are used as coefficients of a polynomial over GF\((2^{8})\). The formed polynomial is multiplied by a fixed polynomial P(x) modulo \(x^{4} + 1\), wherein polynomial P(x) is defined as in (1):

$$\begin{aligned} P(x)= \{03\} x^{3}+\{01\}x^{2}+\{01\} x + \{02\} \end{aligned}$$
(1)

The details of the multiplication operation can be found in [1, 18]. The transformation performed by MixColumns is illustrated in Fig. 3.

Fig. 3
figure 3

Illustration of MixColumns state transformation

2.1.4 Function AddRoundKey

The function computes the new state using a xor of the state columns bytes and the key schedule of the current round. The transformation performed by this function is depicted in Fig. 4.

Fig. 4
figure 4

Illustration of AddRoundKey state transformation

Before the cipher operation takes place, a key schedule is generated. Four subkeys are required for each round of the cipher algorithm. The subkeys for the first round are the private cipher key, which is provided by the user. For a given round, the first subkey is obtained by first rotating once the last subkey from that of the previous round, then substituting each of byte using the S-box used by function subBytes. Thereafter xoring the result with a given constant and finally xoring the result with first subkey of the previous round. The subsequent subkeys of the current round are computed using a xor of the previous key in the current round and the one inversely respective from the previous round. Of course, the whole key schedule required by the entire encryption process can be generated beforehand and store for later use by function AddRoundKey appropriately.

2.2 Decryption with AES

The decryption of a text that was ciphered using AES can be performed by Algorithm 3. Comparing Algorithms 1 and 3, one can note that each function was replaced by its inverse. However, the application sequence of these functions is slightly different. In order to have a unique versatile hardware for encryption and decryption, this algorithm was modified as in Algorithm 4, wherein functions InvSubBytes, InvShiftRows and InvMixColumns are defined in the following subsections. Function AddroundKey is kept unchanged.

figure c

Algorithm 3 and Algorithm 4 are equivalent as operations InvSubBytes and InvShiftRows commute. Moreover, function InvMixColumns is linear so we have InvMixColumns(x xor y) is equivalent to InvMixColumns(x) xor InvMixColumns(y). Recall that operation AddRoundKey is a xor of its arguments. Using these two facts, we can swap operations AddRoundKey and InvMixColumns, provided that the columns of the decryption key schedule are modified using operation InvMixColumns. Note that functions SubBytes and InvSusbytes perm the same process but using distinct S-Boxes.

figure d

2.2.1 Function InvSubBytes

The function operates in the same manner as function SubBytes does but the S-box used is different and is usually called InvS-Box as shown in Fig. 5.

Fig. 5
figure 5

Illustration of InvSubBytes state transformation

2.2.2 Function InvShiftRows

The function yields a new state by cyclically shifting the state rows. The shifting is done in the opposite directions with respect to function InvShiftRows. As before, the bytes of row i are shifted i times, where \(0 \le i \le 4\), as shown in Fig. 6.

Fig. 6
figure 6

Illustration of InvShiftRows state transformation

2.2.3 Function InvMixColumns

The function operates in the same way function MixColumns does but with a different matrix. The formed polynomial is multiplied by a fixed polynomial P(x) modulo \(x^{4} + 1\), wherein P(x) is defined as in (2):

$$\begin{aligned} P(x) = \{0B\} x^{3}+\{0D\}x^{2}+\{09\} x + \{0E\} \end{aligned}$$
(2)

The transformation performed by InvMixColumns is illustrated in Fig. 7.

Fig. 7
figure 7

Illustration of InvMixColumns state transformation

3 Pipelined AES Hardware

The overall architecture of the AES hardware mirrors the structure of Algorithms 2 and 4. It is a synchronous implementation of both the processes of cipher and decipher. It uses four 128-registers. Every clock transition, these registers are loaded, except Register \(_{3}\), which is loaded when an input state is completely ciphered. In the encryption/decryption process, Register \(_{0}\) is loaded with the input data or the partially encrypted/decrypted plaintext/ciphertext; Register \(_{1}\) with the result of the AddRoundKey component; Register \(_{2}\) with the state after applying functions SubBytes (using the appropriate S-Box) and subsequently ShiftRows. The block architecture of the AES cipher/decipher hardware is shown in Fig. 8.

Fig. 8
figure 8

Overall architecture for the AES hardware

3.1 Component Synchronization

For component synchronization purposes, the architecture includes a controller. Among other actions, the controller determines when to reset the cipher hardware, accept input data, to register output results. As the execution of function MixColumn/InvMixColumn is conditional (see Algorithm 2), the controller decides when the result obtained by the associated component can be used or must be ignored. Recall the hardware allows both encryption and decryption. When data is being deciphered, the key schedule generated by component KeyExpansion must be ordered differently [18]. The AES hardware of Fig. 8 takes advantage of component Mix to schedule the subkeys in the required order. The controller also synchronizes this operation. The controller is structured as in Fig. 9.

Fig. 9
figure 9

Controller architecture

The included combinational logic permits the conversion of the 5-bit count to a single bit that triggers state transition. The sate machine includes six states. As long as control signal keyExpand is set, the current state is kept unchanged in \(S_{0}\). As soon as this signal is reset by component keyExpansion, which means that the step of key schedule generation is complete, the machine transits to state \(S_{1}\), wherein it stays for 3 clock cycles, which is the required time to complete the processing of one 128-bit state. Also, during this period of time, the data input signal is active, which allows the hardware to accept the three states that will be ciphered/deciphered in pipelined manner. Synchronously with the fourth clock transition, the machine transits to state \(S_{2}\) allowing to deactivate the data input signal and wait for the three accepted states are almost processed as only the last AddRoundKey is yet to be performed to complete the encryption/decryption process. At the 30th clock transition, the machine state changes to \(S_{3}\) to activate output result signal, which is maintained for the two subsequent clock periods. At the 33rd clock transition, the encryption/decryption of the three accepted states is completed and therefore, the control is returned to state \(S_{1}\), where in data input signal is reactivated to allow more date to be entered and processed. The state machine transition diagram is shown in Fig. 10.

Fig. 10
figure 10

State machine transition diagram

3.2 Component Mix

Function MixColumns is implemented by a massively parallel component that computes all the bytes of the new state in a single clock. It uses four components of the same architecture. This basic component produces one column of the new state. Its architecture is described in Fig. 11, wherein component mult yields the a special product of a given byte from the state times {01}, {02}, {03}, {09}, {0B}, {0D} or {0E} (see [1, 18] for details on the operation). The architecture of component mult is presented in Fig. 12. Component xtime computes the xtime operation as defined in [18] and its architecture is given in Fig. 13.

Fig. 11
figure 11

Basic element in component Mix

Fig. 12
figure 12

Architecture of the basic component mult

Fig. 13
figure 13

Architecture of the basic component xtime

3.3 Component Substitute/Shift

The component implementing function SubBytes uses 16 S-boxes (8 for ciphering and 8 for deciphering) stored in a Read-Only Memory (rom). The obtained state is row-shifted before its storage in \(\textit{Register}_{2}\). The component architecture is given in Fig. 14. The component that implements function AddRoundKey is simply a net of xor gates that adds in \(GF(2^{8})\) the key schedule to the current state.

Fig. 14
figure 14

The structure of component Substitute/Shift

3.4 Component KeyExpansion

As explained before, the generation of the entire key schedule that is required by the encryption/decryption process is performed prior to the start of the proper process. When hardware is used for decryption, the key schedule generated by component KeyExpansion must be ordered differently [18]. Without increase of area requirements, the proposed versatile AES hardware of Fig. 8 takes advantage of component Mix to schedule the subkeys in the required order. The subkeys are then stored in a look-up table. Also, the controller allows for the appropriate set of subkeys to be provide for each round of the encryption/decryption process. The architecture of component keyExpansion is shown in Fig. 15.

Fig. 15
figure 15

Architecture of component KeyExpansion

4 Experimental Results

The pipelined execution of the AES cipher using the architecture of Fig. 8 is illustrated in Fig. 16.

Fig. 16
figure 16

Pipelined execution of the AES algorithm using the hardware of Fig. 8

The generated semi-custom implementation was prototyped on a Xilinx board Virtex-7 VC709 Connectivity Kit, which includes a XC7VX690T FPGA [24]. This FPGA is of very high logic density. It is endowed of more than 4 hundred thousands LuTs (Look-up Tables), more than 8 hundred thousands registers and almost 1 thousand and 5 hundreds memory slices.

The hardware was synthesized for the aforementioned board using Xilinx Vivado development kit [23]. This synthesis allowed it to run at an operation frequency of 500MHz and thus, with a 2ns clock cycle. Every 33 clock cycles, the hardware can yield an encrypted data stream of \(3\times 128\) bits. The throughput, say Tp can then be calculated as in (3). The throughput is a little more than 5Gbps, which is a very nice achievement.

$$\begin{aligned} Tp = \frac{3\times 128}{33\times \textit{clock\;cycle}} = \frac{128}{11\times 2} = 5548.65\,\hbox {Mbs} \end{aligned}$$
(3)

As far as the authors know, the versatile hardware implementation of AES algorithm that performs both encryption and decryption is novel. We compared our implementation to the ones from [7] and [21]. Note that these implementations are for the cipher algorithm only. In contrast, our implementation allows both ciphering and deciphering with the same hardware. One may think that the implementation proposed and those from [7, 21] and [2] are incomparable. They are cited here for reference only. It is also noteworthy to point out that we did not implement the pipelined system, as described here, in software. It entails a big deal of parallel programming. Instead, we compare the pipelined execution of AES to the sequential one, both running in the FPGA. The throughput, expressed in Mbps, as well as the hardware area required, expressed in number of CLBs, are given in Table 1.

Table 1 Performance comparison

Recall that the resilience of AES-based encryption against cryptanalysis attacks depends entirely on the number of rounds used. The pipelined implementation we propose throughout this paper can be easily adapted to a higher round number. To be able to increase the number of round, component KeyExpansion needs to generate more key schedules and therefore the delay introduced therein increases with the number of rounds. The throughput, say tp, can be expressed in terms of the round number, say rn, is as in (4) when no pipelining is used and as in (5) when the pipelined design is used.

$$\begin{aligned} Tp(rn)= & {} \frac{128}{(11+rn)\times \textit{clock\;cycle}} \end{aligned}$$
(4)
$$\begin{aligned} Tp(rn)= & {} \frac{128}{11(1+rn)\times \textit{clock\;cycle}} \end{aligned}$$
(5)

The chart of Fig. 17 shows that increasing the number of rounds can be done without much loss in efficiency. In the case of the designs published in [2, 7, 21], we only have access to the thoughput when no rounds are applied. The main aim of this chart is to prove that the pipelined design always provides a better performance than the sequential one (i.e. without pipelining), except when no rounds are used. It also quantifies the achieved improvement by the pipelined design over the sequential one. Note that the throughput is expressed in terms of b/ns.

Fig. 17
figure 17

The impact of increase in the round number

5 Conclusion

In this paper, we propose a novel pipelined hardware implementation of AES-128 that can be used for both encryption and decryption. Besides, we show that if the required number of rounds must increase to defeat attackers, the proposed implementation stays efficient. The hardware proposed is massively parallel and executes the four main steps of the algorithm in a pipelined manner, which allows a reasonable throughput of a little more of 1 Gbs. Compared to existing implementations of the cipher algorithm, this kind of throughput may be considered somehow low. However, considering the 2-in-1 aspect of the hardware as it allows encryption and decryption, it comes handy for devices with restricted hardware area with a not too bad throughput of almost 6 Gb/s.

In future research work, we intend to investigate further the proposed implementation, with the hope to improve the throughput without much increase in required hardware area.