Keywords

1 Introduction

The recent advancements in Internet of Things (IoT) leads to large number of resource constrained devices to connect to the Internet. To ensure security of these devices, the traditional algorithms are not suitable and hence there is a new class of cryptographic algorithms developed for these applications under the category of lightweight ciphers. Among the different lightweight ciphers developed, Sony Corporation’s CLEFIA [1] is standardized by ISO/IEC 29192-2 in 2012 [2]. Further CLEFIA is a recommended cipher for lightweight cryptographic technology in e-government applications by the Japanese Cryptographic Research and Evaluation Committee (CRYPTREC) in 2013 [3]. Hence we have implemented CLEFIA in our work.

CLEFIA is a 128-bit block cipher with three different key lengths 128, 192 and 256-bits. Two types of keys namely round keys (RK) and whitening keys (WK) are used for encryption and decryption. The number of rounds of operation varies as 18, 22 and 26 rounds with the key length. The encryption/decryption architecture is based on Generalized Feistel Network (GFN) where a 4-branch Feistel structure is employed with two F-functions F0 and F1. Each of the two F-functions uses two Substitution boxes (S-box) S0 and S1 and two diffusion matrices M0 and M1. The key generation module is also based on the same 4-branch Feistel structure for 128-bit and 8-branch Feistel structure for 192/256-bit key. Decryption is the inverse of encryption where the round keys and whitening keys have to be given in reverse order. More detailed explanation of the round functions and key generation can be found in [1]. While most of the existing works have proposed iterative architectures for CLEFIA-128, the authors of [4] have reported three different 8-bit serialized ASIC implementations for the cipher on 130 nm technology.

The vulnerability of the CLEFIA cipher to Differential Attacks [5] and Fault Attacks [6] has been studied in detail. Side Channel Attacks (SCAs) are one class of attacks that exploits the implementation vulnerabilities of the cryptographic devices to extract the secret key. By monitoring the power consumption/electromagnetic radiations/time taken for execution, the key value can be monitored. These attacks are non-invasive and hence they cannot be detected. Among the different types of side channels, power analysis attacks (PAAs) are vast studied in literature. Authors of [7] have conducted a simulation based Differential Power Analysis (DPA) on the first round of CLEFIA to extract the first round key using a hamming weight model (HWM). The work in [8] demonstrates a Correlation Power Analysis (CPA) and electromagnetic attack on the CLEFIA implementation on 8-bit AVR processor based smart card. The attack is carried out on the last 3 rounds of encryption. Hence there is a strong need for efficient countermeasures to prevent PAA.

The PAA countermeasures can be implemented at different abstraction levels. We aim to implement it at RTL level in-order to implement and validate the same on a SCA standard SAKURA-G board. Two major countermeasures at RTL are Hiding and Masking. The work in [10] has shown first, second and third order masking technique for CLEFIA on FPGA and Intel platform, however there is no evaluation of PAA resistance by the authors. The Masking technique leaks information due to glitches. Threshold Implementation (TI), a variant of masking technique was proposed in [9] to overcome the leakage due to glitches and is proved to be the strongest countermeasure against PAA till date. As per the author’s knowledge, there is no work in literature for a TI implementation of CLEFIA. The major contributions of this work are:

  • A 16-bit novel serial architecture is proposed for CLEFIA-128 with a Composite Field Architecture (CFA) based S1 box and Algebraic Normal Form (ANF) based S0 box. The proposed architecture has achieved a very low area and a reasonable throughput.

  • A novel 2-share based TI implementation is derived for CLEFIA S0 and S1 box for the first time in literature and integrated to derive a two-shared TI based CLEFIA architecture.

  • The proposed first-order protected CLEFIA architecture is validated for PAA resistance against (i) Evaluation Style categories – DPA, CPA, Mutual Information Analysis (MIA) and (ii) Conformance Style categories – non-specific Test Vector Leakage Assessment (TVLA). The results show that the implementation is resistant against first-order PAAs.

The paper is organized as follows: Sect. 2 describes the novelty in CLEFIA architecture along with the result comparison. Section 3 explains in detail on the proposed TI implementation derivation and Sect. 4 presents the results of PAA evaluation. Section 5 presents the conclusion followed by references.

2 Proposed CLEFIA Architecture

We have proposed a serial CLEFIA architecture with a 16-bit data-path to perform encryption/decryption as shown in Fig. 1. The synthesis results are performed with Semi-Conductor Laboratory (SCL) [8] 180 nm technology using Cadence Encounter RTL Compiler. For comparison purposes, the area is reported in Gate Equivalence (GE) which is obtained by dividing the area reported in µm2 by the area of the lowest driving strength 2-input NAND gate of that corresponding technology library.

Fig. 1.
figure 1

Proposed 16-bit serial data-path of CLEFIA architecture

The data-path is formed by using area efficient S0/S1 boxes as explained in Sect. 2.1, followed by M0/M1 matrix realizations using x2, x4 and x8 Galois Field (GF) multipliers. The register files R0–R7 are of 16-bit size and each line carries 16-bit data except the ones explicitly shown as 8-bits. The addition of R8 register file in our proposed serial architecture has reduced the critical path delay from 5.1 ns to 3.5 ns. Further the ‘16’ 2x1 MUXes required for the product addition in M0/M1 matrix Multipliers are reduced to ‘2’ MUXes by placing them before the ‘R8’ register. These optimization techniques along with the efficient realizations of the individual blocks to obtain a 16-bit CLEFIA data-path is one of our contribution.

2.1 ANF Based CLEFIA S0 Box

CLEFIA S0 box construction is based on 4 random S-boxes namely SS0, SS1, SS2 and SS3 which are further combined using GF(24) multiplier. In this work, ANF based implementation is chosen for the S0 box realization. ANF representation of a Boolean function is useful to evaluate the algebraic degree and the linearity property of any function. ANF expressions are mainly useful to find shares of the Boolean function, when we adopt the countermeasure against SCA namely TI. The ANF implementation of S0 box requires only 115 GE, hence we have adopted the same for our serial architecture.

2.2 CFA Based CLEFIA S1 Box

CLEFIA S1 box construction is based on GF(28) inversion using the polynomial z8 + z4 + z3 + z2 + 1 [2]. The CFA based realization of S-boxes are proved to achieve the lowest area in literature. The core of the CFA computation is the multiplicative inverse block preceded and followed by the affine transformation f and g respectively as shown in Fig. 2. Gate-level implementations of the GF(24) arithmetic based blocks namely multiplier, squarer, constant multiplier, inverter, isomorphic and inverse isomorphic mapping combined with affine f and g are adopted in this work. The CFA based S1 box architecture consumes an area of about 207 GE which is 50% less compared to the LUT approach. Hence we have chosen this for the top level serial architecture.

Fig. 2.
figure 2

Proposed CFA architecture of CLEFIA S1 box

The proposed serial architecture after integration of the sub-blocks occupies 1345 GE and consumes a power of 1.3 mW. The performance comparison of the proposed serial CLEFIA architecture with the existing serial architecture [4] is shown in Table 1. It can be seen that, there is about 19% decrease in total area and a two-fold improvement in throughput calculated at 100 kHz. For comparison with [4], we have chosen the operating frequency of 100 kHz. The area decrease of the proposed architecture is attributed to low cost implementation of S0 and S1 boxes. The decrease in number of clock cycles which has resulted in throughput improvement is because of the optimal selection of data-path size of 16 bits against the 8-bit size in [4]. Each round computation can be completed in 7 clock cycles, an additional 8 clock cycles is required for input loading and output retrieval, thereby a total of 142 clock cycles for encryption/decryption. Also, the proposed architecture can be used for both encryption and decryption.

Table 1. Synthesis results comparison

The proposed serial CLEFIA architecture occupies the lowest area and has a very low power consumption and hence suitable for resource constrained applications like edge nodes in IoT devices, RFID Tags, smart cards etc.

3 Proposed CLEFIA Threshold Implementation

The proposed serial architecture has comparatively less algorithmic noise that arises from parallel computation of data and hence has less resistance to PAA. This motivates us to derive a low-cost RTL countermeasure for the proposed serial CLEFIA architecture which is explained in the subsequent section.

3.1 Threshold Implementation of CLEFIA

The objective of the countermeasures is to nullify the effect of data that the circuit processes on the power consumption. At the RTL level, the most efficient technique proposed so far till date is Threshold Implementation (TI), originally proposed by Nikova et al. in 2006 [9]. It involves splitting of the data into two or more shares and the computation data-path of each share runs in parallel. There are three basic requirements that any TI implementation should satisfy: Correctness, Non-completeness and Uniformity. The key features and advantages of TI techniques are (i) The intermediate results do not leak any information about the secret value since each share is independent of at least one input (ii) Even in the presence of glitches, sufficient first-order resistance is obtained. (iii) The technique is compatible with the standard RTL Design flow.

In general for a ‘d’-th order security of any arbitrary function with algebraic degree ‘t’, the TI scheme uses a minimum of (td + 1) shares. The work in [11] is to further reduce the number of shares from (td + 1) to (d + 1) provided that the input shares are independent. Hence the output share is independent of at least one share of input variable. In-order to satisfy the non-completeness property, the design has to be pipelined to sufficient number of stages. To ensure uniformity in all these cases, sufficient bits of randomness is introduced at each pipelined intermediate computation stage. The work in [12] has realized a 3-stage pipelined AES S-box with 2 shares with each stage requiring about 12, 24 and 28 bits of randomness. The authors claim to have achieved less area compared to the other AES TI implementations.

Based on this idea, ours is the first work to derive a TI implementation for a low-cost CLEFIA architecture. The TI implementation of the proposed CFA S1 box and ANF S0 boxes are derived and using the same, the two-share top level CLEFIA architecture is implemented. In this work, the shares of the inputs a, b are denoted as {a0, a1} {b0, b1}; the shares of the output f[i], i = 0, 1, 2, 3 are denoted as {f3[i], f2[i], f1[i], f0[i]} respectively.

TI Implementation of CLEFIA S1 box:

The CFA Architecture of S1 box is shown in Sect. 2.2. The CFA architecture consists of linear blocks namely affine transformation f with isomorphic mapping, affine transformation g with inverse isomorphic mapping, square-scaler, addition blocks which are replicated directly for processing the two-shares. The non-linear blocks are the GF(24) multiplier and inverter. Based on the work in [11], for the first-order security (d = 1), number of required input shares is d + 1=2, provided that the input shares are independent and hence the number of output shares is (d + 1)t = 2t, t is the algebraic degree of the function. It is observed that the algebraic degree of the GF(24) multiplier is 2 and GF(24) inverter is 3. Hence the TI multiplier implementation takes 2-input and produces 4-output shares as shown in (1), inverter implementation takes 2-input and produces 8-output shares. The 4-output shares of multiplier are further reduced to two by xoring with sufficient bits of randomness. Similarly the 8-output shares of inverter are reduced to two. Also, the number of pipeline stages is 3 similar to [12]. Due to space limitations, only the multiplier sharing expressions are given in (1). To satisfy the non-uniformity property, the 3-pipelining stages require 12, 28 and 24 bits of randomness each. The TI implementation of CFA S1 box occupies 1085 GE and consumes 6.2 mW power.

$$ \begin{aligned} & f[3] = (a[3] * b[3]) \oplus (a[3] * b[0]) \oplus (a[0] * b[3]) \oplus (a[2] * b[1]) \oplus (a[1] * b[2]) \\ & f[2] = (a[3] * b[3]) \oplus (a[3] * b[2]) \oplus (a[2] * b[3]) \oplus (a[2] * b[0]) \oplus (a[0] * b[2]) \oplus (a[1] * b[1]) \\ & f[1] = (a[3] * b[2]) \oplus (a[2] * b[3]) \oplus (a[3] * b[1]) \oplus (a[1] * b[3]) \oplus (a[2] * b[2]) \oplus (a[0] * b[1]) \oplus (a[1] * b[0]) \\ & f[0] = (a[3] * b[1]) \oplus (a[1] * b[3]) \oplus (a[2] * b[2]) \oplus (a[0] * b[0]) \\ \end{aligned} $$
$$ \begin{aligned} f0[3] = (a0[3] * b0[3]) \oplus (a0[3] * b0[0]) \oplus (a0[0] * b0[3]) \oplus (a0[2] * b0[1]) \oplus (a0[1] * b0[2]) \hfill \\ f1[3] = (a0[3] * b1[3]) \oplus (a0[3] * b1[0]) \oplus (a0[0] * b1[3]) \oplus (a0[2] * b1[1]) \oplus (a0[1] * b1[2]) \hfill \\ f2[3] = (a1[3] * b0[3]) \oplus (a1[3] * b0[0]) \oplus (a1[0] * b0[3]) \oplus (a1[2] * b0[1]) \oplus (a1[1] * b0[2]) \hfill \\ f3[3] = (a1[3] * b1[3]) \oplus (a1[3] * b1[0]) \oplus (a1[0] * b1[3]) \oplus (a1[2] * b1[1]) \oplus (a1[1] * b1[2]) \hfill \\ \end{aligned} $$
$$ \begin{aligned} f0[2] = (a0[3] * b0[3]) \oplus (a0[3] * b0[2]) \oplus (a0[2] * b0[3]) \oplus (a0[2] * b0[0]) \oplus (a0[0] * b0[2]) \oplus (a0[1] * b0[1]) \hfill \\ f1[2] = (a0[3] * b1[3]) \oplus (a0[3] * b1[2]) \oplus (a0[2] * b1[3]) \oplus (a0[2] * b1[0]) \oplus (a0[0] * b1[2]) \oplus (a0[1] * b1[1]) \hfill \\ f2[2] = (a1[3] * b0[3]) \oplus (a1[3] * b0[2]) \oplus (a1[2] * b0[3]) \oplus (a1[2] * b0[0]) \oplus (a1[0] * b0[2]) \oplus (a1[1] * b0[1]) \hfill \\ f3[2] = (a1[3] * b1[3]) \oplus (a1[3] * b1[2]) \oplus (a1[2] * b1[3]) \oplus (a1[2] * b1[0]) \oplus (a1[0] * b1[2]) \oplus (a1[1] * b1[1]) \hfill \\ \end{aligned} $$
$$ \begin{aligned} f0[1] = (a0[3] * b0[2]) \oplus (a0[2] * b0[3]) \oplus (a0[3] * b0[1]) \oplus (a0[1] * b0[3]) \oplus (a0[2] * b0[2]) \oplus (a0[0] * b0[1]) \oplus (a0[1] * b0[0]) \hfill \\ f1[1] = (a0[3] * b1[2]) \oplus (a0[2] * b1[3]) \oplus (a0[3] * b1[1]) \oplus (a0[1] * b1[3]) \oplus (a0[2] * b1[2]) \oplus (a0[0] * b1[1]) \oplus (a0[1] * b1[0]) \hfill \\ f2[1] = (a1[3] * b0[2]) \oplus (a1[2] * b0[3]) \oplus (a1[3] * b0[1]) \oplus (a1[1] * b0[3]) \oplus (a1[2] * b0[2]) \oplus (a1[0] * b0[1]) \oplus (a1[1] * b0[0]) \hfill \\ f3[1] = (a1[3] * b1[2]) \oplus (a1[2] * b1[3]) \oplus (a1[3] * b1[1]) \oplus (a1[1] * b1[3]) \oplus (a1[2] * b1[2]) \oplus (a1[0] * b1[1]) \oplus (a1[1] * b1[0]) \hfill \\ \end{aligned} $$
$$ \begin{aligned} f0[0] = (a0[3] * b0[1]) \oplus (a0[1] * b0[3]) \oplus (a0[2] * b0[2]) \oplus (a0[0] * b0[0]) \hfill \\ f1[0] = (a0[3] * b1[1]) \oplus (a0[1] * b1[3]) \oplus (a0[2] * b1[2]) \oplus (a0[0] * b1[0]) \hfill \\ f2[0] = (a1[3] * b0[1]) \oplus (a1[1] * b0[3]) \oplus (a1[2] * b0[2]) \oplus (a1[0] * b0[0]) \hfill \\ f3[0] = (a1[3] * b1[1]) \oplus (a1[1] * b1[3]) \oplus (a1[2] * b1[2]) \oplus (a1[0] * b1[0]) \hfill \\ \end{aligned} $$
(1)

TI Implementation of CLEFIA S0 box:

The ANF expressions of the CLEFIA S0 box has an algebraic degree of 3. Hence the SS0, SS1, SS2 and SS3 sub-blocks are shared with 2-input and 8-outputs (further reduced to 2) each and are combined using GF(2) multipliers to arrive at its TI implementation. Such an implementation has occupied about 581 GE with a power consumption value of 5.2 mW. Due to page limitations, ANF expressions of only SS0 and its corresponding two shared TI implementation expressions are shown in (2).

$$ \begin{aligned} & f0 = 1 \oplus a[0] \oplus (a[3] * a[2]) \oplus (a[3] * a[1]) \oplus (a[2] * a[1]) \oplus (a[1] * a[0]) \oplus (a[3] * a[2] * a[1]) \oplus (a[2] * a[1] * a[0]) \\ & f1 = 1 \oplus a[3] \oplus a[2] \oplus (a[3] * a[1]) \oplus (a[2] * a[0]) \oplus (a[1] * a[0]) \oplus (a[3] * a[2] * a[1]) \oplus (a[2] * a[1] * a[0]) \\ & f2 = 1 \oplus a[2] \oplus a[1] \oplus (a[3] * a[0]) \oplus (a[2] * a[0]) \oplus (a[1] * a[0]) \oplus (a[3] * a[2] * a[1]) \\ & f3 = a[3] \oplus (a[3] * a[1]) \oplus (a[2] * a[0]) \oplus (a[3] * a[2] * a[1]) \oplus (a[3] * a[2] * a[0]) \\ \end{aligned} $$
$$ \begin{aligned} & f0[0] = 1 \oplus a0[0] \oplus (a0[3] * a0[2]) \oplus (a0[3] * a0[1]) \oplus (a0[2] * a0[1]) \oplus (a0[1] * a0[0]) \oplus (a0[3] * a0[2] * a0[1]) \oplus (a0[2] * a0[1] * a0[0]) \\ & f1[0] = a1[0] \oplus (a0[3] * a1[2]) \oplus (a0[3] * a1[1]) \oplus (a1[2] * a1[1]) \oplus (a1[1] * a1[0]) \oplus (a0[3] * a1[2] * a1[1]) \oplus (a1[2] * a1[1] * a1[0]) \\ & f2[0] = (a1[3] * a0[2]) \oplus (a1[3] * a1[1]) \oplus (a0[2] * a1[1]) \oplus (a1[1] * a0[0]) \oplus (a1[3] * a0[2] * a1[1]) \oplus (a0[2] * a1[1] * a0[0]) \\ & f3[0] = (a1[3] * a1[2]) \oplus (a1[3] * a0[1]) \oplus (a1[2] * a0[1]) \oplus (a0[1] * a1[0]) \oplus (a1[3] * a1[2] * a0[1]) \oplus (a1[2] * a0[1] * a1[0]) \\ & f4[0] = (a0[3] * a0[2] * a1[1]) \oplus (a0[2] * a1[1] * a1[0]) \\ & f5[0] = (a0[3] * a1[2] * a0[1]) \oplus (a1[2] * a0[1] * a0[0]) \\ & f6[0] = (a1[3] * a0[2] * a0[1]) \oplus (a0[2] * a0[1] * a1[0]) \\ & f7[0] = (a1[3] * a1[2] * a1[1]) \oplus (a1[2] * a1[1] * a0[0]) \\ \end{aligned} $$
$$ \begin{aligned} & f0[1] = 1 \oplus a0[3] \oplus a0[2] \oplus (a0[3] * a0[1]) \oplus (a0[2] * a0[0]) \oplus (a0[1] * a0[0]) \oplus (a0[3] * a0[2] * a0[1]) \oplus (a0[2] * a0[1] * a0[0]) \\ & f1[1] = a1[3] \oplus a1[2] \oplus (a1[3] * a0[1]) \oplus (a1[2] * a1[0]) \oplus (a0[1] * a1[0]) \oplus (a1[3] * a1[2] * a0[1]) \oplus (a1[2] * a0[1] * a1[0]) \\ & f2[1] = (a0[3] * a1[1]) \oplus (a0[2] * a1[0]) \oplus (a1[1] * a1[0]) \oplus (a0[3] * a0[2] * a1[1]) \oplus (a0[2] * a1[1] * a1[0]) \\ & f3[1] = (a1[3] * a1[1]) \oplus (a1[2] * a0[0]) \oplus (a[1] * a0[0]) \oplus (a1[3] * a1[2] * a1[1]) \oplus (a1[2] * a1[1] * a0[0]) \\ & f4[1] = (a0[3] * a1[2] * a0[1]) \oplus (a1[2] * a0[1] * a0[0]) \\ & f5[1] = (a1[3] * a0[2] * a0[1]) \oplus (a0[2] * a0[1] * a1[0]) \\ & f6[1] = (a0[3] * a1[2] * a1[1]) \oplus (a1[2] * a1[1] * a1[0]) \\ & f7[1] = (a1[3] * a0[2] * a1[1]) \oplus (a0[2] * a1[1] * a0[0]) \\ \end{aligned} $$
$$ \begin{aligned} & f0[2] = 1 \oplus a0[2] \oplus a0[1] \oplus (a0[3] * a0[0]) \oplus (a0[2] * a0[0]) \oplus (a0[1] * a0[0]) \oplus (a0[3] * a0[2] * a0[1]) \\ & f1[2] = a1[2] \oplus a1[1] \oplus (a0[3] * a1[0]) \oplus (a1[2] * a1[0]) \oplus (a1[1] * a1[0]) \oplus (a0[3] * a1[2] * a1[1]) \\ & f2[2] = (a1[3] * a0[0]) \oplus (a1[2] * a0[0]) \oplus (a1[1] * a0[0]) \oplus (a1[3] * a1[2] * a1[1]) \\ & f3[2] = (a1[3] * a1[0]) \oplus (a0[2] * a1[0]) \oplus (a0[1] * a1[0]) \oplus (a1[3] * a0[2] * a0[1]) \\ & f4[2] = (a0[3] * a0[2] * a1[1]) \\ & f5[2] = (a0[3] * a1[2] * a0[1]) \\ & f6[2] = (a1[3] * a0[2] * a1[1]) \\ & f7[2] = (a1[3] * a1[2] * a0[1]) \\ \end{aligned} $$
$$ \begin{aligned} & f0[3] = a0[3] \oplus (a0[3] * a0[1]) \oplus (a0[2] * a0[0]) \oplus (a0[3] * a0[2] * a0[1]) \oplus (a0[3] * a0[2] * a0[0]) \\ & f1[3] = a1[3] \oplus (a1[3] * a0[1]) \oplus (a0[2] * a1[0]) \oplus (a1[3] * a0[2] * a0[1]) \oplus (a1[3] * a0[2] * a1[0]) \\ & f2[3] = (a0[3] * a1[1]) \oplus (a1[2] * a0[0]) \oplus (a0[3] * a1[2] * a1[1]) \oplus (a0[3] * a1[2] * a0[0]) \\ & f3[3] = (a1[3] * a1[1]) \oplus (a1[2] * a1[0]) \oplus (a1[3] * a1[2] * a1[1]) \oplus (a1[3] * a1[2] * a1[0]) \\ & f4[3] = (a0[3] * a0[2] * a1[1]) \oplus (a0[3] * a0[2] * a1[0]) \\ & f5[3] = (a0[3] * a1[2] * a0[1]) \oplus (a0[3] * a1[2] * a1[0]) \\ & f6[3] = (a1[3] * a0[2] * a1[1]) \oplus (a1[3] * a0[2] * a0[0]) \\ & f7[3] = (a1[3] * a1[2] * a0[1]) \oplus (a1[3] * a1[2] * a0[0]) \\ \end{aligned} $$
(2)

The two-share S0 and S1 boxes are then integrated to form the top level two-share architecture by replicating the linear blocks namely key XORs, registers, x1, x4, x8 multipliers and multiplexers twice for processing the two input shares. The two shares of the plain text are derived by using a LFSR-based Pseudo-Random Number Generator (PRNG) block which generates a 16-bit random number. Thereby the top-level serial based two-shared CLEFIA Architecture occupies around 3955 GE which is about three times more than the unprotected implementation with a power consumption of 30 mW. The throughput decreases by 20% because of the extra clock cycles required due to the pipelined S0 and S1 boxes. Based on our survey, since there is no TI work on CLEFIA in the existing literature, we compare our results with that of AES cipher. A recent low-cost AES TI with similar number of shares have reported an area of 6053 GE on a byte serial architecture [12]. However the proposed lightweight CLEFIA cipher with an efficient TI implementation has 34% lesser area, making it still suitable for resource constrained applications. Since the power consumption is technology dependent, we do not attempt to compare the same with other works.

4 PAA Resistance Analysis

PAAs are Known Plain Text/Cipher Text attacks that recover the secret key byte-by-byte. The first part of this section explains the setup developed to perform the PAA while the second part describes the results obtained for protected and unprotected implementations.

4.1 PAA Setup

  • Threat model: It is assumed that the attacker has physical access to the cryptographic device implementing the encryption algorithm and can control the data inputs. The attacker can also observe the outputs of the performed operation, without the knowledge of the secret key.

  • Attack Data-path: The selection function in our work is the output of the 8-bit S-boxes S0/S1 of first round since S-boxes are responsible to induce the confusion property of the cipher. The data-path computes and stores one S-box computation per clock cycle.

  • Predictions Phase: For ‘m’ plain-text values, the possible key values for an 8-bit attack path is ks(i) = 0 to 255. The data-path is evaluated using MATLAB code and the S-box output values are obtained. The size of the S-box output matrix is m x 256. The DPA attack work on bit models (any bit of the S-box) and CPA/MIA can be performed using HWM or Hamming Distance Model (HDM) or Signed Bit Model (SBM).

  • Measurement Phase: The implementation of the attack data-path is done on SCA Standard Evaluation Board (SASEBO) namely SAKURA-G which is embedded with two FPGAs: (i) The main FPGA namely Cryptographic FPGA – Spartan 6 xc6slx75 and (ii) The secondary FPGA namely Control FPGA – Spartan 6 xc6slx9. The amplified and de-noised supply current of the design implemented in Main FPGA is tapped through SMA coaxial cables. The current traces are captured using KeySight Mixed Signal Oscilloscope MSO6014A. The oscilloscope captures the trigger signal and the current trace from the FPGA board.

  • Pre-processing and Attack Phase: The efficient pre-processing phase of our work consists of the following features: (i) The measured current samples contain very little noise since all the capacitors on Vcore line are removed and voltage drop (power trace) caused by the 1-ohm shunt resistor inserted between the cryptographic FPGA and the Vcore line is captured. (ii) The measured current samples contain a sufficient current level for the attack since they are obtained from the in-built amplifier on the board with a bandwidth of 360 MHz and a gain of +20 dB. (iii) The required features of the measured current samples are extracted using Python. The pre-processed samples are then re-arranged into a trace matrix m x n. The pre-processed traces are then analyzed using DPA-CPA/MIA/TVLA attack scripts in MATLAB.

4.2 Attack Results

With our developed attack setup, we have performed two styles of PAA: (i) Evaluation style attacks- which involves forming a theoretical model and evaluating the implementation against different attack strategies. The analysis types like DPA [13], CPA [14] and MIA [15] fall under this category. DPA uses difference of means (DoM) method to predict the key while CPA uses a statistical distinguisher namely Pearson’s Correlation Coefficient (CC) to correlate between the measurements obtained and predictions made. The idea of using information theory metrics to quantify the dependence between the leakage and power consumption model was proposed by Benedikt et al., as MIA in [15]. The advantage of this method is that it relaxes the strict requirement of linear dependency between the measurements and predictions as in CPA and exploits comparatively more information than DPA. Hence the MIA based attacks are independent of the attacking platform and the variables need not be strictly linear. In our work, the Mutual Information (MI) between the random variable X (power consumption) and the leakage L (predictions) is calculated in bits using MATLAB code. For each secret key value, MI function processes the inputs column by column and returns an ‘m x n’ matrix, where ‘m’ corresponds to the number of traces and ‘n’ corresponds to the sampling points. The maximum value of MI is subsequently recorded, the index of which returns the key. Our work reports a metric named Minimum Time to Disclosure (MTD) that describes the cross-over point of the correct Correlation Coefficient (CC) with the CC curves of the other key guesses. We have used this metric to quantify the CPA and MIA attacks in our work. (ii) Conformance style attacks- which involves evaluating the leakages independently, to determine if there is a SCA vulnerability, rather than estimating the correct key. The very popular TVLA [16] falls under this category which performs t-tests using the statistical mean and variance parameters. Based on the literature, acceptable t-value across all the sampling points is ≤±4.5. If t-value exceeds the threshold, the implementation is considered to be vulnerable to PAA. This methodology does not determine the exact secret key value or the number of traces required for successful key retrieval. In certain applications it is required to strongly determine the implementation vulnerability of a system and this methodology provides its usefulness in those scenarios.

Table 2 shows the metrics comparison of protected and unprotected implementations of a few keys. While the un-protected implementations have shown successful key recoveries with less than 768 traces for all the attack categories, the protected implementation does not reveal the secret key even with 100,000 traces for first-order attacks. Hence sufficient PAA protection is achieved using the proposed two-share CLEFIA architecture. Figures 3, 4 and 5 shows the attack results of CPA and TVLA.

Table 2. PAA resistance analysis
Fig. 3.
figure 3

(a) CPA-HWM CC plot (b) MTD plot for key byte = 165 on unprotected CLEFIA S0 box

Fig. 4.
figure 4

(a) CPA-HWM CC plot (b) MTD plot for key byte = 165 on protected CLEFIA S0 box

Fig. 5.
figure 5

Non-specific TVLA results for (a) un-protected (b) protected implementation

5 Conclusion

CLEFIA is an ISO/IEC standard lightweight cryptographic algorithm suitable for resource constrained applications. We have proposed a 16-bit novel serial CLEFIA architecture with a low-cost CFA based S1 box and ANF based S0 box. The proposed serial architecture has achieved a very low area of 1345 GE with a power consumption of 1.3 mW. Since PAA is shown to be a stronger threat to the cryptographic implementations, we have developed a novel two-share TI implementation for the proposed CLEFIA architecture to mitigate the same. The proposed TI implementation occupies an area of 3955 GE, which is 34% smaller than a similar AES TI implementation. We have briefed the PAA setup created using SAKURA-G board and performed the Evaluation and Conformance style attacks. While the unprotected implementation is easily attacked with less than 768 traces, the protected implementation shows resistance to PAA using 100,000 traces.

The first-order PAA resistant CLEFIA architecture is suitable for resource constrained applications. The higher order PAA resistance of the proposed TI implementation will be analyzed in future.