Keywords

1 Introduction

Most devices in IoT environments are miniature and resource-constrained. Therefore, lightweight cryptography, which is an active area of research, is essential. Some lightweight block ciphers such as PRESENT [22] and CLEFIA [44] have been standardized by ISO/IEC. Additionally, a lightweight cryptography standardization project is ongoing at NIST. A variety of lightweight block ciphers have been introduced in the literature, many of which are hardware-friendly systems [9, 22, 24, 33, 43]. Others focus on software performance [10, 28, 31] or both hardware and software performance [8, 12, 14, 32, 49].

In 1996, Paul Kocher first proposed side-channel attacks, which extract secret information by analyzing side-channel information [37]. Since secure designs for mathematical cryptanalysis cannot guarantee security against side-channel attacks, various countermeasures have been studied. With side-channel attacks becoming more advanced and the associated equipment cost decreasing [47], the application of side-channel countermeasures to ciphers has become critical. Recently, various studies have been actively conducted on efficient implementations of side-channel countermeasures, especially on efficient masked implementations. To minimize performance overhead, the focus has been on reducing the number of nonlinear operations. Several lightweight block ciphers, whose design goal is a low nonlinear operation count, have been proposed [2, 28, 31]. These are intended for use in side-channel protected environments.

However, most existing lightweight block ciphers are unsuitable for at least one type of hardware, software, or side-channel protected implementations in low resource environments. Consequently, it is challenging to design general-purpose lightweight block ciphers capable of operating in any resource-constrained environment.

In this paper, we propose a new lightweight versatile block cipher PIPO. During the PIPO design process, the focus was on minimizing the number of nonlinear operations because this is the most significant factor for efficient higher-order masking implementations. To construct an 8-bit S-box with a small number of nonlinear operations, we devised a unbalanced-Bridge structure that accepts one 3-bit and two 5-bit S-boxes and produces 8-bit S-boxes. This structure allows us to construct a new 8-bit S-box that offers good cryptographic properties and an efficient bitsliced implementation including only 11 nonlinear bitwise operations. We also present theorems applied to the unbalanced-Bridge structure, which show the conditions that the both differential and linear branch numbers of the S-boxes constructed through the structure are greater than 2. We investigated the linear layer with the highest security against differential and linear cryptanalyses when combined with the new S-box, through which we could reduce the number of rounds of PIPO. Consequently, PIPO achieves fast higher-order masking implementations, and its execution time increases less with the number of shares (i.e., the masking order) compared with other lightweight 64-bit block ciphers with 128-bit keys. Additionally, PIPO records excellent performance on 8-bit microcontrollers and competitive round-based hardware implementations.

The following notation and definitions are used throughout this paper.

DDT

Difference Distribution Table of an n-bit S-box whose

\((\varDelta \alpha , \varDelta \beta )\) entry is \(\#\{x \in \mathbb {F}^{n}_{2}|S(x) \oplus S(x \oplus \varDelta \alpha ) = \varDelta \beta \}\),

where \(\varDelta \alpha , \varDelta \beta \in \mathbb {F}^{n}_{2}\)

LAT

Linear Approximation Table of an n-bit S-box whose \((\lambda _\alpha ,\lambda _\beta )\)

entry is \(\#\{x \in \mathbb {F}^{n}_{2}| \lambda _\alpha \bullet x = \lambda _\beta \bullet S(x)\} - 2^{n-1}\),

where \(\lambda _\alpha , \lambda _\beta \in \mathbb {F}^{n}_{2}\), and the symbol \(\bullet \) denotes the canonical

inner product in \(\mathbb {F}^{n}_{2}\)

Differential uniformity

\(\underset{\varDelta \alpha \ne 0, \varDelta \beta }{\text{ max }}\#\{x \in \mathbb {F}^{n}_{2}|S(x) \oplus S(x \oplus \varDelta \alpha ) = \varDelta \beta \}\)

Non-linearity

\(2^{n-1} - 2^{-1}\times \underset{\lambda _\alpha , \lambda _\beta \ne 0}{\text{ max }}|\Phi (\lambda _\alpha , \lambda _\beta )|\), where \(\Phi (\lambda _\alpha , \lambda _\beta )\)

\(=\underset{x \in \mathbb {F}^{n}_{2}}{\sum } (-1)^{\lambda _\beta \bullet S(x) \oplus \lambda _\alpha \bullet x}\)

DBN

Differential Branch Number of an S-box defined as

\(\underset{a,b\ne a}{\text{ min }}(wt(a \oplus b)+wt(S(a)\oplus S(b)))\)

LBN

Linear Branch Number of an S-box defined as

\(\underset{a,b,\Phi (a,b)\ne 0}{\text{ min }}(wt(a)+wt(b))\)

2 Specification of PIPO

The PIPO block cipher accepts a 64-bit plaintext and either a 128 or 256-bit key, generating a 64-bit ciphertext. It performs 13 rounds for a 128-bit key and 17 rounds for a 256-bit key. Each round is composed of a nonlinear layer denoted as the S-layer, a linear layer denoted as the R-layer, and round key and constant XOR additions. The overall structure of PIPO is depicted on the left side of Fig. 1. Here, \(RK_0\) is a whitening key and \(RK_1,RK_2, \cdots , RK_{r}\) are round keys, where \(r=13\) (128-bit key) or 17 (256-bit key). The i-th round constant \(c_{i}\) is i (the round counter) which is XORed with \(RK_i\). During the enciphering process, the intermediate state is regarded as an \(8 \times 8\) array of bits, as shown on the right side of Fig. 1, where X[i] represents the i-th row byte for \(i=0\sim 7\). The S-layer executes eight identical 8-bit S-boxes (denoted as \(S_{8}\)) in parallel. The \(S_{8}\) is applied to each column of the \(8 \times 8\) array of bits, where the uppermost bit is the least significant. The \(S_{8}\) is shown in Table 7 of Appendix C.1. The R-layer rotates the bits in each row by a given offset (Fig. 2).

The key schedule of PIPO is very simple. We first split a 128-bit master key K into two 64-bit subkeys \(K_0\) and \(K_1\), i.e., \(K=K_1 || K_0\). The whitening and round keys are then defined as \(RK_i = K_{i~mod~2}\), where \(i=0,1,2,\cdots ,13\). Similarly, a 256-bit master key K is divided into four 64-bit subkeys \(K_0\), \(K_1\), \(K_2\), and \(K_3\), i.e., \(K=K_3 || K_2 || K_1 || K_0\). In this case, \(RK_i = K_{i~mod~4}\) where \(i=0,1,2,\cdots ,17\). Some test vectors for PIPO are provided in Appendix A. Note that resistance to related-key attacks was not considered when designing the PIPO cipher.

Fig. 1.
figure 1

Overall structure (left) and intermediate state (right) of PIPO

Fig. 2.
figure 2

R-layer

3 Design Rationale of PIPO

3.1 S-Layer

Overall Structure. We focused on the following three criteria when designing our 8-bit S-box, \(S_8\).

  1. 1.

    It should offer an efficient bitsliced implementation including 11 or fewer nonlinear operations.

  2. 2.

    Its differential and linear branch numbers (DBN and LBN) should both be greater than 2.

  3. 3.

    Its differential uniformity should be 16 or less, and its non-linearity should be 96 or more.

Criterion 1 minimizes the number of nonlinear operations required by PIPO, which allows for efficient higher-order masking implementations. Criteria 2 and 3 ensure the cryptographic strengths of the \(S_8\) against differential cryptanalysis (DC) and linear cryptanalysis (LC). Any inferior criteria will lead to the implementation of more rounds to achieve acceptable security against these attacks, eventually resulting in a weak proposal. The thresholds of the criteria were selected based on the properties of the existing lightweight 8-bit S-boxes (refer to Table 1).

The Bridge structure was first proposed in [36], and revisited in [15]. In order to construct an \(S_8\) satisfying all the aforementioned three criteria, we employed the unbalanced-Bridge structure depicted in Fig. 3, where \(S_i^j\) represents the j-th and i-bit S-box in the structure. This structure has the following three characteristics. First, it uses 3-bit and 5-bit S-boxes instead of 4-bit S-boxes. We observe that 8-bit S-box constructions using three 4-bit S-boxes would have difficulty satisfying criterion 1, even though they conform to criteria 2 and 3. Second, all eight output bits are generated from at least two smaller S-boxes (to meet criterion 3). Finally, at least one non-bijective smaller S-box can be adopted to increase the number of possible combinations of smaller S-boxes.

Fig. 3.
figure 3

The unbalanced-Bridge structure

The notation used in this section is introduced below.

\(\rho _c:\mathbb {F}^{5}_{2}\rightarrow \mathbb {F}^{5}_{2}, \,\,\,\rho _c(x||y)=y||x,\,\,\,\text {for}\,\,x\in \mathbb {F}^{3}_{2},\,y\in \mathbb {F}^{2}_{2},\)

\(\tau _n:\mathbb {F}^{5}_{2}\rightarrow \mathbb {F}^{n}_{2},\,\,\,\tau _n(x||y)=x,\,\,\,\text {for}\,\,x\in \mathbb {F}^{n}_{2},\,y\in \mathbb {F}^{5-n}_{2},\,n\in \{1,2,3,4\},\)

\(\tau _n':\mathbb {F}^{5}_{2}\rightarrow \mathbb {F}^{n}_{2},\,\,\,\tau _n'(x||y)=y,\,\,\,\text {for}\,\,x\in \mathbb {F}^{5-n}_{2},\,y\in \mathbb {F}^{n}_{2},\,n\in \{1,2,3,4\},\)

\(\mathfrak {F}^1_A:\mathbb {F}^{3}_{2}\rightarrow \mathbb {F}^{5}_{2},\,\,\,\mathfrak {F}^1_A(X)=(S^{1}_5)^{-1}(X||A)\,\,\,\text {for}\,\,A\in \mathbb {F}^{2}_{2},\)

\(\mathfrak {F}^2_A:\mathbb {F}^{3}_{2}\rightarrow \mathbb {F}^{5}_{2},\,\,\,\mathfrak {F}^2_A(X)=S^{2}_5(X||A)\,\,\,\text {for}\,\,A\in \mathbb {F}^{2}_{2}.\)

Now we can define an 8-bit S-box constructed by the unbalanced-Bridge. Let \(S_{8}(X_{L}||X_{R})=C_{L}(X_{L}, X_{R}) ||\) \(C_{R}(X_{L}, X_{R})\), where \(X_{L}\) and \(X_{R}\) represent the input variables of the \(S_8\) which are in \(\mathbb {F}^{5}_{2}\) and \(\mathbb {F}^{3}_{2}\), respectively. Then \(C_{L}(X_{L},X_{R})=\tau _3(S^{1}_{5}(X_{L}))\oplus S_{3}(X_{R})\) and \(C_{R}(X_{L},X_{R})=\rho _c(S^{2}_{5}(S^{1}_{5}(X_{L})\oplus (S_{3}(X_{R})||0^{(2)})))\oplus \) \((0^{(2)}||S_{3}(X_{R}))\) with \(C_{L}:\mathbb {F}^{5}_{2}\times \mathbb {F}^{3}_{2}\rightarrow \mathbb {F}^{3}_{2}\) and \(C_{R}:\mathbb {F}^{5}_{2}\times \mathbb {F}^{3}_{2}\rightarrow \mathbb {F}^{5}_{2}\). Proposition 1 shows the conditions for the 8-bit S-box constructed by the unbalanced-Bridge to be bijective.

Proposition 1

The 8-bit S-box constructed using the unbalanced-Bridge is bijective if and only if the following three conditions are all satisfied:

  1. i)

    \(S_{3}\) is bijective.

  2. ii)

    \(S^{1}_{5}\) is bijective.

  3. iii)

    For all \(y\in \mathbb {F}^{3}_{2}\), \(f_{y}(x)=\tau _2'(S^{2}_{5}(y||x))\) is a bijective function with \(f_{y}:\mathbb {F}^{2}_{2}\rightarrow \mathbb {F}^{2}_{2}\).

Proof

Refer to Appendix B.1.

Construction of 8-Bit S-Boxes with DBN\(\varvec{>2}\) and LBN\(\varvec{>2}\) and Our \(\varvec{S}_{\varvec{8}}\) Selection. We present here how to construct 8-bit S-boxes with DBN\(>2\) and LBN\(>2\). Our framework is to eliminate all the input and output differences (masks) where the sum of their Hamming weights is 2. During this elimination process, we can obtain some conditions of smaller S-boxes. Theorems 1 and 2 present the necessary and sufficient conditions of smaller S-boxes so that the 8-bit S-boxes constructed by Fig. 3 have both differential and linear branch numbers greater than 2.

Theorem 1

The DBN of bijective 8-bit S-boxes constructed using the unbalanced-Bridge is greater than 2 if and only if conditions i), ii), and iii) are all satisfied (\(\varDelta \alpha \) and \(\varDelta \beta \) below represent arbitrary differences where \(wt(\varDelta \alpha )=wt(\varDelta \beta )=1\)):

  1. i)

    For each \(\varDelta \alpha ,\varDelta \beta \in \mathbb {F}^3_2\), at least one of the entry \((\varDelta \alpha , \varDelta \beta )\) in DDT of \(S_3\) and the entry \((\varDelta \beta ||0^{(2)}, \varDelta \beta ||0^{(2)})\) in DDT of \(S^2_5\) is 0,

  2. ii)

    For each \(\varDelta \alpha ,\varDelta \beta \in \mathbb {F}^5_2\), for each \(A,B(\ne A)\in \mathbb {F}^2_2\), at least one of \(\mathfrak {F}^{1}_{A}(X)\oplus \mathfrak {F}^{1}_{B}(X)=\varDelta \alpha \) and \(\mathfrak {F}^{2}_{A}(X)\oplus \mathfrak {F}^{2}_{B}(X)=\varDelta \beta \) has no solution X, where \(X\in \mathbb {F}^3_2\),

  3. iii)

    For each \(\varDelta \alpha \in \mathbb {F}^3_2\) and \(\varDelta \beta \in \mathbb {F}^5_2\), for each \(A,B\in \mathbb {F}^2_2\), at least one of \(\mathfrak {F}^{1}_{A}(X)\oplus \mathfrak {F}^{1}_{B}(X\oplus \varDelta \alpha )=\varDelta \beta \) and \(\mathfrak {F}^{2}_{A}(X)\oplus \mathfrak {F}^{2}_{B}(X\oplus \varDelta \alpha )=\varDelta 0\) has no solution X, where \(X\in \mathbb {F}^3_2\).

Proof

Refer to Appendix B.2.

The following theorem concerning the LBN can be similarly obtained.

Theorem 2

The LBN of bijective 8-bit S-boxes constructed using the unbalanced-Bridge is greater than 2 if and only if conditions i), ii), and iii) are all satisfied (\(\lambda _\alpha \) and \(\lambda _\beta \) below represent arbitrary masks where \(wt(\lambda _\alpha )=wt(\lambda _\beta )=1\)):

  1. i)

    For each \(\lambda _\alpha ,\lambda _\beta \in \mathbb {F}^3_2\), at least one of the entry \((\lambda _\alpha ,\lambda _\beta )\) in LAT of \(S_3\) and the entry \((0,\lambda _\beta ||0^{(2)})\) in LAT of \(S^2_5\) is 0,

  2. ii)

    For each \(\lambda _\alpha \in \mathbb {F}^5_2\) and \(\lambda _\beta \in \mathbb {F}^3_2\), \(\sum _{A\in \mathbb {F}_2^2} X\cdot Y=0\) where X is the entry \((\lambda _\beta ,\lambda _\alpha )\) in LAT of \(\mathfrak {F}^1_A\) and Y is the entry \((\lambda _\beta ,\lambda _\beta ||0^{(2)})\) in LAT of \(\mathfrak {F}^2_A\),

  3. iii)

    For each \(\lambda _\alpha ,\lambda _\beta \in \mathbb {F}^5_2\) satisfying \(\tau _3(\lambda _\beta )=0\), \(\sum _{A\in \mathbb {F}_2^2} X\cdot Y=0\) where X is the entry \((0,\lambda _\alpha )\) in LAT of \(\mathfrak {F}^1_A\) and Y is the entry \((0,\lambda _\beta )\) in LAT of \(\mathfrak {F}^2_A\).

Proof

Refer to Appendix B.3.

Our \(S_8\) search process is outlined as follows. First, we generated 3-bit and 5-bit S-box sets; for 3-bit S-boxes we ran an exhaustive search with AND, OR, XOR, and NOT instructions while restricting the number of nonlinear (resp. linear) operations to 3 (resp. 4), and for 5-bit S-boxes we ran an exhaustive search with AND, OR, and XOR instruction while restricting the number of nonlinear (resp. linear) operations to 4 (resp.7) with a differential uniformity of 8 or less. Second, we classified two 5-bit S-boxes and one 3-bit S-box that satisfy the conditions of Proposition 1, Theorems 1 and 2. During this process, the search space for the \(S_8\) could be significantly reduced because the early abort technique was used to select \(S_3\), \(S_1^5\), and \(S_2^5\). Third, we randomly chose the combination of \(S_3\), \(S_5^1\), and \(S_5^2\) to verify whether the corresponding 8-bit S-box satisfies criterion 3, and no fixed point. During the search, we found more than 8,000 candidates for the \(S_8\). We selected the one that leads to the best resistance to differential and linear attacks when combined with the linear layer of PIPO (refer to section 3.2). The final selected input/output values of the \(S_8\) are presented in Table 7; its bitsliced implementation is given in Appendix C.2.

Table 1 compares the cryptographic properties and operations with those of other 8-bit S-boxes built from smaller 3 S-boxes.

Table 1. Comparison of bitslice 8-bit S-boxes with respect to cryptographic properties and number of operations

3.2 R-Layer

To ensure efficient hardware and software implementations, we chose the R-layer to be a bit permutation which only uses bit-rotations in bytes. Its bitsliced implementation is given in Listing 1.1. During the design of the R-layer, the following criteria were considered.

  1. 1.

    The number of rounds to achieve full diffusion – through which any input bit can affect the entire output bits – should be minimized.

  2. 2.

    Combining the R-layer with the S-layer should enable the cipher to have the best resistance to DC and LC (among all bit permutations satisfying the first criterion).

To meet the first criterion, we adopted a bit permutation that enables PIPO to achieve full diffusion in two rounds by using rotation offsets \(0\sim 7\) for all rows. The second criterion was taken into account when deciding which rotation to use for which row. We applied all 5,040(=7!) R-layers (except for all rotation equivalences) to the S-layer and selected one with the lowest probabilities of 6 and 7-round best differential and linear trails. Our analysis found that the selected combination of the S and R layers provides superior resistance to DC and LC than any other combinations even when other S-boxes among the aforementioned candidates were chosen. Note that most combinations of S and R layers candidates could not provide best 7-round differential and linear trails with less than probability \(2^{-64}\).

figure a

4 Security Evaluation of PIPO

Table 2 shows the maximum numbers of rounds of characteristics and key recovery attacks that we found for each attack [3, 18,19,20, 40, 42, 46]. In addition to the cryptanalysis shown in Table 2, we conducted algebraic attack [23], integral attack [48], statistical saturation attack [25], invariant subspace attack [38, 39], nonlinear invariant attack [45] and slide attack [21], but they were not applied more effectively than DC or LC.

Table 2. The numbers of rounds of the best characteristics for each cryptanalysis

One of the major design considerations for PIPO is to adopt a compact number of rounds (not enough rounds to guarantee security that is (too) high) based on thorough security analyses. We discovered that the best attacks applied to PIPO are DC and LC. An exhaustive search (based on the branch and bound technique [41]) for the DC and LC distinguishers was performed, in which the best reaches 6 rounds. Our analyses could recover the key of up to 9 and 11 rounds of PIPO-64/128 and PIPO-64/256, respectively.

5 Performance Evaluation of Higher-Order Masking Implementations of PIPO

Bitsliced implementations, initially proposed by Biham [17], are known to be efficient when applying Boolean masking, since secure S-box computations can be carried out in parallel [29,30,31, 34]. Thus, we used an S-box that can be efficiently implemented in this way, and only involves 11 nonlinear bitwise operations. The number of nonlinear operations is very important for Boolean masking schemes, since they have a quadratic complexity, i.e., \(O(d^2)\), compared with the linear complexity, i.e., O(d), for other operations.

Table 3. Comparison of required ROM (bytes) for round constant, number of nonlinear bitwise operations, and permutation layers of round functions
Fig. 4.
figure 4

Execution times of one-block encryptions according to the number of shares in an Atmel AVR XMEGA128 (1 means unprotected)

We constructed PIPO using higher-order masked S-layer and R-layer. There are several variations of ISW-AND [6, 7, 16], however, in this paper, we apply original ISW-AND. Since logical OR of two inputs a and b satisfies \(a \vee b = (a \wedge b) \oplus a \oplus b\), thus, ISW-OR can be calculated by replacing logical AND with ISW-AND.

We compare our proposed PIPO with 64-bit block ciphers with 128-bit keys as shown in Table 3. All the ciphers compared were implemented using bitslice techniques, and only round constants were precomputed. There is no need to precompute round constants of PIPO, RoadRunneR, and PRESENT, because they are the i or \(NR-i\) for \(i=0,1,\cdots ,NR-1\), where NR is the number of rounds. Therefore, the required ROM for round constants is shown in Table 3. Only CRAFT used an additional 16-byte diffusion table for generating tweakeys. The same secure logical operations of PIPO were applied to implement higher-order masking structures.

Figure 4 shows the execution times for different numbers of shares on an 8-bit AVR processor. Especially, it shows that the more nonlinear operations, the greater increase in execution time with the number of shares, refer to Table 3. PIPO has the smallest number of nonlinear operations.

6 Performance Evaluation of Software and Hardware Implementations of PIPO

6.1 Software Implementations

The PIPO block cipher consists of permutation (R-layer) and S-box (S-layer) computations. The permutation routine is performed in 8-bit rotation operations, and 22 XOR, 6 AND, 5 OR, 1 COM and 24 MOV instructions are used to compute the S-box. This uses a total of 21 general-purpose registers: six for temporal storage, one for a zero constant, eight for a plaintext, four for address pointers and two for counter variables.

The developers of SIMON and SPECK have proposed a new metric to measure overall performance on low-end devices, namely RANK [11]. This is calculated as follows:

$$RANK = (10^6/CPB)/(ROM + 2 \times RAM).$$

In this metric, higher values of RANK correspond to better performance. Table 4 compares results for several block ciphers on an 8-bit AVR platform. Here, we used Atmel Studio 6.2, and compiled all implementations with optimization level 3. The target processor was an ATmega128 running at 8 MHz [4]. PIPO requires 320 bytes of code, 31 bytes of RAM and an execution time of 197 CPB. We used the RANK metric to compare the ciphers’ overall performances, finding that PIPO achieved the highest score among block ciphers with the same parameter lengths.

Table 4. Comparison of block ciphers on 8-bit AVR*

6.2 Hardware Implementations

We implemented PIPO-64/128 and PIPO-64/256 in Verilog, and synthesized the proposed architectures using the Synopsys Design Compiler with 130 nm CMOS technology. Figure 5 shows the datapath of an area-optimized encryption-only PIPO block cipher, which performs one round per clock cycle (i.e., uses a 64-bit-wide datapath). The S-layer uses the same 8-bit S-box 8 times, whereas the R-layer is implemented in wiring. For lightweight key generation, we obtain the round key from the master key, directly. This feature avoids including the key storage. Our implementations require 13 and 17 clock cycles to encrypt a 64-bit plaintext with 128-bit and 256-bit keys, respectively.

Fig. 5.
figure 5

Datapath of an area-optimized version of PIPO

Table 5. Area requirement of PIPO-64/128 and PIPO-64/256.

Table 5 shows the areas required by PIPO-64/128 and PIPO-64/256. Most of the areas are taken up by the S-layer, in order to compute eight 8-bit S-boxes in parallel. The flip-flops are used for storing plaintext and counter, and the other areas consist of MUX and other logical operations.

Table 6 compares the results for several different block ciphers implemented as ASICs. Compared with the other block ciphers using the same parameter lengths, PIPO needs more gates than CRAFT, Piccolo and SIMON but its cycles per block are much lower, resulting in the highest figure of merit FOM (nano bits per clock cycle per GE squared [5, 32]). It is obvious that the high FOM of PIPO requires less energy and battery consumption.

Table 6. Comparison of round-based and area optimized implementations for block ciphers using 130 nm ASIC library.

7 Conclusion

In this paper, we proposed a new lightweight versatile block cipher PIPO suitable for diverse resource-constrained environments. In particular, PIPO exhibits excellent performance in both side-channel protected and unprotected environments on 8-bit microcontrollers, and fast round-based hardware implementations as well. Furthermore, a thorough security analysis of PIPO was conducted.