1 Introduction

Historically, the field of lightweight cryptography has been focused on designing algorithms that leave an as small footprint as possible when manufactured in silicon. Small area implicitly results in low power consumption, which is another, equally important, optimization target.

Hitting these two targets comes at a price since the performance and energy consumption of lightweight cryptographic primitives are far from being competitive, and for most online applications, they are often not meeting the requirements. There are only a handful of designs that consider latency and energy consumption among their main design goals. PRINCE [3] and Midori [2] are two prominent examples.

The inherent vulnerability to physical attacks is a serious threat that the field of lightweight cryptography has been exposed to since its creation. Side-channel analysis is one of the most powerful examples of such an attack. To resist an adversary that has access up to d wires inside the circuit [27], the secret value has to be shared into at least \(d+1\) random shares using a masking technique. An example of such a scheme is Boolean masking where the secret is shared into shares using Boolean addition and the shares are then independently processed in a way that prevents revealing the secret information.

In order to circumvent a masked implementation, attackers need to extract and combine the secret information from several shares, i.e., they need to employ a higher-order attack of degree d at least. These attacks are harder to mount because they are susceptible to the amount of noise collected during the trace acquisition. Higher-order SCA protection incurs penalties in the silicon area, execution time, power consumption and the number of random bits required for secure execution. Increased cost comes from the number of shares that are required. When protecting a nonlinear function, the number of output shares grows significantly and depends on the number of input shares, the algebraic degree of the function, the number of nonlinear terms the function has and the security order that needs to be achieved. For example, the number of output shares in consolidated masking scheme [41] exponentially grows with the algebraic degree of the function with the number of shares being lower bound by \((d+1)^t\), with t being the algebraic degree.

The challenge of designing secure cryptographic circuits becomes significantly harder once the strict requirements have to be met for at least one of the following metrics: silicon area, latency, power or energy consumption. In the context of this paper, and as stated in [29], we consider latency as the total time needed to execute a single cryptographic operation. Minimizing latency can be achieved by increasing the frequency the circuit can operate on or by reducing the clock cycle count of the operation. Hence, one design outperforms another with regard to latency if the product of the number of clock cycles and the minimal clock period is smaller in that design.

Recently, multiple examples targeting both low-latency and side-channel protection emerged. The resulting effort produced a variety of AES [22, 44, 46], KECCAK [1] and PRINCE [33]. Their results indicate that low-latency side-channel design is a significantly more difficult problem than designing a countermeasure by optimizing area or the amount of randomness, which are the typical design criteria addressed by the scientific community. Therefore, designing side-channel countermeasures for low-latency or low-energy implementations is considered to be an important open problem.

Threshold implementation (TI) [36] is a provably secure masking scheme specifically designed to counter side-channel leakage caused by the presence of glitches in hardware. The countermeasure removes the dependency between the number of nonlinear terms and the number of input shares, which is a big advantage over classical masking schemes. In [6], the authors extended the approach of TI to counter higher-order (uni-variate) attacks. The theory suggests the usage of at least \(td + 1\) number of input shares in order to make a Boolean function with algebraic degree t secure against a dth-order side-channel attack. That is the reason why the TI scheme introduced in [6] is often referred to as a \(td+1\) TI. In 2015, the authors of [41] proposed a consolidated masking scheme (CMS) and reduced the required number of input shares needed to resist a dth-order attack to \(d+1\), regardless of the algebraic degree of the shared function. Recall that this is theoretically the lowest bound on the number of input shares with respect to the order of security d. Recently, more schemes using \(d+1\) shares such as domain-oriented masking (DOM) and unified masking approach (UMA) emerged [23, 24], where the essential difference with CMS is in the way the refreshing of the output shares is performed. Since the security of CMS, DOM, UMA, HPC [14], and other descendant schemes relies on the TI principles [19], in this paper we refer to all these schemes as \(d+1\) TI.

The goal of generic low-latency masking (GLM) [22] is to be a generalized concept for low-latency masking that is supposed to be applicable to any implementation and protection order. However, the authors have applied their concept to designs that are not low-latency, and therefore, it is difficult to compare their approach. We have to stress that the goal to achieve minimal latency is not equivalent to get only execution within fewer cycles since, at the same time, the complexity of the circuit grows, resulting in a longer critical path. In other words, one gets a design that can be executed in fewer cycles but also with a lower max frequency. It has been pointed out in [31] that all these generalized concepts have to use another re-sharing technique since the original ones have flaws for \(d >2\).

While the established theory of TI guarantees that the number of input shares linearly grows with the order of protection d, it does not provide efficient means to keep the exponential explosion of the number of output shares under control. The state of the art is a lower bound of \({ (d+1) }^t\) given in [41], while in [6] the authors described a method to obtain a TI sharing with \(\left( {\begin{array}{c}td+1\\ t\end{array}}\right) \) output shares. The latter work also notes that the number of output shares can sometimes be reduced by using more than \(td+1\) input shares. Aside from a formula for the lower bound in [41], there was not much other work of applying \(d+1\) TI to functions with a higher degree than 2. The only exception is the AES implementations by [46, 47] where \(d+1\) TI is applied to the inversion of \(GF(2^4)\), which is a function of algebraic degree 3. However, even for this particular case, the first attempt [47] resulted in sharing with the minimal number of output shares, but it did not satisfy the non-completeness property of TI. Only in the follow-up publication [46] was the sharing correct and minimal. Also, for the particular case of a cubic function, it is fairly easy to find the minimal first-order sharing of eight output shares by exhaustive trial-and-error approach.

1.1 Our contribution

In this paper, we introduce three optimization techniques for optimizing threshold implementations, making the low-latency implementations of side-channel secure designs practical. In particular, we first provide a generic construction for \(d+1\) TI that achieves the optimal number of output shares for any n-input Boolean function of degree \(t=n-1\) for any security order d. Second, when \(t<n-1\), we present a methodology based on discrete optimization techniques that provides optimal or near-optimal sharings of several classes of Boolean functions of any degree up to eight variables, for first- and second-order TI. Third, for \(td+1\) TI, we present a heuristic for higher-order protection for any n-input Boolean function of any degree t and up to second-order protection, which yields a number of output shares significantly lower than \(\left( {\begin{array}{c}td+1\\ t\end{array}}\right) \). Last but not least is the optimization for secure serial AES implementation schedule which achieves maximum throughput.

Next, we investigate the energy consumption of different approaches, an important design factor yet often overlooked in the literature. To demonstrate the feasibility of the above methods, we have applied the optimized \(d+1\) TI and \(td+1\) TI on the hardware implementation of PRINCE. Then, we demonstrate how to reduce the latency to achieve the fastest known TI-protected implementation of PRINCE, and we also show the most energy-efficient round-based secure implementation of PRINCE using \(d+1\) TI sharing.

Finally, we would like to point out that the method of minimizing the number of output shares is of general interest since it can equally well be applied to any cryptographic implementation and any design optimization criteria.

2 Preliminaries

We use small letters to represent elements of the finite field \(\mathcal {F}_2^n\). Subscripts are used to specify each bit of an element or each coordinate function of a vectorial Boolean function, i.e., \(x=(x_1, \ldots , x_n)\), where \(x_i \in \mathcal {F}_2\) and \(S(x)=(S_1(x), \ldots , S_m(x))\), where S is defined from \(\mathcal {F}_2^n\) to \(\mathcal {F}_2^m\) and \(S_i\)’s are defined from \(\mathcal {F}_2^n\) to \(\mathcal {F}_2\). We omit subscripts if \(n=1\) or \(m=1\). We use subscripts also to represent shares of one-bit variables. The reader should be able to distinguish from the context if we are referring to specific bits of unshared variable or specific shares of a variable. We denote Hamming weight, concatenation, cyclic right shift, right shift, composition, multiplication and addition with wt(.), ||, \(\ggg \), \(\gg \), \(\circ \), . and \(+\), respectively.

Every Boolean function S can be represented uniquely by its Algebraic Normal Form (ANF): \(S(x) = \sum _{i = (i_1, \ldots , i_n) \in \mathcal {F}_2^n}\)  \(a_{i} x_1^{i_1}x_2^{i_2}\cdots x_n^{i_n}\). Then, the algebraic degree of a Boolean function S is \(\deg (S) = \max \{wt(i): i \in \mathcal {F}_2^n, a_i \ne 0\}\). The algebraic degree of a vectorial Boolean function S is equal to the highest algebraic degree of its coordinate functions \(S_j\).

Two permutations S and \(S'\) are affine equivalent if and only if there exist affine permutations C and D satisfying \(S' = C \circ S\circ D\). We refer to C as the output and D as the input transformation.

2.1 Threshold implementations

The TI sharing designed to protect against the dth-order attack we will simply refer to as the dth-order TI. The most important property that ensures the security of TI even in the presence of glitches is non-completeness. The dth-order non-completeness property requires any combination of up to d component functions to be independent of at least one input share. As shown in [19], non-completeness is a necessary condition for the security of masking in the presence of glitches. When cascading multiple nonlinear functions, the first-order sharing must also satisfy the uniformity; namely, a sharing is uniform if and only if the sharing of the output preserves the distribution of the unshared output. In other words, for a given unmasked value, all possible combinations of output shares representing that value are equally likely to happen. Finding a uniform sharing for any vectorial Boolean function is still an open problem, although several heuristics exist, such as sharing with correction terms or adding virtual shares [8, 11]. Fortunately, uniformity can still be achieved by refreshing the output shares when no uniform sharing is available.

Given the shares \(x_1, \ldots , x_n\) a (first and second order) refreshing can be realized by mapping \((x_1, \ldots , x_n)\) to \((y_1, \ldots , y_n)\) using n random values \(r_1, \ldots , r_n\) as follows:

$$\begin{aligned} y_1 =&x_1 + r_1 + r_n \nonumber \\ y_i =&x_i + r_{i-1} + r_i,\qquad i \in \{2, \ldots , n\}. \end{aligned}$$
(1)

This refreshing scheme is called ring re-masking. A simpler refreshing using \(n-1\) random values exists especially for the first-order secure implementations as we can re-mask the shares \(x_1, \ldots , x_n\) in the following way:

$$\begin{aligned} y_i =&x_i + r_i, \qquad i \in \{1, \ldots ,n-1\},\nonumber \\ y_n =&x_n + r_1 + \cdots + r_{n-1}. \end{aligned}$$
(2)

An improvement regarding the number of random bits used when multiplication gate is shared has been achieved in [24] where the amount of randomness required is halved compared to CMS. In [23], the authors have shown that the amount of randomness for sharing a multiplication gate can be further reduced to one-third, although this comes at a significant performance cost. Since our goal is to build low-latency side-channel secure implementations, we do not take the approach of UMA. Instead, we choose CMS/DOM for \(d+1\) TI designs. In this paper, we will interchangeably use the terms mask refreshing and re-masking. We would also like to mention an alternative mask refreshing technique presented by Daemen [17], which achieves a uniform \(td+1\) sharing by reusing shares of adjacent S-box inputs to create a statewide permutation, which automatically creates the entire S-box layer stage uniform without needing any external randomness.

In order to prevent glitch propagation when cascading nonlinear functions, TI requires register (s) to be placed between the nonlinear operations. Otherwise, the non-completeness property may be violated, and the leakage of the secret internal state is likely to be manifested.

When sharing a nonlinear function using TI in a single-cycle implementation, the number of output shares is typically larger than the number of input shares. This is likely to occur when applying \(td+1\) TI, in which constructions like one provided in [6] produce sharings of Boolean functions of degree t with \(td+1 \atopwithdelims ()t\) output shares and with \(td+1\) inputs shares. The number of output shares in \(d+1\) TI for nonlinear functions is always larger than the fixed \(d+1\) input shares and is lower bound by \((d+1)^t\), in which t is the algebraic degree of the Boolean function. In order to minimize the number of output shares, we need to refresh and recombine (compress) some shares by adding several of them together. To prevent glitches from revealing unmasked values, decreasing the number of shares can only be done after storing these output shares into a register. The output shares that are going to be recombined together still need to be carefully chosen such that they do not reveal any unmasked value.

A sharing of a nonlinear function can also be realized without output share explosion, albeit at the cost of making the evaluation take several cycles, like already mentioned decomposition approach or techniques such as Toffoli gate [20].

While using \(d+1\) TI the relation between the input shares needs to obey a stronger requirement, namely shared input variables need to be independent [41]. This can be achieved in various ways—for example, by refreshing some of the inputs or by using a technique proposed in [24].

2.2 Minimizing implementation overheads using S-box decomposition

Similar to other side-channel countermeasures, the area overhead of applying TI increases polynomially with respect to the security order and exponentially with respect to the algebraic degree of the function we are trying to protect. To keep the large overheads caused by exponential dependency under control, designers often use decomposition of the higher-degree functions into several lower-degree functions. This approach has originally been demonstrated in [40] where the authors implemented a TI-protected PRESENT block cipher [10] by decomposing its cubic S-box into two simpler quadratic S-boxes. Finally, decomposition of the cubic 4-bit S-boxes into chains of smaller quadratic S-boxes was given in [11], which eventually enables compact, side-channel secure implementations of all 4-bit S-boxes.

Although a decomposition of nonlinear functions into several simpler functions of lower algebraic degree is the proper approach to use for area reduction of the TI-protected implementations, its side effect is the increased latency of the S-box evaluation and hence the entire implementation. Recall that the TI requires registers to be placed between the nonlinear operations in order to prevent glitch propagation, which in turn increases the latency. We will not use this approach since our goal is to achieve low latency.

2.3 A note on latency and energy efficiency

As already mentioned, most of the effort the scientific community has spent on designing secure implementations has been focused on reducing area overheads. Another important metric that had been given lots of attention is the amount of randomness used in protected implementations. While both of these metrics are important, the performance and energy consumption of secure implementations have been unjustly treated as less significant. It has been widely accepted that performance is the metric to sacrifice in order to achieve the lowest possible gate count. Contrary to this view, most of the practical applications nowadays require (very) fast execution, and it is often latency of the actual implementation that matters rather than the throughput. Energy consumption is another equally important metric, and, unlike power consumption, it cannot be well controlled by keeping the area low while sacrificing performance. Optimizing for energy consumption is, in fact, one of the most difficult optimization problems in (secure) circuit design since the perfect balance between the circuit power consumption and its execution speed needs to be hit.

The absolute latency is directly proportional to the number of clock cycles a certain operation takes to execute. At the same time, the absolute latency is inversely proportional to the clock frequency the system is running at. While the clock frequency is determined by taking into account multiple factors from the whole system, the most important of which is the overall power/energy consumption, the number of clock cycles a certain algorithm takes to execute is under the complete control of the designer. Especially when considering embedded devices, the tendency is to keep the clock frequency as low as we can while still meeting the performance requirements. That is the reason why minimizing the number of clock cycles of a certain algorithm is the most important strategy when it comes to minimizing the overall latency of that algorithm.

Although the majority of results available in public literature deal with area-efficient hardware architectures, there are still a few notable examples in which latency reduction has been the main target. In [33], the authors particularly explore the extreme case of a single clock cycle side-channel secure implementations of PRINCE and Midori. Moreover, they conclude that designing a low-latency side-channel secure implementation of cryptographic primitives remains an open problem.

2.4 PRINCE

As a proof of concept, we apply different flavors TI to PRINCE [3], a block cipher designed for low-latency hardware implementations. PRINCE block size is 64 bits, with a 128-bit key, used to derived three 64-bit internally used keys \(k_0, k_0^\prime \) and \(k_1\). Its \(\alpha \)-reflection property allows reuse of the same circuitry for both encryption and decryption.

Although not designed to be efficient in software, a bit-sliced software implementation of PRINCE is also fast and can even be executed in fewer clock cycles than other lightweight block ciphers such as PRESENT and KATAN, for example [38].

Here, we give a brief overview of PRINCE, and for a more detailed explanation of the cipher, we refer the reader to the original paper [3]. The block size is 64 bits, and a key has a length of 128 bits. The key is split into two 64-bit parts \(k_0 || k_1\) and expanded to \(k_0 || k_0^\prime || k_1\) shown as follows:

$$\begin{aligned} (k_0 || k_0^\prime || k_1) = k_0 || ((k_0 \ggg 1) + (k_0 \gg 63)) || k_1. \end{aligned}$$

As depicted in Fig. 1, \(k_0\) and \(k_0^\prime \) are used as whitening keys at the start and at the end of the cipher and \(k_1\) is used as round key in \(\text{ PRINCE}_{core}\) which consists of 12 rounds. More precisely, six rounds followed by the middle involution layer which is then followed by the six inverse rounds.

Fig. 1
figure 1

PRINCE cipher

S-box layer The S-box is a 4-bit permutation of algebraic degree 3, and its look-up table is \(S(x) = [B, F, 3, 2, A, C, 9, 1, 6, 7, 8, 0, E, 5, D, 4]\). The S-box inverse is in the same affine equivalence class as the S-box itself. Moreover, input and output transformations are the same:

$$\begin{aligned} S^{-1} = A_{io} \circ S \circ A_{io}. \end{aligned}$$
(3)

The affine transformation \(A_{io}\) is given as \(A_{io}(x) = [5, 7, 6, 4, F, D, C, E, 1, 3, 2, 0, B, 9, 8, A]\).

Linear layer Matrices M and \(M^\prime \) define the diffusion layer of PRINCE. \(M^\prime \) is an involution and M matrix can be obtained from \(M^\prime \) by adding a shift-rows operation SR so that \(M = \hbox {SR} \circ M^\prime \). Recall that SR is a linear operation that permutes the nibbles of the PRINCE state.

\(\hbox {RC}_i\) addition is a 64-bit round constant addition. The round constants \(\hbox {RC}_0 \ldots \hbox {RC}_{11}\) are chosen such that \(\hbox {RC}_i + \hbox {RC}_{11-i} = \alpha \) where \(\alpha \) is a 64-bit constant. This property, called \(\alpha \)-reflection property, together with the construction of \(\text{ PRINCE}_{core}\) rounds makes the decryption of \(\text{ PRINCE}_{core}\) same as the encryption with \(k_1 + \alpha \) key.

2.5 PRINCE S-box decomposition

PRINCE S-box has an algebraic degree three and belongs to class \(\mathcal{C}_{231}\) [11]. According to [11] and the tables given in [34], there are several hundreds of decompositions into three quadratic S-boxes and four affine transformations.

We choose a decomposition where all three quadratic S-boxes are the same, belonging to class \(Q_{294}\), for its small area footprint. The look-up table of \(Q_{294}\) class is given in Eq. (4), while its ANF form is given with Eq. (5). Decomposition of PRINCE S-box to three quadratic S-boxes using other classes of 4-bit quadratic S-boxes exists, but there is no decomposition in which the same quadratic S-box is used in all three stages. This allows for an implementation that reuses the same quadratic implementation, with only the extra cost of a multiplexer to ensure correct evaluation during three stages. Additionally, using a decomposition containing a class that has no known uniform sharing (\(Q_{300}\) as stated in [11]), would mandate mask refreshing. Increased mask refreshing puts more burden on the PRNG which is used, making the entire system more complex, power inefficient and large [8].

$$\begin{aligned} Q_{294}(x)= & {} [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, B, A, E, F, D, C] \end{aligned}$$
(4)
$$\begin{aligned} o_1= & {} x \nonumber \\ o_2= & {} y \nonumber \\ o_3= & {} xy + z \nonumber \\ o_4= & {} xz + w. \end{aligned}$$
(5)

Decomposition leads to lower area and randomness requirement as they depend on the algebraic degree of the function when applying TI. On the other hand, performance is penalized. PRINCE S-box ANF \((o_1,o_2,o_3,o_4) = F(x,y,z,w)\) is given by:

$$\begin{aligned} o_1= & {} 1 + wz + y + zy + wzy + x + wx + yx \nonumber \\ o_2= & {} 1 + wy + zy + wzy + zx + zyx \nonumber \\ o_3= & {} w + wz + x + wx + zx + wzx + zyx \nonumber \\ o_4= & {} 1 + z + zy + wzy + x + wzx + yx + wyx. \end{aligned}$$
(6)

S-box and its inverse decompositions used in our implementation are given in Eq. (7).

$$\begin{aligned} S= & {} A_1 \circ Q_{294} \circ A_2 \circ Q_{294} \circ A_3 \circ Q_{294} \circ A_4 \nonumber \\ S^{-1}= & {} A_5 \circ Q_{294} \circ A_2 \circ Q_{294} \circ A_3 \circ Q_{294} \circ A_6. \end{aligned}$$
(7)

Here, \(A_1\) to \(A_6\) are affine transformations and their respective look-up tables are: \(A_1(x) = [C, E, 7, 5, 8, A, 3, 1, 4, 6, F, D, 0, 2, B, 9]\), \(A_2(x) = [6, D, 9, 2, 5, E, A, 1, B, 0, 4, F, 8, 3, 7, C]\), \(A_3(x) = [0, 8, 4, C, 2, A, 6, E, 1, 9, 5, D, 3, B, 7, F]\), \(A_4(x) = [A, 1, 0, B, 2, 9, 8, 3, 4, F, E, 5, C, 7, 6, D]\), \(A_5(x) = [B, 8, E, D, 1, 2, 4, 7, F, C, A, 9, 5, 6, 0, 3]\), \(A_6(x) = [9, 3, 8, 2, D, 7, C, 6, 1, B, 0, A, 5, F, 4, E]\).

The ANF of the \(Q_{294}\) function, \((x,y,z,w) = F(a,b,c,d)\), is given in Eq. (8)

$$\begin{aligned} x= & {} a \nonumber \\ y= & {} b \nonumber \\ z= & {} ab + c \nonumber \\ w= & {} ac + d. \end{aligned}$$
(8)

We recall that, for a secure implementation with this decomposition method, nonlinear operations need to be separated by registers, making the evaluation of a single S-box take 3 clock cycles.

As discussed in Sect. 2.2, we explore the implementation of the decomposed S-box in order to address the issue of low-area and low-power applications but, in what follows, we also explore the sharing of the non-decomposed S-box to address the issue of low latency and low energy.

3 Efficiency techniques

To find a \(td + 1\) or \(d+1\) sharing for a quadratic vectorial Boolean function is fairly straightforward in general and especially easy for the functions that have a simple ANF, e.g., a quadratic function with a single higher degree term. However, to find an efficient sharing for a vectorial Boolean function of algebraic degree three or higher and with several higher-degree terms may not be evident and the work required to find the minimal number of the output shares becomes increasingly difficult. In this section, we propose methods to deal with this complexity and we explicitly describe an optimal solution for the \(d+1\) sharing for any security order d.

3.1 Efficient first- and second-order \(td + 1\) sharing: \((td+1, \forall {n}, \forall {t}, d \le 2)\)

To obtain a \(td + 1\) TI implementation, where t is the algebraic degree of the function and d is the security order, one needs to go through the following two computational phases:

  1. (a)

    The expansion phase in which the shared function f uses \(s_\mathrm{{in}} \ge td+1\) input shares and results in \(s_\mathrm{{out}}\) output shares. Output share functions \(f_i\) are referred to as component functions.

  2. (b)

    The compression phase in which re-masked \(s_\mathrm{{out}}\) output shares stored in a register are combined again to \(s_\mathrm{{in}}\) shares.

This process takes two clock cycles except when \(s_\mathrm{{in}} = s_\mathrm{{out}}\). For example, for the first-order TI only the first phase suffices. Additionally, if the TI sharing is uniform, the refreshing step can be removed.

The dth-order TI (more specifically, its non-completeness property) requires that any combination of up to d component functions \(f_i\) is missing at least one share for each of the input variables. The method presented in [6] demonstrates how to find a sharing with the minimum number of input shares, i.e., \(s_\mathrm{{in}} = td + 1\), which results in \(s_\mathrm{{out}} = \left( {\begin{array}{c}s_\mathrm{{in}}\\ t\end{array}}\right) \) output shares. However, this approach does not guarantee that \(s_\mathrm{{out}}\) is indeed the theoretical minimum. Even more, there are examples which show that by increasing \(s_\mathrm{{in}}\) it is possible to decrease \(s_\mathrm{{out}}\).

We use \(s_\mathrm{{out}}\) as a figure of merit since the amount of registers required to store the output shares and the amount of random bits required for refreshing increases with the number of output shares. Further on, we will describe a way to find a \(td+1\) sharing with small \(s_\mathrm{{out}}\). It should be noted that the number of input shares \(s_\mathrm{{in}}\) also contributes to the total number of registers used, so it would also be interesting to investigate \(s_\mathrm{{out}} + s_\mathrm{{in}}\) as a figure of merit while investigating \(td+1\) TI sharings. However, increasing the number of input shares \(s_\mathrm{{in}}\) significantly increases the combinatorial logic needed in a hardware implementation, which can negatively impact the area and power consumption.

For every output share, there is a subset of input shares of all variables that are permitted to be part of that share’s equation. Definition 1 introduces the notions of output sets and output sharing sets, which will be used to argue TI properties of correctness and non-completeness for \(td+1\) TI.

Definition 1

Given a \(td+1\) TI sharing of a function f with \(s_\mathrm{{in}}\) input and \(s_\mathrm{{out}}\) output shares, we can enumerate the input shares as elements of a set \(\mathcal {I} = \{0, \ldots s_\mathrm{{in}}-1\}\). For each output share \(f_o\), we can associate a set \(\mathcal {O}\), which is a subset of I, containing indices of allowed input shares that can appear in the ANF of \(f_o\). We refer to O as the output set of the oth output share of f. The set \(\mathcal {S}\) containing all output sets of a sharing as elements is output sharing set.

Set representation determines the maximum degree of the output share component function and is equal to the number of elements in it. Further on, we will refer to a set with k elements as k-set. It should be noted that output sets impose no special restrictions for any particular input variable. Each input variable can have any of the allowed input shares present in the output share.

Equation (9) shows an example of how second-order secure sharing of function \(xy + z\) can be obtained using six input shares and seven output shares. An illustration of this sharing is additionally represented by their output share sets on the left.

$$\begin{aligned}&\{0,1,2\} \qquad o_1 = x_0y_1 + x_1y_0 + x_0y_2 + x_1y_2+x_2y_1 \nonumber \\&\{0,3,4\} \qquad o_2 = z_4 + x_4y_4 + x_0y_3 + x_0y_4 + x_3y_4 + x_4y_3 \nonumber \\&\{1,3,5\} \qquad o_3 = z_3 + x_3y_3 + x_1y_3 + x_3y_1 + x_1y_5\nonumber \\&\qquad \qquad \qquad \qquad + x_3y_5 + x_5y_3\nonumber \\&\{2,4,5\} \qquad o_4 = z_5 + x_5y_5 + x_2y_4 + x_4y_2 + x_2y_5\nonumber \\&\qquad \qquad \qquad \qquad + x_5y_2 + x_4y_5 + x_5y_4\nonumber \\&\{0,1,4\} \qquad o_5 = z_0 + x_0y_0 + x_4y_0 + x_1y_4 + x_4y_1\nonumber \\&\{0,1,5\} \qquad o_6 = z_1 + x_1y_1 + x_0y_5 + x_5y_0 + x_5y_1\nonumber \\&\{0,2,3\} \qquad o_7 = z_2 + x_2y_2 + x_2y_0 + x_3y_0 + x_2y_3 + x_3y_2. \end{aligned}$$
(9)

As it can be seen, the output share sets dictate which indexes of variables are allowed in the corresponding output share. For example, for \(o_1\) only input shares (i.e., indexes) 0, 1 and 2 are allowed. That requirement is indeed fulfilled by the formula describing \(o_1\). Note that these sets do not uniquely define the sharing. We could, for example, remove the term \(x_0y_1\) from \(o_1\) and assign it to \(o_5\) or \(o_6\), as \(\{0,1\}\) is a subset of both \(\{0,1,4\}\) and \(\{0,1,5\}\). Output share sets would still be the same, and the sharing would still be correct second-order \(td+1\) sharing. A good heuristic for assigning monomials to output shares is that all output shares should have about the same amount of terms. That way the implementation becomes balanced and the critical path is roughly equal among all output shares.

We can reason about the properties of a given \(td+1\) sharing by inspecting its output sets. Consider a sharing of a function of degree t and the set \(S_t\) that contains all \(s_\mathrm{{in}} \atopwithdelims ()t\) different t-sets of input shares. Correctness is satisfied if and only if the set that contains all different t-sets that are subsets of at least one output set is equal to the \(S_t\). Indeed, if that is not the case, there would exist a cross-product of degree t that could not be part of any output share, making correctness of the sharing violated. For Eq. (9), given that the unshared function is of degree 2, we can check and confirm that all of the \(6 \atopwithdelims ()2\) 2-sets are contained in output sets of the sharing. For non-completeness property, which ensures dth-order security, no union of d output sets is equal to a set covering all input shares \(\{0,\ldots ,s_\mathrm{{in}}-1\}\). Equivalently, any combination of d output shares does not contain all input shares. In Eq. (9), this is true for \(d=2\) as no union of two output sets gives the set of all input shares \(\{0,1,2,3,4,5\}\).

An interesting question when searching for the minimal number of shares is the size of output sets, as defined in Definition 1. Since any subset of output shares that contains all possible t-sets also contains all possible k-sets with \(k < t\), and these smaller output sets do not contribute in the generation of a correct sharing of degree t function, we can conclude that k-sets with k smaller than t are redundant and can be omitted. On the other hand, the size of each output share set cannot be larger than \(s_\mathrm{{in}} - (t(d-1)+1)\). Let us assume otherwise, i.e., there is an output set of size at least \(s_\mathrm{{in}} - t(d-1)\). To cover the set of all input shares, we need to add missing \(t(d-1)\) input shares to it from other output sets. Since the cardinality of an output set is at least t, we can do it with at most \((d-1)\) other output sets. But now the non-completeness would be violated since there are d output shares which cover all input shares.

Now, we are ready to describe the greedy algorithm (Fig. 2) which finds the sharing with small number of output shares.

Fig. 2
figure 2

Greedy algorithm for efficient \(td+1\) sharing

Step 2b in algorithm described in Fig. 2 involves randomly choosing k-set o among all possible k-set with the desired property. This nondeterministic behavior leads to different output sharings with different cardinalities. Hence, we iterate the greedy algorithm multiple times and choose the output sharing \(S_o\) with the smallest cardinality among all executions. After 100 iterations which complete on a regular PC within seconds, we have observed that the number of output shares Algorithm 2 provides remains the same, even if we increase the number of iterations to 10,000, while the execution time increases. Thus, for the results we present here we have run the greedy algorithm 100 times and chosen the smallest sharing among all executions.

The example of a single pass of the greedy algorithm is given in Table 1, for \(S_\mathrm{{in}} = 6\), \(d=2\) and \(t=2\), making set O containing all sets of size \(k = s_\mathrm{{in}} - (t(d -1)+1)=3\). In each step of the greedy algorithm, these sets are scored according to the step 2b. In bold, we mark the chosen set to be added to the output sharing, and in light gray, we highlight the sets that must be removed, because if they remain in the next steps of the algorithm, they could violate the non-completeness of the constructed sharing. The left column in each table is the score of a given k-set, or the amount of t-sets it contains, that are not present in \(S_o\). The right column is all remaining k-sets that do not violate non-completeness if added to \(S_o\). In the same column, above the horizontal line is partially constructed \(S_o\) represented by k-sets that are added to it. If we take the fourth table, k-set \(\{0,1,4\}\) has a score of 1 as only \(\{1,4\}\) is the new t-set it would add, given \(\{0,1\}\) and \(\{0,4\}\) are already subsets of output shares \(\{0,1,2\}\) and \(\{0,3,4\}\). On the other hand, k-set \(\{2,4,5\}\) has a score of 3 since none of the t-sets \(\{2,4\}\), \(\{2,5\}\) and \(\{4,5\}\) are present in any of the output shares. For this particular order of the k-sets, we end up with output sharing that contains seven shares.

Table 1 Example execution of the greedy algorithm for the case \(s_\mathrm{{in}} = 6\), \(d=2\), \(t=2\)

3.2 Optimal \(d+1\) sharing for functions with degree \(n-1\): \((d+1, \forall {n}, t=n-1, \forall {d})\)

To achieve dth-order security using \(d+1\) sharing for a single term of degree t, i.e., a product of t variables, one gets exactly \({(d+1)}^t\) shares for the product [41]. Alternatively, for \(s_\mathrm{{in}}=d+1\) input shares and a product of t variables one gets \(s_\mathrm{{out}}={(d+1)}^t\) output shares.

Main difference with \(td+1\) sharing is how the non-completeness property is interpreted. Unlike with \(td+1\) TI sharing in the \(d+1\) TI sharing each output share should contain only one share per input variable, in other words if in an output share there are two shares of an input variable, then the dth-order non-completeness will be violated. Recall that for the \(td+1\) TI using more shares per input variable is possible since the number of input shares is bigger. We observe this in Eqs. (9) and (11). In the first output share of Eq. (9), we see three input shares of x: \(x_0\), \(x_1\) and \(x_2\). In \(d+1\) sharing of Eq. (11), the first output share only has one input share of x: \(x_0\).

Therefore, for \(d+1\) case to ensure non-completeness it is enough to have only one share of each input variable present in any given output share. We will assume that the independence of input shares is always satisfied for the \(d+1\) case.

Correctness of the sharing in \(d+1\) case is achieved by verifying that each monomial of a shared term (product) in the unshared function f must be present in one of the output shares.

Consider again the function \(xy + z\). One possible first-order \(d+1\) sharing of it is given in Eq. (10).

$$\begin{aligned}&(x,y,z) \qquad \nonumber \\&(0,0,0) \qquad o_1 = x_0y_0 + z_0\nonumber \\&(0,1,*) \qquad o_2 = x_0y_1\nonumber \\&(1,0,*) \qquad o_3 =x_1y_0\nonumber \\&(1,1,1) \qquad o_4 = x_1y_1 + z_1. \end{aligned}$$
(10)

The sharing can also be represented with a table as shown in the left side of Eq. (10). Each output share is a row of a table, and each column represents the shares of a different input variable. Entry in row i and column j is the allowed input share of jth input variable for ith output share.

Columns are representing the variables x, y and z, respectively. Compared to \(td+1\) set representation, table representation restricts input shares per each variable separately, while output sets impose the restriction that was the same for all input variables.

The asterisk values indicate that we do not care about what input share of z is there, because the sharing of linear term z is ensured by combining rows 1 and 4 of the table. This also shows that the table representation of the sharing does not uniquely determine the exact formula for each output share, and there is a certain freedom in determining where we can insert the input shares.

For example, we can use the table of Eq. (10) to share function \(x + y + xy + z\). There are two options for terms \(x_0\) and \(x_1\), rows 1 and 2 and rows 3 and 4, respectively. The same holds for terms \(y_0\) and \(y_1\), \(y_0\) can be either in output share 1 or 3, and \(y_1\) can be in output share 2 or 4.

Properties of non-completeness and correctness can be easily argued from the table representation. Since for every table row, each column entry in the table can represent only one input share of that column’s variable, first-order non-completeness is automatically satisfied. For row 3 of the table in Eq. (10) by fixing the entries representing x to 1 and y to 0, we ensure that only \(x_1\) and \(y_0\) can occur in that output sharing. Hence, there is no way that \(x_0\) or \(y_1\) can be a part of that particular output share, which is the only way to violate non-completeness in \(d+1\) sharing. Correctness of the table can be verified by checking correctness for every monomial in unshared function f individually. If the combined columns representing variables of the monomial contain all possible combinations of share indexes, sharing is correct. Indeed, if this is the case, all terms of shared product for each monomial can be present in the output sharing. Following example from Eq. (10), for monomial xy we see that all four combinations \(\{(0,0),(0,1),(1,0),(1,1)\}\) are present in two columns representing variables x and y. Hence, all of the terms of shared product \(xy = (x_0 + x_1)(y_0 + y_1) = x_0y_0 + x_0y_1 + x_1y_0 + x_1y_1\) can be present in at least one output share. The same holds for \(z=z_0 + z_1\) as both combination \(\{(0),(1)\}\) are present in output table of Eq. (10). Also, it is easy to see that the number of rows in correct sharing table is lower-bounded by the \({ (d+1) }^t\), when the degree of the function is t.

Now, consider a function \( xy + xz + yz\). One possible first-order \(d+1\) sharing and its table are given in Eq. (11) with table entries on the left. Columns represent x, y and z, respectively.

$$\begin{aligned} (0,0,0) \qquad o_1&= x_0y_0 +x_0z_0 + y_0z_0\nonumber \\ (0,1,1) \qquad o_2&= x_0y_1 +x_0z_1 + y_1z_1\nonumber \\ (1,0,0) \qquad o_3&= x_1y_0 +x_1z_0 \nonumber \\ (1,1,1) \qquad o_4&= x_1y_1 + x_1z_1 \nonumber \\ (*,0,1) \qquad o_5&= y_0z_1\nonumber \\ (*,1,0) \qquad o_6&= y_1z_0. \end{aligned}$$
(11)

The table represented with Eq. (11) now has 6 rows representing different output shares, which is larger than theoretically minimal four shares, and is also easily obtained when we try to derive it by hand. The naive approach is to start by sharing xy into four shares. Next, we try to incorporate xz into these four shares by setting all indexes of z to be equal to y. The problem arises when we now try to add sharing of yz. In the existing four output shares, we have z and y have the same indexes; thus, we are required to add two more shares for terms \(y_0z_1\) and \(y_1z_0\).

Further on, we will show that for any function with n input variables of degree \(t = n-1\) it is possible to have a \(d+1\) sharing with minimal \({ (d+1) }^t\) shares.

Definition 2

Table with n columns representing output sharing of a function of degree t with n input variables is referred to as a \(D^n\)-table. The number of rows of the table is the number of output shares for a given sharing. If the output sharing is correct, then \(D^n\)-table is t-degree correct \(D^n\)-table. t-degree correct \(D^n\)-table with minimal numbers of rows is called an optimal \(D^n\)-table. Optimal \(D^n\)-table that has \({ (d+1) }^t\) rows is called ideal \(D^n\)-table, denoted \(D^n_t\)-table

Obviously, for \(t=n\) ideal \(D^n_n\)-table is just a table that contains all different \({ (d+1) }^t\) indexes of input variables in the terms of the shared product that occur when sharing a function of degree t. We can also consider each row of a \(D^n\)-table as an ordered tuple of size n. ith value in a such tuple represents the ith input variable, and its value is the allowed input share of that variable in the output share represented by the tuple. All tuple entries can have values from the set \(\{0,\ldots ,d\}\).

Fig. 3
figure 3

Algorithm for optimal \(d+1\) sharing

Definition 3

\(D^t\)-table \(D_1\) is t-subtable of \(D^n\)-table \(D_2\) for given t columns if \(D_2\) reduced to these t columns is equal to \(D_1\).

We have shown with the sharing in Eq. (10) how one can check the correctness of the table. Now, we generalize this by showing how to check if a given \(D^n\)-table can be used for sharing of any function of degree t. It turns out that it is sufficient to check correctness only for the terms of degree t, since if we are able to share a product of t variables with a number of output shares, we can also always share any product of a subset of these t variables using the same output shares.

It is easy to see that a \(D^n\)-table D can be used to share any function of degree t if and only if for any combination of t columns, \(D^t\)-table formed by chosen t columns contains all possible \({ (d+1) }^t\) ordered tuples of size t. In, other words, t-subtable of D for any t columns is t-degree correct \(D^t\)-table.

This comes from the fact that \(D^t\)-table that contains all possible \({ (d+1) }^t\) ordered t-tuples represents a correct sharing for functions of degree t. If this is true for any combination of t columns of D, we can correctly share any combination of products of size t from n input variables.

An example is given in Table 2 where \(D^3\)-table on the left can be used for first-order sharing of any function of degree 2 since all 3 \(D^2\)-tables obtained from it have all 4 possible ordered 2-tuples (0, 0), (0, 1), (1, 0) and (1, 1) as at least one of its rows.

Table 2 \(D^3\)-table and its 3 2-subtables

Next, we show how one can construct ideal \(D^n\)-table for any function for given n, d and \(t=n-1\). To recap, we first build a \({ (d+1) }^t \times n\) table D, where every row is a tuple of indexes (in a single row no variable index is allowed to be missing and, naturally, no variable index is duplicated) and t-subtable of D for any t columns is a t-degree correct \(D^t\)-table. Since \(t=n-1\), we can consider t-subtable generation as column removal from D. Such a \(D^n\)-table D is then equivalent to a sharing which fulfills the correctness and the non-completeness properties of TI. Constructing an ideal \(D^n_n\)-table is trivial by enumerating all ordered index n-tuples. The number of rows in it is \({ (d+1) }^n\).

Showing that a particular \(D^n\)-table with \({ (d+1) }^{n-1}\) rows is a \(D^n_{n-1}\)-table becomes equivalent to proving that removal of any single column (restriction to \(n-1\) columns or, equivalently, variables) from the \(D^n\)-table yields a \(D^{n-1}_{n-1}\)-table. Alternatively, any \((n-1)\)-subtable of \(D^n_{n-1}\)-table is a \(D^{n-1}_{n-1}\)-table.

Here, we will show how to build the \(D^n_{t}\)-table for the case when \(t=n-1\). For any given \(D^n_{n-1}\)-table and security order d, we will prove the existence of other d \(D^n_{n-1}\)-tables such that no n-tuple exists in more than one table. In other words, no two tables contain rows that are equal. We call such \(d+1\) \(D^n_{n-1}\)-tables conjugate tables, and the sharings produced from them conjugate sharings. Having all rows different implies that these \(d+1\) \(D^n_{n-1}\)-tables cover \((d+1){ (d+1) }^{n-1} = { (d+1) }^n\) index n-tuples, i.e., all possible index n-tuples. Therefore, these \(d+1\) \(D^n_{n-1}\)-tables together form a \(D^n_n\)-table.

We build the \(d+1\) conjugate \(D^n_{n-1}\)-tables inductively. For a given d, we build \(d+1\) conjugate \(D^2_1\)-tables; then, assuming \(d+1\) conjugate \(D^n_{n-1}\)-tables exist, we construct \(d+1\) conjugate \(D^{n+1}_n\)-tables.

The initial step is simple: \(D^2_1\) has two columns (for the variables x and y) and in each row i (enumerated from 0 to d) of each conjugate table j (enumerated from 0 to d) we set the value in the first column to be i, and the value of the second column to be \((i+j) \mod (d+1)\), hence obtaining the \((d+1)\) conjugate D-tables with \(d+1\) rows. Indeed, both columns of any of the constructed \(D^2_1\)-tables contain all values between 0 and d; so, by removing either column we always obtain a correct \(D^1_1\)-table. Also, this construction ensures that the second column never has the same index value in one row for different tables; therefore, no two rows for different tables are the same, ensuring that formed tables are indeed conjugate.

Induction step Assume we have \(d+1\) conjugate \(D^n_{n-1}\)-tables. Using them, we are now going to build \(d+1\) conjugate \(D^{n+1}_n\)-tables in a following manner:

The example of the iterative step from Algorithm 3 is given in Fig. 4.

Fig. 4
figure 4

Generating conjugate \(D^3_2\)-tables from \(D^2_1\)-tables

Lemma 1

Given \(d+1\) conjugate \(D^n_{n-1}\)-tables the algorithm described in Fig. 3 constructs \(d+1\) conjugate \(D^{n+1}_n\)-tables.

Proof

First, let us show that the constructed \(d+1\) \(D^{n+1}_n\)-tables are conjugate, i.e., there is no \((n+1)\)-tuple which belongs to more than one of them. Let us assume there exists an \((n+1)\)-tuple which belongs to two \(D^{n+1}_n\)-tables. This implies the existence of an n-tuple which belongs to two of the initial \(d+1\) \(D^{n}_{n-1}\)-tables, contradicting the fact that these initial tables are conjugate.

Finally, any restriction to a particular set of columns has to have all the combinations of index n-tuples, i.e., the correctness property. In fact, it is sufficient to prove that any set of n columns in any of the new conjugate tables contains all possible n-tuples. Indeed, if we remove the last column in any of the so constructed tables, we get the union of the original \(d+1\) \(D^{n}_{n-1}\)-tables forming one \(D^{n}_{n}\)-table. By definition, \(D^{n}_{n}\)-table satisfies this property. Lastly, we are left with the other case of removing one of the first n columns, which results in a table of dimensions \({ (d+1) }^n \times n\). If we prove there are no duplicates among the \({ (d+1) }^n\) tuples within this table, all combinations will be the table, making it again a \(D^{n}_{n}\)-table. Consider two n-tuples. If they are equal, their last indexes are also equal. By Algorithm 3 design, equality of the last indexes (these are in the \((n+1)\)-st column) implies that the two \((n-1)\)-tuples belong to one of the starting conjugate \(D^{n}_{n-1}\)-tables, i.e., they cannot be in different conjugate \(D^{n}_{n-1}\)-tables. However, for the \((n-1)\)-tuples which belong to one of the starting \(D^n_{n-1}\)-tables by assumption it is known that there are no duplications and hence the considered two \((n-1)\)-tuples cannot be equal. \(\square \)

Theorem 1

Any of the constructed conjugate \(D^{n}_{n-1}\)-tables by algorithm in Fig. 3 provides optimal sharing for given n, d and \(t=n-1\).

Proof

The algorithm is applied inductively for the number of variables from 2 till n. Since one \(D^{n}_{n-1}\)-table contains exactly \({ (d+1) }^{n-1}\) rows, we conclude it is optimal because this is the theoretical lower bound for the number of output shares for the case \(t=n-1\). \(\square \)

Recall that aside from a formula for the lower bound in [41], there was not much other work of applying \(d+1\) TI to functions with higher degree than 2 with the only exception: the AES implementations by [46, 47] where \(d+1\) TI was applied to the inversion of \(GF(2^4)\), which function has algebraic degree 3. When we tried to obtain by hand \(d+1\) TI for PRINCE S-box of algebraic degree 3, we only managed to find output sharing for the most significant bit of the S-box with 12 and 44 output shares, for the first-order and the second-order \(d+1\) TI prior to the discovery of Algorithm 3. The optimal solution is 8 and 27 output shares for these two cases, respectively, which is easily found using the approach described here.

Another benefit of using an algorithmic solution is it can easily be automated using a computer, removing the possibility of human error that is likely to occur, the more complex the ANF becomes.

It is well known that a balanced Boolean function of n variables has a degree at most \(n-1\). Therefore, all \(n \times n\) S-boxes which are permutations have a degree of at most \(n-1\). Indeed, nearly all bijective S-boxes used in symmetric ciphers are chosen to have a maximum degree of \(n-1\). In particular, inversion in the field always has the maximum degree of \(n-1\), most notable example of its usage being the AES S-box. In the particular case of AES inversion, applying the algorithm shown here will produce the minimal number of shares, which is 128. This is, however, too large for any practical application.

The most notable exception where a low-degree function is used is KECCAK’s [4] \(\chi \)-function which is a \(5 \times 5\) S-box of degree 2. A sharing with eight shares can be easily found for \(\chi \) by hand, while a conjugate \(D^5\)-table will have 16 entries which corresponds to the optimal sharing for degree 4.

3.3 Optimal \(d + 1\) sharing for functions of up to 8 bits: \((d+1, n \le 8, t<n-1, \forall {d})\)

In case when \(t < n-1\) as was already shown from the KECCAK example, we see that the sharing obtained using Algorithm 3 does not give a solution with minimal number of output shares. Or in other words the method presented in the previous section is not optimal when the degree of the function is lower than \(n-1\). Therefore, finding the optimal sharing for functions with a degree lower than \(n-1\) needs to be performed in a different manner. In order to try to find a solution for this particular case, we must first reformulate our problem.

Recall that using sharing tables correctness of a given output sharing of a function F is equivalent to all of the terms in the ANF representation of F being correctly shared. In other words, for each term l of t variables, columns representing output shares should contain all different \({ (d+1) }^t\) combinations with repetitions. We evaluate the shares of l from 1 to \({ (d+1) }^t\) in lexicographic order and say that output share S covers ith share of term l if columns of S representing variables of l form the ith share of t variables.

The problem of finding a correct sharing can further be reduced to the set covering problem (SCP). Each different output share among different \({ (d+1) }^n\) shares will be considered as a different set, and the family of these sets will be referred to as \(\mathcal {D}\). The universe \(\mathcal {U}\) of all elements to be covered is created by going through ANF, and for each term l, we add \({ (d+1) }^t\) elements, where t is the degree of l, representing different shares of l. In other words, each share of each term is a separate element to be covered. Set S from \(\mathcal {D}\) will contain element e from \(\mathcal {U}\) if output share represented by S covers term share from F represented by e. Now given \(\mathcal {D}\) and \(\mathcal {U}\) we need to find subfamily \(\mathcal {C} \subseteq \mathcal {D}\) with the minimal cardinality such that union of sets from \(\mathcal {C}\) is \(\mathcal {U}\). With respect to elements e from \(\mathcal {U}\), there exists at least one set S from \(\mathcal {C}\) that contains e.

We can further represent SCP in terms of decision variables. For all possible output shares from \(\mathcal {D}\) \(1 \ldots { (d+1) }^n\), we associate a \(\{0,1\}\) variable \(x_S\) denoting if share S is chosen. Now, our goal of finding correct and non-complete minimal sharing can be formulated as:

$$\begin{aligned}&\text{ minimize } \sum _{S \in \mathcal {D}} x_S \end{aligned}$$
(12)
$$\begin{aligned}&\text{ subject } \text{ to } \sum _{S:e \in S} x_S \ge 1,\ (\forall e, e \in \mathcal {U}). \end{aligned}$$
(13)

Expression (12) is referred to as objective function, while inequalities given by (13) are called constraints.

Set covering problem is a well-known discrete optimization problem, which pops up in various applications, for example, in logical minimizer design, mobile network base station placement, etc. Hence, since we have reduced the problem of finding minimal sharing to set covering, we can try and use discrete optimization methods to find good solutions to it.

3.3.1 Constraint programming optimization techniques

We have used four different techniques to solve the underlying set covering problem: constraint programming, mixed integer programming, randomized greedy with restarts and simulated annealing with a greedy heuristic.

Constraint programming (CP) [42] focuses on finding a feasible solution given a number of constraints. Its original use is to determine if a problem is satisfiable. But it can also be used in minimization optimization, by finding a feasible solution with objective cost N, then adding new constraint such that objective functions has to be \(\le N\) and restarting the process. The cycle is repeated until the problem becomes infeasible, and the final objective value where a feasible solution is found is the minimal one. We have used freely available MiniZinc [37] software to solve our set covering problem.

Mixed integer linear programming (MILP) [45] relaxes the problem such that decision variables become non-binary, but continuous real values, \(x_S \in [0,1]\), and then tries to solve the underlying linear programming problem [18] to establish a lower bound of the objective function. Afterward, it tries to find the smallest solution such that all decision variables are integer, satisfying the original problem constraints. Like CP, MILP is able to prove the optimality of the solution. We have used Gurobi 9.0 [26] solver for this part.

The randomized greedy heuristic with restarts or iterated greedy (IG) is a technique where a solution is constructed in a greedy manner, and in each step, we take the set S that covers most uncovered elements so far. We stop when all elements are covered. All ties are broken randomly, i.e., if multiple sets cover an equal number of still uncovered elements, the algorithm randomly chooses one of them to add to the solution being built. We loop this approach multiple times and in the end take the solution from the iteration that has the smallest number of sets. Since ties are broken randomly, solutions will differ from run to run. The optimality of this technique cannot be proven, since the greedy algorithm finds local minima, not a global one for SCP. However, in practice, the solutions it finds are often close to optimal.

Simulated annealing [28] is a meta-heuristic where a neighbor solution with a higher objective function cost is accepted with a probability that gradually decreases during execution time. Intuitively, accepting worse solutions allows us to explore more of the search space and escape local minima. Over time lowering of the acceptance probability guides the search more and more toward good solutions, while the earlier rounds are used to explore large search space in a more indiscriminate fashion. The probability parameter is called temperature. We utilized the implementation approach given in [9, 30] where we separate execution into multiple rounds. After each round, the temperature is decreased by a constant factor cool. In each round, a number I of neighbors are explored. Neighbor is constructed by removing some of the sets from the solution, then constructing a new solution using remaining sets as a starting point and adding new ones until all elements are covered again. We accept the new solution if it is better than the previous one, or if not we still accept it with probability \(\exp (-\delta /\hbox {temp})\) where \(\delta \) is the difference in objective functions of the new and the current solution.

The implementation details from [30] are given in the following table, while the simulated annealing algorithm for SCP is given in Algorithm 1.

Parameter

Description

A

Data structure providing relation information of which sets cover which element, typically given as a \(\{0,1\}\) matrix.

cool

Temperature reduction ratio between rounds \(temp_{r+1} = temp_r \times cool\)

\(\rho \)

Used to determine the percentage of shares to be dropped from current solution, during neighbor construction.

R

Number of rounds to run.

I

Number of neighbors examined in each round.

rand()

Function that returns uniformly randomly value in range [0, 1].

temp

Temperature parameter used to determine the probability of accepting bad neighbor.

Construct()

Greedy heuristic method used to construct new covering after a percentage of shares has been removed from it.

Perturb()

Neighbor defining function. \(\rho \) factor of shares is stripped from the current solution. Shares removed have the least amount of uniquely covered elements.

RemoveRedundant()

Executed after a new solution has been constructed. It can happen that some of the chosen shares are redundant since all of the elements covered by it are already fully covered by other shares in the candidate solution.

figure h

3.3.2 Sharing solutions

We have focused on applying four discrete optimization techniques on a set of different Boolean functions of up to eight bits, whose sharings can be used on any Boolean functions with the same number of input bits and algebraic degree. Since optimal sharing has already been explored for the case where degree \(t \ge n-1\) with n number of bits, we have investigated Boolean functions where \(t < n-1\). In order for the solution to be generic enough, we further assume the most extreme case. That is, if we have n variables and degree t, we search for a sharing of a function that has all \(n \atopwithdelims ()t\) t-degree terms present in the ANF. Sharing of such a function can be used for any n-bit function of degree t. However, a more efficient sharing for a given n-bit function F of degree t might exists, depending on the number of t-degree terms that are present in the ANF and their structure, and the same discrete optimization methodology can be applied on F to find possibly better sharing.

First, we have applied CP using Minizinc [37] with Chuffed 0.10.4 [16] and Google OR-tools [39] solvers. The resulting number of shares using CP solvers is given in Table 3. For \(d=1\), both options are able to find optimal sharing for all cases, except for 8-bit functions of degree 5, where a solution with 52 shares is found after few hours but without proof of optimality, even after running for the CP solver for several days on a regular PC. Optimality for other cases is proven within several tens of minutes. For the second-order case \(d=2\), CP solver is only able to prove optimality for the simplest of cases of 4 bits and degree 2. It is able to find some solutions when \(n=5\), but without proof of optimality. For \(n > 5\), the solver is unable to provide any solutions within few minutes, so we deem it not suitable for those cases, due to the size of the search space. We did not see much difference in solution times between Chuffed and OR-tools solvers, although Chuffed seems to be able to finish the search slightly faster.

Table 3 Number of shares found using CP solver

Next, we have tried to use MILP solver, in particular Gurobi 9.0 solver [26]. The resulting number of shares using MILP solver is given in Table 4. For \(d=1\), the solver was able to find the same solutions and prove optimality in less time. However, for the case of 8 bits and degree 5, a solution of 52 was found again, but without proof of optimality. When \(d=2\), MILP solver was more successful than CP one, finding optimal solutions for degree 2 functions for all \(n=4,5,6,7\), and degree 3 functions of 5 and 6 bits. However, more complex cases seem to quickly become difficult for the solver.

Table 4 Number of shares found using MILP solver

From Tables 3 and 4, it becomes apparent that CP and MILP solvers struggle with \(d=2\), particularly for 8-bit functions of degree 4 or more. The CP solver struggles much more, as it is unable to find solutions for second-order sharings of functions with six or more input bits. This is not surprising due to the exponential increase in the number of decision variables. Hence, for the more difficult cases heuristics are the only possible way to find good solutions. Iterated greedy approach was made by using 100,000 runs and choosing the best solution among these runs. It finishes is just a matter of seconds even for harder cases. The program is sometimes able to find optimal solutions of CP and MILP approaches, but even when it does not the solutions it finds are within 30% of minimal, where a minimal solution is known. IG run results are given in Table 5.

Table 5 Number of shares found using IG heuristic

As a way to improve on the results of the IG heuristic, we have also tested the simulated annealing technique. SA run results are given in Table 6. First, the IG technique was used with 20,000 runs to provide an initial solution, and then, SA was run with \(R=150\) rounds and \(I=100\) iterations in each round. The program ran extremely fast on a regular PC, and it was the slowest for 8-bit functions of degree six where it finished in about 5 min. SA program was able to find the same results for \(d=1\) as CP and MILP solver, finding optimal solutions in much faster times. For \(d=2\), it was able to improve on some of the instances compared to the IG approach, but solution quality was at most 10% better in instances where CP and MILP were unable to provide solutions in a reasonable amount of time. In some instances with 8 bits and degrees of 3, 4 and 5, it was unable to improve upon the solution provided by the IG approach. Modifying SA annealing parameters had a limited impact on the quality of the solution. We determine that the cooling factor of 0.91, starting temperature of 100 and removal coefficient \(\rho \) of 0.2 to be working well in almost all cases. \(\rho \) had the most impact on the improvements, and we notice that smaller values are more beneficial for larger t values, about 0.1, while values of 0.3 work better on smaller t values.

Table 6 Number of shares found using SA heuristic
Table 7 Best sharing using all four approaches

Finally, we can put together the solutions of all four approaches to collect the best ones. Aggregate results are presented in Table 7. Examining the best found solutions, we can determine that the optimal sharing does not give a large increase in the number of output shares compared to the trivial bound of \({ (d+1) }^t\). For \(d=1\), increase is up to 50%, except in the hardest case with 8 variables and degree 5 functions where, we have 52 shares compared to the bound 32 shares, an increase of a little over 60%. A similar situation happens with \(d=2\) where found solutions are within 70% increase. Due to the particular case of SCP for sharing being somewhat pathological, where all shares cover an equal amount of elements, it is difficult for the solver to find optimal solutions with the increasing number of total shares, while good solutions are still relatively easily discovered using greedy heuristic.

Comparison of our result with the recent ones presented in Sect. 3.2 and [48] is presented in Table 8. Obviously, the method presented in Sect. 3.2, although achieving optimality when \(t=n-1\) is ineffective in cases where \(t<n-1\). Greedy heuristic given in [48] finds solutions that are multiples of \({ (d+1) }^t\). However, the authors only presented the solution for a specific case of 8-bit degree 3 function, and for \((n=4, t=2, d=2),\ (n=4, t=2, d=3),\ (n=5, t=2, d=4),\ (n=5, t=3, d=3),\ (n=6, t=3, d=3)\) cases, optimal solution is found while not providing more information. Hence, in Table 8 we indicate the smallest number of output shares algorithm presented in [48] could potentially find. Our method provides better results in all cases for first and second security order.

Table 8 Comparison to previously known results

Concrete sharings are given in Tables 11,  1213 and  14 in “Appendix A.” If we examine the found solutions for \(d=1\), we can see that many of the solutions have symmetric order. One would assume that optimal solutions will always have sharings with such structure, but apparently this is not the case. For example, if we provide this additional constraint to the CP solver, it will no longer find optimal sharing of 7-bit functions of degree 2 to be 12, but 14. The symmetric structure is obtained from the CP solver, probably based on the heuristic it was using to parse the search space.

3.3.3 Sharing for AES cubic power functions

As an example of an application of the generic method presented here, we demonstrate the improved sharing of power functions \(x^{26}\) and \(x^{49}\) that are used to create AES inversion in work given by [35, 48]. Both of them are cubic functions, and in [48], a sharing using 16 output shares is used to produce a first-order secure \(d+1\) implementation. The sharing is only valid for the lowest bit of the power functions, but using the rotational symmetry of power functions in a normal basis, we can use the same sharing for other output bits as well. The normal basis generator chosen in [48] is \(\beta =205\). Table 7 gives a sharing with 12 shares which is 4 shares or 25% less for a generic cubic function. However, if we apply the smallest sharing on the exact ANF of the cubic power functions, we can improve upon this result. Using the MILP set covering program, we were able to find a first-order sharing for both power functions \(x^{26}\) and \(x^{49}\) with ten shares each, six shares or 37.5 percent less than the original sharing. Furthermore, we have applied the same method to a second-order \(d+1\) of the same power function and normal basis generator. Surprisingly, both functions have sharing with \({(d+1)}^t=27\) shares, the theoretical minimum. In order to make sure this is the best sharing, we have investigated seven other pairs of cubic functions that yield inversion when composed together: \((13, 98), (19, 161), (38, 208), (52, 152), (67, 137), (76, 104), (134, 196)\). In addition, we have expanded the search over all 128 normal basis generators. However, the exhaustive search showed that ten shares is the smallest number of shares in every case. Other generators that can be used to produce the sharing with ten shares of the cubic power functions are \(\{36, 96, 117, 124, 140, 199, 202\}\). Interestingly enough, in all the pairs of cubic power functions that produce inversion, the generators that have ten shares in output sharing are always the same. To make the notation as succinct as possible, we will only enumerate the chosen shares of power functions \(x^{26}\) and \(x^{49}\) with normal basis generator 205 in their lexicographical order, first share having index 0 and the last share having the index \({(d+1)}^n - 1\). In other words, we look at all possible shares as n digit words in base \(d+1\). For example, if we had a sharing with \(d=2, n=4, t=2\) given as [2, 12, 25, 31, 44, 45, 60, 64, 77], it means that the actual nine shares are

$$\begin{aligned}&(0, 0, 0, 2) \ (0, 1, 1, 0)\ (0, 2, 2, 1) \ (1, 0, 1, 1)\ (1, 1, 2, 2)\\&\quad (1, 2, 0, 0) (2, 0, 2, 0)\ (2, 1, 0, 1)\ (2, 2, 1, 2). \end{aligned}$$

A shortcut to getting actual output shares for first-order designs using this notation where allowed shares are 0 and 1 is to observe the binary representation of the index value, and each bit represents allowed share of that input variable. It should be noted that this representation of the sharing, while succinct, is not unique with respect to the actual distribution of ANF terms, as low-degree monomials have multiple possible output shares they can be a part of. For the first-order sharing of an 8-bit function, there are \(2^8 = 256\) possible shares, while for the second-order sharing there are \(3^8 = 6561\) total shares. The chosen shares for the first-order sharing are given as follows:

$$\begin{aligned} \hbox {shares}(x^{26}) =&[0, 30, 43, 93, 114, 154, 181, 195, 232, 246]\nonumber \\ \hbox {shares}(x^{49}) =&[0, 30, 79, 115, 124, 165, 187, 214, 217, 234].\nonumber \end{aligned}$$

The chosen shares for the second-order sharing are given as follows:

$$\begin{aligned} \begin{aligned} \hbox {shares}(x^{26}) =&[0, 439, 626, 796, 1145, 1338, 1511, 1938,\\&2044, 2388,2575, 2690, 3103, 3290,3474, \\&3863, 3975, 4162, 4533, 4639, 5069, 5212, \\&5408, 5754, 5927, 6111, 6550]\\ \hbox {shares}(x^{49}) =&[0, 338, 673, 930, 1025, 1324, 1617,\\&1919, 2011, 2218, 2553, 2882, 3148, 3231, \\&3542, 3745, 4053, 4148, 4436, 4762, 5097, \\&5276, 5368, 5676, 5963, 6271, 6354]. \end{aligned} \end{aligned}$$

3.4 Optimizing secure AES schedule for maximum throughput

The design of side-channel secure AES schemes has been traditionally optimized for very low area, most commonly using a single S-box instance to compute both the SubBytes and round key update. While this approach is justified, as there are multiple applications such as RFID devices, where area and power are quite constrained, it still fares rather poorly with regards to latency and throughput. In the case of AES-128, the AES variant that is most prominent in the academic literature, taking this approach limits the execution time asymptotically to 200 cycles per execution with each round performing 20 S-box operations. To ease the notation from now on, we will refer to AES-128 simply as AES.

Adding the required ShiftRows, MixColumns, key schedule operations and the several clock cycle latency of the S-box in side-channel protected implementations increases the total latency even further. Several side-channel AES implementations [7, 15, 25, 32] require at least 246 cycles to complete one AES encryption. Recently, several approaches have been proposed that achieve the throughput of one encryption per 200 cycles, while achieving latency of 216 cycles [25, 47]. But they are only feasible for S-boxes that compute the output in five or less cycles, which is a restrictive requirement, with most existing side-channel secure AES S-box implementations take six or more cycles, going as far as nine cycles in the work presented in [21]. Here, we propose a new method of scheduling that can achieve a throughput of 20 cycles per round for S-box latency up to 11 cycles, allowing for serialized AES implementation with the highest possible latency/throughput. First, we declare a set of data dependencies that need to be followed in order to have a correct AES round implementation using a serialized S-box:

  • All state bytes that are to be overwritten by the S-box output need either to be in the S-box pipeline or to have finished S-box operation.

  • All MixColumn input operation column bytes need to have finished S-box operation. The last byte can be collected straight from the S-box output and not from the state registers, however.

  • State byte in the next round can only be used as S-box input if the MixColumn operation for that byte’s column has been performed.

  • Key byte must not be updated before it is used in the current round.

  • First four bytes of the key can only be updated if the corresponding S-box output has been completed.

  • Except for the first column of the key, each byte in subsequent columns can only be updated if the corresponding byte from the previous column has been updated.

We can program these rules in the MiniZinc constraint modeling language, providing latency of the S-box as a parameter. MiniZinc will then try to satisfy all constraints and output a solution if it exists or if it cannot find a solution after exploring the entire search space it will state that the problem is unsatisfiable. We have run our constraint model for different latency, and it has always found a scheduling for latency up to 11 clock cycles. The model is unsatisfiable for S-box latency of 12, proving that we cannot find a solution for any S-box latency of 12 or more cycles.

We illustrate our solution on the nine-cycle latency S-box, which can, for example, be used to improve the scheduling of M&M implementation [21].

Fig. 5
figure 5

S-box pipeline schedule

Fig. 6
figure 6

Timing diagram of S-box usage in the pipeline schedule

All other solutions are presented in “Appendix B.” The state and key update schedule is shown in Fig. 5, while the corresponding timing diagram is shown in Fig. 6. In our design, ShiftRows operation is performed together with the S-box output. That is, the output of the S-box is written to the state byte where it would be after performing the shift row operation. Another way to look at it is that each column is written back diagonally with rotational wrapping around to the state matrix, not back to itself. This is clearly shown in Figs. 5 and 6. MixColumn operation is performed as soon as all of the column bytes are ready, i.e., in 28th, 24th, 29th and 30th cycle, respectively. The result of MixColumn is ready in the following cycle. The key addition is only performed after the MixColumn operation, except in the last round where it is performed before ShiftRows. The initial key addition is performed during the loading stage. From the timing diagram in Fig. 6, we see that the actual output for all the state bytes is ready 10 cycles after the beginning of the following round.

It should be noted that the trade-off in using our approach is an additional multiplexer as we are required to read from and write to arbitrary bytes of the state and the key matrix, which slightly increases the area occupied by the implementation. However, we can achieve speed-ups of roughly 20–35% for existing implementations with S-box latency greater than 6.

4 Implementations, results and evaluation

4.1 Implementations

In this section, we provide eight implementations of PRINCE: first- and second-order protection, using \(td+1\) and \(d+1\) sharing, with and without S-box decomposition.

4.1.1 PRINCE unprotected implementation

Figure 7 represents the architecture of unprotected round-based PRINCE. Here, we evaluate the S-box in one cycle. One encryption is performed in 12 clock cycles. We utilize the fact that the S-box and its inverse are in the same equivalence class to reuse the same circuitry for both first and last rounds, with the exception that affine transformation circuits \(A_{io}\) are used during the evaluation of \(S^{-1}\). By adding an extra multiplexer, we can perform decryption as well. To minimize the overhead of adding decryption, we use the round counter design as explained in [33].

Fig. 7
figure 7

Unprotected PRINCE round-based architecture

Following Fig. 7, when evaluating the S-box, the data travel through multiplexers \(\alpha _1 - \beta _2 - \delta _1\), except in the first round where the path is \(\alpha _1 - \beta _1 - \delta _1\). Similarly, when evaluating the inverse S-box, active path is \(\alpha _2 - \gamma _1 - \delta _2\), except in the last round where we use \(\alpha _1 - \gamma _2 - \delta _2\).

Next, in Sects. 4.1.24.1.6 we first present the TIs of the S-box and its building block \(Q_{294}\), and then, in Sect. 4.1.7 we present the actual secure implementations of PRINCE.

4.1.2 TIs of \(Q_{294}\)

We have implemented \(td + 1\) and \(d+1\) variants of TI for both the first- and the second-order \(Q_{294}\) implementations. We use the first-order \(td+1\) direct TI sharing [11] and the second-order \(td+1\) sharing with five input shares and ten output shares, as explained in [6]. For the \(d+1\) first- and second-order implementations, we used sharing given in [41]. Compression is applied to the second-order \(td+1\) and both \(d+1\) versions. Detailed description of the hardware architecture is given in “Appendix B.”

4.1.3 First-order secure \(td+1\) TI of PRINCE S-box

For the first-order \(td + 1\) design, we generated sharing with four input and output shares following [11].

Output bits are refreshed using Eq. (2), requiring 12 random bits per S-box. The exact sharing does not fit within the margins of this paper and will be provided as a Verilog code online [12].

4.1.4 Second-order secure \(td+1\) TI of PRINCE S-box

To create a second-order secure masking for the PRINCE S-box with \(d=2\), we have used the iterated greedy algorithm described in Sect. 3.1. This algorithm provides a solution that has 17 output shares and 8 input shares. Compared to the solution given in [6], which had 35 output shares and 7 input shares, we have reduced the total number of shares by almost a half. All output bits are refreshed using the ring re-masking from Eq. (1), requiring 68 random bits per S-box. Since the rest of the PRINCE core is using three shares (see Sect. 4.1.7), we generate five extra shares before the S-box input which consumes 20 random bits extra. Therefore, the whole S-box evaluation uses 88 random bits. Again, the exact sharing will be provided as an online Verilog code [12].

4.1.5 First-order secure \(d+1\) TI of PRINCE S-box

To implement the first-order secure masking of PRINCE S-box, with \(d=1\), we use the algorithm described in Sect. 3.2 to obtain a conjugate \(D^4_3\)-table. This table represents an optimal solution for two input shares with eight output shares for each input/output bit of the S-box. Recall that the PRINCE S-box is a \(4 \times 4\)-bit S-box and that it has a degree 3.

All output bits are refreshed using the mask refreshing as given in Eq. (2), requiring 7 bits of randomness per output bit, or 28 bits per S-box in total. The optimal sharing is given in Eq. (14) as conjugate \(D^4_3\)-table. The exact sharing will again be provided as an online Verilog code [12].

$$\begin{aligned} (x,y,z,w) \nonumber \\ (0,0,0,0) \nonumber \\ (1,1,0,0) \nonumber \\ (0,1,1,0) \nonumber \\ (1,0,1,0) \nonumber \\ (0,0,1,1) \nonumber \\ (1,1,1,1) \nonumber \\ (0,1,0,1) \nonumber \\ (1,0,0,1). \end{aligned}$$
(14)

As an example consider the first coordinate function of PRINCE as given in Eq. (6): \(o^1 = 1 \oplus wz \oplus y \oplus zy \oplus wzy \oplus x \oplus wx \oplus yx\). Then, the optimal sharing is obtained as follows:

$$\begin{aligned} o_1^1 =&1 + w_0z_0 + y_0 + z_0y_0 + w_0z_0y_0 + x_0 + w_0x_0 + y_0x_0 \nonumber \\ o_2^1 =&y_1 + z_0y_1 + w_0z_0y_1 + x_1 + w_0x_1 + y_1x_1 \nonumber \\ o_3^1 =&w_0z_1 + z_1y_1 + w_0z_1y_1 + y_1x_0 \nonumber \\ o_4^1 =&z_1y_0 + w_0z_1y_0 + y_0x_1 \nonumber \\ o_5^1 =&w_1z_1 + w_1z_1y_0 + w_1x_0 \nonumber \\ o_6^1 =&w_1z_1y_1 + w_1x_1 \nonumber \\ o_7^1 =&w_1z_0 + w_1z_0y_1 \nonumber \\ o_8^1 =&w_1z_0y_0. \end{aligned}$$
(15)

Continuing for the second bit’s algebraic function \(o^2 = 1 + yw + yz + xz + yzw + xyz\) optimal sharing is:

$$\begin{aligned} o_1^2 =&1 + y_0w_0 + y_0z_0 + x_0z_0 + y_0z_0w_0 + x_0y_0z_0 \nonumber \\ o_2^2 =&y_1z_0w_0 + x_1y_1z_0 \nonumber \\ o_3^2 =&y_1w_0 + y_1z_1 + y_1z_1w_0 + x_0y_1z_1 \nonumber \\ o_4^2 =&x_1z_1 + y_0z_1w_0 + x_1y_0z_1 \nonumber \\ o_5^2 =&y_0w_1 + y_0z_1 + x_0z_1 + y_0z_1w_1 + x_0y_0z_1 \nonumber \\ o_6^2 =&y_1z_1w_1 + x_1y_1z_1 \nonumber \\ o_7^2 =&y_1w_1 + y_1z_0 + y_1z_0w_1 + x_0y_1z_0 \nonumber \\ o_8^2 =&x_1z_0 + y_0z_0w_1 + x_1y_0z_0. \end{aligned}$$
(16)

Optimal sharing for the third bit with algebraic function \(o^3 = w + x + zw + xw + xz + xzw + xyz\) is:

$$\begin{aligned} o_1^3 =&w_0 + x_0 + z_0w_0 + x_0w_0 + x_0z_0 + x_0z_0w_0 + x_0y_0z_0 \nonumber \\ o_2^3 =&x_1z_0w_0 + x_1y_1z_0 \nonumber \\ o_3^3 =&z_1w_0 + x_0z_1w_0 + x_0y_1z_1 \nonumber \\ o_4^3 =&x_1w_0 + x_1z_1 + x_1z_1w_0 + x_1y_0z_1 \nonumber \\ o_5^3 =&w_1 + z_1w_1 + x_0w_1 + x_0z_1 + x_0z_1w_1 + x_0y_0z_1 \nonumber \\ o_6^3 =&x_1z_1w_1 + x_1y_1z_1 \nonumber \\ o_7^3 =&z_0w_1 + x_0z_0w_1 + x_0y_1z_0 \nonumber \\ o_8^3 =&x_1 + x_1w_1 + x_1z_1 + x_1z_0w_1 + x_1y_0z_0. \end{aligned}$$
(17)

Finally, for the fourth bit of PRINCE S-box and its function \(o^4 = 1 + z + x + yz + xy + yzw + xzw + xyw\) optimal sharing is given with:

$$\begin{aligned} o_1^4 =&1 + z_0 + x_0 + y_0z_0 + x_0y_0 + y_0z_0w_0 + x_0z_0w_0 + x_0y_0w_0 \nonumber \\ o_2^4 =&x_1y_1 + y_1z_0w_0 + x_1z_0w_0 + x_1y_1w_0 \nonumber \\ o_3^4 =&y_1z_1 + y_1z_1w_0 + x_0z_1w_0 + x_0y_1w_0 \nonumber \\ o_4^4 =&y_0z_1w_0 + x_1z_1w_0 + x_1y_0w_0 \nonumber \\ o_5^4 =&z_1 + y_0z_1 + y_0z_1w_1 + x_0z_1w_1 + x_0y_0w_1 \nonumber \\ o_6^4 =&y_1z_1w_1 + x_1z_1w_1 + x_1y_1w_1 \nonumber \\ o_7^4 =&y_1z_0 + x_0y_1 + y_1z_0w_1 + x_0z_0w_1 + x_0y_1w_1 \nonumber \\ o_8^4 =&x_1 + x_1y_0 + y_0z_0w_1 + x_1z_0w_1 + x_1y_0w_1. \end{aligned}$$
(18)

Note that the sharing of the cubic terms is unique, while there are more options for the sharings of the lower-degree terms and that is why one needs to avoid repetitions.

4.1.6 Second-order secure \(d+1\) TI of PRINCE S-box

For the second-order secure masking of PRINCE S-box, with \(d=2\), we again use the algorithm described in Sect. 3.2 to obtain the conjugate \(D^4_3\)-table. This table represents an optimal solution with 3 input shares and 27 output shares for each input/output bit of the S-box. All output bits are refreshed using the ring re-masking as given in Eq. (1), requiring 108 random bits for the whole S-box. The optimal sharing is given in Eq. (19) as a conjugate \(D^4_3\)-table, where the same rules are used as in the previous section for \(d=1\) case. The exact sharing will be provided as an online Verilog code [12].

$$\begin{aligned}&(x,y,z,w)\qquad&&\nonumber \\&(0,0,0,0)\qquad&(0,0,1,1)\qquad&(0,0,2,2) \nonumber \\&(1,1,0,0)\qquad&(1,1,1,1)\qquad&(1,1,2,2) \nonumber \\&(2,2,0,0)\qquad&(2,2,1,1)\qquad&(2,2,2,2) \nonumber \\&(0,1,1,0)\qquad&(0,1,2,1)\qquad&(0,1,0,2) \nonumber \\&(1,2,1,0)\qquad&(1,2,2,1)\qquad&(1,2,0,2) \nonumber \\&(2,0,1,0)\qquad&(2,0,2,1)\qquad&(2,0,0,2) \nonumber \\&(0,2,2,0)\qquad&(0,2,0,1)\qquad&(0,2,1,2) \nonumber \\&(1,0,2,0)\qquad&(1,0,0,1)\qquad&(1,0,1,2) \nonumber \\&(2,1,2,0)\qquad&(2,1,0,1)\qquad&(2,1,1,2). \end{aligned}$$
(19)

4.1.7 Protected implementations of the PRINCE cipher

Fig. 8
figure 8

TI PRINCE round-based architecture with decomposition

Figure 8 depicts the data path of the hardware implementation for the four protected round-based implementations of PRINCE which use the S-box decomposition. All the data lines have the width of \(64 \times s\) bits, where s is the number of input shares. The only exception is the RC constant output, which has a width of 64 bits. The sharing of the nonlinear layer, followed by the linear layer, re-masking and compression layer is denoted with the NLRC block in Fig. 8. Hardware implementations of NLRC layers of \(Q_{294}\) are discussed in “Appendix B.”

In order to support both encryption and decryption, input and output whitening keys, \(k_{wi}\) and \(k_{wo}\) are either \(k_0\) or \(k_0^\prime \), depending on what is being executed. We only require one extra multiplexer to implement this feature. When evaluating the S-box, the data path of the multiplexers is \(\alpha _1 - \beta _2 - \delta _1\) in the first, \(\alpha _2 - \delta _2\) in the second and \(\alpha _2 - \delta _3\) in the third clock cycle, except in the first round where the third cycle path is \(\alpha _1 - \beta _1 - \delta _1\). Similarly, when evaluating the inverse S-box, the active inputs of multiplexers are \(\alpha _3 - \gamma _1 - \delta _4\) in the first, \(\alpha _2 - \delta _2\) in the second, and \(\alpha _2 - \delta _3\) in the third clock cycle, except in the last round where the path during the third cycle is \(\alpha _2 - \gamma _2 - \delta _4\).

For \(td+1\) implementations, we use three and five shares, respectively, for the affine operations in order to reduce the amount of randomness required for the execution. This incurs an additional penalty in the area occupied by the implementation. Recall that the output of the S-box component functions for \(td+1\) TI is shared with three and ten shares, respectively, for the first- and the second-order secure implementations. Re-masking and compression are done only for the second-order \(td+1\) TI. The \(d+1\) implementations use two and three shares for the first- and the second-order secure implementation, respectively. The output of the S-box component functions is shared with four and nine shares, respectively, for the first- and the second-order secure implementations. Re-masking and compression are required in both cases.

The round constant is added to only one of the shares. The key is shared with the same number of shares as the plaintext. Unlike most of the secure implementations available in the literature, in this paper we focus on the round based implementation instead of the serialized one. This greatly reduces the execution time, at the expense of increased area and the required amount of randomness per clock cycle. In order to decrease area, we employ multiplexers to avoid instantiating additional registers for the three stages of the S-box evaluation. Since PRINCE has 12 rounds and each with S-box evaluations has 3 (for \(d=1\) and \(td+1\) TI) or \(2*3\) stages (for \(d+1\) TI and \(td+1\) TI but \(d>1\)), the total execution takes 36 or 72 clock cycles.

Fig. 9
figure 9

TI PRINCE round-based architecture without decomposition

Figure 9 represents the architecture for the four protected round-based implementations of PRINCE without S-box decomposition. The architecture is almost the same as for the unprotected design. The implementations of NLRC layers of PRINCE S-box are discussed in Sects. 4.1.34.1.6.

When evaluating the S-box, the data path of the multiplexers is \(\alpha _1 - \beta _2 - \delta _1\) except in the first round where the path in the first clock cycle is \(\alpha _1 - \beta _1 - \delta _1\). Similarly, when evaluating the inverse S-box, the active inputs of multiplexers are \(\alpha _2 - \gamma _1 - \delta _2\), except in the first cycle of the last round where the path is \(\alpha _1 - \gamma _2 - \delta _2\). Unlike in the unprotected version, the S-box evaluation takes two cycles (S-box layer and linear layer are separated into two cycles); hence, it takes 24 cycles for one encryption/decryption operation. The exception is the first-order \(td+1\) implementation where S-box evaluation takes one cycle, making encryption/decryption latency 12 cycles.

For \(td+1\) implementations, we use four and three shares, respectively, for the affine operations. Recall that the output of the S-box component functions for \(td+1\) TI is shared with 4 and 17 shares, respectively, for the first- and the second-order secure implementations. Compression is required only for the second-order \(td+1\) implementation, while re-masking is applied for both of them. The \(d+1\) implementations use two and three shares for the first- and the second-order secure implementation, respectively. Recall that the output of the S-box component functions is shared with 8 and 27 shares, respectively. Re-masking and compression are required in both cases.

4.1.8 Randomness reduction

The resharing of the first-order secure implementation is performed according to the DOM [24] rules, in which complementary domains are re-masked using the same randomness, with no re-masking for output shares containing only one domain. It can be noticed from Eq. (14) that output shares \(o_1, o_2, o_3, o_4\) have complementary domains of shares \(o_6, o_5, o_8, o_7\), respectively. If we consider eight output shares of 4-bit length, the re-masking is given with Eq. (20), where \(o_i\) and \(ro_i\) are S-box outputs before and after re-masking and \(r_i\) are random 4-bit values, requiring 12 random bits. Recombination is achieved by adding shares \(ro_1, ro_2, ro_3, ro_4\) into one and \(ro_5, ro_6, ro_7, ro_8\) into another recombined share.

$$\begin{aligned} ro_1&= o_1 \qquad ro_2 = o_2 + r_1 \qquad ro_3 = o_3 + r_2 \qquad ro_4 = o_4 + r_3\nonumber \\ ro_5&= o_5 + r_1 \qquad ro_6 = o_6 \qquad ro_7 = o_7 + r_3 \qquad ro_8 = o_8 + r_2. \end{aligned}$$
(20)

Let us take a closer look at the PRINCE round structure. As explained in Sect. 2.4, the mixing layer consists of matrices M, \(M^\prime \) or \(M^{-1}\). Recall that M can be obtained from \(M^\prime \) using nibble shuffling operation SR, i.e., \(M = \hbox {SR} \circ M^\prime \). The involution \(64 \times 64\) matrix \(M^\prime \) is constructed as block diagonal matrix with entries \((M_0, M_1, M_1, M_0)\), where \(M_0\) and \(M_1\) are \(16 \times 16\) matrices.

This structure implies that 16-bit chunks of the state are processed independently. Therefore, we can use the same randomness for all four 16-bit blocks for the attacker case of \(d=1\) and \(d=2\).

Namely, assuming the PRINCE state is composed of 16 nibbles enumerated from 0 to 15 then the following four groups can be formed (0, 1, 2, 3), (4, 5, 6, 7), (8, 9, 10, 11) and (12, 13, 14, 15). When evaluating the S-boxes in a given group, the randomness required can be reused for the evaluation of the S-boxes in the other groups. It can also be observed that the nibble shuffling

$$\begin{aligned}&\hbox {SR}: (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)\\ {}&\quad \rightarrow (0,5,10,15,4,9,14,3,8,13,2,7,12,1,6,11)\\&\hbox {SR}^{-1}: (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)\\ {}&\quad \rightarrow (0,13,10,7,4,1,14,11,8,5,2,15,12,9,6,3) \end{aligned}$$

does not cause mixing of the S-boxes outputs obtained with the same randomness. Hence, using this structure in round-based implementation reduces the amount randomness by a factor of four. Although the probing model does not accurately reflect the HW specifics like glitches and the parallelism of the implementations, note the first increases the attacker capabilities, while the second diminishes it, below we will still use this attacker model in the argumentations.

When we consider the first-order attacker, he can probe one share out of two at a given cycle; thus, the reuse of randomness is not exploitable. For the case of second-order attacker, he is able to get either (a) two shares out of three of one nibble or (b) one share of the two nibbles using the same randomness at a given cycle. In case of (a), again the attacker cannot exploit the reuse of randomness since he does not know anything about this second value which can be combined with his own knowledge for the first value. In the case of (b), the attacker is unable to mount a bivariate attack using points from different rounds (and hence cycles) due to the re-masking after each operation and the key addition (all of them done in the same cycle), and since the nibble shuffling does not cause mixing of the S-boxes outputs in the same round.

4.2 Results

To demonstrate our results, we first use the 90-nm CMOS library provided by TSMC and consider the worst power voltage temperature (PVT) corner case (the temperature of +125\(^{\circ }\) C and the supply voltage of 1.0 V). Please note that the worst corner case is used in almost all industrial applications. In most of the scientific publications, however, a typical corner case is usually reported, giving an optimistic estimate of what will eventually be manufactured in silicon. However, in order to have a fair comparison, we have also synthesized our designs as well as the previously existing TI PRINCE implementation [33] using TSMC 90-nm library using the typical case of +25\(^{\circ }\) C. The authors of [33] provided us with their implementations, allowing for an apple to apple comparison of the best designs using the same compiler and library, as the synthesis results for design presented in [33] differ from the original paper.

For synthesis, we use Cadence Encounter RTL Compiler version 14.20-s034 to evaluate the proposed architectures. The designs are synthesized using the operating frequency of 10 MHz and the power consumption is estimated by simulating a back-annotated post-synthesis netlist with 100 random test vectors, using Cadence Incisive Enterprise Simulator version 15.10.006. Energy is calculated for one complete encryption/decryption operation. Table 9 shows area, power and energy consumption, the number of random bits required per clock cycle and the maximum frequency for all the hardware implementations measured at the worst PVT corner case. All the designs have their unconstrained critical paths well below 100 ns and, hence, collecting area figures and power/energy consumption at the frequency of 10 MHz guarantees fair comparison.

It should be noted that the area, power and energy consumption of PRNG is not included in Tables 9 and 10, thus making the obtained results favoring solutions with more randomness. In practice, one must take the impact of PRNG into account and it is expected that higher-throughput PRNGs consume more area, power and energy. However, in most security applications, PRNG is a component shared between multiple resources, making its impact on the overall area, power and energy consumption limited.

Table 9 Area/power/energy/randomness/ latency/max frequency comparison at worst-case PVT
Table 10 Area/power/energy/randomness/ latency/max frequency comparison at normal case PVT

As expected, the first-order \(d+1\) TI design with S-box decomposition occupies the smallest area, compared to other secure implementations. Compared to the first-order \(td+1\) TI architecture with S-box decomposition, this comes at the cost of extra randomness required.

We report an interesting observation when comparing the energy consumption of different architectures. The smallest energy consumption of 238 pJ has been achieved for the first-order secure \(d+1\) TI architecture without S-box decomposition presented here. This is closely followed by design from [33] with the S-box decomposition 264 pJ. We attribute the absence of randomness needed for resharing in this specific design to its low power consumption, despite the area of both versions of first-order \(td+1\) TI architectures with the S-box decomposition being larger compared to several other designs in Table 9. The absence of randomness greatly reduces the switching activity of the circuit lowering the power consumption considerably. Still, we observe that the first-order \(d+1\) TI architecture without the S-box decomposition consumes only 5% more energy, while being 33% faster. Another interesting observation is that the first-order secure designs consume considerably less energy compared to second-order designs.

For second-order designs, those without decomposition lead to large area overheads (particularly in \(td+1\) scenario) and a high number of random bits compared to simpler designs. We conclude that the \(d+1\) designs are still interesting implementation choices if enough randomness can be provided. Second-order \(td+1\), on the other hand, is quite unpractical due to its large area overheads and large power and energy consumption.

One can see that all protected designs except first-order \(td+1\) without the S-box decomposition have their maximal frequency within 20% of each other. The reason for the first-order \(td+1\) without the S-box decomposition smaller maximum frequency is the absence of register prior to the S-box operation. Also, the implementation from [33] has a smaller critical path compared to our designs. The critical path for all implementations goes from the round counter to the S-box input register. In the first-order \(td+1\) without the S-box decomposition case, we do not have the S-box input register, thus making the critical part longer. Still, even with this limitation, the \(td+1\) first-order version has smaller total latency compared to other designs.

Compared to our designs, the design described in [33] stores the key in plain, requiring less area for key storage, we leave out the question of whether this is a good approach from the security point of view, but it has certainly impact on the area, power and energy consumption. In addition, the authors of [33] proposed different affine transformations of decomposed S-boxes and their architecture has simplified interface and control logic. That is the reason why their design is considerably smaller and has lower power consumption than the first-order \(td+1\) version of our proposal with the S-box decomposition. When the energy consumption is compared, however, the two designs perform similarly with the design of [33] being 2.3% more efficient. This is obviously due to the fact that our first-order \(td+1\) design with the S-box decomposition is 10% faster in terms of the required number of clock cycles.

Another interesting observation is that our first-order \(td+1\) design with the S-box decomposition has lower power consumption than four other designs from Table 9, while having larger area. As discussed previously, this is due to no additional randomness being required during the encryption/decryption process. We have investigated the impact of mask refreshing further and came to the conclusion that adding (or removing) the mask refreshing might change the power consumption up to a factor of two. An example of this is the first-order \(d+1\) TI without the S-box decomposition, where the power/energy consumption drops by 40% if the random inputs are set to zero. Hence, when achieving lower power/energy is the main requirement using uniform sharing is the best approach.

Table 9 also clearly shows the difference between power and energy consumption. The most extreme example is comparison between first-order \(td+1\) design without the S-box decomposition and \(d+1\) design with the S-box decomposition. Although \(td+1\) design without the S-box decomposition has almost six times the power consumption, it has slightly smaller energy consumption, as it takes six times less clock cycles to complete.

As can be seen by the reported figures, adding side-channel countermeasures increases the size of the unprotected PRINCE by at least a factor of 2.5. It should be noted that one has the penalty of extra clock cycles as well in all the cases except the first-order \(td+1\) without the S-box decomposition version.

The fastest unprotected PRINCE with the worst-case PVT synthesis takes 42.2 ns, followed by first-order \(td+1\) TI without decomposition which takes 58.8 ns, i.e., 39% latency increase; next is the second-order \(d+1\) TI without decomposition which takes 82.2 ns, i.e., additional 41% latency increase. Also, all designs without the S-box decomposition have significantly smaller latency compared to the implementation presented in [33], ranging from 1.4 to 2 times performance increase.

Table 10 shows area, power and energy consumption, the number of random bits required per clock cycle and the maximum frequency for three hardware implementations, one given by Moradi [33], and two that newly proposed ones all measured at the typical PVT case.

At the maximum frequency, our first-order design surpasses the previous state of the art by reducing latency by almost a third. The energy consumption of our first-order at the frequency of 10 MHz is lower by almost 10%. On the other hand, the implementation from [33] beats our version with respect to area, power consumption, maximal running frequency and randomness required. Potentially, it also can achieve higher throughput, with small modifications to the finite state machine, so it processes three messages at once. Given that our goal was to minimize implementation latency and energy, these results are not surprising.

Table 11 Sharing indices of best shares for security order \(d=1\)
Table 12 Sharing indices of best shares for security order \(d=2\), part 1
Table 13 Sharing indices of best shares for security order \(d=2\), part 2
Table 14 Sharing indices of best shares for security order \(d=2\), \(n=8\), \(t=6\)
Fig. 10
figure 10

Example power trace waveform used to perform the t-test on first-order PRINCE

Fig. 11
figure 11

Leakage detection test results on first-order PRINCE. PRNG off (left) and PRNG on (right). First-order (top) and second-order (bottom) t-test results

4.3 Side-channel evaluation

We first provide an evaluation of the first-order PRINCE without S-box decomposition using optimal \(d+1\) sharing which design was programmed onto a Xilinx Spartan-6 FPGA. The platform used is a Sakura-G board. The design is separated into two FPGAs to minimize the noise: One performs the PRINCE encryption and the second FPGA handles the I/O and the start signal. Our core runs at 3.072 MHz, while the sampling rate is 500 million samples per second. Since one trace consists of 2500 points, we are able to cover the first seven rounds of the execution. The power waveform is given in Fig. 10.

We apply a non-specific leakage detection test [13] on the input plaintext following the standard methodology [43], and the resulting t-test graphs are shown in Fig. 11. First, we turn PRNG off to verify the validity of the setup and leakage is detected with 1 million traces. The left-hand side in Fig. 11 demonstrates a strong first-order leakage during the loading of the plaintext and the key. This can be attributed to one share of both the key and the plaintext being equal to the unshared value, while the other share is zero. Another strong peak is during the first S-box execution as there is still a high correlation to the input. Leakage is present in later rounds as well due to a lack of additional randomness, although it becomes smaller, potentially due to jitter and misalignment in later rounds. Second-order leakage can also be observed when the masks are off. When PRNG is on, no first-order leakage is detected after 100 million traces, while second-order leakage is observed as expected.

Fig. 12
figure 12

Leakage detection test results on second-order PRINCE. PRNG off (left) and PRNG on (right). First-, second- and third-order (top–middle–down) t-test results

Due to the size and randomness needed, the second-order design did not fit onto the same FPGA board. Instead, the design is tested against simulated power traces. We measured the estimated power consumption by running a post-synthesis simulation with back-annotated netlist. Input-to-output timing delays and current consumption of every gate in the netlist were taken into account and modeled as specified by the technology liberty timing file. In our simulations, one clock cycle is represented with 50 sample points and we cover the first seven rounds of the execution. One million traces have been obtained with PRNG switched on, and two thousand traces with PRNG off. Simulated traces are perfectly aligned, they do not contain any measurement noise, and the numerical noise of the samples is minimized by having a precision of 32-bit floating-point representation compared to 8-bit obtained from the FPGA setup.

The second-order implementation t-test results are shown in Fig. 12. We notice that with PRNG off, leakage occurs in all orders with only two thousand traces. With PRNG on, the design exhibits no leakage in the first and second-order t-test, while several points leak in the third order. More precisely, third-order leakage occurs during the writing of the S-box output to the register every other cycles.

5 Conclusion and outlook

As discussed in [33], designing a low-latency side-channel protection in general, and for PRINCE block cipher in particular, has been identified as an open problem. In this work, we have shown the fastest round-based first- and second-order secure implementations of PRINCE using \(td+1\) and \(d+1\) TI sharing correspondingly, as well as the most energy-efficient round-based first-order secure implementation of PRINCE using \(d+1\) TI sharing. We have introduced several methods for optimizing threshold implementations, which make the low latency, low energy and higher throughput of side-channel secure designs practical. Then, we have investigated several different trade-offs that occur in side-channel secure designs. Particularly, we discuss the energy consumption of given implementations, an important factor in several applications, such as battery-powered devices.

On the methods for optimized TI sharings: First, we provided an algorithm that produces a \(d+1\) TI sharing with the optimal (minimum) number of output shares for any n-input Boolean function of degree \(t=n-1\) and for any security order d. Second, when \(t<n-1\), we presented discrete optimization-based methodology, which can be used to find good, and in many cases optimal, sharings of Boolean functions up to 8 bits. Third, we presented a heuristic for minimizing the number of output shares for higher-order \(td + 1\) TI. We highlight that these contributions are of general interest since the method of minimizing the number of output shares can be applied to any cryptographic design. The last optimization is on the secure AES scheduling which achieves maximum throughput for a serial implementation. The proposed algorithms should be of use during the design of low-latency side-channel hardware implementations of many symmetric designs, since almost all in-use symmetric algorithms incorporate S-boxes that have no more than eight input bits. We would like to add that improving the heuristics to provide sharing solutions for security orders higher than two would be beneficial, especially for software implementations which are typically realized in higher security orders [5].

Next, we reported, evaluated and compared hardware figures for eight different TI-protected round-based versions of PRINCE cipher, namely \(d+1\) and \(td+1\) TI versions, first- and second-order secure, with or without the S-box decomposition. The \(td+1\) TI versions tend to consume less randomness. The \(d+1\) TI versions with decomposition achieve lower area and power consumption. The first-order designs without decomposition have favorable energy consumption. The comparison with state of the art showed that our designs have more than 30% lower latency compared to the architecture presented in [33], while the energy consumption is lower by about 10%. It should, however, be noted that the design presented in [33] still has the highest power efficiency reported in the literature.

We would like to summarize that the generic algorithm for achieving the minimal number of output shares is a necessary, but not sufficient condition when designing for low-latency and low-energy applications. Applying TI on higher-degree functions reduces the total clock count, in effect reducing latency and energy consumed during one operation. However, due to increased circuit complexity it increases the area and the critical path of the design, which have a negative impact on energy consumption and latency, respectively. A circuit designer should take all these parameters into consideration since the optimal design choice heavily depends on the algorithm in question, alongside the constraints imposed upon the design. In the case of PRINCE block cipher, our work shows that for achieving low latency it is more efficient not to perform S-box decomposition.