Keywords

1 Introduction

ARX, standing for Addition/Rotation/XOR, is a class of symmetric-key algorithms designed using only the following simple operations: modular addition, bitwise rotation and exclusive-OR. In contrast to S-box-based designs, where the only non-linear elements are the substitution tables (S-boxes), ARX designs rely on modular addition as the only source of non-linearity. Notable representatives of the ARX class include the stream ciphers Salsa20 [1] and ChaCha20 [2], the SHA-3 finalists Skein [3] and BLAKE [4] as well as several lightweight block ciphers such as TEA, XTEA [5], etc. Dinu et al. recently reported [6] that the most efficient software implementations on small processors belonged to ciphers from the ARX class: Chaskey-cipher [7] by Mouha et al., speck [8] by the American National Security Agency (NSA) and LEA [9] by the South Korean Electronic and Telecommunications Research Institute.Footnote 1

For the mentioned algorithms, the choice of using the ARX paradigm was based on three observationsFootnote 2. First, getting rid of the table look-ups, associated with S-Box based designs, increases the resilience against side-channel attacks. Second, this design strategy minimizes the total number of operations performed during an encryption, allowing particularly fast software implementations. Finally, the computer code describing such algorithms is very small, making this approach especially appealing for lightweight block ciphers where the memory requirements are the harshest.

Despite the widespread use of ARX ciphers, the following problem has remained open up until now.

Open Problem

Is it possible to design an ARX cipher that is provably secure against single-trail differential and linear cryptanalysis by design?

To the best of our knowledge, there has only been one attempt at tackling this issue. In [10] Biryukov et al. have proposed several ARX constructions for which it is feasible to compute the exact maximum differential and linear probabilities over any number of rounds. However, these constructions are limited to 32-bit blocks. The general case of this problem, addressing any block size, has still remained without a solution.

More generally, the formal understanding of the cryptographic properties of ARX is far less satisfying than that of, for example, S-Box-based substitution-permutation networks (SPN). Indeed, the wide-trail strategy [11] (WTS) and the wide-trail argument [12] provide a way to design S-box based SPNs with provable resilience against differential and linear attacks. It relies on bounding the number of active S-Boxes in a differential (resp. linear) trail and deducing a lower bound on the best expected differential (resp. linear) probability.

Our Contribution. We propose two different strategies to build ARX-based block ciphers with provable bounds on the maximum expected differential and linear probabilities, thus providing a solution to the open problem stated above.

The first strategy is called the Long Trail Strategy (LTS). It borrows the idea of counting the number of active S-Boxes from the wide-trail argument but the overall principle is actually the opposite to the wide-trail strategy as described in [11]. While the WTS dictates the spending of most of the computational resources in the linear layer in order to provide good diffusion between small S-boxes, the LTS advocates the use of large and comparatively expensive S-Boxes in conjunction with cheaper and weaker linear layers. We formalize this method and describe the long-trail argument that can be used to bound the differential and linear trail probabilities of a block cipher built using this strategy.

Using this framework, we build a family of lightweight block ciphers called Sparx. All three instances in this family can be entirely specified using only three operations: addition modulo \(2^{16}\), 16-bit rotations and 16-bit XOR. These ciphers are, to the best of our knowledge, the first ARX-based block ciphers for which the probability of both differential and linear trails are bounded. Furthermore, while one may think that these provable properties imply a performance degradation, we show that it is not the case. On the contrary, Sparx ciphers have very competitive performance on lightweight processors. In fact, the most lightweight version – Sparx-64 is in the top 3 for 16-bit micro-controllers according to the classification method presented in [6].

Finally, we propose the LAX construction, where bit rotations are replaced with a more general linear permutation. The bounds on the differential probability are expressed as a function of the branching number of the linear layer. We note that the key insight behind this construction has been published in [13], but its realization has been left as a challenge.

Outline. First, we introduce the notations and concepts used throughout the paper in Sect. 2. In Sect. 3, we describe how an ARX-based cipher with provable bounds can be built using an S-Box-based approach and how the method used is a particular case of the more general Long Trail Strategy. Section 4 contains the specification of the Sparx family of ciphers, the description of its design rationale and a discussion about the efficiency of its implementation on microcontrollers. The LAX structure is presented in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Preliminaries

We use \(\mathbb {F}_2\) to denote the set \(\{0,1\}\). Let \(f: \mathbb {F}^n_2 \rightarrow \mathbb {F}^n_2\), \((a,b) \in \mathbb {F}^n_2 \times \mathbb {F}^n_2\) and \(x \in \mathbb {F}^n_2\). We denote the probability of the differential trail \((a \overset{d}{\rightarrow }b)\) by \(\text {Pr}[f(x) \oplus f(x \oplus a) = b]\) and the correlation of the linear approximation \((a \overset{\ell }{\rightarrow }b)\) by \(\big (2~\text {Pr}[a \cdot x = b \cdot f(x)] - 1\big )\) where \(y \cdot z\) is the scalar product of y and z.

In an iterated block cipher, not all differential (respectively linear) trails are possible. Indeed, they must be coherent with the overall structure of the round function. For example, it is well known that a 2-round differential trail for the AES with less than 4 active S-Boxes is impossible. To capture this notion, we use the following definition.

Definition 1 (Valid Trail)

Let f be an n-bit permutation. A trail \(a_0 \rightarrow ... \rightarrow a_r\) for r rounds of f is a valid trail if \(\text {Pr}[a_i \rightarrow a_{i+1}] > 0\) for all i in \([0,r-1]\). The set of all valid r-round differential (respectively linear) trails for f is denoted \(\mathcal {V}_{\delta }(f)^{r}\) (resp. \(\mathcal {V}_{\ell }(f)^{r}\)).

We use the acronyms MEDCP and MELCC to denote resp. maximum expected differential characteristic probability and maximum expected linear characteristic correlation – a signature introduced earlier in [14]. The \(\textsc {MEDCP}{}\) of the keyed function \(f_{k_i} : x \mapsto f(x \oplus k_i)\) iterated over r rounds is defined as follows:

$$\begin{aligned} \textsc {MEDCP}(f^{r}) = \max _{(\varDelta _0 \rightarrow ... \varDelta _r) \in \mathcal {V}_{\delta }(f)^{r}} \prod _{i=0}^{r-1} \text {Pr}[\varDelta _i \overset{d}{\rightarrow }\varDelta _{i+1}], \end{aligned}$$

where \(\text {Pr}[\varDelta _i \overset{d}{\rightarrow }\varDelta _{i+1}]\) is the expected value of the differential probability of \(\varDelta _i \overset{d}{\rightarrow }\varDelta _{i+1}\) for the function \(f_k\) when k is picked uniformly at random. \(\textsc {MELCC}(f^{r})\) is defined analogously. Note that \(\textsc {MEDCP}(f^{r})\) and \(\big (\textsc {MEDCP}(f^{1})\big )^r\) are not equal.

As designers, we thrive to provide upper bounds for both \(\textsc {MEDCP}(f^{r})\) and \(\textsc {MELCC}(f^{r})\). Doing so allows us to compute the number of rounds f needed in a block cipher for the probability of all trails to be too low to be usable. In practice, we want \(\textsc {MEDCP}(f^{r}) \ll 2^{-n}\) and \(\textsc {MELCC}(f^{r}) \ll 2^{-n/2}\) where n is the block size.

While this strategy is the best known, the following limitations must be taken into account by algorithm designers.

  1. 1.

    The quantities \(\textsc {MEDCP}(f^{r})\) and \(\textsc {MELCC}(f^{r})\) are relevant only if we make the Markov assumption, meaning that the differential and linear probabilities are independent in each round. This would be true if the subkeys were picked uniformly and independently at random but, as the master key has a limited size, it is not the case.

  2. 2.

    These quantities are averages taken over all possible keys: it is not impossible that there exists a weak key and a differential trail T such that the probability of T is higher than \(\textsc {MEDCP}(f^{r})\) for this particular key. The same holds for the linear probability.

  3. 3.

    These quantities deal with unique trails. However, it is possible that several differential trails share the same input and output differences, thus leading to a higher probability for said differential transition. This so-called differential effect can be leveraged to decrease the data complexity of differential attack. The same holds for linear attacks where several approximations may form a linear hull.

Still, this type of bound is the best that can be achieved in a generic fashion (to the best of our knowledge). In particular, this is the type of bound provided by the wide-trail argument used in the AES.

3 ARX-Based Substitution-Permutation Network

In this section, we present a general design strategy for building ARX-based block ciphers borrowing techniques from SPN design. The general idea is to build a SPN with ARX-based S-boxes instead of with S-boxes based on look-up tables (LUT). The proofs for the bound on the MEDCP and MELCC are inspired by the wide-trail argument introduced in the design of the AES [12]. However, because of the use of large S-Boxes, the method used relies on a different type of interaction between the linear and non-linear layers. We call the corresponding design strategy the long trail strategy. It is quite general and could be also applied in other contexts e.g. for non-arx constructions.

First, we present possible candidates for the ARX-based S-Box and, along the way, identify the likely reason behind the choice of the rotation constants in SPECK-32. Then, we describe the long trail strategy in more details. Finally, we present two different algorithms for computing a bound for the MEDCP and MELCC of block ciphers built using a LT strategy. We also discuss how to ensure that the linear layer provides sufficient diffusion.

3.1 ARX-Boxes

Definition 2

( ARX -box ). An ARXbox is a permutation on m bits (where m is much smaller than the block size) which relies entirely on addition, rotation and XOR to provide both non-linearity and diffusion. An arx-box is a particular type of S-Box.

Possible constructions for arx-boxes can be found in a recent paper by Biryukov et al. [10]. A first one is based on the MIX function of Skein [3] and is called Marx-2. The rotation amounts, namely \(\{1, 2, 7, 3\}\), were chosen so as to minimize the differential and linear probabilities. The key addition is done over the full state. The second construction is called Speckey and consists of one round of Speck-32 [8] with the key added to the full state instead of only to half the state as in the original algorithm. The two constructions Marx-2 and Speckey are shown in Fig. 1a and b. The differential and linear bounds for them are given in Table 1. While it is possible to choose the rotations used in Speckey in such a way as to slightly decrease the differential and linear boundsFootnote 3, such rotations are more expensive on small microcontrollers which only have instructions implementing rotations by 1 and by 8 (in both directions). We infer, although we cannot prove it, that the designers of Speck-32 made similar observations.

Fig. 1.
figure 1

Key addition followed by the candidate 32-bit ARX-boxes, Marx-2 and Speckey. The branch size is 8 bits for Marx-2, 16 bits for Speckey.

Table 1. Maximum expected differential characteristic probabilities (MEDCP) and maximum expected absolute linear characteristic correlations (MELCC) of Marx-2 and Speckey (\(\log _2\) scale); r is the number of rounds.

3.2 Naive Approaches and Their Limitations

A very simple method to build ARX-based ciphers with provable bounds on MEDCP and MELCC is to use a SPN structure where the S-boxes are replaced by ARX operations for which we can compute the MEDCP and MELCC. This is indeed the strategy we follow but care must be taken when actually choosing the ARX-based operations and the linear layer.

Let us for example build a 128-bit block cipher with an S-Box layer consisting in one iteration of Speckey on each 32-bit word and with an MDS linear layer, say a multiplication with the MixColumns matrix with elements in \(GF(2^{32})\) instead of \(GF(2^8)\). The MEDCP bound of such a cipher, computed using a classical wide-trail argument, would be equal to 1! Indeed, there exists probability 1 differentials for 1-round Speckey so that, regardless of the number of active S-Boxes, the bound would remain equal to 1. Such an approach is therefore not viable.

As the problem identified above stems from the use of 1-round Speckey, we now replace it with 3-round Speckey where the iterations are interleaved with the addition of independent round keys. The best linear and differential probabilities are no longer equal to 1, meaning that it is possible to build a secure cipher using the same layer as before provided that enough rounds are used. However, such a cipher would be very inefficient. Indeed, the MDS bound imposes that 5 arx-boxes are active every 2 rounds, so that the MEDP bound is equal to \(p_d^{5r/2}\) where r is the number of rounds and \(p_d\) is the best differential probability of the arx-box (3-rounds Speckey). To push the bound below \(2^{-128}\) we need at least 18 SPN rounds, meaning 54 parallel applications of the basic arx-round! We will show that, with our alternative approach, we can obtain the same bounds with much fewer rounds.

3.3 The Long Trail Design Strategy

Informed by the shortcomings of the naive design strategies described in the previous section, we devised a new method to build ARX-based primitives with provable linear and differential bounds. It is based on the following observation.

Observation 1

(Impact of Long Trails). Let d(r) and \(\ell (r)\) be the MEDCP and MELCC of some arx-box iterated r times and interleaved with the addition of independent subkeys. Then, in most cases:

$$\begin{aligned} d(qr) \ll d(r)^q \text { and } \ell (qr) \ll \ell (r)^q. \end{aligned}$$

In other words, in order to diminish the MEDCP and MELCC of a construction, it is better to allow long trails of arx-boxes without mixing.

For example, if we look at Speckey, the MEDCP for 3 rounds is \(2^{-3}\) and that of 6 rounds is \(2^{-15}\) which is far smaller than \((2^{-3})^2 = 2^{-6}\) (see Table 1). Similarly, the MELCC for 3 rounds is \(2^{-1}\) and after 6 rounds it is \(2^{-7} \ll (2^{-1})^2\).

In fact, a similar observation has been made by Nikolić when designing the CAESAR candidate family Tiaoxin [15]. It was later generalized to larger block sizes in [16], where Jean and Nikolić present, among others, the AES-based \(\mathcal {A}^{2}_{\oplus }\) permutation family. It uses a partial S-Box layer where the S-Box consists of 2 AES rounds and a word-oriented linear layer in such a way that some of the S-Box calls can be chained within 2-round long trails. Thus, they may use the 4-round bound on the number of active 8-bit AES S-Boxes, which is 25, rather than twice the 2-round bound, which would be equal to 10 (see Table 2). Their work on this permutation can be interpreted as a particular case of the observation above.

Table 2. Bound on the number of active 8-bit S-Boxes in a differential (or linear) trail for the AES.

Definition 3 (Long Trail)

We call Long Trail (LT) an uninterrupted sequence of calls to an arx-box interleaved with key additions. No difference can be added into the trail from the outside. Such trails can happen for two reasons.

  1. 1.

    A Static Long Trail occurs with probability 1 because one output word of the linear layer is an unchanged copy of one of its input words.

  2. 2.

    A Dynamic Long Trail occurs within a specific differential trail because one output word of the linear layer consists of the XOR of one of its input words with a non-zero difference and a function of words with a zero difference. In this way the output word of the linear layer is again equal to the input word as in a Static LT, but here this effect has been obtained dynamically.

Definition 4 (Long Trail Strategy)

The Long Trail Strategy is a design guideline: when designing a primitive with a rather weak but large S-Box (say, an ARX-based permutation), it is better to foster the existence of long trails rather than to have maximum diffusion in each linear layer.

This design principle has an obvious caveat: although slow, diffusion is necessary! Unlike the WTS, in this context it is better to trade some of the power of the diffusion layer in favor of facilitating the emergence of long trails.

The long trail strategy is a method for building secure and efficient ciphers using a large but weak S-Box S such that we can bound the MEDCP (and MELCC) of several iterations of \(x \mapsto S(x \oplus k)\) with independent round keys. In this paper, we focus on the case where S consists of arx operations but this strategy could have broader applications such as, as briefly discussed above, the design of block ciphers operating on large blocks using the AES round function as a building block.

In a way, this design method is the direct opposite of the wide trail strategy as it is summarized by Daemen and Rijmen in [11] (emphasis ours):

Instead of spending most of the resources on large S-boxes, the wide trail strategy aims at designing the round transformation(s) such that there are no trails with a low bundle weight. In ciphers designed by the wide trail strategy, a relatively large amount of resources is spent in the linear step to provide high multiple-round diffusion.

The long trail approach minimizes the amount of resources spent in the linear layer and does spend most of the resources on large S-Boxes. Still, as discussed in the next section, the method used to bound the MEDCP and MELCC in the long trail strategy is heavily inspired by the one used in the wide trail strategy.

A Cipher Structure for the LT Strategy. We can build block ciphers based on the long trail strategy using the following two-level structure. First, we must choose an S-Box layer operating on w words in parallel. The composition of a key addition in the full state and the application of this S-Box layer is called a round. Several rounds are iterated and then a word-oriented linear mixing layer is applied to ensure diffusion between the words. The composition of r rounds followed by the linear mixing layer is called a step Footnote 4, as described in Fig. 2. The encryption thus consists in iterating such steps. We used this design strategy to build a block cipher family, Sparx, which we describe in Sect. 4.

Fig. 2.
figure 2

A cipher structure for the LT strategy.

Long Trail-Based Bounds. In what follows we only discuss differential long trails for the sake of brevity. Linear long trails are treated identically.

Definition 5 (Truncated LT Decomposition)

Consider a cipher with a round function operating on w words. A truncated differential trail is a sequence of values of \(\{0,1\}^w\) describing whether an S-Box is active at a given round. The LT Decomposition of a truncated differential trail is obtained by grouping together the words of the differential trails into long trails and then counting how many active long trails of each length are present. It is denoted \(\{t_i \}_{i \ge 1}\) where \(t_i\) is equal to the number of truncated long trails with length i.

Example 1

Consider a 64-bit block cipher using a 32-bit S-Box, one round of Feistel network as its linear layer and 4 steps without a final linear layer. Consider the differential trail \((\delta ^L_0, \delta ^R_0) \rightarrow (\delta ^L_1, \delta ^R_1) \rightarrow (0, \delta ^R_2) \rightarrow (\delta ^L_3, 0)\) (see Fig. 3 where the zero difference is dashed). Then this differential trail can be decomposed into 3 long trails represented in black, blue and red: the first one has length 1 and \(\delta _0^R\) as its input; the second one has length 2 and \(\delta _0^L\) as its input; and the third one has length 3 and \(\delta _1^L\) as its input so that the LT decomposition of this trail is \(\{t_1 = 1, t_2 = 1, t_3 = 1\}\). Using the terminology introduced earlier, the first two trails are Static LT, while the third one is Dynamic LT.

Fig. 3.
figure 3

An example of active LT decomposition.

Theorem 1 (Long Trail Argument)

Consider a truncated differential trail T covering r rounds consisting of an S-Box layer with S-Box S interleaved with key additions and some linear layer. Let \(\{t_i \}_{i \ge 1}\) be the LT decomposition of T. Then the probability \(p_D\) of any fully specified differential trail fitting in T is upper-bounded by

$$\begin{aligned} p_D \le \prod _{i \ge 1} \big (\textsc {MEDCP}(S^{i})\big )^{t_i} \end{aligned}$$

where \(\textsc {MEDCP}(S^{i})\) is an upper-bound on the probability of a differential trail covering i iterations of S.

Proof

Let \(\varDelta _{i, s} \overset{d}{\rightarrow }\varDelta _{j, s+1}\) denote any differential trail occurring at the S-Box level in one step, so that the S-Box with index i at step s sees the transition \(\varDelta _{i, s} \overset{d}{\rightarrow }\varDelta _{j, s+1}\). By definition of a long trail, we have in each long trail a chain of differential trails \(\varDelta _{i_0, s_0} \overset{d}{\rightarrow }\varDelta _{i_1, s_0+1} \overset{d}{\rightarrow }... \overset{d}{\rightarrow }\varDelta _{i_t, s_0+t}\) which, because of the lack of injection of differences from the outside, is a valid trail for t iterations of the S-Box. This means that the probability of any differential trail following the same sequence of S-boxes as in this long trail is upper-bounded by \(\textsc {MEDCP}(S^{t})\). We simply bound the product by the product of the bounds to derive the theorem.\(\square \)

3.4 Choosing the Linear Layer: Bounding the MEDCP and MELCC while Providing Diffusion

In order to remain as general as possible, in this section we do not consider the details of a specific S-Box but instead we focus on fleshing out design criteria for the linear layer. All the information for the S-Box that is necessary to follow the explanation is the MEDCP and MELCC of its r-fold iterations including the key additions e.g. the data provided in Table 1 for our arx-box candidates.

As the linear layers we consider may be weaker than usual designing spn, it is also crucial that we ensure that ciphers built using such a linear layer are not vulnerable to integral attacks [18], in particular those based on the division property [19]. Incidentally, this gives us a criteria quantifying the diffusion provided by several steps of the cipher.

In this section, we propose two methods for bounding the MEDCP and MELCC of several steps of a block cipher. The first one is applicable to any linear layer but is relatively inefficient, while the second one works only for a specific subset of linear layers but is very efficient.

When considering truncated differential trails, it is hard to bound the probability of the event that differences in two or more words cancel each other in the linear layer i.e. the event that a Dynamic LT occurs. Therefore, for simplicity we assume that such cancellations happen for free i.e. with probability 1. Due to this simplification, we expect our bounds to be higher (i.e. looser) than the tight bounds. In other words, we underestimate the security of the cipher. Note that we also exclude the cases where the full state at some round has zero difference as the latter is impossible due to the cipher being a permutation.

Algorithms for Bounding MEDCP and MELCC of a Cipher. In this sub-section we propose generic approaches that do not depend on the number of rounds per step. In fact, to fully avoid the confusion between rounds and steps in what follows we shall simply refer to SPN rounds.

One way to bound the MEDCP and MELCC of a cipher is as follows:

  1. 1.

    Enumerate all possible truncated trails composed of active/inactive S-boxes.

  2. 2.

    Find an optimal decomposition of each trail into long trails (LT).

  3. 3.

    Bound the probability of each trail using the product of the MEDCP (resp. MELCC) of all active long trails i.e. by applying the Long Trail Argument (see Theorem 1) on the corresponding optimal trail decomposition.

  4. 4.

    The maximum bound over all trails is the final upper bound.

This approach is feasible only for a small number of rounds, because the number of trails grows exponentially. The algorithm is based on a recursive dynamic programming approach and has time complexity \(O(wr^2)\), where w is the number of S-Boxes applied in parallel in each S-Box layer and r is the number of rounds.

As noted, the most complicated step in the above procedure is finding an optimal decomposition of a given truncated trail into long trails. The difficulty arises from the so-called branching: situation in which a long trail may be extended in more than one way. Recall that our definition of LT (cf. Definition 3) relies on the fact that there is no linear transformation on a path between two S-Boxes in a LT. The only transformations allowed are some XORs. Therefore, branching happens only when some output word of the linear layer receives two or more active input words without modifications. In order to cut off the branching effect (and thus to make finding the optimal decomposition of a LT feasible), we can put some additional linear functions that will modify the contribution of (some of) the input words. Equivalently, when choosing a linear layer we simply do not consider layers which cause branching of LTs. As we will show later, this restriction has many advantages.

To simplify our study of the linear layer, we introduce a matrix representation for it. In a block cipher operating on w words, a linear layer may be expressed as a \(w\times w\) block matrix. We will denote zero and identity sub-matrices by 0 and 1 respectively and an unspecified (arbitrary) sub-matrices by L. This information is sufficient for analyzing the high-level structure of a cipher. Using this notation, the linear layers to which we restrict our analysis have matrices where each column has at most one 1.

For the special subset of linear layers outlined above, we present an algorithm for obtaining MEDCP and MELCC bounds, that is based on a dynamic programming approach. Since there is no LT branching, any truncated trail consists of disjoint sequences of active S-Boxes. By Observation 1, we can treat each such sequence as a LT to obtain an optimal decomposition. Because of this simplification, we can avoid enumerating all trails by grouping them in a particular way.

We proceed round by round and maintain a set of best trails up to an equivalence relation, which is defined as follows. For all S-Boxes at the current last round s, we assign a number, which is equal to the length of the LT that covers this S-Box, or zero if the S-Box is not active. We say that two truncated trails for s steps are equivalent if the tuples consisting of those numbers (current round s and length of LT) are the same for both trails. This equivalence captures the possibility to replace some prefix of a trail by an equivalent one without breaking the validity of the trail or its LT decomposition. The total probability, however, can change. The key observation here is that from two equivalent trails we can keep only the one with the highest current probability. Indeed, if the optimal truncated trail for all r rounds is an extension of the trail for s rounds with lower probability, we can take the first s rounds from the trail with higher probability without breaking anything and obtain a better trail, which contradicts the assumed optimality.

Due to page limit constraints, the pseudo-code for the algorithm is given in the full version of this paper [20].

This algorithm can be used to bound the probability of linear trails. Propagation of a linear mask through some linear layer can be described by multiplying the mask by the transposed inverse of the linear layer’s matrix. In our matrix notation we can easily transpose the matrix but inversion is harder. However, we can build the linear trails bottom-up (i.e. starting from the last round): in this case we need only the transposed initial matrix. Our algorithm does not depend on the direction, so we obtain bounds on linear trails probabilities by running the algorithm on the transposed matrix using the linear bounds for the iterated S-box.

Ensuring Resilience Against Integral Attacks. As illustrated by the structural attack against SASAS and a recent generalization [21] to ciphers with more rounds, a spn with few rounds may be vulnerable to integral attacks. This attack strategy has been further improved by Todo [19] who proposed the so-called division property as a means to track which bit should be fixed in the input to have a balanced output. He also described an algorithm allowing an attacker to easily find such distinguishers.

We implemented this algorithm to search for division-property-based integral trails covering as many rounds as possible. With it, for each matrix candidate we compute a maximum number of rounds covered by such a distinguisher. This quantity can then be used by the designer of the primitive to see if the level of protection provided against this type of attack is sufficient or not.

Tracking the evolution of the division property through the linear layer requires special care. In order to do this, we first make a copy of each word and apply the required XORs from the copy to the original words. Due to such state expansion, the algorithm requires both a lot of memory and time. In fact, it is even infeasible to apply on some matrices. To overcome this issue, we ran the algorithm with reduced word size. During our experiments, we observed that such an optimization may only result in longer integral characteristics and that this side effect occurs only for very small word sizes (4 or 5 bits). In light of this, we conjecture that the values obtained in these particular cases are upper bounds and are very close to the values which could be obtained without reducing the word size.

4 The Sparx Family of Ciphers

In this Section, we describe a family of block ciphers built using the framework laid out in the previous section. The instance with block size n and key size k is denoted Sparx-\(n/k\).

4.1 High Level View

The plaintexts and ciphertexts consist of \(w = n/32\) words of 32 bits each and the key is divided into \(v = k/32\) such words. The encryption consists of \(n_s\) steps, each composed of an arx-box layer of \(r_a\) rounds and a linear mixing layer. In the arx-box layer, each word of the internal state undergoes \(r_a\) rounds of Speckey, including key additions. The v words in the key state are updated once \(r_a\) arx-boxes have been applied to one word of the internal state. The linear layers \(\lambda _{w}\) for \(w=2,4\) provide linear mixing for the w words of the internal state.

This structure is summarized by the pseudo-code in Algorithm 1. The structure of one round is represented in Fig. 4, where A is the 32-bit arx-box consisting in one unkeyed Speck-32 round. We also use \(A^a\) to denote a rounds of Speckey with the corresponding key additions (see Fig. 5a).

figure a
Fig. 4.
figure 4

A high level view of step s of Sparx.

The different versions of Sparx all share the same definition of A. However, the permutations \(\lambda _{w}\) and \(K_v\) depend on the block and key sizes. The different members of the Sparx-family are specified below. The round keys can either be derived on the fly by applying \(K_v\) on the key state during encryption or they can be precomputed and stored. The first option requires less RAM, while the second is faster. The only operations needed to implement any instance of Sparx are:

  • addition modulo \(2^{16}\), denoted \(\boxplus \),

  • 16-bit exclusive-or (XOR), denoted \(\oplus \), and

  • 16-bit rotation to the left or right by i, denoted respectively \(x \lll i\) and \(x \ggg i\).

We claim that no attack using less than \(2^k\) operations exists against Sparx-\(n/k\) in neither the single-key nor in the related-key setting. We also faithfully declare that we have not hidden any weakness in these ciphers. Sparx is free for use and its source code is available in the public domainFootnote 5.

4.2 Specification

Table 3 summarizes the different Sparx instances and their parameters. The quantity \(\min _{\text {secure}}(n_s)\) corresponds to the minimum number of steps for which we can prove that the MEDCP is below \(2^{-n}\), that the MELCC is below \(2^{-n/2}\) for the number of rounds per step chosen and for which we cannot find integral distinguishers covering this amount of steps.

Table 3. The different Sparx instances.

Sparx-64/128. The lightest instance of Sparx is Sparx-\(64/128\). It operates on two words of 32 bits and uses a 128-bit key. There are 8 steps and 3 rounds per step. As it takes 5 steps to achieve provable security against linear and differential attacks, our security margin is at least equal to \(37\,\%\) of the rounds. Furthermore, while our long trail argument proves that 5 steps are sufficient to ensure that there are no single-trail differential and linear distinguishers, we do not expect this bound to be tight.

The linear layer \(\lambda _{2}\) simply consists of a Feistel round using \(\mathcal {L}\) as a Feistel function. The general structure of a step of Sparx-\(64/128\) is provided in Fig. 5b. The 128-bit permutation used in the key schedule has a simple definition summarized in Fig. 6, where the counter r is initialized to 0. It corresponds to the pseudo code given in Algorithm 2, where \((z)_L\) and \((z)_R\) are the 16-bit left and right halves of the 32-bit word z.

Fig. 5.
figure 5

A high level view of Sparx-\(64/128\). Branches have a width of 16 bits (except for the keys in the step structure).

The \(\mathcal {L}\) function is borrowed from Noekeon [22] and can be defined using 16- or 32-bit rotations. It is defined as a Lai-Massey structure mapping a 32-bit value x||y to \(x \oplus \big ((x \oplus y) \lll 8 \big ) || y \oplus \big ((x \oplus y) \lll 8 \big )\). Alternatively, it can be seen as a mapping of a 32-bit value z to \(z \oplus (z \lll ^{32} 8) \oplus (z \ggg ^{32} 8)\) where the rotations are over 32 bits.

Fig. 6.
figure 6

\(K_{4}^{64}\) (used in Sparx-\(64/128\)).

figure b

Sparx-128/128 and Sparx-128/256. For use cases in which a larger block size can be afforded, we provide Sparx instances with a 128-bit block size and 128- or 256-bit keys. They share an identical step structure which is fairly similar to Sparx-\(64/128\). Indeed, the linear layer relies again on a Feistel function except that \(\mathcal {L}\) is replaced by \(\mathcal {L^{\prime }}\), a permutation of \(\{0,1\}^{64}\). Both Sparx-\(128/128\) and Sparx-\(128/256\) use 4 rounds per step but the first uses 8 steps while the last uses 10.

Fig. 7.
figure 7

The step structure of both Sparx-\(128/128\) and Sparx-\(128/256\).

Fig. 8.
figure 8

The 128-bit permutation \(K_{4}^{128}\) used in Sparx-\(128/128\).

figure c

The Feistel function \(\mathcal {L^{\prime }}\) can be defined as follows. Let a||b||c||d be a 64-bit word where each a, ..., d is 16-bit long. Let \(t = (a \oplus b \oplus c \oplus d) \lll 8\). Then \(\mathcal {L^{\prime }}(a || b || c || d) = c \oplus t ~||~ b \oplus t ~||~ a \oplus t ~||~ d \oplus t\). This function can also be expressed using 32-bit rotations. Let x||y be the concatenation of two 32-bit words and \(\mathcal {L^{\prime }}_b\) denote \(\mathcal {L^{\prime }}\) without its final branch swap. Let \(t = \big ((x \oplus y) \ggg ^{32} 8\big ) \oplus \big ((x \oplus y) \lll ^{32} 8\big )\), then \(\mathcal {L^{\prime }}_b(x || y) = x \oplus t || y \oplus t\). Alternatively, we can use \(\mathcal {L}\) to compute \(\mathcal {L^{\prime }}_b\) as follows: \(\mathcal {L^{\prime }}_b(x || y) = y \oplus \mathcal {L}(x \oplus y) || x \oplus \mathcal {L}(x \oplus y)\).

These two ciphers, Sparx-\(128/128\) and Sparx-\(128/256\), differ only by their number of steps and by their key schedule. The key schedule of Sparx-\(128/128\) needs a 128-bit permutation \(K_4^{128}\) described in Fig. 8 and Algorithm 3 while Sparx-\(128/256\) uses a 256-bit permutation \(K_4^{256}\), which is presented in both Fig. 9 and Algorithm 4.

Fig. 9.
figure 9

The 256-bit permutation \(K_{8}^{256}\) used in Sparx-\(128/256\).

figure d

4.3 Design Rationale

Choosing the Arx- box We chose the round function of Speckey/Speck-32 over Marx-2 because of its superior implementation properties. Indeed, its smaller total number of operations means that a cipher using it needs to do fewer operations when implemented on a 16-bit platform. Ideally, we would have used an arx-box with 32-bit operations but, at the time of writing, no such function has known differential and linear bounds (cf. Table 1) for sufficiently many rounds.

We chose to evaluate the iterations of the arx-box over each branch rather than in parallel because such an order decreases the number of times each 32-bit branch must be loaded in CPU registers. This matters when the number of registers is too small to contain both the full key and the full internal state of the cipher and does not change anything if it is not the case.

Mixing Layer, Number of Steps and Rounds per Step. Our main approach for choosing the mixing layer was exhaustive enumeration of all matrices suitable for our long trail bounding algorithm from Sect. 3.4 and selecting the final matrix according to various criteria, which we will discuss later.

For Sparx-\(64/128\), there is only one linear layer fulfilling our design criteria: one corresponding to a Feistel round. For such a structure, we found that the best integral covers 4 steps (without the last linear layer) and that, with 3 rounds per step, the MEDCP and MELCC are bounded by \(2^{-75}\) and \(2^{-38}\). These quantities imply that no single trail differential or linear distinguisher exists for 5 or more steps of Sparx-\(64/128\).

For Sparx instances with 128-bit block we implemented an exhaustive search on a large subset of all possible linear layers. After some filtering, we arrived at roughly 3000 matrices. For each matrix we ran our algorithm from Sect. 3.4 to obtain bounds on MEDCP and MELCC for different values of the number of rounds per step (\(r_a\)). We also ran the algorithm for searching integral characteristics described in Sect. 3.4.

Then, we analyzed the best matrices and found that there is a matrix which corresponds to a Feistel-like linear layer with the best differential/linear bound for \(r_a=4\). This choice also offered good compromise between other parameters, such as diffusion, strength of the ARX-box, simplicity and easiness/efficiency of implementation. It also generalizes elegantly the linear layer of Sparx-\(64/128\). We thus settled for this Feistel-like function.

For more details on the selection procedure and other interesting candidates for the linear layer we refer the reader to the full version of this paper [20].

The Linear Feistel Functions. The linear layer obtained using the steps described above is only specified at a high level, it remains to define the linear Feistel functions \(\mathcal {L}\) and \(\mathcal {L^{\prime }}\). The function \(\mathcal {L}\) that we have chosen has been used in the Lai-Massey round constituting the linear layer of Noekeon [22]. We reuse it here because it is cheap on lightweight processors as it only necessitates one rotation by 8 bits and 3 XORs. It also provides some diffusion as it has branching number 3. Its alternative representation using 32-bit rotations allows an optimized implementation on 32-bit processors.

Used for a larger block size, the Feistel function \(\mathcal {L^{\prime }}\) is a generalization of \(\mathcal {L}\): it also relies on a Lai-Massey structure as well as a rotation by 8 bits. The reason behind these choices are the same as before: efficiency and diffusion. Furthermore, \(\mathcal {L^{\prime }}\) must also provide diffusion between the branches. While this is achieved by the XORs, we further added a branch swap in the bits of highest weight. This ensures that if only one 32-bit branch is active at the input of \(\mathcal {L^{\prime }}\) then two branches are active in its output. Indeed, there are two possibilities: either the output of the rotation is non-zero, in which case it gets added to the other branch and spreads to the whole state through the branch swap. Otherwise, the output is equal to 0, which means that the two 16-bit branches constituting the non-zero 32-bit branch hold the same non-zero value. These will then be spread over the two output 32-bit branches by the branch swap. The permutation \(\mathcal {L^{\prime }}\) also breaks the 32-bit word structure, which can help prevent the spread of integral patterns.

Key Schedule. The key schedules of the different versions of Sparx have been designed using the following general guidelines.

First, we look at criteria related to the implementation. To limit code size, components from the round function of Sparx are re-used in the key-schedule itself. To accommodate cases where the memory requirements are particularly stringent, we allow an efficient on-the-fly computation of the key.

We also consider cryptographic criteria. For example, we need to ensure that the keys used within each chain of 3 or 4 arx-boxes are independent from one another. As we do not have enough entropy from the master key to generate truly independent round keys, we must also ensure that the round-keys are as different as possible from one another. This implies a fast mixing of the master key bits in the key schedule. Furthermore, in order to prevent slide attacks [23], we chose to have the round keys depend on the round index. Finally, since the subkeys are XOR-ed in the key state, we want to limit the presence of high probability differential pattern in the key update. Diffusion in the key state is thus provided by additions modulo \(2^{16}\) rather than exclusive-or. While there may be high probability patterns for additive differences, these would be of little use because the key is added by an XOR to the state.

As with most engineering tasks, some of these requirements are at odds against each other. For example, it is impossible to provide extremely fast diffusion while also being extremely lightweight. Our designs are the most satisfying compromises we could find.

4.4 Security Analysis

Single Trail Differential/Linear Attack. By design and thanks to the long trail argument, we know that there is no differential or linear trail covering 5 steps (or more) with a useful probability for any instance of Sparx. Therefore, the 8 steps used by Sparx-\(64/128\) and Sparx-\(128/128\) and the 10 used by Sparx-\(128/256\) are sufficient to ensure resilience against such attacks.

Attacks Exploiting a Slow Diffusion. We consider several attacks in this category, namely impossible and truncated differential attacks, meet-in-the middle attacks as well as integral attacks.

When we chose the linear layers, we ensured that they prevented division-property-based integral attacks, meaning that they provide good diffusion. Furthermore, the Feistel structure of the linear layer makes it easy to analyse and increases our confidence in our designs. In the case of 128-bit block sizes, the Feistel function \(\mathcal {L^{\prime }}\) has branching number 3 in the sense that if only one 32-bit branch is active then the two output branches are active. This prevents attacks trying to exploit patterns at the branch level. Finally, this Feistel function also breaks the 32-bit word structure through a 16-bit branch swap which frustrates the propagation of integral characteristics.

Meet-in-the-middle attacks are further hindered by the large number of key additions. This liberal use of the key material also makes it harder for an attacker to guess parts of it to add rounds at the top or at the bottom of, say, a differential characteristic.

Best Attacks. The best attacks we could find are integral attacks based on Todo’s division property. The attack against Sparx-\(64/128\) covers 15/24 rounds and recovers the key in time \(2^{101}\) using \(2^{37}\) chosen plaintexts and \(2^{64}\) blocks of memory. For 22-round Sparx-\(128/128\), we can recover the key in time \(2^{105}\) using \(2^{102}\) chosen plaintexts and \(2^{72}\) blocks of memory. Finally, we attack 24-round Sparx-\(128/256\) in time \(2^{233}\), using \(2^{104}\) chosen plaintexts and \(2^{202}\) blocks of memory. A description of these attacks as well as the description of some time/data tradeoffs are provided in the full version of this paper [20].

4.5 Software Implementation

Next we describe how Sparx can be efficiently implemented on three resource constrained microcontrollers widely used in the Internet of Things (IoT), namely the 8-bit Atmel ATmega128, the 16-bit TI MSP430, and the 32-bit ARM Cortex-M3. We support the described optimization strategies with performance figures extracted from assembly implementations of Sparx-\(64/128\) and Sparx-\(128/128\) using the FELICS open-source benchmarking framework [24]. We use the same tool to get the most suitable implementations of Sparx for the two IoT-specific usage scenarios described in [6]. The first scenario uses a block cipher to encrypt 128 bytes of data using CBC mode, while the second encrypts 128 bits of data using a cipher in CTR mode. The most suitable implementation for a given usage scenario is selected using the Figure of Merit (FOM) defined in [6]:

$$\begin{aligned} \mathrm {FOM}(i_1,i_2,i_3)= \frac{p_{i_1,AVR} + p_{i_2,MSP} + p_{i_3,ARM}}{3}, \end{aligned}$$

where the performance parameter \(p_{i,d}\) aggregates the code size, the RAM consumption, and the execution time for implementation i according to the requirements of the usage scenario. The smaller the FOM value of an implementation in a certain use case, the better (more suitable) is the implementation for that particular use case. Finally, we compare the results of our implementations with the results available on the tool’s website.Footnote 6

Table 4. Performance characteristics of the main components of Sparx

Implementation Aspects. In order to efficiently implement Sparx on a resource constrained embedded processor, it is important to have a good understanding of its instruction set architecture (ISA). The number of general-purpose registers determines whether the entire cipher’s state can be fitted into registers or whether a part of it has to be spilled to RAM. Memory operations are generally slower than register operations, consume more energy and increase the vulnerability of an implementation to side channel attacks [25]. Thus, the number of memory operations should be reduced as much as possible. Ideally the state should only be read from memory at the beginning of the cryptographic operation and written back at the end. Concerning the three targets we implemented Sparx for, they have 32 8-bit, 12 16-bit, and 13 32-bit general-purpose registers, which result in a total capacity of 256 bytes, 192 bytes, and 416 bytes for AVR, MSP, and ARM, respectively.

The Sparx family’s simple structure consists only of three components: the arx-box A and its inverse \(A^{-1}\), the linear layer \(\lambda _{2}\) or \(\lambda _{4}\) (depending on the version), and the key addition. The key addition (bitwise XOR) does not require additional registers and its execution time is proportional to the ratio between the operand width and the target device’s register width. The execution time in cycles and the number of registers required to perform A, \(A^{-1}\), \(\lambda _{2}\), and \(\lambda _{4}\) on each target device are given in Table 4.

The costly operation in terms of both execution time and number of required registers is the linear layer. The critical point is reached for the 128-bit linear layer \(\lambda _{4}\) on MSP, which requires 13 registers. Since this requirement is above the number of available registers, a part of the state has to be saved onto the stack. Consequently, the execution time increases by 5 cycles for each pushpop instruction pair.

A 2-step implementation uses a simplified linear layer without the most resource demanding part – the branch swaps. It processes the result of the left branch after the first step as the right branch of the second step and similarly the result of the right branch after the first step as the left branch of the second step. This technique reduces the number of required registers and improves the execution time at the cost of an increase in code size. The performance gain is a factor of 2 on AVR, 2.7 on MSP, and 1.3 on ARM.

Table 5. Different trade-offs between the execution time and code size for encryption of a block using Sparx-\(64/128\) and Sparx-\(128/128\). Minimal values are given in bold.

The linear transformations \(\mathcal {L}\) and \(\mathcal {L^{\prime }}\) exhibit interesting implementation properties. For each platform there is a different optimal way to perform them. The optimal way to implement the linear layers on MSP is using the representations from Figs. 5c and 7b. On ARM the optimal implementation performs the rotations directly on 32-bit values. The function \(\mathcal {L}\) can be executed on AVR using 12 XOR instructions and no additional registers. On the other hand, the optimal implementation of \(\mathcal {L^{\prime }}\) on AVR requires 2 additional registers and takes 24 cycles.Footnote 7

The linear layer performed after the last step of Sparx can be dropped without affecting the security of the cipher, but it turns out that it results in poorer overall performances. The only case when this strategy helps is when top execution time is the main and only concern of an implementation. Thus we preferred to keep the symmetry of the step function and the overall balanced performance figures.

The salient implementation-related feature of Sparx family of ciphers is given by the simple and flexible structure of the step function depicted in Fig. 4, which can be implemented using different optimization strategies. Depending on specific constraints, such as code size, speed, or energy requirements to name a few, the rounds inside the step function can be rolled or unrolled; one or two step functions can be computed at once. The main possible trade-offs between the execution time and code size are explored in Table 5.

Except for the 1-step implementation of Sparx-\(128/128\) on MSP, which needs RAM memory to save the cipher’s state, all other RAM requirements are determined only by the process of saving the context onto the stack at the begging of the measured function. Thus, the RAM consumption of a pure assembly implementation would be zero, except for the 1-step rolled and unrolled implementations of Sparx-\(128/128\) on MSP.

Due to the 16-bit nature of the cipher, performing A and \(A^{-1}\) on a 32-bit platform requires a little bit more execution time and more auxiliary registers than performing the same operations on a 16-bit platform. The process of packing and unpacking a state register to extract and store back the two 16-bit branches of A or \(A^{-1}\) adds a performance penalty. The cost is amplified by the fact that the flexible second operand can not be used with a constant to extract the least or most significant 16 bits of a 32-bit register. Thus an additional masking register is required.

The simple key schedules of Sparx-\(64/128\) and Sparx-\(128/128\) can be implemented in different ways. The most efficient implementation turns out to be the one using the 1-iteration rolled strategy. Another interesting approach is the 4-iterations unrolled strategy, which has the benefit that the final permutation is achieved for free by changing the order in which the registers are stored in the round keys. This strategy increases the code size by up to a factor of 4, while the execution time is on average 25 % better.

Although we do not provide performance figures for Sparx-\(128/256\), we emphasize that the only differences with respect to implementation aspects between Sparx-\(128/256\) and Sparx-\(128/128\) are the key schedules and the different number of steps.

Evaluation and Comparison. We evaluate the performance of our implementations of Sparx using FELICS in the two aforementioned usage scenarios. The key performance figures are given in the full version of this paper [20]. The balanced results are achieved using the 1-step implementations of Sparx-\(64/128\) and Sparx-\(128/128\).

Table 6. Top 10 best implementations in Scenario 1 (encryption key schedule + encryption and decryption of 128 bytes of data using CBC mode) ranked by the Figure of Merit (FOM) defined in FELICS. The results for all ciphers are the current ones from the Triathlon Competition at the moment of submission. The smaller the FOM, the better the implementation.

Then we compare the performance of Sparx with the current results available on the Triathlon Competition at the time of submission.Footnote 8 As can be seen in Table 6 the two instances of Sparx perform very well across all platforms and rank very high in the FOM-based ranking. The forerunners are the NSA designs Simon and Speck, Chaskey, RECTANGLE and LEA, but, apart from RECTANGLE, none of them provides provable bounds against differential and linear cryptanalysis.

Besides the overall good performance figures in the two usage scenarios, the following results are worth mentioning:

  • the execution time of Sparx-\(64/128\) on MSP is in the top 3 of the fastest ciphers in both scenarios thanks to its 16-bit oriented operations;

  • the code size of the 1-step rolled implementations of Sparx-\(64/128\) and Sparx-\(128/128\) on MSP is in the top 5 in both scenarios as well as in the small code size and RAM table for scenario 2;

  • the 1-step rolled implementation of Sparx-\(64/128\) breaks the previous minimum RAM consumption record on AVR in scenario 2;

  • the execution time of the 2-steps implementation of Sparx-\(64/128\) in scenario 2 is in the top 3 on MSP, in the top 5 on AVR, and in the top 7 on ARM; it also breaks the previous minimum RAM consumption records on AVR and MSP.

Given its simple and flexible structure as well as its very good overall ranking in the Triathlon Competition of lightweight block ciphers, the Sparx family of lightweight ciphers is suitable for applications on a wide range of resource constrained devices. The absence of look-up tables reduces the memory requirements and provides, according to [25], some intrinsic resistance against power analysis attacks.

5 Replacing Rotations with Linear Layers: The LAX Construction

In this section we outline an alternative strategy for designing an ARX cipher with provable bounds against differential and linear cryptanalysis. It is completely independent from the Long Trail Strategy outlined in the previous sections and uses the differential properties of modular addition to derive proofs of security.

5.1 Motivation

In his Master thesis [13] Wallén posed the challenge to design a cipher that uses only addition modulo-2 and \(\mathrm {GF}(2)\)-affine functions, and that is provably resistant against differential and linear cryptanalysis [13, Sect. 5]. In this section we partially solve this challenge by proposing a construction with provable bounds against single-trail differential cryptanalysis (DC).

5.2 Theoretical Background

Definition 6

( \(\mathrm {xdp}^{+}\) ). The XOR differential probability (DP) of addition modulo \(2^{n}\) is defined as:

$$\begin{aligned} {\mathrm {xdp}^{+}}({\alpha }, {\beta } \rightarrow {\gamma }) = 2^{-2n} \cdot {\#\{(x,y) : ((x \oplus {\alpha }) + (y \oplus {\beta })) \oplus (x + y) = {\gamma }\}}, \end{aligned}$$

where \(\alpha \), \(\beta \) and \(\gamma \) are n-bit XOR differences and x and y are n-bit values.

The XOR linear correlation of addition modulo \(2^{n}\) (\(\mathrm {xlc}^{+}\)) is defined in a similar way. Efficient algorithms for the computation of \(\mathrm {xdp}^{+}\) and \(\mathrm {xlc}^{+}\) have been proposed resp. in [2629]. These results also reveal the following property. The magnitude of both \(\mathrm {xdp}^{+}\) and \(|\mathrm {xlc}^{+}|\) is inversely proportional to the number of bit positions at which the input/output differences (resp. masks) differ. For \(\mathrm {xdp}^{+}\), this fact is formally stated in the form of the following proposition.

Proposition 1

(Bound on \(\mathrm {xdp}^{+}\) ). The differential probability \(\mathrm {xdp}^{+}\) is upper-bounded by \(2^{-k}\), where k is the number of bit positions, excluding the MSB, at which the bits of the differences are not equal:

$$\begin{aligned} \mathrm {xdp}^{+}(\alpha ,\beta \rightarrow \gamma ) \le 2^{-k}: k = \#\{i: \lnot (\alpha [i] = \beta [i] = \gamma [i]), 0\le i \le w-2\} \end{aligned}$$

Proof. Follows from [26, Alg. 2, Sect. 4].

A similar proposition also holds for \(|\mathrm {xlc}^{+}|\) (see e.g. [10]). Proposition 1 provides the basis of the design strategy described in the following section.

5.3 The LAX Construction

LAX is a block cipher construction with 2n-bit block and n-bit words. We investigate three instances of LAX designated by the block size: LAX-16, LAX-32 and LAX-64. A brief description of the round function of LAX-2n, shown in Fig. 10 (left), is given below.

Fig. 10.
figure 10

Left: the round function of LAX; Right: three round differential of LAX.

Let L be an \(n \times n\) binary matrix that is (a) invertible and (b) has branch number \(d > 2\). With \(\ell (x)\) is denoted the multiplication of the n-bit vector x by the matrix L: \(~\ell (x) = L x\). Note that due to condition (b) it follows that \(\forall x \ne 0: h(x) + h(\ell (x)) \ge d\), where h(x) is the Hamming weight of x.

The round function \(\mathcal {A}(\cdot )\) of LAX-2n maps a pair of n-bit words \((x_{\mathrm {L}}, x_{\mathrm {R}})\) to a pair of n-bit words \((y_{\mathrm {L}}, y_{\mathrm {R}})\) as follows (see Fig. 10 (left)):

$$\begin{aligned} (y_{\mathrm {L}}, y_{\mathrm {R}}) = \mathcal {A}(x_{\mathrm {L}}, x_{\mathrm {R}}) = ( \ell (x_{\mathrm {R}}),~\ell (x_{\mathrm {L}} \boxplus x_{\mathrm {R}})). \end{aligned}$$

The matrix L is chosen as the non-identity part of the generator matrix G of a systematic [2nnd] linear code over \(\mathrm {GF}(2)\) such that \(G = [I~L]\). More specifically, the matrices L for LAX-16, LAX-32 and LAX-64 are derived from the following codes respectively: [16, 8, 5], [32, 16, 8] and [64, 32, 10]. Note that the matrix of LAX-32 is the same as the one used in block cipher ARIA [30].

5.4 Bounds on the Differential Probability of LAX

Lemma 1

For all differences \(\alpha \ne 0\), the differential \((\alpha ,\alpha \rightarrow \alpha )\) is impossible.

Proof

Let \(\mathrm {xdp}^{+}(\alpha ,\beta \rightarrow \gamma ) \ne 0\) for some differences \(\alpha \), \(\beta \) and \(\gamma \). The statement of the lemma follows from the following two properties of \(\mathrm {xdp}^{+}\) [26]. First, it must hold that \(\alpha [0]\oplus \beta [0]\oplus \gamma [0]=0\). Second, if \(\alpha [i]=\beta [i]=\gamma [i]\) for some \(0 \le i \le n-2\), then it must hold that \(\alpha [i+1]\oplus \beta [i+1]\oplus \gamma [i+1]=\alpha [i]\). Since we want that \(\alpha = \beta = \gamma \), from the first property it follows that \(\alpha [0] = \beta [0] = \gamma [0]=0\). Given that, due to the second property it follows that \(\alpha [i] = \beta [i] = \gamma [i] = 0\), \(\forall i \ge 1\). Therefore the only value of \(\alpha \) for which \(\mathrm {xdp}^{+}(\alpha ,\beta \rightarrow \gamma ) \ne 0\) and \(\alpha = \beta = \gamma \) is \(\alpha = 0\).    \(\square \)

Theorem 2

(Differential bound on 3 rounds of LAX-2n ). The maximum differential probability of any trail on 3 rounds of LAX-2n is \(2^{-(d-2)}\), where d is the branch number of the matrix L.

Proof

Let \((\alpha _{i-1}, \beta _{i-1}, \gamma _{i-1})\), \((\alpha _{i}, \beta _{i}, \gamma _{i})\) and \((\alpha _{i+1}, \beta _{i+1}, \gamma _{i+1})\) be the input/output differences of the addition operations in three consecutive rounds of LAX-2n and let \(p_{k} = \mathrm {xdp}^{+}(\alpha _{k}, \beta _{k}\rightarrow \gamma _{k})\) for \(k \in \{i-1, i, i+1\}\) (see Fig. 10 (right)). We have to show that \(p_{i-1} p_{i} p_{i+1} \le 2^{-(d-2)}\) or, equivalently, that \(\log _2 p_{i-1} + \log _2 p_{i} + \log _2 p_{i+1} \le -(d-2)\). Denote with h(x) the Hamming weight of the word x and with \(h^{*}(x)\) the Hamming weight of x, excluding the MSB. Note that \(h^{*}(x) \le h(x) - 1\). We consider two cases:

Case 1: \(\beta _{i-1} \ne \gamma _{i-1}\) . By Proposition 1 we have that \(\log _2 p_{i-1} \le -h^{*}(\beta _{i-1}\oplus \gamma _{i-1})\) and \(\log _2 p_{i} \le -h^{*}(\alpha _{i}\oplus \beta _{i})\). Since \(\beta _i = \ell (\gamma _{i-1})\) and \(\alpha _i = \ell (\beta _{i-1})\) (see Fig. 10 (right)) and using the linearity of \(\ell (\cdot )\) we have that \(-h^{*}(\alpha _{i}\oplus \beta _{i}) = -h^{*}(\ell (\beta _{i-1}\oplus \gamma _{i-1}))\). As \(\beta _{i-1} \ne \gamma _{i-1}\) it follows that \(h^{*}(\beta _{i-1}\oplus \gamma _{i-1}) \ne 0\) and \(h^{*}(\ell (\beta _{i-1}\oplus \gamma _{i-1})) \ne 0\). Thus we derive:

$$\begin{aligned} \log _2 p_{i-1} + \log _2 p_{i} \le -h^{*}(\beta _{i-1}\oplus \gamma _{i-1}) -h^{*}(\ell (\beta _{i-1}\oplus \gamma _{i-1})). \end{aligned}$$

From the properties of L it follows that \(-h(\beta _{i-1}\oplus \gamma _{i-1}) -h(\ell (\beta _{i-1}\oplus \gamma _{i-1})) \le -d\) and so \(-h^{*}(\beta _{i-1}\oplus \gamma _{i-1}) -h^{*}(\ell (\beta _{i-1}\oplus \gamma _{i-1})) \le -(d-2)\). Therefore:

$$\begin{aligned} \log _2 p_{i-1} + \log _2 p_{i} \le -(d-2). \end{aligned}$$

Case 2: \(\beta _{i-1} = \gamma _{i-1} \ne 0\) . In this case \(\alpha _i = \beta _i = \ell (\beta _{i-1}) = \ell (\gamma _{i-1})\). Due to Lemma 1 it follows that \(\gamma _{i} \ne \beta _{i}\). Therefore we can apply the argument from Case 1 on rounds i and \(i+1\) to derive the statement of the theorem in this case.    \(\square \)

5.5 Experimental Results

We have implemented the search algorithm proposed in [10] in order to find the probabilities of the best differential trails in LAX-16 and LAX-32. In Table 7, we compare the results to the theoretical bounds computed using Theorem 2.

Table 7. Best differential probabilities and best absolute linear correlations (\(\log _2\) scale) for up to 12 rounds of LAX.

Clearly the bound from Theorem 2 does not hold for the linear case. The problem is the “three-forked branch” in the LAX round function that acts as an XOR when the inputs are linear masks rather than differences. Thus, LAX only provides differential bounds and the full solution to the Wallén challenge still remains an open problem.

6 Conclusion

In this paper we presented, for the first time, a general strategy for designing ARX primitives with provable bounds against differential (DC) and linear cryptanalysis (LC) – a long standing open problem in the area of ARX design. The new strategy, called the Long Trail Strategy (LTS) advocates the use of large and computationally expensive S-boxes in combination with very light linear layers (the so-called Long Trail Argument). This makes the LTS to be the exact opposite of the Wide Trail Strategy (WTS) on which the AES (and many other SPN ciphers) are based. Moreover, the proposed strategy is not limited to ARX designs and can easily be applied also to S-box based ciphers.

To illustrate the effectiveness of the LTS we have proposed a new family of lightweight block ciphers, called SPARX, designed using the new approach. The family has three instances depending on the block and key sizes: Sparx-\(64/128\), Sparx-\(128/128\) and Sparx-\(128/256\). With the help of the Long Trail Argument we prove resistance against single-trail DC and LC for each of the three instances of Sparx. In addition, we analyze the new constructions against a wide range of attacks such as impossible and truncated differentials, meet-in-the-middle and integral attacks. Our analysis did not find an attack covering 5 or more rounds of any of the three instances. The latter ensures a security margin of about 37 % of Sparx.

Beside (provable) security the members of the Sparx family are also very efficient. We have implemented them in software on three resource constrained microcontrollers widely used in the Internet of Things (IoT), namely the 8-bit Atmel ATmega128, the 16-bit TI MSP430, and the 32-bit ARM Cortex-M3. According to the FELICS open-source benchmarking framework our implementations of Sparx-\(64/128\) and Sparx-\(128/128\) rank respectively 6 and 7 in the list of top 10 most software efficient lightweight ciphers. In addition, the execution time of Sparx-\(64/128\) on MSP is in the top 3 of this list. To the best of our knowledge, this paper is the first to propose a practical ARX design that has both arguments for provable security and competitive performance.

A secondary contribution of the paper is the proposal of an alternative strategy for ARX design with provable bounds against differential cryptanalysis. It is independent of the LTS and uses the differential properties of modular addition to derive proofs of security. As an illustration of this approach, the LAX family of constructions is described. The provable security of LAX against linear cryptanalysis is left as an open problem.