1 Introduction

Symmetric cryptographic primitives play major roles in virtually any cryptographic scheme and security-related application. The main reason for this massive deployment of symmetric primitives, i.e., (tweakable) block ciphers, stream ciphers, hash functions, or cryptographic permutations, is their significant performance advantage. Symmetric primitives usually outperform other cryptographic schemes by up to several orders of magnitude.

One class of design of symmetric primitives that is inherently motivated by (software) efficiency is the ARX-based design. ARX is short for addition (modulo a power of two), word-wise rotation and XOR. Indeed, ciphers following this framework are composed of those operations and avoid the computation of smaller S-boxes through look-up tables. As CPUs may have these operations implemented on the hardware level, particularly an addition unit and a barrel shifter, executing them on such CPUs based on a suitable register size is inherently fast.

The block cipher FEAL [3] was probably the first ARX cipher presented in the literature, and by now, several state-of-the-art ciphers follow this approach. One of the most important (families of) ARX ciphers is certainly the one formed by Salsa20, ChaCha and their variants [4, 5]. Designed by Bernstein, these ciphers are now the default replacement for RC4 in TLS due to their high efficiency and the simplicity of their implementations and are thus some of the most widely used ciphers in practice. Besides being used in TLS, ChaCha is also deployed in several other products, and in particular, it is used as a building block in the popular hash functions Blake and Blake2 [6, 7].

The ARX-based design approach is not restricted to stream ciphers, as it can also be used in the design of efficient block ciphers (e.g., Sparx [8]), cryptographic permutations (e.g., Sparkle [9]), and message authentication codes (MACs). For the latter, Chaskey [10] is among the most prominent examples.

Besides the advantage of having efficient implementations, there are also good reasons for ARX-based designs regarding security. The algebraic degree of ARX ciphers is usually high after only a very few rounds, as the carry bit within one modular addition already has an almost maximal degree. Structural attacks like integral [11] or invariant attacks [12] are less of a concern and rotational cryptanalysis [13], originally invented for ARX ciphers, can be efficiently prevented in most cases by the XOR of constants.

When it comes to differential [14] and linear attacks [15], ARX-based designs often show a peculiar behavior. For a small number of rounds, i.e., only very few modular additions, the differential probabilities (resp., absolute linear correlations) are very high. In particular, for a single modular addition, those are equal to 1 due to the linear behavior of the least significant and, in the case of differentials, most significant bits. Moreover, for a single modular addition, the differential probabilities and linear correlations are well understood, and we have at hand nice and efficient formulas for their computation [16, 17]. In the case of (dependent) chains of modular additions, bitwise rotations, and XORs, the situation is different, and experimentally checking the probabilities is often the best way to evaluate the behavior.

While a few rounds are thus very weak, for a well-crafted ARX scheme, the probabilities of differentials and the absolute correlations of linear approximations decrease very quickly as the number of rounds increases. Indeed, this property led to the long-trail strategy for designing ARX-based ciphers [8].

Now, for symmetric primitives, the existence of strong differentials and linear approximations for a few rounds with a rapid decrease of probabilities (resp. absolute correlations) is exactly the situation in which considering differential-linear attacks [18] is promising. In a nutshell, differential-linear attacks combine a differential with probability p for the first r rounds of the cipher and a linear approximation with correlation q for the next t rounds into a linear approximation for \(r+t\) rounds with correlation \(pq^2\) that can be turned into an attack with data complexity of roughly \(p^{-2}q^{-4}\).

Indeed, it is not surprising that the best attacks against many ARX constructions, including ChaCha and Chaskey, are differential-linear [19,20,21]. Our work builds upon those ideas and improves differential-linear attacks on ARX ciphers along several dimensions.

1.1 Our Contribution

Table 1 (Partial) key-recovery attacks on Chaskey and ChaCha

In this paper, we present the best knownFootnote 1 attacks on Chaskey and ChaCha. Our improvements over prior work are based on improvements in the differential, the linear part, the LLR statistic, and the key-recovery part of differential-linear attacks.

Differential part

For the differential part, our observation is both simple and effective. Recall that for a differential-linear attack. One needs many (roughly \(q^{-4}\)) pairs to fulfill the difference in the first part of the cipher, that is, many right pairs for the differential. Now, imagine that an attacker could construct many right pairs with probability (close to) one, given only a single right pair. It would immediately reduce the data complexity of the attack by a factor of \(p^{-1}\). As we will see, this situation is rather likely to occur for a few rounds of many ARX ciphers, particularly for ChaCha and Chaskey. The details of those improvements are presented in Sect. 5.

Linear part

For the linear part, our first observation is that it is often beneficial to not restrict to a single mask but rather consider multiple linear approximations. As we detail in Sect. 6, this nicely combines with an improved version of the partitioning technique for ARX ciphers [19, 29], which splits the space of ciphertexts into subsets to increase the correlation of linear approximations. The starting point of our attacks is a new way of partitioning the ciphertexts, summarized in Sect. 3. Note that, although we use multiple linear masks in the attack, because of partitioning the ciphertexts, we basically use only a single linear mask for each ciphertext. This way, we avoid possible dependencies that would be hard to analyze otherwise.

LLR statistic

Our advanced partitioning technique for the linear part exploits linear approximations with different correlations for every ciphertext. One ciphertext pair causes high absolute correlation, but another might cause lower absolute correlation. It is not appropriate to treat these ciphertext pairs in the same manner. We use the log-likelihood ratio (LLR) statistic to solve this problem. According to the Neyman–Pearson lemma [30], the LLR test is the most powerful statistical test and, as such, has been used as a cryptanalytic tool (see, e.g., [31, 32]). In our case, the use of the LLR statistic is beneficial because we can exploit all partitions which were discarded by the original partitioning technique in [1]. The details of the LLR-based technique are presented in Sect. 7.

Key recovery

Related to the improvement in the linear part and LLR statistic, we present a significant speed-up in the key recovery part. Here, the main observation is that after considering multiple masks and the partitioning technique, several key bits appear only linearly in the approximations. In particular, their value does not affect the absolute value of the correlation but rather the sign only. Instead of guessing those keys individually as done in previous attacks, this observation allows us to recover them by applying the fast Walsh–Hadamard transform (FWHT). Similar ideas have already been described in [33]. Details of this approach are given in Sect. 8.

Putting these improvements into one framework and applying this framework to round-reduced variants of ChaCha and Chaskey results in significantly reduced attack complexities. Our attacks and their corresponding complexities are summarized in Table 1, together with a comparison to the best attacks published so far. It is important to note that, as these attacks are on round-reduced variants of the ciphers only, they do not pose any threat on the full-round versions of ChaCha or Chaskey. Rather, these attacks strengthen our trust in the design. We expect that our improvements have applications to other ciphers, especially ARX-based designs.

2 Preliminaries

By \(\oplus \), we denote the XOR operation, i.e., addition in \({\mathbb {F}}_2^n\), and by \(+\), we either denote the addition in \({\mathbb {Z}}\), or the modular addition mod \(2^n\) (we identify elements of \({\mathbb {F}}_2^n\) with elements of \({\mathbb {Z}}\) by regarding them as binary representations), depending on the context. For \(x \in {\mathbb {F}}_2^n\), we denote by \({\bar{x}}\) the bitwise complement of x. Note that we have \(-x={\bar{x}}+1\). Given a non-empty set \({\mathcal {S}} \subseteq {\mathbb {F}}_2^n\) and a Boolean function \(f :{\mathbb {F}}_2^n \rightarrow {\mathbb {F}}_2\), we define

$$\begin{aligned} \mathbf {Cor}_{x \in {\mathcal {S}}}\left[ f(x)\right] :=\frac{1}{|{\mathcal {S}}|}\sum _{x \in {\mathcal {S}}}(-1)^{f(x)}. \end{aligned}$$

If for (another) Boolean function \(g :{\mathbb {F}}_2^n \rightarrow {\mathbb {F}}_2\), we have \(\mathbf {Cor}_{x \in {\mathcal {S}}}\left[ g(x) \oplus f(x) \right] = c\), we say that \(g(x) \approx f(x)\) holds with correlation c if \(x \in {\mathcal {S}}\).

We denote the ith unit vector of a binary vector space by [i] and the sum of unit vectors \([i_1] \oplus [i_2] \oplus \dots \oplus [i_t]\) by \([i_1,i_2,\dots ,i_t]\). Given a vector \(x \in {\mathbb {F}}_2^n\), x[i] denotes the ith bit of x, and \(x[i_1,i_2,\dots ,i_t]\) denotes \(\bigoplus _{j=1}^t x[i_j]\). For \(\gamma ,x \in {\mathbb {F}}_2^n\), we define the inner product by \(\langle \gamma , x \rangle = \bigoplus _{i=0}^{n-1} \gamma [i]x[i]\). In particular, \(x[i_1,i_2,\dots ,i_t]=\langle x, [i_1,i_2,\dots ,i_t] \rangle \).

In the remainder of this paper, we assume that, when a randomly chosen sampling set \({\mathcal {S}} \subseteq {\mathbb {F}}_2^n\) is a (sufficiently large) subset of \({\mathbb {F}}_2^n\), \(\mathbf {Cor}_{x \in {\mathcal {S}}}\left[ f(x) \right] \) is a good approximation for \(\mathbf {Cor}_{x \in {\mathbb {F}}_2^n}\left[ f(x) \right] \). In other words, we assume that the empirical correlations obtained by sampling for a sufficiently large number of messages closely match the actual correlations.

We denote by \({\mathcal {N}}(\mu ,\sigma ^2)\) the normal distribution with mean \(\mu \) and variance \(\sigma ^2\). By \(\Phi \), we denote the cumulative distribution function of the standard normal distribution \({\mathcal {N}}(0,1)\). Thus if \(X \sim {\mathcal {N}}(\mu ,\sigma ^2)\), it holds that

$$\begin{aligned} \mathbf {Pr}\left[ X \le \Theta \right] =\Phi \left( \frac{\Theta - \mu }{\sigma } \right) . \end{aligned}$$

2.1 Differential-Linear Attacks

Fig. 1
figure 1

The structure of a classical differential-linear distinguisher

Fig. 2
figure 2

A differential-linear distinguisher with experimental evaluation of the middle correlation r

We first recall the basic variant of differential-linear cryptanalysis as introduced by Langford and Hellman [18], and the enhancement by Biham, Dunkelman, and Keller [34]. Figure 1 shows the overview of the distinguisher. An entire cipher E is divided into two subciphers \(E_1\) and \(E_2\), such that \(E = E_2 \circ E_1\), and a differential distinguisher and a linear distinguisher is applied to the first and second part, respectively.

In particular, assume that the differential \(\Delta _{\mathrm {in}}\overset{E_1}{\rightarrow } \Delta _m\) holds with probability

$$\begin{aligned} \mathbf {Pr}_{x \in {\mathbb {F}}_2^n}\left[ E_1(x) \oplus E_1(x\oplus \Delta _{\mathrm {in}}) = \Delta _m\right] = p. \end{aligned}$$

Let us further assume that the linear approximation \(\Gamma _m \overset{E_2}{\rightarrow } \Gamma _{\mathrm {out}}\) has correlation \(\mathbf {Cor}_{x \in {\mathbb {F}}_2^n}\left[ \langle \Gamma _m, x\rangle \oplus \langle \Gamma _{\mathrm {out}}, E_2(x)\rangle \right] = q\). The differential-linear distinguisher exploits the fact that, under the assumption that \(E_1(x)\) and \(E_2(x)\) are independent random variables, we have

$$\begin{aligned} \mathbf {Cor}_{x\in {\mathbb {F}}_2^n}\left[ \langle \Gamma _{\mathrm {out}}, E(x) \rangle \oplus \langle \Gamma _{\mathrm {out}}, E(x \oplus \Delta _{\mathrm {in}}) \rangle \right] = pq^2. \end{aligned}$$
(1)

Therefore, by preparing \(\epsilon p^{-2}q^{-4}\) pairs of chosen plaintexts \((x, {\tilde{x}})\), for \({\tilde{x}} = x \oplus \Delta _{\mathrm {in}}\), where \(\epsilon \in {\mathbb {N}}\) is a small constant, one can distinguish the cipher from a pseudorandom permutation.

In practice, there might be a problem with the assumption that \(E_1(x)\) and \(E_2(x)\) are independent, resulting in wrong estimates for the correlation. To provide a better justification of this independence assumption (and in order to improve attack complexities), adding a middle part is a simple solution and usually done in recent attacks (including ours). Here, the cipher E is divided into three subciphers \(E_1, E_m\) and \(E_2\) such that \(E = E_2 \circ E_m \circ E_1\) and the middle part \(E_m\) is experimentally evaluated. In particular, let

$$\begin{aligned} r = \mathbf {Cor}_{x \in {\mathcal {S}}}\left[ \langle \Gamma _m,E_m(x)\rangle \oplus \langle \Gamma _m, E_m(x \oplus \Delta _m) \rangle \right] , \end{aligned}$$

where \({\mathcal {S}}\) denotes the set of samples over which the correlation is computed. Then, the total correlation in Eq. 1 can be estimated as \(p rq^2\). Recently, as a theoretical support for this approach, the Differential-Linear Connectivity Table (DLCT) [35], has been introduced. The overall attack framework is depicted in Fig. 2 and we will use this description in the remainder of the paper.

Finally, in order to better understand Eq. (1), we denote the differential-linear correlation (known as the auto-correlation in the theory of Boolean functions) on E by

$$\begin{aligned} {{\,\mathrm{Aut}\,}}_E( \Delta _{\mathrm {in}},\alpha ,\alpha ') :=2^{-n} \sum _{x \in {\mathbb {F}}_2^n} (-1)^{\langle \alpha , E(x) \rangle \oplus \langle \alpha ', E(x \oplus \Delta _{\mathrm {in}})\rangle }, \end{aligned}$$

where Eq. (1) is special case such that \(\alpha = \alpha ' = \Gamma _{\mathrm {out}}\).

2.2 Partitioning Technique for ARX-based Designs

Partitioning allows increasing the correlation of the differential-linear distinguisher by deriving linear equations which hold conditioned on ciphertext and key bits. We recall the partitioning technique as used in [19]. Let \(a,b \in {\mathbb {F}}_2^m\) and let \(z = a + b\). When \(i=0\) (lsb), the modular addition for bit i becomes linear, i.e., \(z[0] = a[0] \oplus b[0]\). Of course, for \(i>0\), the ith output bit of modular addition is not linear on the inputs. However, by restricting (ab) to a specific subset, we might obtain other linear relations. In previous work, the following formula on z[i] was derived.

Lemma 1

[19] Let \(a,b \in {\mathbb {F}}_2^m\) and \(z = a+b\). For \(i \ge 2\), we have

$$\begin{aligned} z[i] = {\left\{ \begin{array}{ll} a[i] \oplus b[i] \oplus a[i-1] &{} \mathrm{if}~ a[i-1] = b[i-1]\\ a[i] \oplus b[i] \oplus a[i-2] &{} \mathrm{if}~ a[i-1] \ne b[i-1] ~\mathrm{and}~ a[i-2] = b[i-2]. \end{array}\right. } \end{aligned}$$

3 New Partitioning Technique

Before introducing our new attack framework for differential-linear attacks, we first introduce some new partitioning techniques.

The original idea of the partitioning technique [29] is to divide all the data into some partitions, and only using those partitions that can decrease the data complexity. The generalized partitioning technique in [19] also has the same feature, i.e., when a single modular addition is analyzed, all the data are divided into four partitions and one out of those four is discarded.

Our new partitioning techniques are twofold. First, we introduce linear masks for partitions that have originally been discarded. Our new FWHT-based key recovery using the LLR statistic allows us to efficiently use such partitions. Second, we additionally introduce a partitioning technique to compute \(z[i] \oplus z[i-1]\) with the same key guessing cost as the evaluation of z[i]. There are multiple linear trails with high correlation in the ARX ciphers. The new partition is useful when we dynamically change available linear trails for each partition.

Fig. 3
figure 3

Two examples of the partitioning technique. In the first case (left picture), we are interested in an approximation for \(z_0[i]\). In the second case (right picture), we are interested in an approximation for \(z_0[i] \oplus z_0[i-1]\). The corresponding partitions and approximations are given in Lemmas 2 and 3, respectively

Let us consider two m-bit words \(z_0\) and \(z_1\) and a modular addition operation

$$\begin{aligned} {\mathbb {F}}_2^{2m} \rightarrow {\mathbb {F}}_2^{2m}, \quad (z_1,z_0) \mapsto (y_1,y_0) = (z_1, z_0 + z_1), \end{aligned}$$

as depicted in Fig. 3. In the attacks we present later, we are interested in the value \(z_0[i] = (y_0 - y_1)[i] = (y_0 + {{\bar{y}}}_1 + 1)[i]\). Notice that for \(i \le 2\) there exist trivial relations for \(z_0[i]\). The following lemma deals with the case \(i \ge 3\), which is relevant for our applications.

Lemma 2

Let \(s = y_0 \oplus {{\bar{y}}}_1\) and let \(i \ge 3\). Let \({\mathcal {S}}_{b_0b_1} :=\{ (y_1,y_0) \in {\mathbb {F}}_2^{2m} \mid s[i-1] = b_0 \text { and } s[i-2] = b_1\}\). We have

$$\begin{aligned} z_0[i] \approx {\left\{ \begin{array}{ll} y_0[i] \oplus y_1[i] \oplus y_0[i-1], ~\mathrm{with~cor.}~{-1}, &{} \mathrm{if}~ (y_1,y_0) \in {\mathcal {S}}_{\mathtt {0}*}, \\ y_0[i] \oplus y_1[i] \oplus y_0[i-2], ~\mathrm{with~cor.}~{-1}, &{} \mathrm{if}~ (y_1,y_0) \in {\mathcal {S}}_{\mathtt {10}}, \\ y_0[i] \oplus y_1[i] \oplus y_0[i-3], ~\mathrm{with~cor.}~{-2^{-1}}, &{} \mathrm{if}~ (y_1,y_0) \in {\mathcal {S}}_{\mathtt {11}}, \end{array}\right. } \end{aligned}$$
(2)

where \({\mathcal {S}}_{\mathtt {0}*} = {\mathcal {S}}_{\mathtt {00}} \cup {\mathcal {S}}_{\mathtt {01}}\).

Proof

Figure 4 represents the computation of \(z_0[i]\), where \(z_0 = y_0 - y_1 = y_0 + {{\bar{y}}}_1 + 1\). In fact, if c[i] denotes the carry occurring at bit position i, and assume that \(c[-1] :=1\), we have that \(z_0[i] = y_0[i] \oplus y_1[i] \oplus 1 \oplus c[i-1]\) for all \(i \ge 0\).

Let us first assume that \((y_1,y_0) \in {\mathcal {S}}_{\mathtt {0}*}\), i.e., \(y_0[i-1] \oplus {{\bar{y}}}_1[i-1] = 0\). Then, \(c[i-1] = y_0[i-1]\) if \(i\ge 1\). Thus,

$$\begin{aligned} z_0[i] = y_0[i] \oplus y_1[i] \oplus 1 \oplus y_0[i-1], \end{aligned}$$

for all \(i\ge 1\), and we obtain the first equality.

Next, let us assume that \((y_1,y_0) \in {\mathcal {S}}_{\mathtt {10}}\), i.e., \(y_0[i-1] \oplus {{\bar{y}}}_1[i-1] = 1\) and \(y_0[i-2] \oplus {{\bar{y}}}_1[i-2] = 0\). Then, \(c[i-1] = c[i-2]\), and \(c[i-2] = y_0[i-2]\) if \(i\ge 2\). Thus,

$$\begin{aligned} z_0[i] = y_0[i] \oplus y_1[i] \oplus 1 \oplus y_0[i-2], \end{aligned}$$

for all \(i\ge 2\), and we obtain the second equality.

Finally, let us assume that \((y_1,y_0) \in {\mathcal {S}}_{\mathtt {11}}\), i.e., \(y_0[i-1] \oplus {{\bar{y}}}_1[i-1] = 1\) and \(y_0[i-2] \oplus {{\bar{y}}}_1[i-2] = 1\). Then, \(c[i-1] = c[i-2]\), and \(c[i-2] = c[i-3]\) if \(i \ge 3\). Thus,

$$\begin{aligned} z_0[i] = y_0[i] \oplus y_1[i] \oplus 1 \oplus c[i-3] \end{aligned}$$

holds for all \(i\ge 3\). The carry \(c[i-3]\) is the output of the majority function as \(c[i-3] = maj(c[i-4], y_0[i-3], {{\bar{y}}}_1[i-3])\), and a linear approximation, \(c[i-3] \approx y_0[i-3]\), holds with correlation \(2^{-1}\). Thus, we have

$$\begin{aligned} z_0[i] \approx y_0[i] \oplus y_1[i] \oplus y_0[i-3] \end{aligned}$$

with correlation \(-2^{-1}\), and we obtain the final approximation. \(\square \)

Fig. 4
figure 4

Representation for \(z_0 = y_0 + {{\bar{y}}}_1 + 1\)

The representations for the partitions \({\mathcal {S}}_{\mathtt {0}*}\) and \({\mathcal {S}}_{\mathtt {10}}\) are the same as in Lemma 1. We additionally introduce a linear approximation for the partition \({\mathcal {S}}_{\mathtt {11}}\), which was discarded in the original partitioning technique. Note that the cost to determine partitions is not increased compared to the previous partitioning technique because we simply use the discarded partition. The cost increase only involves the new bit \(y_0[i-3]\), but thanks to our FWHT-based key recovery technique, the cost increase will be negligible. We discuss it in detail in Sect. 8.

Fig. 5
figure 5

Two linear trails with correlation \(2^{-1}\)

Due to the propagation rules for linear trails over modular addition, we may end up with multiple linear trails that are closely related to each other. As an example, Fig. 5 shows two possible trails, where [i] and \([i,i-1]\) denote the corresponding linear masks. The partitioning technique described above evaluates \(z_0[i]\), but we can expect that there is a highly biased linear trail in which \(z_0[i] \oplus z_0[i-1]\) needs to be evaluated instead of \(z_0[i]\). In the trivial method, we would apply the partitioning technique of Lemma 2 for \(z_0[i]\) and \(z_0[i-1]\) separately, which requires the knowledge of 3 bits of information from y in total. Our new partitioning method allows us to determine the partition with the knowledge of only the same 2 bits of information as needed for evaluating the case of \(z_0[i]\), namely \((y_0[i-1] \oplus y_1[i-1])\) and \((y_0[i-2] \oplus y_1[i-2])\). This is especially helpful if y consists of the ciphertext XORed with the key, so we need to guess less key bits to evaluate the partition.

Lemma 3

Let \(s = y_0 \oplus {{\bar{y}}}_1\) and let \(i \ge 3\). Let \({\mathcal {S}}_{b_0b_1} :=\{ (y_1,y_0) \in {\mathbb {F}}_2^{2m} \mid s[i-1] = b_0 \text { and } s[i-2] = b_1\}\). We have

$$\begin{aligned}&z_0[i] \oplus z_0[i-1]\\&\quad \approx {\left\{ \begin{array}{ll} y_0[i] \oplus y_1[i], \qquad \qquad \qquad \qquad \qquad ~\mathrm{with~cor.}~1, &{} \mathrm{if}~ (y_1,y_0) \in {\mathcal {S}}_{\mathtt {1}*}, \\ y_0[i] \oplus y_1[i] \oplus y_0[i-1] \oplus y_0[i-2], ~\mathrm{with~cor.}~{-1}, &{}\mathrm{if}~(y_1,y_0) \in {\mathcal {S}}_{\mathtt {00}}, \\ y_0[i] \oplus y_1[i] \oplus y_0[i-1] \oplus y_0[i-3], ~\mathrm{with~cor.}~{-2^{-1}}, &{}\mathrm{if}~(y_1,y_0) \in {\mathcal {S}}_{\mathtt {01}}, \end{array}\right. } \end{aligned}$$

where \({\mathcal {S}}_{\mathtt {1} *} = {\mathcal {S}}_{\mathtt {10}} \cup {\mathcal {S}}_{\mathtt {11}}\).

Proof

By evaluating the modular addition \(z_0 = y_0 + {{\bar{y}}}_1 + 1\), we have

$$\begin{aligned} z_0[i] \oplus z_0[i-1] = y_0[i] \oplus {{\bar{y}}}_1[i] \oplus c[i-1] \oplus y_0[i-1] \oplus {{\bar{y}}}_1[i-1] \oplus c[i-2],\end{aligned}$$

where \(c[i-1]\), resp., \(c[i-2]\) denotes the carry occurring at bit position \(i-1\), resp., \(i-2\). As before, we define \(c[-1]:=1\).

Let us first assume that \((y_1,y_0) \in {\mathcal {S}}_{\mathtt {1}*}\), i.e., \(y_0[i-1] \oplus {{\bar{y}}}_1[i-1] = 1\). Then, clearly \(c[i-1] = c[i-2]\). Thus,

$$\begin{aligned} z_0[i] \oplus z_0[i-1] = y_0[i] \oplus y_1[i], \end{aligned}$$

and we obtain the first equality.

Next, let us assume that \((y_1,y_0) \in {\mathcal {S}}_{\mathtt {00}}\), i.e., \(y_0[i-1] \oplus {{\bar{y}}}_1[i-1] = 0\) and \(y_0[i-2] \oplus {{\bar{y}}}_1[i-2] = 0\). Since \(y_0[i-1] \oplus {{\bar{y}}}_1[i-1] = 0\), we have \(c[i-1] = {{\bar{y}}}_1[i-1]\). Since \(y_0[i-2] \oplus {{\bar{y}}}_1[i-2] = 0\), we have \(c[i-2] = y_0[i-2]\). Thus,

$$\begin{aligned} z_0[i] \oplus z_0[i-1] = y_0[i] \oplus y_1[i] \oplus 1 \oplus y_0[i-1] \oplus y_0[i-2], \end{aligned}$$

and we obtain the second equality.

Finally, let us assume that \((y_1,y_0) \in {\mathcal {S}}_{\mathtt {01}}\), i.e., \(y_0[i-1] \oplus {{\bar{y}}}_1[i-1] = 0\) and \(y_0[i-2] \oplus {{\bar{y}}}_1[i-2] = 1\). Since \(y_0[i-1] \oplus {{\bar{y}}}_1[i-1] = 0\), we have \(c[i-1] = {{\bar{y}}}_1[i-1]\). Since \(y_0[i-2] \oplus {{\bar{y}}}_1[i-2] = 1\), we have \(c[i-2] = c[i-3]\). Thus,

$$\begin{aligned} z_0[i] \oplus z_0[i-1] = y_0[i] \oplus y_1[i] \oplus 1 \oplus y_0[i-1] \oplus c[i-3]. \end{aligned}$$

The carry \(c[i-3]\) is the output of the majority function as \(c[i-3] = maj(c[i-4], y_0[i-3], {{\bar{y}}}_1[i-3])\), and a linear approximation, \(c[i-3] \approx y_0[i-3]\), holds with correlation \(2^{-1}\). Thus,

$$\begin{aligned} z_0[i] \oplus z_0[i-1] \approx y_0[i] \oplus y_1[i] \oplus y_0[i-1] \oplus y_0[i-3] \end{aligned}$$

holds with correlation \(-2^{-1}\), and we obtain the final approximation. \(\square \)

In practice, we need to consider a more complicated partition. For example, we sometimes consider the case that the approximated function consists of multiple modular subtractions and rotation. A summary of our partitioning is given in “Appendix A”.

4 High-Level Overview of the New Attack Framework

Fig. 6
figure 6

The new attack framework

In this section, we introduce a high-level overview of our new attack framework for differential-linear attacks with the partitioning technique. The framework consists of several novel techniques, which are: (1) amplifying the probability of the differential part by carefully choosing an appropriate linear subspace \({\mathcal {U}}\) for generating good pairs, (2) choosing the linear masks dynamically depending on each partition, and (3) an FWHT-based technique for improving the key recovery part when using partitions.

Figure 6 shows the high-level description of the general structure. Here F corresponds to the part of the cipher that we are going to cover using our improved key-guessing strategy. Our aim is to recover parts of the last whitening key k by using a differential-linear distinguisher given by s (multiple) linear approximations \(\langle \Gamma _{\mathrm {out}}^{(p_i)},z \rangle \oplus \langle \Gamma _{\mathrm {out}}^{(p_j)},{\tilde{z}} \rangle \). In the following, we assume that the ciphertext space \({\mathbb {F}}_2^n\) is split into a direct sum \({\mathcal {P}}\oplus {\mathcal {R}}\) with \(n_{{\mathcal {P}}} :=\dim {\mathcal {P}}\) and \(n_{{\mathcal {R}}} :=\dim {\mathcal {R}}= n-n_{{\mathcal {P}}}\), so that the partitions will be given by the cosets \({\mathcal {T}}_{p_i} = p_i \oplus {\mathcal {R}}\) for any \(p_i \in {\mathcal {P}}\) (i.e., \(p_i\) represents a set of the partition). Notice that, without loss of generality, we can also assume that \(c \in {\mathbb {F}}_2^n\) is divided into two parts \(c_{\mathcal {P}}\in {\mathbb {F}}_2^{n_{\mathcal {P}}} = {\mathcal {P}}\) and \(c_{\mathcal {R}}\in {\mathbb {F}}_2^{n - n_{\mathcal {P}}} = {\mathcal {R}}\), and \(c = (c_{\mathcal {P}}, c_{\mathcal {R}})\); similarly, we can write \(k = (k_{\mathcal {P}},k_{\mathcal {R}})\) and \(y = (y_{\mathcal {P}}, y_{\mathcal {R}})\). Then, for any \(p_i \in {\mathcal {P}}\), the partition \({\mathcal {T}}_{p_i} \subset {\mathbb {F}}_2^n\) is defined as \({\mathcal {T}}_{p_i} = \{y \in {\mathbb {F}}_2^n \mid y_{{\mathcal {P}}} = p_i \}\). Since both formalizations are equivalent, we will use either interchangeably depending on the context.

4.1 The Differential Part

The first step of the attack is to collect many \((x, x \oplus \Delta _{\mathrm {in}})\) satisfying \(E_1(x) \oplus E_1(x \oplus \Delta _{\mathrm {in}}) = \Delta _m\), which are called right pairs. Let \({\mathcal {X}}\) be the set defined as

$$\begin{aligned} {\mathcal {X}} = \{ x \in {\mathbb {F}}_2^n \mid E_1(x) \oplus E_1(x \oplus \Delta _{\mathrm {in}}) = \Delta _m \}. \end{aligned}$$

If pairs are used from \({\mathcal {X}}\), the probability of the differential part becomes 1 and the correlation of the differential-linear distinguisher also increases. To collect many such pairs efficiently, we use a linear subspace \({\mathcal {U}}\). In the simplest case, this subspace is chosen such that for any \(x \in {\mathcal {X}}\) and any \(u \in {\mathcal {U}}\) it holds that \(x \oplus u \in {\mathcal {X}}\), i.e., \((x \oplus u, x \oplus u \oplus \Delta _{\mathrm {in}})\) is a right pair as well. However, this strict requirement restricts the size of \({\mathcal {U}}\) significantly and for the attack it is sufficient if this implication is true for most elements in \({\mathcal {X}}\). To capture this precisely, we define a subset \(\mathcal {X'}\) of the set of right pairs \({\mathcal {X}}\) as

$$\begin{aligned} \mathcal {X'} = \{ x \in {\mathcal {X}} \mid x \oplus u \in {\mathcal {X}} \text{ for } \text{ all } u \in {\mathcal {U}}\}. \end{aligned}$$

Once we find an \(x \in {\mathcal {X}}'\), we can find \(2^{\dim {\mathcal {U}}}\) right pairs for free. The differential probability p is in fact defined as \(p = |{\mathcal {X}}|/2^n\), which means we can reduce the data complexity by the factor \(p^{-1}\) when \({\mathcal {X}}' = {\mathcal {X}}\). We would like to remark that this idea has already been used in other contexts, e.g., the differential attack [36], but to best of our knowledge, it has not been applied to differential-linear attacks. We discuss the differential part in detail in Sect. 5.

4.2 The Linear Part

The idea is to identify several tuples \(({\mathcal {T}}_{p_i},\Gamma _{\mathrm {out}}^{(p_i)},\gamma ^{(p_i)})\), \(i \in \{1,\dots ,s\}\), where \(\gamma ^{(p_i)} \in {\mathcal {R}}\), for which we can observe a high absolute correlation

$$\begin{aligned} \varepsilon _{i} :=\mathbf {Cor}_{y \in {\mathcal {T}}_{p_i}}\left[ \langle \Gamma _{\mathrm {out}}^{(p_i)},z \rangle \oplus \langle \gamma ^{(p_i)},y \rangle \right] . \end{aligned}$$

In the simplest case, we would have \(\varepsilon _{i} = 1\), i.e.,

$$\begin{aligned} y \in {\mathcal {T}}_{p_i} \quad \Rightarrow \quad \left( \langle \Gamma _{\mathrm {out}}^{(p_i)},z\rangle = \langle \gamma ^{(p_i)},y \rangle = \langle \gamma ^{(p_i)},c \rangle \oplus \langle \gamma ^{(p_i)},k \rangle \right) . \end{aligned}$$

In other words, by considering only a specific subset of the ciphertexts (defined by \({\mathcal {T}}_{p_i}\)) we obtain linear relations in the key with a high correlation.

The correlation of the differential-linear distinguisher for \(E_2 \circ E_m \circ E_1\), which is denoted by \(q_{i,j}\), is defined as

$$\begin{aligned} q_{i,j} :=\mathbf {Cor}_{\begin{array}{c} x \in {\mathcal {X}} \text { such that}\\ (y, {{\tilde{y}}}) \in {\mathcal {T}}_{p_i} \times {\mathcal {T}}_{p_j} \end{array}} \left[ \langle \Gamma _{\mathrm {out}}^{(p_i)},z\rangle \oplus \langle \Gamma _{\mathrm {out}}^{(p_j)},{\tilde{z}} \rangle \right] . \end{aligned}$$

In practice, it is not feasible to compute \(q_{i,j}\) for all partitions indexed by (ij) when each correlation is low and the number of partitions is large. Therefore, we introduce the following assumption.

Assumption 1

The correlation \(q_{i,j}\) is independent of \({\mathcal {T}}_{p_i} \times {\mathcal {T}}_{p_j}\). In other words, we assume

$$\begin{aligned} q_{i,j} = \mathbf {Cor}_{\begin{array}{c} x \in {\mathcal {X}} \end{array}} \left[ \langle \Gamma _{\mathrm {out}}^{(p_i)},z\rangle \oplus \langle \Gamma _{\mathrm {out}}^{(p_j)},{\tilde{z}} \rangle \right] \text { for all } i,j. \end{aligned}$$

We finally observe the following correlation for \((y, {{\tilde{y}}})\) by guessing the secret key, and the final correlation is defined as

$$\begin{aligned} \rho _{i,j}&:=\mathbf {Cor}_{\begin{array}{c} x \in {\mathcal {X}} \text { such that}\\ (y, {{\tilde{y}}}) \in {\mathcal {T}}_{p_i} \times {\mathcal {T}}_{p_j} \end{array}} \left[ \langle \gamma ^{(p_i)},y\rangle \oplus \langle \gamma ^{(p_j)},{\tilde{y}} \rangle \right] . \end{aligned}$$

In addition to Assumption 1, we use the following assumption in order to estimate the final correlation.

Assumption 2

The correlations \(q_{i,j}\), \(\varepsilon _i\), and \(\varepsilon _j\) are independent of each other, and \(\rho _{i,j}\) can be estimated as

$$\begin{aligned} \rho _{i,j} = \varepsilon _i \varepsilon _j q_{i,j}. \end{aligned}$$
(3)

Unfortunately, Assumption 2 does not hold in general because it ignores the impact of the auto-correlation-linear hull effect. Namely, for a more precise evaluation, we need to consider multiple differential-linear trails with tuples \(({\mathcal {T}}_{p_i}, *,\gamma ^{(p_i)})\), where \(*\) represents arbitrary linear masks, and \(\rho _{i,j}\) can be computed as the sum of these correlations. In other words, we need to consider multiple \(\Gamma _{\mathrm {out}}^{(p_i)}\) for each fixed \(({\mathcal {T}}_{p_i}, \gamma ^{(p_i)})\). We later discuss the auto-correlation-linear hull in Sect. 6. There it is also shown that, when considering the auto-correlation-linear hull, Assumptions 1 and 2 are replaced by the assumption that the hull is dominated by a single trail in order to justify Eq. (3).

How to identify belonging partitions is very important. It highly depends on the specification of the target cipher. Applications to Chaskey and ChaCha are too complicated to understand this behavior. We provide some simple cases for a more easy understanding of this behavior in “Appendix B”.

4.3 LLR-Based Statistical Test

According to the Neyman–Pearson Lemma, the LLR test is the most powerful statistical test and as such has been used as a cryptanalysis tool (see, e.g., [31, 32]). Considering the use of multiple linear trails with different correlations, the LLR test is more appropriate than the simple sum without considering the different correlations of each linear approximation.

Let us consider our differential-linear attacks using N pairs. An important remark is that each of the N pairs contributes differently to the correlations. Therefore, we need to consider the contribution to the theoretical correlation of each of them. Let \((y, {{\tilde{y}}})\) be the \(\ell \)th pair and let us assume that y and \({{\tilde{y}}}\) belong to the ith and jth partitions, respectively, i.e., \(y_{\mathcal {P}}= p_i\) and \({{\tilde{y}}}_{\mathcal {P}}= p_j\) (for ease of notation, we do not make the dependency of \(y, {\tilde{y}}, i, j\) on \(\ell \) explicit). We then get the 1-bit representation

$$\begin{aligned} w_\ell = \langle y, \gamma ^{(p_{i})} \rangle \oplus \langle {{\tilde{y}}}, \gamma ^{(p_{j})} \rangle \end{aligned}$$

and consider the probability

$$\begin{aligned} \pi _\ell = \mathbf {Pr}_{y, {\tilde{y}}}\left[ w_\ell = 0 \right] . \end{aligned}$$

We refer to the theoretical correlation as \(\rho _{i,j}\) when the ith and jth partitions are used. Namely, \(\pi _\ell = \frac{\rho _{i,j} + 1}{2}\). For simplicity, let \(C_\ell \) denote this correlation for the \(\ell \)th pair (and the ith and jth partitions are used for this pair), i.e., \(C_\ell = \rho _{i,j} = 2 \pi _\ell - 1\).

Let \(D_0\) and \(D_1\) be the random vector distributions

$$\begin{aligned} D_0 : \left( {\mathcal {B}}(1, \pi _1), \ldots , {\mathcal {B}}(1, \pi _N) \right) , \quad D_1 : \left( {\mathcal {B}}(1, 1/2), \ldots , {\mathcal {B}}(1, 1/2) \right) . \end{aligned}$$

where \({\mathcal {B}}(n,\pi _{\ell })\) are independent binomial distributions with n trials and success probability \(\pi _{\ell }\), and where the \(\pi _{\ell }\) are not necessarily distinct.

Our goal is to distinguish whether \(w :=\left( w_1, \dotsc , w_N\right) \) is the result of sampling from \(D_0\) (i.e., the real distribution) or \(D_1\) (i.e., the random distribution). Let \(q_0\) and \(q_1\) be the probability that \(w :=\left( w_1, \dotsc , w_N\right) \) is the result of sampling from \(D_0\) or \(D_1\), respectively. Thus,

$$\begin{aligned} q_0&= \mathbf {Pr}\left[ X = w \mid X \sim D_0\right] ,&q_1&= \mathbf {Pr}\left[ X = w \mid X \sim D_1\right] . \end{aligned}$$

The LLR statistic is defined as \(\ln (q_0/q_1)\), and it is computed as

$$\begin{aligned} \ln \left( \frac{q_0}{q_1}\right) = \frac{1}{2} \sum _{\ell =1}^N \ln {\left( 1-C_{\ell }^2\right) } + \frac{1}{2}\sum _{\ell =1}^N \ln {\left( \frac{1 - C_{\ell }}{1+C_{\ell }}\right) }{(-1)}^{w_{\ell }}. \end{aligned}$$

Let us assume that the LLR statistic follows normal distributions \({\mathcal {N}}(\mu _0, \sigma _0^2)\) and \({\mathcal {N}}(\mu _1, \sigma _1^2)\) when the correct and wrong keys are guessed, respectively. We have

$$\begin{aligned} \mu _0 = -\mu _1 = \frac{1}{2} \sum _{\ell =1}^N C_{\ell }^2 = \frac{N}{2} {\overline{C}},&\sigma _0^2 = \sigma _1^2 = \sum _{\ell =1}^N C_{\ell }^2 = N {\overline{C}}, \end{aligned}$$

where by \({\overline{C}}\) we denote the average of the squared correlation, i.e., \({\overline{C}} :=\frac{1}{N} \sum _{\ell =1}^N C_{\ell }^2\). We later show the formula described above in Sect. 7.

4.4 WHT-Based Key Recovery Technique

We use \((y, {{\tilde{y}}})\) to identify partitions and compute the LLR statistic. Note that \(y \in {\mathcal {T}}_{p_i} \Leftrightarrow c \in {\mathcal {T}}_{p_i} \oplus k_{{\mathcal {P}}}\), so we need to guess \(n_{{\mathcal {P}}}\) bits of k to partition the ciphertexts into the corresponding \({\mathcal {T}}_{p_i}\). Note that, since we require \(\gamma ^{(p_i)} \in {\mathcal {R}}\), we obtain linear relations only on \(k_{{\mathcal {R}}}\).

Let \((c, {{\tilde{c}}})\) be the pair of ciphertexts. The partition the pair belongs to is determined by \(y_{\mathcal {P}}= c_{\mathcal {P}}\oplus k_{\mathcal {P}}\). Therefore, part of the key \(k_{\mathcal {P}}\) is guessed to identify the partition. After guessing \(k_{\mathcal {P}}\), we get the following approximation.

$$\begin{aligned} \langle y, \gamma ^{(y_{\mathcal {P}})} \rangle \approx \langle {{\tilde{y}}}, \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle&\Rightarrow \langle c \oplus k, \gamma ^{(y_{\mathcal {P}})} \rangle \approx \langle {{\tilde{c}}} \oplus k, \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle \\&\Rightarrow \langle c, \gamma ^{(y_{\mathcal {P}})} \rangle \oplus \langle {{\tilde{c}}}, \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle \approx \langle k, \gamma ^{(y_{\mathcal {P}})} \oplus \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle \end{aligned}$$

Since the left side is known, we get the parity of k with respect to the linear mask \(\gamma ^{(y_{\mathcal {P}})} \oplus \gamma ^{({{\tilde{y}}}_{\mathcal {P}})}\). For a linear subspace, W defined by \(W :=\text {Span}\{ \gamma ^{(p_i)} \oplus \gamma ^{(p_j)} \mid i,j \in \{1,\ldots ,s\} \}\), the approximations above involve \(\dim W\) bits of information for k. By using the fast Walsh–Hadamard transform (FWHT), we do not need to guess \(\dim W\) bits for every pair of texts, and the time complexity is estimated as \(2^{n_{\mathcal {P}}} (2 N + \dim W \cdot 2^{\dim W})\), where \(n_{\mathcal {P}}\) is the bit length of \(k_{\mathcal {P}}\). We discuss this procedure in detail in Sect. 8.

5 The Differential Part—Finding Many Right Pairs

Let us be given a permutation \(E_1:{\mathbb {F}}_2^n \rightarrow {\mathbb {F}}_2^n\) and a differential \(\Delta _{\mathrm {in}}\overset{E_1}{\rightarrow } \Delta _m\) that holds with probability p. In other words,

$$\begin{aligned} |\{x \in {\mathbb {F}}_2^n \mid E_1(x) \oplus E_1(x \oplus \Delta _{\mathrm {in}}) = \Delta _m\}|= p\cdot 2^n. \end{aligned}$$

In a usual differential-linear attack on a permutation \(E = E_2 \circ E_m \circ E_1\) as explained in Sect. 2.1, the internal structure of \(E_1\) could be in general arbitrary and we would consider randomly chosen \(x \in {\mathbb {F}}_2^n\) to observe the ciphertexts of the plaintext pairs \((x,x \oplus \Delta _{\mathrm {in}})\). For each of those pairs, the differential over \(E_1\) is fulfilled with probability p, which results in a data complexity of roughly \(\epsilon p^{-2}r^{-2}q^{-4}\) for the differential-linear attack. In other words, we did not exploit the particular structure of \(E_1\). In particular, it would be helpful to know something about the distribution of right pairs \((x,x\oplus \Delta _{\mathrm {in}}) \in {\mathbb {F}}_2^n \times {\mathbb {F}}_2^n\) which fulfill the above differential.

Let us denote by \({\mathcal {X}}\) the set of all values that define right pairs for the differential, i.e.,

$$\begin{aligned} {\mathcal {X}} = \{x \in {\mathbb {F}}_2^n \mid E_1(x) \oplus E_1(x\oplus \Delta _{\mathrm {in}}) = \Delta _m \}. \end{aligned}$$

To amplify the correlation of a differential-linear distinguisher, instead of choosing random plaintexts from \({\mathbb {F}}_2^n\), we could consider only those that are in \({\mathcal {X}}\). In particular, we haveFootnote 2

$$\begin{aligned} \mathbf {Cor}_{x \in {\mathcal {X}}}\left[ \langle \Gamma _{\mathrm {out}}, E(x) \rangle \oplus \langle \Gamma _{\mathrm {out}}, E(x \oplus \Delta _{\mathrm {in}}) \rangle \right] = rq^2. \end{aligned}$$

Since the set \({\mathcal {X}}\) might have a rather complicated structure and is key-dependent, we cannot use this directly for an arbitrary permutation \(E_1\). However, if \({\mathcal {X}}\) presents a special structure such that, given one element \(x \in {\mathcal {X}}\), we can generate many other elements in \({\mathcal {X}}\) for free,Footnote 3 independently of the secret key, we can use this to reduce the data complexity in a differential-linear attack. For example, if \({\mathcal {X}}\) contains a large affine subspace \({\mathcal {A}} = {\mathcal {U}} \oplus a\), given \(x \in {\mathcal {A}}\), we can generate (roughly) \(2^{\dim {\mathcal {U}}}\) elements in \({\mathcal {X}}\) for free, namely all elements \(x \oplus u\), for \(u \in {\mathcal {U}}\). In order to obtain an effective distinguisher, we must be able to generate enough plaintext pairs to observe the correlation of the differential-linear approximation. In particular, we require \(|{\mathcal {U}}|> \epsilon r^{-2}q^{-4}\).

This will be exactly the situation we find in ChaCha. Here the number of rounds covered in the differential part is so small that it can be described by the independent application of two functions (see Sect. 5.1).

If \(|{\mathcal {U}}|\) is smaller than the threshold of \(\epsilon r^{-2}q^{-4}\), we cannot generate enough right pairs for free to obtain a distinguisher by this method and we might use a probabilistic approach (see Sect. 5.2).

In Sect. 5.3, we show how the conditional differential framework can be used to efficiently find bigger sets \({\mathcal {U}}\), as well as to recover some information on the key, and provide some ideas to adapt it to the ARX scenario.

5.1 Fully Independent Parts

Let \(E_1 :{\mathbb {F}}_2^n \rightarrow {\mathbb {F}}_2^n\) with \(n=2m\) be a parallel application of two block ciphers \(E_1^{(i)}:{\mathbb {F}}_2^m \rightarrow {\mathbb {F}}_2^m\), \(i \in \{0,1\}\) (for a fixed key), i.e.,

$$\begin{aligned} E_1:(x^{(1)},x^{(0)}) \mapsto (E_1^{(1)}(x^{(1)}), E_1^{(0)}(x^{(0)})). \end{aligned}$$

Suppose that, \(E_1^{(0)}\) presents a differential \(\alpha \overset{E_1^{(0)}}{\rightarrow }\beta \) with probability p. We consider the differential \(\Delta _{\mathrm {in}}\overset{E_1}{\rightarrow } \Delta _m\) with \(\Delta _{\mathrm {in}}= (0,\alpha )\) and \(\Delta _m = (0,\beta )\), which also holds with probability p. Given one element \((x^{(1)},x^{(0)}) \in {\mathcal {X}}\), any \((x^{(1)} \oplus u, x^{(0)})\) for \(u \in {\mathbb {F}}_2^m\) is also contained in \({\mathcal {X}}\), and we can thus generate \(2^m\) right pairs for free.

If \(2^m > \epsilon r^{-2}q^{-4}\), a differential-linear distinguisher on \(E = E_2 \circ E_m \circ E_1\) would work as follows:

  1. 1.

    Choose \(a = (a^{(1)},a^{(0)}) \in {\mathbb {F}}_2^n\) uniformly at random.

  2. 2.

    Empirically compute

    $$\begin{aligned} \mathbf {Cor}_{x \in a \oplus ({\mathbb {F}}_2^m \times \{0\})}\left[ \langle \Gamma _{\mathrm {out}}, E(x) \rangle \oplus \langle \Gamma _{\mathrm {out}}, E(x \oplus \Delta _{\mathrm {in}}) \rangle \right] .\end{aligned}$$
  3. 3.

    If we observe a correlation of \(rq^2\) using \(\epsilon r^{-2}q^{-4}\) many x, the distinguisher succeeded. If not, start over with Step 1.

With probability p, we choose an element \(a \in {\mathcal {X}}\) in Step 1. In that case, the distinguisher succeeds in Step 3. Therefore, the data complexity of the distinguisher is \( \epsilon p^{-1}r^{-2}q^{-4}\), compared to \(\epsilon p^{-2}r^{-2}q^{-4}\) as in the classical differential-linear attack.

5.2 Probabilistic Independent Parts

Since the previous decomposition is not always possible or \(2^m\) might not be big enough, we are also interested in the situations in which the differential part cannot be simply written as the parallel application of two functions. Again, the goal is, given one element \(x \in {\mathcal {X}}\), to be able to generate \(\epsilon r^{-2} q^{-4}\) other elements in \({\mathcal {X}}\), each one with a much lower cost than \(p^{-1}\). Suppose that \({\mathcal {U}} \subseteq {\mathbb {F}}_2^n\) is a subspace with \(|{\mathcal {U}}|> \epsilon r^{-2} q^{-4}\) and suppose that \(\mathbf {Pr}_{u \in {\mathcal {U}}}\left[ x \oplus u \in {\mathcal {X}} \mid x \in {\mathcal {X}}\right] = p_1\), where \(p_1\) is much larger than p. The data complexity of the improved differential-linear distinguisher would then be \( \epsilon p^{-1}p_1^{-2}r^{-2}q^{-4}\). Note that the probability \(p_1\) may also depend on x. In particular, there might be \(x \in {\mathcal {X}}' \subseteq {\mathcal {X}}\) for which \(p_1\) is (almost) 1, but the probability to draw such an initial element x from \({\mathbb {F}}_2^n\) is \(p'\), which is smaller than p. Then, the data complexity would be \(\epsilon p'^{-1} p_1^{-2} r^{-2} q^{-4}\). For instance, this will be the case for the attack on Chaskey (Sect. 9), where we have \(p_1 \approx 1\) and \(p' = p \times 222/256\).

In such situations, we propose an algorithmic way to experimentally detect suitable structures in the set of right pairs. The idea (see Algorithm 1 for the pseudocode) is to detect canonical basis vectors within the subspace \({\mathcal {U}}\). Running this algorithm for enough samples will return estimates of the probability \(\gamma _j\) that a right pair \(x \in {\mathcal {X}}\) stays a right pair when the jth bit is flipped, i.e.,

$$\begin{aligned} \gamma _i = \mathbf {Pr}\left[ x\oplus [i] \in {\mathcal {X}} \mid x \in {\mathcal {X}}\right] . \end{aligned}$$

When applied to a few rounds of ARX ciphers, it can be expected that there are some bits that will always turn a right pair into a right pair, i.e., \(\gamma _i=1\). Moreover, due to the property of the modular addition that the influence of bits on distant bits degrades quickly, high values of \(\gamma _j\ne 1\) can also be expected. As we will detail in Sect. 9 this will be the case for the application to Chaskey.

figure a

5.3 Using the Conditional Differential Framework for Finding Better Subspaces

A practical way for producing a set of pairs of data whose elements all verify a certain differential path reflects a similar scenario to the one which is considered in conditional differential attacks. In [36], in the context of NLFSR-based primitives, the elements of the basis of \({\mathcal {U}}\) are called freebits and involve more complex relations derived from the differential paths than just simple single-bit relations. In Sect. 3.2 of [36], three types of conditions are presented: type 0, which only involve public bits (which is common in NLFSR initialization, but not in ARX or SPN constructions); type one, which involve both secret and public bits; and type 2, which are conditions directly on the keybits. The freebits are actually the ones that do not affect type 1 conditions. Using these definitions we can improve previous attacks in two ways: increasing for free the number of keybits recovered by the differential-linear attack thanks to type 2 conditions, and increasing the size of \({\mathcal {U}}\) by using the freebits as defined in [36].

In this section, we provide some hints and general ideas on how to use this framework for improving the analysis in ARX constructions. In “Appendix C” we provide a detailed example on how to determine additional keybits with type 2 conditions and on how to increase the number of freebits with evolved relations for Chaskey.

Conditional differential framework for differential-linear attacks

Using the definitions from [36, Sect. 3.2], it is easy to see how to improve the differential part of some attacks (quite straightforwardly for ARX) in three main ways:

  1. 1.

    We can increase the size of \({\mathcal {U}}\) by exhausting the input values that keep the conditions of type 1 fixed to a certain value (as was done in the applications in that paper), as if these conditions were true, they would stay true for all the sampling set. These exhausted bits might not be completely free as type 1 conditions need to remain constant.

  2. 2.

    When considering a particular set of plaintexts to check if it is the one verifying the differential path, some information on the value of some associated keybits or keybit relations can be presupposed (given directly by conditions of type 2 and indirectly by conditions of type 1). This means that, for all the cases, we can suppose some information on the key as known. This information might be used to recover more bits and more importantly, could reduce the complexity of the key-search part in the final rounds.

  3. 3.

    Combination of both: guessing some keybits, that might be useful for the linear part, and that simultaneously might allow to detect sampling bit relations that follow the path with probability 1.

Main ideas for exploiting the conditions on ARX

We now present some general ideas for exploiting conditions on ARX constructions. Even though some of them might seem trivial, it is nonetheless helpful to set them as rules to follow.

We can define some rules that apply when flipping the parity of differences. Instead of using only single non-active bit flipping for defining the freebits, we can study the effect of flipping the parity of the differences as additional sampling bits when possible. We can identify several relevant cases, and we present here four cases of particular interest: (i) If a pair of differences is going to be erased after a modular addition (which implies they have a different parity), changing the parity of one will need changing the parity of the other. (ii) If a bit difference is staying at the same position (and not propagating further) after a modular transition, changing its parity will not affect the transition. (iii) If two active bits at position i will produce a difference after the modular addition at position \(i+1\) (move the difference), flipping both active bits at the same time will change the parity of the output at \(i+1\). (iv) If two words are added with a difference in position i and in positions i and \(i+1\), respectively, and we want to absorb the differences after the modular additions, the carries of the previous positions will not affect the bits after position i. We can also change the parity of the three bits simultaneously, and the differences will still be absorbed, and the values will stay the same. Of course, all this might have an effect on further rounds, which will have, in turn, to be taken into account.

It is also useful to keep in mind that when we identify several input bits that only influence the differential transitions by a xor, swapping a pair number of these will not alter the verification of the path.

When dealing with carries, they might affect transitions with low probability. It is interesting to keep in mind that, when there is a sum of two zeros at position i, the value of all the bits at lower positions will not affect the carries at any higher positions. That might imply that a small guess (for instance 2 keybits for fixing two bits to zero) can generate many more bits for the sampling part with probability one if they only affected the differential path through these carries.

6 Auto-Correlation-Linear Hulls and Partitioning

In this section, we want to better understand how to compute the correlations \(\rho _{i,j}\) when the ith and jth partitions are used. This will allow us to shed light on Assumptions 1 and 2.

For this, we will derive general formulas which express the differential-linear correlation with restricted output of a function composed of two parts. Note that in the following we do not make any assumptions on the independence of these parts. Furthermore, the notion of an auto-correlation-linear hull developed below will allow to improve upon the attacks by considering multiple intermediate masks.

figure b

We start by considering the unrestricted variant. Given two functions \(G_1, G_2:{\mathbb {F}}_2^n\rightarrow {\mathbb {F}}_2^n,\) an input difference \(\Delta \) and two output-masks \(\gamma \) and \(\gamma '\), let \(H:=G_2 \circ G_1\) and

$$\begin{aligned} {{\,\mathrm{Aut}\,}}_H(\Delta ,\gamma ,\gamma ') := 2^{-n} \sum _{x \in {\mathbb {F}}_2^n} (-1)^{\langle \gamma , H(x) \rangle \oplus \langle \gamma ', H(x \oplus \Delta )\rangle } \end{aligned}$$

be the differential-linear correlation on H.

We are interested in how to compute the differential-linear correlation when considering intermediate masks \(\Gamma \) and \(\Gamma '\). In a second step, the outputs will be restricted to coming from a set M and a set N, respectively.

Note that this approach is different from the considerations in [37] as there it was about how to compute the auto-correlation by connecting the differential and the linear part correctly, while here we extend a differential-linear correlation using a second linear approximation of the parts.

This auto-correlation can be expressed as

$$\begin{aligned} {{\,\mathrm{Aut}\,}}_H(\Delta ,\gamma ,\gamma ')= & {} \sum _{u\in {\mathbb {F}}_2^n} {\widehat{H}}(u,\gamma ) {\widehat{H}}(u,\gamma ') (-1)^{\langle u,\Delta \rangle } , \end{aligned}$$
(4)

where we denote by

$$\begin{aligned} {\widehat{H}}(u,\gamma ) = 2^{-n}\sum _{x \in {\mathbb {F}}_2^n} (-1)^{\langle \gamma , H(x) \rangle + \langle u,x\rangle } \end{aligned}$$

the correlation of the approximation of H with input and output masks u and \(\gamma \). This follows from the connection of the Walsh–Hadamard transform and the convolution of functions (see, e.g., [38, Proposition 11]), but can also be verified directly.

In our attack framework, \(G_1\) would correspond to \(E_2 \circ E_m \circ E_1\) and \(G_2\) to F and we would experimentally estimate the auto-correlation and multiply it with the correlation of \(G_2\) with input mask \(\Gamma \) and output mask \(\gamma \), i.e., we estimate

$$\begin{aligned} {{\,\mathrm{Aut}\,}}_H(\Delta ,\gamma ,\gamma ')\approx & {} {\widehat{G}}_2(\Gamma ,\gamma ) {\widehat{G}}_2(\Gamma ',\gamma ') {{\,\mathrm{Aut}\,}}_{G_1}(\Delta ,\Gamma ,\Gamma '). \end{aligned}$$
(5)

This is of course only an approximation and we now want to get an explicit expression of the hull effect, i.e., of all the parts we ignore in the above expression without making any assumptions. Furthermore, we have to take the partitioning of the outputs into account.

For this, we use the fact (see [39]) that we can split the correlation of H into

$$\begin{aligned} {\widehat{H}}(u,\gamma ) = \sum _{\Gamma \in {\mathbb {F}}_2^n} {\widehat{G}}_1(u,\Gamma ){\widehat{G}}_2(\Gamma ,\gamma ) \end{aligned}$$

with an intermediate mask \(\Gamma \), i.e., the linear hull. Putting this back in the definition of the auto-correlation, we get the following.

Proposition 1

(Auto-Correlation-Linear Hull) Let \(G_1,G_2 :{\mathbb {F}}_2^n \rightarrow {\mathbb {F}}_2^n\). For any \(\Delta ,\gamma ,\gamma ' \in {\mathbb {F}}_2^n\) we then have

$$\begin{aligned} {{\,\mathrm{Aut}\,}}_{G_2 \circ G_1}(\Delta ,\gamma ,\gamma ')=\sum _{\Gamma ,\Gamma ' \in {\mathbb {F}}_2^n} {\widehat{G}}_2(\Gamma ,\gamma ) {\widehat{G}}_2(\Gamma ',\gamma ') {{\,\mathrm{Aut}\,}}_{G_1}(\Delta ,\Gamma ,\Gamma '). \end{aligned}$$

Proof

Let \(H = G_2 \circ G_1\). Then

$$\begin{aligned}&{{\,\mathrm{Aut}\,}}_{G_2 \circ G_1}(\Delta ,\gamma ,\gamma ') \\&\quad = \sum _{u\in {\mathbb {F}}_2^n} {\widehat{H}}(u,\gamma ) {\widehat{H}}(u,\gamma ') (-1)^{\langle u,\Delta \rangle } \\&\quad = \sum _{u\in {\mathbb {F}}_2^n} \left( \sum _{\Gamma \in {\mathbb {F}}_2^n} {\widehat{G}}_1(u,\Gamma ){\widehat{G}}_2(\Gamma ,\gamma ) \right) \left( \sum _{\Gamma ' \in {\mathbb {F}}_2^n} {\widehat{G}}_1(u,\Gamma '){\widehat{G}}_2(\Gamma ',\gamma ') \right) (-1)^{\langle u,\Delta \rangle } \\&\quad = \sum _{\Gamma ,\Gamma ' \in {\mathbb {F}}_2^n} {\widehat{G}}_2(\Gamma ,\gamma ) {\widehat{G}}_2(\Gamma ',\gamma ') \sum _{u\in {\mathbb {F}}_2^n} {\widehat{G}}_1(u,\Gamma ) {\widehat{G}}_1(u,\Gamma ')(-1)^{\langle u,\Delta \rangle } \\&\quad = \sum _{\Gamma ,\Gamma ' \in {\mathbb {F}}_2^n} {\widehat{G}}_2(\Gamma ,\gamma ') {\widehat{G}}_2(\Gamma ',\gamma ) {{\,\mathrm{Aut}\,}}_{G_1}(\Delta ,\Gamma ,\Gamma '). \end{aligned}$$

\(\square \)

So, as could be expected, the linear hull theorem has a natural extension to an auto-correlation-linear hull theorem and the approximation in Eq. (5) corresponds to focusing on a single \((\Gamma ,\Gamma ')\) while actually all pairs \((\Gamma ,\Gamma ')\) have to be considered. It remains to see how restricting the input, i.e., partitioning, affects this expression.

6.1 Impact of Partitioning on the Correlation

We again consider a function \(H:{\mathbb {F}}_2^n \rightarrow {\mathbb {F}}_2^n\), an input difference \(\Delta \), output-masks \(\gamma \) and \(\gamma '\) and two non-empty subsets MN of \({\mathbb {F}}_2^n\). We are interested in

$$\begin{aligned} {{\,\mathrm{Aut}\,}}_H^{(M,N)}(\Delta ,\gamma ,\gamma ') :=\frac{2^n}{|M||N|} \sum _{\begin{array}{c} x \in {\mathbb {F}}_2^n \\ H(x) \in M , H(x \oplus \Delta ) \in N \end{array}} (-1)^{\langle \gamma , H(x) \rangle \oplus \langle \gamma ', H(x \oplus \Delta )\rangle } .\end{aligned}$$

One would hope that one can still use Eq. (4) with minor modifications. That is basically by replacing the correlation of H by its restricted version. To capture this, for a function \(F :{\mathbb {F}}_2^n \rightarrow {\mathbb {F}}_2^n\) and a non-empty set \(S \subseteq {\mathbb {F}}_2^n\) we denote by

$$\begin{aligned} {\widehat{F\vert _S}}(a,b):=\frac{1}{|S|} \sum _{x \in S}(-1)^{\langle b,F(x) \rangle \oplus \langle a,x \rangle } \end{aligned}$$

the correlation when the input is restricted to the set S. Later, it will actually be the output that is restricted, which will be handled by considering the inverse of the function (and swapping input and output-mask). Using Lemmas 4 and 5, we can state the main insight of this section as Proposition 2.

Lemma 4

We have

$$\begin{aligned} {{\,\mathrm{Aut}\,}}_H^{(M,N)}(\Delta ,\gamma ,\gamma ') =\sum _{u\in {\mathbb {F}}_2^n} {\widehat{H^{-1}\vert _M}} (\gamma ,u) {\widehat{H^{-1}\vert _N}}(\gamma ',u) (-1)^{\langle u,\Delta \rangle } . \end{aligned}$$

Proof

We start by expanding the right-hand side of the equation, denoted by L, as follows

$$\begin{aligned}&L= \sum _{u\in {\mathbb {F}}_2^n} {\widehat{H^{-1}\vert _M}} (\gamma ,u) {\widehat{H^{-1}\vert _N}}(\gamma ',u) (-1)^{\langle u,\Delta \rangle } \\&\quad = \frac{1}{|M ||N|} \sum _{u\in {\mathbb {F}}_2^n} \left( \sum _{y \in M} (-1)^{\langle u,H^{-1}(y) \rangle \oplus \langle \gamma , y \rangle }\right) \left( \sum _{y' \in N} (-1)^{\langle u,H^{-1}(y') \rangle \oplus \langle \gamma ', y' \rangle }\right) (-1)^{\langle u,\Delta \rangle } \\&\quad = \frac{1}{|M ||N|} \sum _{y \in M , y' \in N} (-1)^{\langle \gamma ,y \rangle \oplus \langle \gamma ',y' \rangle } \sum _{u\in {\mathbb {F}}_2^n} (-1)^{\langle u, H^{-1}(y) \oplus H^{-1}(y') \oplus \Delta \rangle } \\&\quad =\frac{2^n}{|M ||N|} \sum _{\begin{array}{c} y \in M, y' \in N \\ H^{-1}(y')=H^{-1}(y) \oplus \Delta \end{array}} (-1)^{\langle \gamma ,y \rangle \oplus \langle \gamma ',y' \rangle }. \end{aligned}$$

We now define x as \(H^{-1}(y)\) and we get

$$\begin{aligned} L= & {} \frac{2^n}{|M ||N|} \sum _{\begin{array}{c} x \in {\mathbb {F}}_2^n \\ H(x) \in M, H(x \oplus \Delta ) \in N \end{array}} (-1)^{\langle \gamma ,H(x) \rangle \oplus \langle \gamma ',H(x \oplus \Delta ) \rangle } \end{aligned}$$

which is equal to \({{\,\mathrm{Aut}\,}}_H^{(M,N)}(\Delta ,\gamma ,\gamma ')\) by definition. \(\square \)

In order to get the restricted version of the auto-correlation-linear hull, we have to understand the linear hull of a restriction first.

Lemma 5

Let \(H = G_2 \circ G_1 :{\mathbb {F}}_2^n \rightarrow {\mathbb {F}}_2^n\) and let \(S \subseteq {\mathbb {F}}_2^n\) be a non-empty set. Then

$$\begin{aligned} {\widehat{ H\vert _S}}(\gamma ,\Gamma ) = \sum _{\mu \in {\mathbb {F}}_2^n} {\widehat{G}}_2(\mu ,\Gamma ) {\widehat{{G_1}\vert _S}}(\gamma ,\mu ) \end{aligned}$$

Proof

We have

$$\begin{aligned}&\sum _{\mu \in {\mathbb {F}}_2^n} {\widehat{G}}_2(\mu ,\Gamma ) {\widehat{{G_1}\vert _S}}(\gamma ,\mu ) \\&\quad = \frac{1}{2^n|S|} \sum _{\mu \in {\mathbb {F}}_2^n} \left( \sum _{y \in {\mathbb {F}}_2^n} (-1)^{\langle \Gamma , G_2(y) \rangle \oplus \langle \mu , y \rangle } \right) \left( \sum _{x \in S} (-1)^{\langle \mu , G_1(x) \rangle \oplus \langle \gamma , x \rangle }\right) \\&\quad = \frac{1}{2^n|S|} \sum _{y \in {\mathbb {F}}_2^n,x \in S} (-1)^{\langle \Gamma , G_2(y) \rangle \oplus \langle \gamma , x \rangle } \sum _{\mu \in {\mathbb {F}}_2^n} (-1)^{\langle \mu , y \oplus G_1(x) \rangle } \\&\quad = \frac{1}{|S|}\sum _{x \in S, y=G_1(x)} (-1)^{\langle \Gamma , G_2(y) \rangle \oplus \langle \gamma , x \rangle } \\&\quad = \frac{1}{|S|}\sum _{x \in S} (-1)^{\langle \Gamma , G_2(G_1(x)) \rangle \oplus \langle \gamma , x \rangle } = {\widehat{ H\vert _S}}(\gamma ,\Gamma ). \end{aligned}$$

\(\square \)

Proposition 2

(Auto-Correlation-Linear Hull with Restriction) Let \(G_1,G_2 :{\mathbb {F}}_2^n \rightarrow {\mathbb {F}}_2^n\) and let \(M,N \subseteq {\mathbb {F}}_2^n\) be non-empty sets. For any \(\Delta ,\gamma ,\gamma ' \in {\mathbb {F}}_2^n\) we then have

$$\begin{aligned} {{\,\mathrm{Aut}\,}}_{G_2 \circ G_1}^{(M,N)}(\Delta ,\gamma ,\gamma ') = \sum _{\Gamma ,\Gamma ' \in {\mathbb {F}}_2^n} {\widehat{G_2^{-1}\vert _M}}(\gamma , \Gamma ) {\widehat{G_2^{-1}\vert _N}}(\gamma ',\Gamma ') {{\,\mathrm{Aut}\,}}_{G_1}(\Delta ,\Gamma ,\Gamma '). \end{aligned}$$

Proof

Starting with Lemma 4, we express the restricted auto-correlation as

$$\begin{aligned}T={{\,\mathrm{Aut}\,}}_H^{(M,N)}(\Delta ,\gamma ,\gamma ') =\sum _{u\in {\mathbb {F}}_2^n} {\widehat{H^{-1}\vert _M}} (\gamma ,u) {\widehat{H^{-1}\vert _N}}(\gamma ',u) (-1)^{\langle u,\Delta \rangle } , \end{aligned}$$

and substitute the correlations by using Lemma 5 as

$$\begin{aligned} {\widehat{H^{-1}\vert _M}} (\gamma ,u) = \sum _{\Gamma \in {\mathbb {F}}_2^n} {\widehat{G_1^{-1}}}(\Gamma ,u) {\widehat{G_2^{-1}\vert _M}}(\gamma ,\Gamma ) \end{aligned}$$

and

$$\begin{aligned} {\widehat{H^{-1}\vert _N}} (\gamma ',u) = \sum _{\Gamma ' \in {\mathbb {F}}_2^n} {\widehat{G_1^{-1}}}(\Gamma ',u) {\widehat{G_2^{-1}\vert _N}}(\gamma ',\Gamma '). \end{aligned}$$

Doing this we get

$$\begin{aligned} T= & {} \sum _{u \in {\mathbb {F}}_2^n} \left( \sum _{\Gamma \in {\mathbb {F}}_2^n} {\widehat{G_1^{-1}}}(\Gamma ,u) {\widehat{G_2^{-1}\vert _M}}(\gamma ,\Gamma ) \right) \left( \sum _{\Gamma ' \in {\mathbb {F}}_2^n} {\widehat{G_1^{-1}}}(\Gamma ',u) {\widehat{G_2^{-1}\vert _N}}(\gamma ',\Gamma ') \right) (-1)^{\langle u,\Delta \rangle } \\= & {} \sum _{\Gamma ,\Gamma ' \in {\mathbb {F}}_2^n} {\widehat{G_2^{-1}\vert _M}}(\gamma ,\Gamma ) {\widehat{G_2^{-1}\vert _N}}(\gamma ',\Gamma ') \sum _{u \in {\mathbb {F}}_2^n} {\widehat{G_1^{-1}}}(\Gamma ,u) {\widehat{G_1^{-1}}}(\Gamma ',u) (-1)^{\langle u,\Delta \rangle }. \end{aligned}$$

Using the fact that, when considering the correlation of the inverse, input and output masks get swapped, we can rewrite this as

$$\begin{aligned} T = \sum _{\Gamma ,\Gamma ' \in {\mathbb {F}}_2^n} {\widehat{G_2^{-1}\vert _M}}(\gamma ,\Gamma ) {\widehat{G_2^{-1}\vert _N}}(\gamma ',\Gamma ') \sum _{u \in {\mathbb {F}}_2^n} {\widehat{G}}_1(u,\Gamma ) {\widehat{G}}_1(u,\Gamma ') (-1)^{\langle u,\Delta \rangle } .\end{aligned}$$

The last part, according to Eq. (4), is nothing else than the auto-correlation of \(G_1\) and thus we conclude

$$\begin{aligned} T= \sum _{\Gamma ,\Gamma ' \in {\mathbb {F}}_2^n} {\widehat{G_2^{-1}\vert _M}}(\gamma ,\Gamma ) {\widehat{G_2^{-1}\vert _N}}(\gamma ',\Gamma ') {{\,\mathrm{Aut}\,}}_{G_1}(\Delta ,\Gamma ,\Gamma ') \end{aligned}$$

as claimed. \(\square \)

Relation to Assumptions 1and 2

Proposition 2 provides a more precise interpretation of Eq. (3). Recall that \(G_1\) corresponds to \(E_2 \circ E_m \circ E_1\) and \(G_2\) to F. The part \(\varepsilon _i\) and \(\varepsilon _j\) correspond directly to \({\widehat{G_2^{-1}\vert _M}}(\gamma ,\Gamma )\) and \({\widehat{G_2^{-1}\vert _M}}(\gamma ,\Gamma )\), while the value \(q_{i,j}\) is now replaced by \({{\,\mathrm{Aut}\,}}_{G_1}(\Delta ,\Gamma ,\Gamma ')\). Our main observation is that we can still consider the unrestricted auto-correlations of \(G_1\) in this hull, which means that Assumption 1 is actually not needed, and Assumption 2, about the independence of the parts, can be replaced by the assumption that the hull is dominated by a single trail, that is a single choice of intermediate masks \(\Gamma ,\Gamma '\).

Moreover, using Proposition 2 allows to improve upon the correlation by considering multiple intermediate masks, as we demonstrate in the application to Chaskey in Sect. 9.

7 LLR-based Statistical Test

Let us consider our differential-linear attacks using N pairs. Following the notation introduced in Sect. 4.3, we let \((y, {{\tilde{y}}})\) be the \(\ell \)th pair and we assume that y and \({{\tilde{y}}}\) belong to the ith and jth partitions, respectively (we recall that, for ease of notation, we do not make the dependency of \(y, {\tilde{y}}, i, j\) on \(\ell \) explicit). Then, let us consider

$$\begin{aligned} w_\ell = \langle y, \gamma ^{(p_{i})} \rangle \oplus \langle {{\tilde{y}}}, \gamma ^{(p_{j})} \rangle \end{aligned}$$

for the \(\ell \)th pair \((y, {{\tilde{y}}})\), and consider the probability

$$\begin{aligned} \pi _\ell = \mathbf {Pr}_{y, {{\tilde{y}}}}\left[ w_\ell = 0\right] . \end{aligned}$$

Note that \(\pi _\ell = \frac{\rho _{i,j} + 1}{2}\), and for simplicity, let \(C_\ell \) denote this correlation for the \(\ell \)th pair (and the ith and jth partitions are used for this pair), i.e., \(C_\ell = \rho _{i,j} = 2 \pi _\ell - 1\).

Let \(D_0\) and \(D_1\) be the distributions

$$\begin{aligned} D_0 : \left( {\mathcal {B}}(1, \pi _1), \ldots , {\mathcal {B}}(1, \pi _N) \right) , \quad D_1 : \left( {\mathcal {B}}(1, 1/2), \ldots , {\mathcal {B}}(1, 1/2) \right) .\end{aligned}$$

where \({\mathcal {B}}(n,\pi _{\ell })\) is the binomial distribution with n trials, each having probability \(\pi _{\ell }\) of success, where \(0 \le \pi _{\ell } \le 1\) are not necessarily distinct.

Let \(q_0\) and \(q_1\) be the probabilities that \(w :=\left( w_1, \dotsc , w_N\right) \) is the result of sampling from \(D_0\) (i.e., from the real distribution) or \(D_1\) (i.e., the random distribution), respectively. Thus

$$\begin{aligned} q_0&= \mathbf {Pr}\left[ w = X \mid X \sim D_0\right] = \prod _{\ell =1}^{N} \pi _{\ell }^{w_{\ell }} (1 - \pi _{\ell })^{1-w_{\ell }}, \\ q_1&= \mathbf {Pr}\left[ w = X \mid X \sim D_1\right] = \prod _{\ell =1}^{N} 2^{-1} = 2^{-N}. \end{aligned}$$

The LLR statistic is defined as \(\ln (q_0/q_1)\). The likelihood of \(D_0\) is larger than that of \(D_1\) when \(\ln (q_0/q_1)>0\). Then the LLR statistic can be rewritten as

$$\begin{aligned} \ln \left( \frac{q_0}{q_1}\right)&= \ln \left( 2^{N} \times \prod _{\ell =1}^{N} \pi _{\ell }^{w_{\ell }} (1 - \pi _{\ell })^{1-w_{\ell }} \right) \\&= N \ln (2) + \sum _{\ell =1}^{N} w_{\ell } \ln (\pi _{\ell }) + \sum _{\ell =1}^{N} (1-w_{\ell }) \ln (1-\pi _{\ell }) \\&= N \ln (2) + \sum _{\ell =1}^{N} \ln (1-\pi _{\ell }) + \sum _{\ell =1}^{N} \ln \left( \frac{\pi _{\ell }}{1- \pi _{\ell }}\right) w_{\ell }. \end{aligned}$$

Note that the first term is constant, but the second and third term depend on the value of the guessed key bits, which affects the partition of the pairs. With a slight abuse of notation, we treat \(q_i\) as a random variable.

Note that, when we use \(C_\ell \) instead of \(\pi _\ell \), the LLR statistic is rewritten as

$$\begin{aligned} \mathrm {LLR}&= N \ln (2) + \sum _{\ell =1}^N \ln {\left( 1 - \pi _{\ell }\right) } + \sum _{\ell =1}^N \ln \left( \frac{\pi _{\ell }}{1 - \pi _{\ell }}\right) w_{\ell } \\&= N \ln (2) + \sum _{\ell = 1}^N \ln \left( \frac{1 - C_{\ell }}{2}\right) + \sum _{\ell = 1}^N\ln \left( \frac{1 + C_{\ell }}{1-C_{\ell }}\right) \frac{1 - {(-1)}^{w_{\ell }}}{2}\\&= \sum _{\ell =1}^N\underbrace{\ln \left( \sqrt{1 - C_{\ell }}\sqrt{1+{C_{\ell }}}\right) }_{0.5\ln {\left( 1-C_{\ell }^2\right) }} + \frac{1}{2}\sum _{\ell =1}^N \ln {\left( \frac{1 - C_{\ell }}{1+C_{\ell }}\right) }{(-1)}^{w_{\ell }}. \end{aligned}$$

We can now determine the means and variances of the LLR statistic under \(D_0\) and \(D_1\) in terms of the correlations \(C_l\).

Proposition 3

Let \({\mathcal {W}}\) and \({\mathcal {R}}\) be the LLR statistics when w is chosen from \(D_1\) and \(D_0\), respectively. Then, the means \({\mathbb {E}}[{\mathcal {W}}]\) and \({\mathbb {E}}[{\mathcal {R}}]\) are estimated as

$$\begin{aligned} {\mathbb {E}}\left[ {\mathcal {W}} \right] \approx - \frac{1}{2} \sum _{\ell =1}^{N} C_{\ell }^{2},&{\mathbb {E}}\left[ {\mathcal {R}} \right] \approx \frac{1}{2} \sum _{\ell =1}^{N} C_{\ell }^{2}. \end{aligned}$$

Proof

Assuming that w is chosen from \(D_1\), the average value of each \(w_{\ell }\) is 1/2.

$$\begin{aligned} {\mathbb {E}}\left[ {\mathcal {W}}\right]&={\mathbb {E}}\left[ \ln \left( \frac{q_0}{q_1}\right) \right] = N \ln (2) + \sum _{\ell =1}^{N} \ln (1-\pi _{\ell }) + \sum _{\ell =1}^{N} 2^{-1} \ln \left( \frac{\pi _{\ell }}{1- \pi _{\ell }}\right) \\&= N \ln (2) + \sum _{\ell =1}^{N} \ln (1-\pi _{\ell }) + \sum _{\ell =1}^{N} 2^{-1} \ln (\pi _{\ell }) - \sum _{\ell =1}^{N} 2^{-1} \ln (1-\pi _{\ell }) \\&= N \ln (2) + \sum _{\ell =1}^{N} 2^{-1} \ln (\pi _{\ell }(1-\pi _{\ell })). \end{aligned}$$

Let \(C_{\ell }=2\pi _{\ell }-1\), and \(\pi _{\ell }(1-\pi _{\ell })=\frac{1+C_{\ell }}{2} \times \frac{1-C_{\ell }}{2}=\frac{1-C_{\ell }^2}{4}\). Therefore,

$$\begin{aligned} {\mathbb {E}}\left[ {\mathcal {W}}\right]&= N \ln (2) + \sum _{\ell =1}^{N} 2^{-1} \ln \left( \frac{1-C_{\ell }^2}{4}\right) \\&= N \ln (2) + \sum _{\ell =1}^{N} 2^{-1} \ln (1-C_{\ell }^2) - \sum _{\ell =1}^{N} \ln (2) \\&= \frac{1}{2} \sum _{\ell =1}^{N} \ln (1-C_{\ell }^2). \end{aligned}$$

Using the Taylor series of \(\ln (1-C_{\ell }^2)\), we can approximate this expression with \(-C_{\ell }^2\) when \(C_{\ell }^2\) is close to 0. Therefore

$$\begin{aligned} {\mathbb {E}}\left[ {\mathcal {W}} \right]&\approx - \frac{1}{2} \sum _{\ell =1}^{N} C_{\ell }^{2}. \end{aligned}$$

Next, assuming that \(w_{\ell }\) is chosen from \(D_0\), the average value of \(w_{\ell }\) is \(1/2 + C_{\ell }/2\). Therefore

$$\begin{aligned} {\mathbb {E}}\left[ {\mathcal {R}}\right]&= - \frac{1}{2} \sum _{\ell =1}^{N} C_{\ell }^{2} + \sum _{\ell =1}^{N} \frac{C_{\ell }}{2} \ln \left( \frac{\pi _{\ell }}{1-\pi _{\ell }}\right) . \end{aligned}$$

Since \(\pi _{\ell }/(1-\pi _{\ell })=\frac{1+C_{\ell }}{1-C_{\ell }}\), we can rewrite the second term as

$$\begin{aligned} \sum _{\ell =1}^{N} \frac{C_{\ell }}{2} \ln \left( \frac{\pi _{\ell }}{1-\pi _{\ell }}\right)&= \sum _{\ell =1}^{N} \frac{C_{\ell }}{2} \ln \left( \frac{1+C_{\ell }}{1-C_{\ell }}\right) \\&= \sum _{\ell =1}^{N} \frac{C_{\ell }}{2} \left( \ln (1+C_{\ell }) - \ln (1-C_{\ell }) \right) \approx \sum _{\ell =1}^{N} C_{\ell }^2, \end{aligned}$$

where we have used again a Taylor approximation of \(\ln (1 + z)\) for the last step. We conclude that

$$\begin{aligned} {\mathbb {E}}\left[ {\mathcal {R}}\right]&\approx \frac{1}{2} \sum _{\ell =1}^{N} C_{\ell }^{2}. \end{aligned}$$

\(\square \)

Proposition 4

Let \({\mathcal {W}}\) and \({\mathcal {R}}\) be the LLR statistics when w is chosen from \(D_1\) and \(D_0\), respectively. Then, the variances \(\mathrm {Var}[{\mathcal {W}}]\) and \(\mathrm {Var}[{\mathcal {R}}]\) are estimated as

$$\begin{aligned} \mathrm {Var}\left[ {\mathcal {W}}\right] \approx \mathrm {Var}\left[ {\mathcal {R}}\right] \approx \sum _{\ell =1}^{N} C_{\ell }^{2}. \end{aligned}$$

Proof

In order to compute the variance of \( \ln \left( \frac{q_0}{q_1}\right) \), for simplicity we treat the term \(\pi _{\ell }\) as a constant. We have experimentally verified that this is reasonable. With this in mind, we can write

$$\begin{aligned} \mathrm {Var}\left[ \ln \left( \frac{q_0}{q_1}\right) \right] \approx \mathrm {Var}\left[ \sum _{\ell =1}^{N} w_{\ell } \ln \left( \frac{\pi _{\ell }}{1-\pi _{\ell }}\right) \right] . \end{aligned}$$

Assuming that w is chosen from \(D_1\), we know that the variance of \(w_{\ell }\) is 1/4. As before, since \(\pi _{\ell }/(1-\pi _{\ell })=\frac{1+C_{\ell }}{1-C_{\ell }}\), we obtain

$$\begin{aligned} \sum _{\ell =1}^{N} \frac{1}{4} \left[ \ln \left( \frac{1+C_{\ell }}{1-C_{\ell }}\right) \right] ^2 = \sum _{\ell =1}^{N} \frac{1}{4} \left( \ln (1+C_{\ell }) - \ln (1-C_{\ell }) \right) ^2 \end{aligned}$$

and using Taylor approximation:

$$\begin{aligned} \mathrm {Var}\left[ {\mathcal {W}}\right] \approx \sum _{\ell =1}^{N} \frac{1}{4} \left( \ln (1+C_{\ell }) - \ln (1-C_{\ell }) \right) ^2 \approx \sum _{\ell =1}^N C_{\ell }^2 \end{aligned}$$

Similarly, assuming that w is chosen from \(D_0\), the variance of \(w_{\ell }\) is \(1/4 - C_{\ell }^2/4\). Therefore, we want to compute

$$\begin{aligned} \mathrm {Var}\left[ {\mathcal {R}}\right] \approx \sum _{\ell =1}^{N} \frac{1}{4}\left( 1-C_{\ell }^2\right) \left[ \ln \left( \frac{1+C_{\ell }}{1-C_{\ell }}\right) \right] ^2 \end{aligned}$$

which is approximated with Taylor to

$$\begin{aligned} \sum _{\ell =1}^{N} \frac{1}{4}\left( 1-C_{\ell }^2\right) \left[ \ln \left( \frac{1+C_{\ell }}{1-C_{\ell }}\right) \right] ^2 \approx \sum _{\ell =1}^N C_{\ell }^2 - C_{\ell }^4. \end{aligned}$$

Therefore, the variance in both cases is approximately

$$\begin{aligned} \sum _{\ell =1}^{N} C_{\ell }^{2}. \end{aligned}$$

\(\square \)

7.1 Distinguishing Between the Distributions

Our experiments indicate that the LLR statistic is normally distributed both in the random and in the real case. While this could potentially be treated theoretically, using, e.g., some variant of the central limit theorem, we prefer to back this up by experiments in the applications. For now let \({\mathcal {N}}(\mu _0, \sigma _0^2)\) and \({\mathcal {N}}(\mu _1, \sigma _1^2)\) be the (assumed) normal distributions for the LLR statistics when the correct and wrong keys are guessed, respectively. The previous computation has shown that

$$\begin{aligned} \mu _0 = -\mu _1 = \frac{1}{2} \sum _{\ell =1}^N C_{\ell }^2 = \frac{N}{2} {\overline{C}},&\sigma _0^2 = \sigma _1^2 = \sum _{\ell =1}^N C_{\ell }^2 = N {\overline{C}}, \end{aligned}$$

where by \({\overline{C}}\) we denote the average of the squared correlation, i.e., \({\overline{C}} :=\frac{1}{N} \sum _{\ell =1}^N C_{\ell }^2\).

In order to distinguish between the two distributions, we are interested in the gap between \(\mu _0 - \mu _1\) and \(\sigma _0 (=\sigma _1)\).

$$\begin{aligned} \frac{\mu _0 - \mu _1}{\sigma _1} = \frac{N {\overline{C}}}{\sqrt{N {\overline{C}}}} = \sqrt{N {\overline{C}}} = \sqrt{\sum _{\ell =1}^N C_{\ell }^2}. \end{aligned}$$
(6)

Therefore, the larger \(\sum _{\ell =1}^N C_{\ell }^2\), the bigger the gap is. This implies that in order to maximize this gap, no data should be discarded. That is, there should be no partition with correlation zero. This is different from what happened in [1], where the same gap can approximately be represented as \( \frac{\sum _{\ell =1}^N |C_{\ell }|}{\sqrt{N}} = \sqrt{N {\overline{c}}^2}\), where \({\overline{c}} = \frac{1}{N}\sum _{\ell =1}^N |C_{\ell }|\). In other words, this gap is proportional to the squared value of the average of the (absolute value of the) correlation (\({\overline{c}}\)), while for the LLR statistic the same gap is proportional to the average of squared correlations (\({\overline{C}}\)). We remark that the latter is always larger than the former, as expected when using the LLR.

8 WHT-based Key Recovery Technique

We use the LLR statistic to recover the secret key. Recall that the LLR statistic is calculated as

$$\begin{aligned} \mathrm {LLR} = \frac{1}{2} \sum _{\ell =1}^N \ln {\left( 1-C_{\ell }^2\right) } + \frac{1}{2}\sum _{\ell =1}^N \ln {\left( \frac{1 - C_{\ell }}{1+C_{\ell }}\right) }{(-1)}^{w_{\ell }}, \end{aligned}$$

where \(w_\ell = \langle y, \gamma ^{(y_{\mathcal {P}})} \rangle \oplus \langle {{\tilde{y}}}, \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle \) and \((y,{{\tilde{y}}})\) is the \(\ell \)th pair in N pairs. Only ciphertext pair \((c, {{\tilde{c}}})\) can be observed by attackers. Which partition the pair belongs to is determined by \(y_{\mathcal {P}}\) and \({{\tilde{y}}}_{\mathcal {P}}\). Therefore, the key denoted by \(k_{\mathcal {P}}\) is guessed to identify the partition and \(y_{\mathcal {P}}= c_{\mathcal {P}}\oplus k_{\mathcal {P}}\). After guessing \(k_{\mathcal {P}}\), we can get the following 1-bit representation:

$$\begin{aligned} w_\ell&= \langle y, \gamma ^{(y_{\mathcal {P}})} \rangle \oplus \langle {{\tilde{y}}}, \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle \\&= \langle c, \gamma ^{(y_{\mathcal {P}})} \rangle \oplus \langle {{\tilde{c}}}, \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle \oplus \langle k, \gamma ^{(y_{\mathcal {P}})} \oplus \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle . \end{aligned}$$

We need not only \(k_{\mathcal {P}}\) but also \(\langle k, \gamma ^{(y_{\mathcal {P}})} \oplus \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle \) to compute \(w_\ell \). Let \(k_{\mathcal {R}}\) denote involved key bits, and the size is \(\dim W\), where a linear subspace W is defined by \(W :=\text {Span}\{ \gamma ^{(p_i)} \oplus \gamma ^{(p_j)} \mid i,j \in \{1,\ldots ,s\} \}\). The trivial procedure would require guessing \(k_{\mathcal {P}}\) and \(k_{\mathcal {R}}\) for every pair, for a time complexity of \(2N \times 2^{n_{\mathcal {P}}+ \dim W}\), where \(n_{\mathcal {P}}\) is the bit length of \(k_{\mathcal {P}}\).

In this section, we introduce a more advanced procedure, where the fast Walsh–Hadamard transform (FWHT) is applied instead of guessing \(k_{\mathcal {R}}\) for every pair. As a result, the time complexity is reduced to \(2^{n_{\mathcal {P}}} (2 N + \dim W \cdot 2^{\dim W})\).

8.1 Using the FWHT for Key Recovery

The first step in the key recovery procedure is guessing \(k_{\mathcal {P}}\) to identify partitions. Once we have guessed these key bits, the first term of the LLR statistic, i.e., \(\alpha :=\frac{1}{2} \sum _{\ell =1}^N \ln {\left( 1-C_{\ell }^2\right) }\), is constant and independent of \(w_\ell \). Thus, in order to determine the LLR, we compute \(\mathrm {LLR}' = \mathrm {LLR} - \alpha \) as

$$\begin{aligned} \mathrm {LLR}'&= \frac{1}{2}\sum _{\ell =1}^N \ln {\left( \frac{1 - C_{\ell }}{1+C_{\ell }}\right) }{(-1)}^{w_{\ell }} \\&= \frac{1}{2}\sum _{\ell =1}^N \ln {\left( \frac{1 - C_{\ell }}{1+C_{\ell }}\right) }{(-1)}^{\langle c, \gamma ^{(y_{\mathcal {P}})} \rangle \oplus \langle {{\tilde{c}}}, \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle } \times (-1)^{\langle k, \gamma ^{(y_{\mathcal {P}})} \oplus \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle }. \end{aligned}$$

Then, by using an array \(\beta \), whose element \(\beta (\gamma )\) is defined as

$$\begin{aligned} \beta (\gamma ) :=\frac{1}{2} \sum _{\ell = 1: \gamma = \gamma ^{(y_{\mathcal {P}})} \oplus \gamma ^{({{\tilde{y}}}_{\mathcal {P}})}}^{N} \ln {\left( \frac{1 - C_{\ell }}{1+C_{\ell }}\right) }{(-1)}^{\langle c, \gamma ^{(y_{\mathcal {P}})} \rangle \oplus \langle {{\tilde{c}}}, \gamma ^{({{\tilde{y}}}_{\mathcal {P}})} \rangle }, \end{aligned}$$

\(\mathrm {LLR}'\) is computed as

$$\begin{aligned} \mathrm {LLR}'&= \sum _{\gamma \in W} \beta (\gamma ) \times (-1)^{\langle k, \gamma \rangle }. \end{aligned}$$

Given a real-valued function \(f :{\mathbb {F}}_{2}^n \rightarrow {\mathbb {R}}\), the Walsh–Hadamard transform evaluates the function

$$\begin{aligned} {\widehat{f}} : {\mathbb {F}}_{2}^n \rightarrow {\mathbb {R}}, \quad x\mapsto \sum _{y \in {\mathbb {F}}_{2}^n} f(y) \times (-1)^{\langle x, y \rangle }. \end{aligned}$$

A naive computation needs \({\mathcal {O}}(2^{2n})\) steps (additions and evaluations of f), i.e., for each \(x \in {\mathbb {F}}_{2}^n\), we compute \((-1)^{\langle x, y \rangle }f(y)\) for each \(y \in {\mathbb {F}}_2^n\). The Fast Walsh–Hadamard transform is a well-known recursive divide-and-conquer algorithm that evaluates the Walsh–Hadamard transform in \({\mathcal {O}}(n 2^n)\) steps. We refer to, e.g., [38, Section 2.3] for the details.

figure c

Algorithm 2 shows the attack procedure using the FWHT. We first collect N ciphertext pairs, and therefore, it needs 2N queries to E as the data complexity. We next guess \(k_{\mathcal {P}}\) and prepare a real number \(\alpha \) and the array of real numbers \(\beta \) to compute the LLR statistic. For every stored ciphertext pair, we identify partitions, get corresponding correlation \(\rho _{i,j}\) and linear mask \(\gamma \), and update \(\alpha \) and \(\beta \) accordingly. We finally apply the FWHT to \(\beta \) and the LLR statistics are computed as \(\alpha + {\hat{\beta }}(k_{\mathcal {L}})\). The overall running time can be estimated as \(2^{n_{{\mathcal {P}}}}(2N+\dim W\cdot 2^{\dim W})\).

8.2 Success Probability of Algorithm 2

In the following, we assume that the distributions involved can be well estimated by normal approximations. This significantly simplifies the analysis. Note that we opted for a rather simple statistical model ignoring, in particular, the effect of the wrong key distribution by assuming the simple randomization hypothesis and ignoring the way we sample our plaintexts (i.e., known vs. chosen vs. distinct plaintext). Those effects might have major impact on the performance of attacks when the data complexity is close to the full code-book and the success probability and the gain are limited. However, none of this is the case for our parameters. In our concrete applications, we have verified the behavior experimentally wherever possible.

For the statistical model for the right key, this implies that the counter can be expected to approximately follow a normal distribution with parameters

$$\begin{aligned} {\mathcal {C}}(k_{{\mathcal {P}}},k_{{\mathcal {L}}}) \sim {\mathcal {N}} \left( \frac{N}{2} {\overline{C}}, N {\overline{C}} \right) , \end{aligned}$$

where \({\overline{C}} = \frac{1}{N} \sum _{\ell = 1}^{N} C_\ell ^2\). The wrong key counters (under the simple randomization hypothesis) are approximately normally distributed with parameters

$$\begin{aligned} {\mathcal {C}}(k'_{{\mathcal {P}}},k'_{{\mathcal {L}}}) \sim {\mathcal {N}} \left( - \frac{N}{2} {\overline{C}}, N {\overline{C}} \right) . \end{aligned}$$

With this, we can deduce the following proposition.

Proposition 5

After running Algorithm 2 for \(p^{-1}\) times, the probability that the correct key is among the key candidates is

$$\begin{aligned} p_{\mathrm{success}} \ge \frac{1}{2} \mathbf {Pr}\left[ {\mathcal {C}}(k_{{\mathcal {P}}},k_{{\mathcal {L}}}) \ge \Theta \right] = \frac{1}{2} \left( 1 - \Phi \left( \frac{\Theta - \frac{N}{2} {\overline{C}}}{\sqrt{N {\overline{C}}}}\right) \right) .\end{aligned}$$

The expected number of wrong keys is \(\frac{2^{n_{\mathcal {P}}+ \dim W}}{p} \times \left( 1 - \Phi \left( \frac{ \Theta + \frac{N}{2} {\overline{C}}}{\sqrt{N {\overline{C}}}} \right) \right) \).

9 Application to Chaskey

Fig. 7
figure 7

The round function of Chaskey

Chaskey [10] is a lightweight MAC algorithm whose underlying primitive is an ARX-based permutation in an Even-Mansour construction, i.e., Chaskey-EM. The permutation operates on four 32-bit words, i.e., the block size is 128 bits. In the version proposed in [10], the permutation employs 8 rounds of the form depicted in Fig. 7. In [40], the author proposed a version with an increased number of rounds (from 8 to 12), and this version is currently standardized in ISO/IEC 29192-6:2019. The designers claim security up to \(2^{80}\) computations as long as the data are limited to \(2^{48}\) blocks.

Let \((v_0^r, v_1^r, v_2^r, v_3^r)\) be the input of the rth round function, and \((w_0^r, w_1^r, w_2^r, w_3^r)\) denotes the state after applying the half round for \((v_0^r, v_1^r, v_2^r, v_3^r)\). Please refer to Fig. 7 for each index of the branches.

9.1 Overview of Our Attack

We first show the high-level overview of our attack. Similar to the previous differential-linear attack from [19], we first divide the cipher into three sub ciphers. For the 7-round attack, we use \(E_1\) covering 1.5 rounds, \(E_m\) covering 4 rounds, and \(E_2\) covering 0.5 rounds, and the key-recovery, i.e., the function F, is covering 1 round. We also present a 7.5-round attack, where \(E_2\) covers 1 round instead of 0.5 rounds.

The differential characteristic and the linear trail are applied to \(E_1\) and \(E_2\), respectively, while the experimental differential-linear distinguisher is applied to the middle part \(E_m\). Note that, since the differential-linear distinguisher over \(E_m\) is constructed experimentally, its absolute correlation must be high enough to be detectable by using a relatively small sampling space. Moreover, since it is practically infeasible to check all input differences and all output linear masks, we restricted ourselves to the case of an input difference of Hamming weight 1 and linear masks of the form [i] or \([i,i+1]\), i.e., 1-bit or consecutive 2-bit linear masks. As a result, when there is a non-zero difference only in the 31st bit (msb) of \(w^1_0\), i.e., \(\Delta _m = (([]),([]),([31]),([]))\), we observed the following two differential-linear distinguishers with correlations \(2^{-5.1}\):

$$\begin{aligned}&{{\,\mathrm{Aut}\,}}_{E_m}(\Delta _m, ([],[],[20],[]), ([],[],[20],[])) \approx 2^{-5.1}, \end{aligned}$$
(7)
$$\begin{aligned}&{{\,\mathrm{Aut}\,}}_{E_m}(\Delta _m, ([],[],[20,19],[]), ([],[],[20,19],[])) \approx 2^{-5.1}. \end{aligned}$$
(8)

These correlationsFootnote 4 are estimated using a set consisting of \(2^{26}\) random samples of \(w^1\). This is significant enough since the standard deviation assuming a normal distribution is \(2^{13}\). Note that we do not focus on the theoretical justification of this 4-round experimental differential-linear distinguisher in this paper and we start the analysis for \(E_1\) and \(E_2\) from the following subsection.

9.2 Using Conditional Differentials

Table 2 Probability that adding one basis element affects the output difference

Before we discuss the improved basis we have found using the conditional differential technique, we first recall the basis of \({\mathcal {U}}\) introduced in [1] (see the first two-row blocks in Table 2). Here, the threshold of the probability is relaxed from 0.95 to 0.93, and \(v_3[30]\) (in red) is newly added in the basis. The conditional differential techniques provide us with four other basis elements with probability 1, which cannot be found by Algorithm 1 [1] (see the third-row block in Table 2). A linear subspace \({\mathcal {U}}\), formed by elements that don’t affect the output with probability 1, and whose dimension is \(18+4=22\) is finally used to attack 7-round Chaskey. In addition, all, i.e., \(18+8+4=30\), basis elements are used to attack 7.5-round Chaskey.

In “Appendix C”, we provide the details on how to obtain these relations. We use the conditional differential framework and Fig. 18 in order to recover for free the value of some keybits and also to find additional bits of information for sampling and increase the dimension of \({\mathcal {U}}\) from 18 as given in [1] (and involving exclusively one-bit relations) to 22, or 23 if one-bit relation on the key is known. The new proposed set of freebits (or relations with probability 1) is optimal and no more such relations exist.

9.2.1 Keybits That Are Obtained for Free

If we find a set of inputs that verifies the differential path, we can directly deduce the following linear relations on the keybits, due to the conditions where differences are absorbed during the first modular additions (or the other way round, for each guess of these values, build sets of inputs that verify the 6 related conditions): \(k_1[8] \oplus k_0[8]\), \(k_1[21] \oplus k_0[21]\), \(k_1[30] \oplus k_0[30]\) and \( k_2[26]\oplus k_3[26]\), \( k_2[21]\oplus k_3[21]\), \( k_2[26]\oplus k_3[27]\). Note that these techniques can be used after mounting concrete attacks shown in Sects. 9.3 and 9.4. Thus, this does not contribute to accelerating our key-recovery attacks.

9.2.2 Additional Space for Sampling

Compared with the linear subspace \({\mathcal {U}}\) shown in [1], the dimension of \({\mathcal {U}}\) increases by 4 by adding vectors listed in the third-row block in Table 2 to the basis. In order to find these relations, we have used the rules presented in Sect. 5.3, and some more detailed explanations can be found in “Appendix C” for the interested reader using Fig. 18. We summarize this in the following lemma:

Lemma 6

There is a set \(\mathcal {X'} \subseteq {\mathbb {F}}_2^{128}\) of size \(2^{128-17}\) and a 22-dimensional linear subspace \({\mathcal {U}}\), such that for any element \(x \in \mathcal {X'}\) and any \(u \in {\mathcal {U}}\) it holds that \(E_1(x \oplus u) \oplus E_1(x \oplus u \oplus \Delta _{\mathrm {in}}) = \Delta _m\), where \(E_1\) denotes 1.5 rounds of Chaskey.

Our improved 7-round attack uses this linear subspace.

One additional probability-one relation can be obtained if we flip the bit \(v_2[27]\) and at the same time \(v_2[29]=v_2[29]\oplus v_2[28] \oplus v_3[28]\). The issue with this one is that it depends on the relation of \(k_2[28]\oplus k_3[28]\) (guessing this bit of information for instance would allow us to have an extra sampling bit) and it will not be used in the attack.

In addition to the probability-one relations, we can consider a larger linear subspace by adding relations with very high probabilities.

Lemma 7

There is a set \({\mathcal {X}}'\subseteq {\mathbb {F}}_2^{128}\) whose size is about \(2^{128-17.28}\) and a 30-dimension linear subspace \({\mathcal {U}}\), such that for any element \(x \in {\mathcal {X}}'\) and any \(u \in {\mathcal {U}}\) it holds that \(E_1(x \oplus u) \oplus E_1(x \oplus u \oplus \Delta _{\mathrm {in}}) = \Delta _m\) where \(E_1\) denotes 1.5 rounds of Chaskey.

We can build a 30-dimensional linear subspace such that all its elements verify simultaneously the differential with probability \(2^{-17.28}\). For this, we consider the 22-dimensional linear subspace of Lemma 6 and add to its basis the 7 vectors from [1] and \(v_3[30]\). Our 7.5-round attack uses this linear subspace.

9.3 The 7-Round Attack

9.3.1 List of Differential-Linear Distinguishers

As shown in Eqs. (7) and (8), we have two differential-linear distinguishers with correlations \(2^{-5.1}\), where two output linear masks, ([], [], [20], []) and ([], [], [20, 19], []) , are available. By extending ([], [], [20], []) and ([], [], [20, 19], []) by 0.5 rounds, respectively, we can get four linear masks (see Fig. 8). When both texts in pairs use either of \(\psi ^{(0)}\) or \(\psi ^{(1)}\), the correlation is \(\pm 2^{-6.42}\). Moreover, when both texts in pairs use either of \(\psi ^{(2)}\) or \(\psi ^{(3)}\), the correlation is \(\pm 2^{-6.42}\).

Fig. 8
figure 8

Four 0.5-round linear trails for the 7-round attack

We have other linear masks whose absolute correlation is relatively high but lower than \(2^{-6.43}\). Table 3 summarizes 12 output masks.

Table 3 List of output linear masks after 6 rounds

For any \((i,j) \in \{0,1\} \times \{0,1\}\) and \((i,j) \in \{2,3\} \times \{2,3\}\), the correlations of the differential-linear distinguishers are estimated by the combination of two output masks as follows:

$$\begin{aligned} {{\,\mathrm{Aut}\,}}(\beta _0^{(i)}, \beta _0^{(j)} )&= \delta ^{(i)}_0 \cdot \delta ^{(j)}_0 \cdot 2^{-6.42}, \quad {{\,\mathrm{Aut}\,}}(\beta _0^{(i)}, \beta _1^{(j)} ) = \delta ^{(i)}_0 \cdot \delta ^{(j)}_1 \cdot 2^{-7.70}, \\ {{\,\mathrm{Aut}\,}}(\beta _0^{(i)}, \beta _2^{(j)} )&= \delta ^{(i)}_0 \cdot \delta ^{(j)}_2 \cdot 2^{-8.76}, \quad {{\,\mathrm{Aut}\,}}(\beta _1^{(i)}, \beta _1^{(j)} ) = \delta ^{(i)}_1 \cdot \delta ^{(j)}_1 \cdot 2^{-8.95}, \\ {{\,\mathrm{Aut}\,}}(\beta _1^{(i)}, \beta _2^{(j)} )&= \delta ^{(i)}_1 \cdot \delta ^{(j)}_2 \cdot 2^{-10.01}, \quad {{\,\mathrm{Aut}\,}}(\beta _2^{(i)}, \beta _2^{(j)}) = \delta ^{(i)}_2 \cdot \delta ^{(j)}_2 \cdot 2^{-11.06}, \end{aligned}$$

where \({{\,\mathrm{Aut}\,}}(\beta ^{(i)}, \beta ^{(j)}) = {{\,\mathrm{Aut}\,}}_{E_2 \circ E_m}(\Delta _m, \beta ^{(i)}, \beta ^{(j)})\) and \(\delta ^{(i)}_h \in \{1,-1\}\) is defined by the \(\delta \) column in Table 3. Each correlation is estimated by using \(2^{35}\) pairs. Considering that the lowest correlation is \(2^{-11.06}\), an estimation with \(2^{35}\) pairs is reliable enough. These differential-linear distinguishers are finally used to estimate the theoretical correlation by considering the auto-correlation-linear hull.

9.3.2 Theoretical Correlations with Auto-Correlation-Linear Hull

To understand how to estimate the theoretical correlation, we provide an example. We observe a pair of ciphertexts \((c, {{\tilde{c}}})\) and guess key bits to identify the partition.

Table 4 List of partition points for the attack against 7-round Chaskey

Table 4 summarizes the partition points for the 7-round attack. To identify the partition, we need to know

$$\begin{aligned}&s^R[15], s^R[14], v_3[18] \oplus v_2[9,17], s^L[10], s^L[9], s^L[18], s^L[17], \\&v_3[11] \oplus v_2[2,10], s^L[3], s^L[2], s^L[11], (s^L[10]), \end{aligned}$$

and 11-bit key guessing is enough, where \(s^L = k_1 \oplus k_2\) and \(s^R = k_0 \oplus k_3\). After we guess the 11-bit key, we assume that \(\zeta _1 \ni p_i \cong (0,0)\), \(\zeta _2 \ni p_i \cong (0,0,0,1,0)\), and \(\zeta _3 \ni p_i \cong (0,0,0,1,0)\) for c. We now consider the case that the linear trail \(\psi ^{(3)}\) is used for both texts in a pair. When \(\beta ^{(3)}_0\) is used, available linear masks and corresponding correlation are computed as follows:

  • To compute \(w^6_0[16]\), \(\gamma = 11100\) is used with correlation \(-1\).

  • To compute \(v^6_2[11]\), \(\gamma = 11111011010\) is used with correlation \(-1\).

  • To compute \(v^6_2[4]\), \(\gamma = 11111011010\) is used with correlation \(-1\).

Note that the partition shown in Fig. 16 is directly available to evaluate \(v_2^6[11]\). For other bits, e.g., \(v_2^6[4]\), corresponding correlations must be reevaluated because the 11th bit and 4th bit provide slightly different correlations. Assuming all partition points are independent, the correlation is

$$\begin{aligned} {\widehat{F^{-1}\vert _{{\mathcal {T}}_{p_i}}}}(\gamma ^{p_i}, \beta _0^{(3)}) = -1 \times -1 \times -1 = -1 \end{aligned}$$

due to the piling-up lemma [15].

We also assume that \(\zeta _1 \ni p_i \cong (0,0)\), \(\zeta _2 \ni p_i \cong (0,0,1,0,0)\), and \(\zeta _3 \ni p_i \cong (0,0,0,1,0)\) for \({{\tilde{c}}}\). When \(\beta _0^{(3)}\) is used, available linear masks and corresponding correlation are computed as follows:

  • To compute \({{\tilde{w}}}^6_0[16]\), \(\gamma = 11100\) is used with correlation \(-1\).

  • To compute \({{\tilde{v}}}^6_2[11]\), \(\gamma = 11111011100\) is used with correlation \(2^{-0.263}\).

  • To compute \({{\tilde{v}}}^6_2[4]\), \(\gamma = 11111011100\) is used with correlation \(-1\).

Again, assuming all partition points are independent, the correlation is

$$\begin{aligned} {\widehat{F^{-1}\vert _{{\mathcal {T}}_{p_j}}}}(\gamma ^{p_j}, \beta _0^{(3)}) = -1 \times 2^{-0.263} \times -1 = 2^{-0.263}. \end{aligned}$$

Thus, when \(\beta _0^{(3)}\) and \(\beta _0^{(3)}\) are used for c and \({{\tilde{c}}}\), respectively, the correlation (with one trail) is estimated as

$$\begin{aligned}&{\widehat{F^{-1}\vert _{{\mathcal {T}}_{p_i}}}}(\gamma ^{p_i}, \beta _0^{(3)}) \times {\widehat{F^{-1}\vert _{{\mathcal {T}}_{p_j}}}}(\gamma ^{p_j}, \beta _0^{(3)}) \times {{\,\mathrm{Aut}\,}}_{E_2 \circ E_m}(\Delta _m, \beta _0^{(3)}, \beta _0^{(3)} ) \\&\quad = -1 \times 2^{-0.263} \times (\delta _0^{(3)} \times \delta _0^{(3)} \times 2^{-6.42}) = - 2^{-6.683}. \end{aligned}$$

We now take the auto-correlation-linear hull into account. Instead of \(\beta _0^{(3)}\) for c, we use \(\beta _1^{(3)}\) and compute the correlation when the same linear mask \(\gamma \) is used.

  • To compute \(v^6_2[4,3,2]\), \(\gamma = 11111011010\) is used with correlation \(2^{-0.677}\).

Therefore,

$$\begin{aligned} {\widehat{F^{-1}\vert _{{\mathcal {T}}_{p_i}}}}(\gamma ^{p_i}, \beta _1^{(3)}) = -1 \times -1 \times 2^{-0.677} = 2^{-0.677}. \end{aligned}$$

Therefore, when \(\psi _1^{(3)}\) and \(\psi _0^{(3)}\) are used for c and \({{\tilde{c}}}\), respectively, the correlation (with one trail) is estimated as

$$\begin{aligned}&{\widehat{F^{-1}\vert _{{\mathcal {T}}_{p_i}}}}(\gamma ^{p_i}, \beta _1^{(3)}) \times {\widehat{F^{-1}\vert _{{\mathcal {T}}_{p_j}}}}(\gamma ^{p_j}, \beta _0^{(3)}) \times {{\,\mathrm{Aut}\,}}_{E_2 \circ E_m}(\Delta _m, \beta _1^{(3)}, \beta _0^{(3)} ) \\&2^{-0.677} \times 2^{-0.263} \times (\delta _1^{(3)} \times \delta _0^{(3)} \times 2^{-7.70}) = - 2^{-8.64}. \end{aligned}$$

We estimate \(3 \times 3 = 9\) correlations and sum up these correlations (considering the sign). As a result, when \(\psi ^{(3)}_1\) and \(\psi ^{(3)}_0\) are used, the absolute correlation increases to \(2^{-5.90893}\). We similarly estimate correlations when different linear trails are used, but in fact, using \(\psi ^{(3)}_1\) and \(\psi ^{(3)}_0\) causes the highest absolute correlation on this partition. Remark that once the indicator is given, the best linear mask and corresponding correlation are computed. The complexity is about \(2^{2k_{\mathcal {P}}}\), which is negligible in comparison with the time complexity of the whole attack.

9.3.3 Experimental Reports

The absolute correlation of each partition is high enough so that we experimentally verify our attack procedure. In our experiments, we used the right pair and the correct key to observe the LLR statistic for the correct case. On the other hand, the right pair is not used for the wrong case.

Fig. 9
figure 9

Comparison with LLR statistics to attack 7-Round Chaskey

The LLR statistic depends on the sum of the squared correlation \(N {\overline{C}} = \sum _{\ell =1}^{N} c_\ell ^2\). We estimated \({\overline{C}} \approx 2^{-14.711}\), and \(N {\overline{C}} \approx 39.1\) when \(N = 2^{20}\) pairs are used. The following shows the comparison of the LLR statistics, where the theoretical distribution is drawn by the normal distribution with mean \(N {\overline{C}}/2\) (for a correct case) and \(-N {\overline{C}}/2\) (for wrong case) and the standard deviation \(\sqrt{N {\overline{C}}}\). By repeating our attack procedure 1024 times, two experimental histograms are drawn (see Fig. 9). A slight gap is observed between the theoretical distribution and experimental histogram in the correct case. Note that the experimental one is more biased than the theoretical estimation. We expect that the reason comes from the additional auto-correlation-linear hull that we do not take into account.

We finally estimate the data and time complexities. To identify the partition, we need to guess the 11-bit secret key. We also enumerated elements of the linear subspace W and computed the basis by using Gaussian elimination. As a result, the dimension of W is 10. Because of Lemma 6, \(2^{17}\) iterations are required to find the right pair. Thus, we need to remove \(2^{11+10+17}=2^{38}\) wrong cases. When \(2^{21}\) pairs are used, we have \(N {\overline{C}} \approx 78.2\). With a success probability of 90%, we can construct a 45.5-bit filter, which is enough to remove \(2^{38}\) wrong cases. We finally estimate the time complexity by using the formula as follows:

$$\begin{aligned} T&= p^{-1} \times 2^{n_{\mathcal {P}}} \times \left( 2N + \dim W 2^{\dim W} \right) \\&= 2^{17} \times 2^{11} \times \left( 2 \times 2^{21} + 10 \times 2^{10} \right) \approx 2^{50.00}. \end{aligned}$$

9.4 The 7.5-Round Attack

Table 5 List of output linear masks after 6.5 rounds

We further extend four 0.5-round linear trails to eight 1-round linear trails. For every linear trail, we have two different trails whose absolute correlation is slightly lower. Table 5 shows 24 such output masks. For any \((i,j) \in \{0,1,2,3\} \times \{0,1,2,3\}\) and \((i,j) \in \{4,5,6,7\} \times \{4,5,6,7\}\), the correlations of the differential-linear distinguishers are estimated by the combination of two output masks as follows:

$$\begin{aligned} {{\,\mathrm{Aut}\,}}(\beta _0^{(i)}, \beta _0^{(j)} )&= \delta ^{(i)}_0 \cdot \delta ^{(j)}_0 \cdot 2^{-9.72}, \quad {{\,\mathrm{Aut}\,}}(\beta _0^{(i)}, \beta _1^{(j)} ) = \delta ^{(i)}_0 \cdot \delta ^{(j)}_1 \cdot 2^{-10.99}, \\ {{\,\mathrm{Aut}\,}}(\beta _0^{(i)}, \beta _2^{(j)} )&= \delta ^{(i)}_0 \cdot \delta ^{(j)}_2 \cdot 2^{-12.04}, \quad {{\,\mathrm{Aut}\,}}(\beta _1^{(i)}, \beta _1^{(j)} ) = \delta ^{(i)}_1 \cdot \delta ^{(j)}_1 \cdot 2^{-12.26}, \\ {{\,\mathrm{Aut}\,}}(\beta _1^{(i)}, \beta _2^{(j)} )&= \delta ^{(i)}_1 \cdot \delta ^{(j)}_2 \cdot 2^{-13.32}, \quad {{\,\mathrm{Aut}\,}}(\beta _2^{(i)}, \beta _2^{(j)} ) = \delta ^{(i)}_2 \cdot \delta ^{(j)}_2 \cdot 2^{-14.40}, \end{aligned}$$

where \({{\,\mathrm{Aut}\,}}(\beta ^{(i)}, \beta ^{(j)}) = {{\,\mathrm{Aut}\,}}_{E_2 \circ E_m}(\Delta _m, \beta ^{(i)}, \beta ^{(j)})\) and \(\delta ^{(i)}_h \in \{1,-1\}\) is defined by the \(\delta \) column in Table 5. These correlations are estimated by using \(2^{40}\) pairs. Considering the lowest absolute correlation is \(2^{-14.6}\), an estimation with \(2^{40}\) pairs is reliable enough. We use the same method as the attack against 7-round Chaskey to determine a linear mask and estimate the corresponding correlation.

Table 6 summarizes the partition points for the 7.5-round attack. To identify the partition, we need to know

$$\begin{aligned}&s^R[22], s^R[21], s^R[20], s^R[19], s^R[18], s^L[24], s^L[23], \\&v_3[28] \oplus v_0[27,14], s^L[15], s^L[14], s^L[28], s^L[27], \\&v_1[25] \oplus v_2[8,1], s^R[2], s^R[1], s^R[9], s^R[8], \\&v_1[18] \oplus v_2[26,1], s^R[27], s^R[26], (s^R[2]), (s^R[1]), \\&v_1[10] \oplus v_2[25,18], (s^R[19]), (s^R[18]), (s^R[26]), s^R[25], \\&s^L[30], s^L[29], (s^L[28]), (s^L[27]) \end{aligned}$$

and 24-bit key guessing is enough, where \(s^L = k_0 \oplus k_1\) and \(s^R = k_2 \oplus k_3\).

Table 6 List of partition points for the attack against 7.5-round Chaskey

9.4.1 Experimental Reports

Each absolute correlation is relatively lower than for a 7-round attack, but it is still possible to verify our attack procedure experimentally by using about \(2^{28}\) pairs. Like the 7-round attack, we used a right pair and the correct key to observe the LLR statistic for the correct case, and a right pair is not used for the wrong case.

Fig. 10
figure 10

Comparison with LLR statistics to attack 7.5-Round Chaskey

We estimated \({\overline{C}} \approx 2^{-24.37}\), and \(N {\overline{C}} \approx 12.38\) when \(N = 2^{28}\) pairs are used. Figure 10 shows the comparison of the LLR statistics, where the theoretical distribution is drawn by the normal distribution with mean \(N {\overline{C}}/2\) (for a correct case) and \(-N {\overline{C}}/2\) (for wrong case) and the standard deviation \(\sqrt{N {\overline{C}}}\). By repeating our attack procedure 256 times, two experimental histograms are drawn. Similar to the 7-round attack, a slight gap between the theoretical distribution and experimental histogram is observed in the correct case. We again expect that the reason comes from the other auto-correlation-linear hull that we do not consider. We estimate the data and time complexities. To identify the partition, we need to guess the 25-bit secret key. We also enumerated elements of the linear subspace W and computed the basis by using Gaussian elimination. As a result, the dimension of W is 21. To find a right pair, we need \(2^{17.28}\) iterations because of Lemma 7. Thus, we need to remove \(2^{24+21+17.28}=2^{62.28}\) wrong cases. Chaskey outputs at most \(2^{48}\) data, the number of available pairs is at most \(2^{48-17.28-1} = 2^{29.72}\). Then, \(N {\overline{C}} \approx 40.78\). With a success probability of 90%, we can construct a 22.5-bit filter, which is insufficient to remove all wrong cases. Considering \(2^{17.28}\) iterations to find a right pair, the performance to filter wrong keys decreases to 5.22 bits. We finally estimate the time complexity as

$$\begin{aligned} T&= p^{-1} \cdot 2^{n_{\mathcal {P}}} \cdot \left( 2N + \dim W 2^{\dim W} \right) \\&= 2^{17.28} \cdot 2^{24} \cdot \left( 2 \cdot 2^{30} + 21 \cdot 2^{21} \right) \approx 2^{72.28}. \end{aligned}$$

9.4.2 Using Multiple Linear Approximations Every Partition

Only filtering \(2^{5.22}\) wrong keys is not always enough to attack 7.5-round Chaskey. To recover the unique key under the restriction of \(2^{48}\) data, we use an extended attack, where multiple linear approximations are used for every partition. In the 7.5-round attack, there are \(2 \times 4 \times 4 = 32\) linear approximations, and we choose only one approximation with the highest absolute correlation. However, why do we not use the other 31 approximations? The use of these approximations allows us to reduce the data complexity significantly. Of course, this is a little controversial technique because we are unlikely to assume that each approximation is independent. Fortunately, since our attack can be verified experimentally, we implemented our attack under this controversial assumption.

Fig. 11
figure 11

Comparison with LLR statistics to attack 7.5-Round Chaskey when multiple linear masks are used every partition

We estimated \({\overline{C}} \approx 2^{-22.86}\). When \(N = 2^{28}\) pairs are used, \(N {\overline{C}} \approx 35.26\), which increases from 12.38. Figure 11 shows the comparison of the LLR statistics, where the theoretical distribution is drawn by the normal distribution with mean \(N {\overline{C}}/2\) (for a correct case) and \(-N {\overline{C}}/2\) (for wrong case) and the standard deviation \(\sqrt{N {\overline{C}}}\). By repeating our attack procedure 256 times, two experimental histograms are drawn. Despite the controversial assumption, our theoretical estimation can simulate the experimental result nicely. Therefore, for the application to 7.5-round Chaskey, we conclude that using multiple linear approximations independently does not have any issue.

We estimate the data and time complexities. Again, since only \(2^{29.72}\) pairs are available, \(N {\overline{C}} \approx 116.16\). With a success probability of 90%, we can construct a 69.6-bit filter, enough to remove \(2^{62.28}\) wrong cases. The number of approximations, 32, is multiplied. We finally estimate the time complexity as

$$\begin{aligned} T&= p^{-1} \times 2^{n_{\mathcal {P}}} \times \left( 2N \times 32 + \dim W 2^{\dim W} \right) \\&= 2^{17.28} \times 2^{24} \times \left( 2 \times 2^{29.72} \times 32 + 21 \times 2^{21} \right) \approx 2^{77.00}. \end{aligned}$$

10 Application to ChaCha

The internal state of ChaCha is represented by a \(4 \times 4\) matrix whose elements are 32-bit vectors. In this section, the input state for the rth round function is represented as

$$\begin{aligned} \begin{pmatrix} v_0^r &{} v_1^r &{} v_2^r &{} v_3^r \\ v_4^r &{} v_5^r &{} v_6^r &{} v_7^r \\ v_8^r &{} v_9^r &{} v_{10}^r &{} v_{11}^r \\ v_{12}^r &{} v_{13}^r &{} v_{14}^r &{} v_{15}^r \end{pmatrix}. \end{aligned}$$

The QR (an abbreviation for quarterround) function is applied in odd and even rounds on every column and diagonal, respectively. We also introduce the notion of a half round, in which the QR function is divided into two sub-functions depicted in Fig. 12. Let \(w^r\) be the internal state after applying a half round on \(v^r\). Moreover, we use the term branches for abc and d, as shown in Fig. 12.

In the initial state of ChaCha, a 128-bit constant is loaded into the first row, a 128- or 256-bit secret key is loaded into the second and third rows, and a 64-bit counter and 64-bit nonce are loaded into the fourth row. In other words, the first three rows in \(v^0\) are fixed. For r-round ChaCha, the odd and even round functions are iteratively applied, and the feed-forward values \(v^0_i \boxplus v^r_i\) is given as the key stream for all i. Note that we can compute \(v^r_i\) for \(i \in \{0,1,2,3,12,13,14,15\}\) because corresponding \(v^0_i\) is known.

Fig. 12
figure 12

The odd and even round functions of ChaCha

10.1 Overview of Our Attack

We use the same attack strategy as for Chaskey. The cipher is divided into the sub ciphers \(E_1\) covering 1 round, \(E_m\) covering 2.5 rounds, and \(E_2\) covering 1.5 rounds to attack 6 rounds, where the key recovery is applied the last single round (F). One difference to Chaskey is the domain space that the attacker can control. In particular, we cannot control branches ab, and c because fixed constants and the fixed secret key are loaded into these states. Thus, only branch d can be varied. It implies that active bit positions for input differences are limited to branch d, and a difference \(\Delta _m\) with Hamming weight 1 after \(E_1\) will not be available due to the property of the round function. Therefore, we first need to generate consistent \(\Delta _m\) whose Hamming weight is minimized. The following shows such differential characteristics over one QR function:

$$\begin{aligned} \Delta _{\mathrm {in}}= (([]),([]),([]),([i])) \ \rightarrow \ \Delta _{m} = (&([i+28]),([i+31,i+23,i+11,i+3]),\\&([i+24,i+16,i+4]),([i+24,i+4])). \end{aligned}$$

The probability that pairs with input difference \(\Delta _{\mathrm {in}}\) satisfy this characteristic is \(2^{-5}\) on average. We discuss the properties of this differential characteristic in Sect. 10.2 in more detail.

We next evaluate an experimental differential-linear distinguisher for the middle part \(E_m\). When the Hamming weight of \(\Gamma _m\) is 1, and the active bit is in the lsb, it allows the absolute correlation of linear trails for \(E_2\) to be lower. For \(i=6\), i.e., \(\Delta _{m} = (([2]),([5,29,17,9]),([30,22,10]),([30,10]))\), we find the following four differential-linear distinguishers:

$$\begin{aligned} {{\,\mathrm{Aut}\,}}_{E_m}(\Delta _{m,j}, \alpha _j, \alpha _j) = 2^{-8.3} \end{aligned}$$

for \(j \in \{0,1,2,3\}\), where \(\Delta _{m,j}\) is a difference such that \(\Delta (v^1_{j}, v^1_{j+4}, v^1_{j+8}, v^1_{j+12}) = \Delta _m\) (and other branches are constant), and \(\alpha _j\) is a linear mask such that the lsb of the branch \(w^3_{(j+1) ~ \mathrm {mod}~4}\) is 1 (and the others are 0). When this experimental distinguisher is combined with the differential characteristic for \(E_1\), it covers 3.5 rounds with a 1-bit output linear mask \(\Gamma _m\). This differential-linear distinguisher is improved by 0.5 rounds from the previous distinguisher with 1-bit output linear mask (see [20, 22]).

10.2 Differential Part

The QR function is independently applied to each column in the first round. Therefore, when the output difference of one QR function is restricted by \(\Delta _m\), the input of the other three QR functions are trivially independent of the output difference. It implies that we have 96 independent bits, and we can easily amplify the probability of the differential-linear distinguisher. On the other hand, we face a different problem: the probability of the differential characteristic \((\Delta _{\mathrm {in}}, \Delta _m)\) highly depends on the value of the secret key. For example, for \(\Delta v_{12}^0[6] = 1\), we expect that there is a pair \((v_{12}^0, v_{12}^0 \oplus \mathtt {0x00000020})\) satisfying \(\Delta (v_0^1, v_4^1, v_8^1, v_{12}^1) = \Delta _m\), but it depends on the constant \(v_{0}^0\) and the key values \(v_{4}^0\) and \(v_{8}^0\). We cannot find such a pair for 292 out of 1024 randomly generated keys in our experiments. On the other hand, when we can find it, i.e., on 732 out of 1024 keys, the average probability satisfying \(\Delta (v_0^1, v_4^1, v_8^1, v_{12}^1) = \Delta _m\) is \(2^{-4.5}\). This experiment implies the existence of “strong keys” against our attackFootnote 5. However, note that we can vary the columns in which we put a difference, which involves different key values. Since the fraction of “strong keys” is not so high, i.e., 292/1024, we can assume that there is at least one column in which no “strong key” is chosen with very high probability.

To determine the factor p, for 1024 randomly generated keys, we evaluated \(p^{-1}\) randomly chosen nonces and counters, where the branch in which we induce the difference is also randomly chosen. As a result, we can find a right pair on 587 keys with \(p^{-1}=2^5\) iterations. Therefore, with \(p=2^{-5}\), we assume that we can find a right pair with a probability of 1/2 in this stage of the attack.

In the following, we explain our attack for the case that \(v_{12}^{0}\) is active, \(\Delta (v^1_0, v^1_4, v^1_8, v^1_{12}) = \Delta _m\). Note that the analysis for the other three cases follows the same argument.

10.3 Linear Part for the 6-Round Attack

Fig. 13
figure 13

Two linear trails for 1.5-round ChaCha

To attack 6-round ChaCha, we first construct a 5-round differential-linear distinguisher, where 1.5-round linear trails are appended (i.e., the \(E_2\) part) to the 3.5-round experimental differential-linear distinguisher from the previous section. We have two 1.5-round linear trails given by

$$\begin{aligned}&\mathbf {Cor}[w_1^3[0] \oplus \psi ^{(1)}] = 2^{-1},&\mathbf {Cor}[w_1^3[0] \oplus \psi ^{(0)}] = -2^{-1}, \end{aligned}$$

where \(\psi ^{(1)} = \psi \oplus v_{10}^5[6]\) and \(\psi ^{(0)} = \psi \oplus v_{14}^5[6]\), and

$$\begin{aligned} \psi&= (v_{5}^5[19,7] \oplus v_{10}^5[19,7] \oplus v_{15}^5[8,0]) \oplus (v_{1}^5[0] \oplus v_{6}^5[26] \oplus v_{11}^5[0])\\&\quad \oplus (v_{13}^5[0]) \oplus (v_{3}^5[0] \oplus v_{9}^5[12] \oplus v_{14}^5[7]). \end{aligned}$$

Figure 13 shows the two 1.5-round linear trails. Since their correlations are \(\pm 2^{-1}\), we have \(2 \times 2\) differential-linear distinguishers on 5 rounds whose correlations are \(\pm 2^{-10.3}\). Note that the sign of each correlation is deterministic according to the output linear mask.

10.4 Key Recovery for the 6-Round Attack

Fig. 14
figure 14

Key recovery for 6-round ChaCha

Our 6-round attack uses these 5-round differential-linear distinguishers, and the 1-round key recovery is shown in Fig. 14. Let \(\mathbf {c} = (c_0, \ldots , c_{15})\) be the corresponding output, and let \(\mathbf {v} = (v_0, \ldots , v_{15})\) be the sixteen 32-bit values before the secret key is added. Note that the secret key is only added with half of the state and public values are added with the other state. Therefore, we simply regard \(v_i = c_i\) for \(i \in \{0,15,1,12,2,13,3,14\}\).

First, we partially extend two linear masks for the last round to be linearly computed. Figure 14 summarizes the extended linear masks, where we need to compute the bits labeled by a red color. Moreover, for simplicity, we introduce \(t_0\), \(t_{10}\), \(t_{11}\), and \(t_3\) as depicted in Fig. 14.

Each bit in \(\mathbf {v}\) to which the secret key is not added can be computed for free. For the other bits, we need to guess some key bits first. We first explain the simple case, i.e., we compute \(v_i[j]\) from \(c_i\). As an example, we focus on \(v_7[7]\), which involves \(k_7\) nonlinearly. We apply the partition technique to compute this bit. By guessing \(k_7[6]\) and \(k_7[5]\) (remember that \(k_7[7]\) cancels out in the differential-linear approximation), (3/4) data are available with correlation \(\pm 1\), and the remaining (1/4) dataFootnote 6 is available with correlation \(-2^{-1}\). Since \(v_i[0]\) is linearly computed by \(c_i[0]\), there are 13 simple partition points in which we need to guess key bits. In total, we need to guess a 26-bit key.

Computing bits in \(\mathbf {v}^5\) and \(\mathbf {t}\) is a bit more complicated than the simple case above. For example, let us consider \(v_9^5[12]\), and this bit can be computed as

$$\begin{aligned} v_9^5[12]&= (c_9 \boxminus k_9 \boxminus c_{14} \boxminus (c_3 \oplus (v_{14} \ggg 8)))[12] \\&= ((c_9 \boxminus c_{14} \boxminus (c_3 \oplus (v_{14} \ggg 8)))\boxminus k_9 )[12]. \end{aligned}$$

Since we can compute \((c_9 \boxminus c_{14} \boxminus (c_3 \oplus (v_{14} \ggg 8)))\) for free, this case is equivalent to the simple case. We also use this equivalent transformation for \(t_{10}\), \(t_{11}\), and \(v_{10}[19]\). In total, we have 6 such partition points, and some partition points can share the same key, e.g., 2-bit key \(k_{10}[18]\) and \(k_{10}[17]\) is already guessed to compute \(v_{10}[19]\). Guessing 4 bits of additional key is enough to compute each bit. Since we have two linear masks \(\psi ^{(0)}\) and \(\psi ^{(1)}\), we dynamically change an applied linear mask according to the data such that correlations to compute \(v_{10}^5[7]/v_{10}^5[7,6]\) become \(\pm 1\).

We cannot use the equivalent transformation to compute bits in \(t_0\) and \(t_3\). Then, we further extend this linear mask with correlation \(2^{-1}\). For example, we have the following approximations

$$\begin{aligned} t_0[8]&\approx v_0[8,7] \oplus v_5[15] \oplus v_{10}[8] \oplus 1, \quad t_0[8] \approx v_0[8] \oplus v_5[15,14] \oplus v_{10}[8,7], \end{aligned}$$

for \(t_0[8]\) with correlation \(2^{-1}\), and we can use preferable approximations depending on the data. Namely, we first guess \(k_{10}[7]\) and determine which linear approximations are available. Then, we guess \(k_{5}[14]\) and \(k_{5}[13]\) and compute \(v_5[15]\) (resp. \(v_5[15,14]\)). In order words, by guessing 3-bit key, 3/4 data are available with correlation \(\pm 2^{-1}\) and 1/4 data are available with correlation \(\pm 2^{-2}\). We also use the same technique for \(t_3[7]/t_3[7,6]\).

10.4.1 Estimating the Average of the Squared Correlation

Based on the analysis above, we estimate the average of the squared correlation. We suppose each partitioning point is independent when its indicator uses different bits to calculate the average.

We start evaluating the function involving the 1st diagonal.

  • The indicator to compute \(t_0[8]\) is \((c_5 \oplus k_5)[14]\), \((c_5 \oplus k_5)[13]\), and \((c_{10} \oplus k_{10})[7]\). As discussed before, the average of the squared correlation is \(3/4 \times (2^{-1})^2 + 1/4 \times (2^{-2})^2\).

  • The indicator to compute \(v_5[26]\) is the 25th and 24th bits in \((c_5 \oplus k_5)\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_5[7,6]\) is the 6th and 5th bits in \((c_5 \oplus k_5)\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_{10}^5[19]\) is the 18th and 17th bits in \(((c_{10} \boxminus c_{15} \boxminus (c_0 \oplus (c_{15} \ggg 8))) \oplus k_{10})\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_{10}^5[7]\) (or \(v_{10}^5[7,6]\)) is the 6th bit in \(((c_{10} \boxminus c_{15} \boxminus (c_0 \oplus (c_{15} \ggg 8))) \oplus k_{10})\). Here, we change the applied linear mask according to the observed ciphertext such that correlations become 1. Thus, the average of the squared correlation is 1.

  • The indicator to compute \(t_{10}[19]\) is the 18th and 17th bits in \(((c_{10} \boxminus c_{15}) \oplus k_{10})\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(t_{10}[7]\) is the 6th and 5th bits in \(((c_{10} \boxminus c_{15}) \oplus k_{10})\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_{10}[31]\) is the 30th and 29th bits in \((c_{10} \oplus k_{10})\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_{10}[19]\) is the 18th and 17th bits in \((c_{10} \oplus k_{10})\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

In total, we guess 6 key bits in \(k_5\) and 7 key bits in \(k_{10}\). The average of the squared correlation is the product of these nine partitioning points, i.e., about \(2^{-4.396}\).

We similarly evaluate the function involving the 2nd, 3rd, and 4th diagonals.

  • The indicator to compute \(v_6[19]\) is the 18th and 17th bits in \((c_6 \oplus k_6)\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_6[13]\) is the 12th and 11th bits in \((c_6 \oplus k_6)\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_6[7]\) is the 6th and 5th bits in \((c_6 \oplus k_6)\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(t_{11}[26]\) is the 25th and 24th bits in \(((c_{11} \boxminus c_{12}) \oplus k_{11})\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_{11}[12]\) is the 11th and 10th bits in \((c_{11} \oplus k_{11})\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_{11}[6]\) is the 5th and 4th bits in \((c_{11} \oplus k_{11})\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

For the 2nd diagonal, we guess 6 key bits in \(k_6\) and 6 key bits in \(k_{11}\). The average of the squared correlation is about \(2^{-1.797}\).

  • The indicator to compute \(v_7[7]\) is the 6th and 5th bits in \((c_7 \oplus k_7)\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

For the 3rd diagonal, we guess 2 key bits in \(k_7\). The average of the squared correlation is about \(2^{-0.300}\).

  • The indicator to compute \(t_3[7,6]\) or \(t_3[7]\) is \((c_4 \oplus k_4)[13]\), \((c_4 \oplus k_4)[12]\), and \((c_{9} \oplus k_9)[6]\). As discussed before, the average of the squared correlation is \(3/4 \times (2^{-1})^2 + 1/4 \times (2^{-2})^2\).

  • The indicator to compute \(v_4[19]\) is the 18th and 17th bits in \((c_4 \oplus k_4)\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_4[7]\) is the 6th and 5th bits in \((c_4 \oplus k_4)\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_{9}^5[12]\) is the 11th and 10th bits in \(((c_{9} \boxminus c_{14} \boxminus (c_3 \oplus (c_{14} \ggg 8))) \oplus k_9)\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

  • The indicator to compute \(v_{9}[12]\) is the 11th and 10th bits in \((c_{9} \oplus k_9)\). As discussed before, the average of the squared correlation is \(3/4 \times (1)^2 + 1/4 \times (2^{-1})^2\).

For the 4th diagonal, we guess 6 key bits in \(k_4\) and 3 key bits in \(k_9\). The average of the squared correlation is about \(2^{-3.498}\).

Therefore, we guess 36 key bits in total, and the average of the squared correlation (for the linear part of one of each pair) is \(2^{-9.991}\). The average of the squared correlation for the whole approximation is estimated by

$$\begin{aligned} {\overline{C}} = (2^{-10.3})^2 \times (2^{-9.991}) \times (2^{-9.991}) \approx 2^{-40.582}. \end{aligned}$$

Note that, unlike Chaskey, once these key bits are correctly guessed, all linearly involved bits are either determined or canceled out by XORing another text. It implies \(\dim (W)=0\), and we do not need to proceed with the FWHT.

10.4.2 Data and Time Complexities and Success Probability

Based on the LLR statistic, we estimate the data complexity and the corresponding success probability.

To find a right pair, we repeat Algorithm 2 \(2^{5}\) times. If we use a right pair and guess the correct key, the LLR statistic follows the normal distribution \({\mathcal {N}}( \frac{N}{2} {\overline{C}}, N {\overline{C}} )\) when the correct key is guessed. On the other hand, we assume that it follows \({\mathcal {N}}( -\frac{N}{2} {\overline{C}}, N {\overline{C}} )\) for either using a wrong pair or wrong guess.

We need to filter \((5+36)\)-bit wrong guess by this difference of the normal distributions. By using Proposition 5, the expected number of wrong keys is less than 1 when

$$\begin{aligned} \Theta \ge \sqrt{N {\overline{C}}} \times \left( \Phi ^{-1}(1 - 2^{-41}) - \frac{N}{2}{\overline{C}} \right) . \end{aligned}$$

When we use \(N = 2^{47}\) pairs, \(\Theta \approx 23.303\) andFootnote 7\(p_{\mathrm{success}} = 0.491\). For this success probability, the data complexity is \(2^{1+47+5} = 2^{53}\). We guess \(2^{36}\) keys for each texts, the required time complexity is \(2^{53+36}=2^{89}\).

10.5 Another 6-Round Attack

The aforementioned attack is the straightforward application of our attack framework, and it could be optimal considering the data complexity. Interestingly, we have another strategy where less time complexity is possible, although it increases data complexity.Footnote 8

In the aforementioned attack, we guessed many key bits for each ciphertext. Now, instead of guessing many key bits, we deduce \(k_{\mathcal {P}}\) for observed ciphertexts such that the absolute correlations become 1 (except for \(t_0[8]\) and \(t_3[7]\) or \(t_3[7,6]\)). For example, to compute \(v_5[26]\), we first check \((c_5[25] \Vert c_5[24])\). When we guess \((k_5[25] \Vert k_5[24])\) as 00, the indicator is 11 and the absolute correlation is lower than 1. For another guessing, we have representations whose correlation is \(\pm 1\). We skip guessing the key as 00 for this ciphertext only to reduce the time complexity.

We have 21 partitioning points, and 3/4 data are available for each point. Only for one point, i.e., \(v_{10}^5[7]/v_{10}^5[7,6]\), we do not need to reduce the available data by changing the applied linear masks dynamically. In summary, the fraction of available partitions is \((3/4)^{20} \approx 2^{-8.3}\). Both texts in each pair must belong to an available partition, and the fraction of available pairs is \(2^{-16.6}\). The final correlation is \(2^{-10.3} \times (2^{-2}) \times (2^{-2}) = 2^{-14.3}\), and the average of the squared correlation is estimated by \({\overline{C}} = 2^{-28.6}\).

To find a right pair, we repeat Algorithm 2 \(2^{5}\) times. By using Proposition 5, the expected number of wrong keys is less than 1 when

$$\begin{aligned} \Theta \ge \sqrt{N^*{\overline{C}}} \times \left( \Phi ^{-1}(1 - 2^{-41}) - \frac{N^*}{2}{\overline{C}} \right) , \end{aligned}$$

where \(N^*= N \times 2^{-16.6}\). When we use \(N = 2^{52}\) pairs, \(N^*= 2^{35.4}\) and \(\Theta \approx 19.693\) andFootnote 9\(p_{\mathrm{success}} \approx 0.5\). For this success probability, the data complexity is \(2^{1+52+5} = 2^{58}\).

On this attack, we do not need to guess \(2^{36}\) keys for all \(2^{58}\) data. On each data, we guess available key bits only. Therefore, the time complexity is estimated as

$$\begin{aligned} 1/p \times (2N + 2N^{*} \times 2^{n_{{\mathcal {P}}}}) = 2^5 \times (2^{53} + 2^{36.4} \times 2^{36}) \approx 2^{77.4}. \end{aligned}$$

10.6 The 7-Round Attack

Unfortunately, 7-round ChaCha is too complicated to apply our technique to the linear part. On the other hand, thanks to our contribution for the differential part, we find a new differential-linear distinguisher which is improved by 0.5 rounds. Therefore, to confirm the effect of our contribution for the differential part, we use the known technique, i.e., the probabilistic neutral bits (PNB) approach, for the key-recovery attack against 7-round ChaCha. The PNB-based key recovery is a fully experimental approach. We refer to [22] for the details and simply summarize the technique as follows:

  • Let the correlation in the forward direction (a.k.a, differential-linear distinguisher) after r rounds be \(\epsilon _d\).

  • Let n be the number of PNBs given by a correlation \(\gamma \). Namely, even if we flip one bit in PNBs, we still observe correlation \(\gamma \).

  • Let the correlation in the backward direction, where all PNB bits are fixed to 0 and non-PNB bits are fixed to the correct ones, be \(\epsilon _a\).

Then, the time complexity of the attack is estimated as \(2^{256-n} N + 2^{256 - \alpha }\), where the data complexity N is given as

$$\begin{aligned} N = \left( \frac{\sqrt{\alpha \log (4)} + 3\sqrt{1 - \epsilon _a^2 \epsilon _d^2}}{\epsilon _a \epsilon _d} \right) ^2, \end{aligned}$$

where \(\alpha \) is a parameter that the attacker can choose.

In our case, we use a 4-round differential-linear distinguisher with correlation \(\epsilon _d = 2^{-8.3}\). Under pairs generated by the technique shown in 10.2, we experimentally estimated the PNBs. With \(\gamma =0.35\), we found 74 PNBs, where non-zero bits of the following bit-vectors represent PNB:

$$\begin{aligned} v_4&: \mathtt{0x00098080} \\ v_5&: \mathtt{0x8CFFE7FC } \\ v_6&: \mathtt{0xF8087FC0 } \\ v_7&: \mathtt{0x0000403C } \\ v_8&: \mathtt{0x80000100 } \\ v_9&: \mathtt{0xF8198183 } \\ v_{10}&: \mathtt{0x80700007 } \\ v_{11}&: \mathtt{0xF8000000}. \end{aligned}$$

Then, the correlation is \(\epsilon _a = 2^{-10.6769}\). Then, with \(\alpha =36\), we have \(N=2^{43.83}\) and the time complexity is \(2^{225.86}\). Again, since we need to repeat this procedure \(p^{-1}\) times, the data and time complexities are \(2^{48.83}\) and \(2^{230.86}\), respectively.Footnote 10

11 Conclusion and Future Work

We presented new ideas for differential-linear attacks and, particularly, the best attacks on ChaCha,Footnote 11 one of the most widely used ciphers in practice, and Chaskey. We hope that our framework finds more applications. In particular, we think that it is a promising future work to investigate other ARX designs for our ideas.

Besides the direct application of our framework to more primitives, our work raises several more fundamental questions. As explained in the experimental verification, we sometimes observe higher LLR statistics than expected, making the attacks more efficient than estimated. The gap would come from the difficulty of estimating the accurate correlation of all partitions. Our paper does not solve how to estimate these correlations accurately and efficiently.

Another important open question is in the 7-round attack on ChaCha. We applied the partitioning technique to the 6-round attack. Our result outperforms PNB techniques, which is an experiment-based key recovery technique. Unfortunately, using this technique in 7-round ChaCha is too complicated. A more simple and powerful key recovery procedure exploiting the partitioning technique would beat the PNB-based key recovery like the 6-round attack. However, it is still an open question in our paper.