Keywords

1 Introduction

In 2019, Gohr [15] proposed differential-neural cryptanalysis, employing neural networks as superior distinguishers and exploiting them to perform efficient key recovery attacks. Impressively, the differential-neural distinguisher (\(\mathcal{N}\mathcal{D}\)) outperformed the traditional pure differential distinguishers using full differential distribution tables (DDT). However, interpreting these neural network-based distinguishers remains challenging, hindering the comprehension of the additional knowledge learned by differential-neural distinguishers.

Despite the intricate nature of neural network interpretability, researchers have made primary progress in understanding the differential-neural distinguish-er’s inner workings. In EUROCRYPT 2021, Benamira et al. [7] proposed that Gohr’s neural distinguisher effectively approximates the cipher’s DDT during the learning phase. Moreover, the distinguisher relies on both the differential distribution of ciphertext pairs and that of the penultimate and antepenultimate rounds. Yet, the specific form of additional information remains undisclosed.

In AICrypt 2023, Gohr et al. [16] proved the differential-neural distinguisher for Simon32/64 can use only differential features and achieve accuracy same as pure differential ones. Applying the same neural network to both Speck and Simon yields different conclusions: neural networks learned or did not learn features beyond full DDT. These intriguing findings motivate us to delve deeper into the neural network’s mechanisms, aiming to comprehend the specific features underpinning its conclusions for each cipher and to improve and exploit further the neural distinguishers should additional features be captured.

Our Contributions. In this work, we conclude that \(\mathcal {N}\mathcal {D}\)s’ advantage over pure DDT-based distinguishers is in exploiting the differential distribution under the partially known value input to the last non-linear operation. Specifically, \(\mathcal {N}\mathcal {D}\)s exploit the correlation between the ciphertexts’ partial value, ciphertext pair’s differences, and intermediate states’ differences. Furthermore, our work shows that differential-neural cryptanalysis in the related-key (\(\mathcal{R}\mathcal{K}\)) setting can attack more rounds than in the single-key setting, which was not apparent before. The concrete contributions include the following.

  • Improving full DDT-based distinguisher. We observe that, apart from the information of differences, one knows the partial value of inputs, denoted by y, to the last modular addition of Speck, leveraging by which one can improve DDT-based distinguishers. We show that the differential probability conditioned on a fixed value of y can differ from the average differential probability over all possible y. This insight enables more accurate classification based on the ciphertext pair’s differences and the ciphertexts’ partial value. The high-level idea is to consider conditional probabilities and specific cases where the fulfillment of the differential constraints can be predicted based on the value of y. The results indicate that it is highly likely that \(\mathcal {N}\mathcal {D}\)s rely on these specific cases to outperform pure DDT-based distinguishers.

  • Optimizing the performance and training process of \(\mathcal{N}\mathcal{D}\)s. Addressing the challenge of training high-round, especially 8-round, \(\mathcal{N}\mathcal{D}\) of Speck32/64, we introduce the Freezing Layer Method. By freezing all convolutional layers in a pre-trained 7-round \(\mathcal{N}\mathcal{D}\), we efficiently train an 8-round \(\mathcal{N}\mathcal{D}\) using simple basic training with unaltered hyperparameters. This method matches Gohr’s accuracy but cuts training time and data.

  • Exploring differential-neural attacks in the related-key setting. The conclusion that \(\mathcal {N}\mathcal {D}\)s can efficiently capture features beyond full DDT encourages further exploration of \(\mathcal {N}\mathcal {D}\)-based attacks. We observed that control over the differential propagation is vital for achieving effective high-round \(\mathcal{N}\mathcal{D}\)s. Hence, we introduce related-key (\(\mathcal{R}\mathcal{K}\)) differences to slow down the diffusion of differences, aiding in training \(\mathcal{N}\mathcal{D}\) for higher rounds. As a result, we achieve a 14-round key recovery attack on Speck32/64 using related-key neural distinguishers (\(\mathcal {R}\mathcal {K}\text {-}{\mathcal {N}\mathcal {D}}\)s). Results are in Table 1. Furthermore, we constructed various distinguishers under various \(\mathcal{R}\mathcal{K}\) differential trails and conducted comprehensive comparisons, reinforcing \(\mathcal {N}\mathcal {D}\) explainability.

Table 1. Summary of key recovery attacks on Speck32/64

Organization. The paper’s structure is as follows: Sect. 2 provides preliminaries. Section 3 provides insights on the \(\mathcal {N}\mathcal {D}\) explainability. Section 4 provides enhancements on the \(\mathcal {N}\mathcal {D}\) training. Section 5 details of related-key differential-neural cryptanalysis. The conclusion is presented in Sect. 6.

2 Preliminary

2.1 Notations

Denote by \(C = (C_{n-1}, \ldots , C_0)\) the binary vector of n bits, where \(C_i\) is the bit at position i and \(C_0\) is the least significant. Define n as the word size in bits and 2n as the state size. Let \((C_{L}^{r}, C_{R}^{r})\) represent left and right state branches after r rounds, and \(k^{r}\) the r-round subkey. Bitwise XOR is denoted by \(\oplus \), addition modulo \(2^n\) by \(\boxplus \), bitwise AND by \(\odot \), and bitwise right/left rotation by \(\ggg /\lll \).

2.2 Brief Description of Speck32/64

In 2013, the National Security Agency (NSA) proposed Speck and Simon block ciphers, aiming to ensure security on resource-constrained devices [5]. By 2018, both ciphers were standardized by ISO/IEC for air interface communication. The Speck cipher uses a Feistel-like ARX design, emanating its non-linearity from modular addition and leveraging XOR and rotation for linear mixing. Speck32/64 is the smallest Speck variant [5]. Its round function, one of 22 rounds, takes a 16-bit subkey \(k^{i}\) and a state of two 16-bit words, \((C_{L}^i, C_{R}^i)\). Its key schedule reuses the round function to generate round keys. With K as a master key and \(k^i\) the i-th round key, \(K=(l^2,l^1,l^0,k^0)\). The round function’s details are in Fig. 1.

Fig. 1.
figure 1

The round function and key schedule algorithm of Speck32/64

2.3 Overview of Differential-Neural Cryptanalysis

The differential-neural distinguisher operates as a supervised model, distinguishing whether ciphertext pairs originate from plaintext pairs with a defined input difference or from random pairs. Given m plaintext pairs \(\{(P_j,P'_j), j \in [0, m-1]\}\), the corresponding ciphertext pairs \(\{(C_j,C'_j), j \in [0, m-1]\}\) constitute a sample (In [15], \(m = 1\)). Each training sample is associated with a label Y defined as:

$$Y=\left\{ \begin{array}{l} 1, \text{ if } P_j \oplus P'_j=\varDelta , j \in [0, m-1] \\ 0, \text{ if } P_j \oplus P'_j \ne \varDelta , j \in [0, m-1] \end{array}\right. $$

The \(\mathcal {N}\mathcal {D}\) architecture from [15] uses the prevalent ResNet. It comprises an initial input block, several residual blocks, and a prediction output layer.

In [15], three training schemes are proposed: a) Basic training for short-round distinguishers. b) An enhanced method using the KeyAveraging simulation and an \((r-1)\)-round distinguisher, achieving the optimal 7-round \(\mathcal {N}\mathcal {D}\) for Speck. c) A staged training approach evolving a pre-trained \((r-1)\)-round distinguisher to an r-round one in stages, yielding the most extended \(\mathcal {N}\mathcal {D}\) on Speck, covering 8 rounds. In [15], Gohr also showed how to combine a neural distinguisher with a classical differential and use a Bayesian-optimized key-guessing strategy for key recovery. Later, in [16], the authors provide general guidelines for optimizing Gohr’s neural network and diverse optimization approaches across different ciphers, highlighting its efficacy and versatility. The authors also clarify which kind of ciphers the neural network can’t learn beyond differential features.

3 Explicitly Explain Knowledge Beyond Full DDT

Studies show differential-based neural distinguishers often outperform DDT-based ones in certain ciphers [3, 15, 16]. However, what specific knowledge these neural distinguishers learn beyond DDT remains elusive. Prior research suggests that these distinguishers rely on differential distributions in the last two rounds and differential-linear (DL) properties [7, 11]. In [15], a “Real Differences Experiment” was conducted to observe how well neural networks could detect real differences beyond DDT. The experiment used randomized ciphertext pairs with a blinding value R introduced to obscure information beyond the difference. Results showed that neural networks could detect real differences without explicit training, and ciphertext pairs have non-uniform distributions within their difference equivalence classes. But, using blinding values in the form \(R = aa\) (with a as any 16-bit word), the distinguishers failed (henceforth referred to as Gohr’s aaaa-blinding experiment). This underlines that the neural distinguishers aren’t exploiting the key schedule, and they can make finer distinctions than mere difference equivalence classes. These insights are crucial to explicitly explaining \(\mathcal {N}\mathcal {D}\)’s superior classification mechanism. Based on these studies, this section takes a further step towards fully interpreting the knowledge that an \(\mathcal {N}\mathcal {D}\) has captured beyond full differential distribution.

We’ll initiate by locating the root of the performance improvement, then deduce the specific pattern that causes the improvement, and finally use this pattern to improve the pure DDT-based distinguisher.

3.1 Locating Information Used by \(\mathcal {N}\mathcal {D}\)s of Speck Beyond DDT

In the following, we start with a generalized definition of information that the differential-neural distinguisher might use.

Generalized Definition of XOR Information. In Gohr’s differential-neural distinguishers, given Speck ’s Feistel-like structure, samples are split into four words: \(\mathcal {A}, \mathcal {B}\) (forming the first ciphertext) and \(\mathcal {C}, \mathcal {D}\) (forming the second), as depicted in Fig. 2. In subsequent discussions, a symbol’s superscript denotes the number of encryption rounds. The absence of a superscript implies r rounds.

Fig. 2.
figure 2

Definition of XOR information

Traditional differential distinguishers focus solely on the difference of ciphertext pairs. Yet, as indicated in prior research [13, 22, 27], internal differentials can also be pivotal in cryptanalytic tasks.

We broaden the focus to include the XOR interactions among \(\mathcal {A}\), \(\mathcal {B}\), \(\mathcal {C}\), and \(\mathcal {D}\). For brevity, XOR combinations like \(\mathcal {A}\oplus \mathcal {B} \oplus \mathcal {C} \oplus \mathcal {D}\) are shortened to \(\mathcal {ABCD}\). In other words, beyond the traditionally focused differences like \(\mathcal{A}\mathcal{C}\) and \(\mathcal{B}\mathcal{D}\), we explore under-emphasized XORs such as \(\mathcal{A}\mathcal{B}\), \(\mathcal{C}\mathcal{D}\), \(\mathcal{A}\mathcal{D}\), \(\mathcal{B}\mathcal{C}\), and \(\mathcal {ABCD}\). For clarity, we classify these XORs as: Inter-XOR (\(\mathcal{A}\mathcal{C}\), \(\mathcal{B}\mathcal{D}\)), Intra-XOR (\(\mathcal{A}\mathcal{B}\), \(\mathcal{C}\mathcal{D}\)), Cross-XOR (\(\mathcal{A}\mathcal{D}\), \(\mathcal{B}\mathcal{C}\)), and Total-XOR (\(\mathcal {ABCD}\)).

In Speck, Intra-XOR and Total-XOR relate to values and differences from the prior round. Specifically, Intra-XOR helps deduce the right-half values, and Total-XOR deduces the right-half differences of the preceding round.

Table 2. Experimental results detailing the information harnessed by \(\mathcal {N}\mathcal {D}\)s. Each set comprises both positive and negative samples. The notation (\(\mathcal {A}, \mathcal {B}, \mathcal {C}, \mathcal {D}\)) denotes ciphertext pairs derived from plaintext pairs with an input difference of (0040,0000), while Random signifies pairs generated from random values. \(\mathcal {R}_1\) refers to a random value.

Is XOR Information the Sole Basis for Differential-Neural Distinguisher’s Decision Making? Using a mechanical method to determine relations between information sets, it became evident that focusing solely on specified XOR information is natural for finding the source of the information that \(\mathcal {N}\mathcal {D}\)s exploit beyond the difference information.

Determine Relations Between Information Sets Mechanically. Consider a pair of ciphertexts from a round-reduced Speck, denoted as \(C_0 = ({C_0}_L, {C_0}_R)\) and \(C_1 = ({C_1}_L, {C_1}_R)\). Each ciphertext splits into two parts, with \({C_i}_J \in \mathbb {F}_2^{b}\) for \(i\in \{0,1\}\) and \(J \in \{L, R\}\). For Speck32/64, \(b = 16\). Let K be the last round key, with \(K \in \mathbb {F}_2^{b}\). For each \(C_i\), let \({M_i}_L\) and \({M_i}_R\) represent the state value immediately preceding the XOR with key K and before the XOR between the left and right branches for \(i\in \{0,1\}\). That is

$$ \begin{array}{llll} {C_0}_L = {M_0}_L \oplus K, &{} {C_0}_R = {M_0}_L \oplus {M_0}_R \oplus K, &{} {C_1}_L = {M_1}_L \oplus K, &{} {C_1}_R = {M_1}_L \oplus {M_1}_R \oplus K \\ \end{array} $$

The method to determine relations between information sets can be outlined in the following steps: Let \(\mathcal {R}_1\) and \(\mathcal {R}_2\) be two random values in \(\mathbb {F}_2^{b}\).

  1. 1.

    Setup:

    1. (a)

      Set up a vector space \(\mathcal {V}\) over the field \(\mathbb {F}_2\) with dimension 7.

    2. (b)

      Define various basis vectors for \(\mathcal {V}\), acting as linear masks whose non-zero bits indicate the variable selection from the following vector

      \([{M_0}_L, {M_0}_R, {M_1}_L, {M_1}_R, K, \mathcal {R}_1, \mathcal {R}_2]\). Concretely,

      $$ \begin{array}{ll@{}ll} \varGamma _{{M_0}_L} &{}= \texttt {[1,0,0,0,0,0,0]} &{} \varGamma _{{M_1}_L} &{}= \texttt {[0,0,1,0,0,0,0]} \\ \varGamma _{{M_0}_R} &{}= \texttt {[0,1,0,0,0,0,0]} &{} \varGamma _{{M_1}_R} &{}= \texttt {[0,0,0,1,0,0,0]} \\ \varGamma _{ K} &{}= \texttt {[0,0,0,0,1,0,0]} &{} \varGamma _{\mathcal {R}_1} &{}= \texttt {[0,0,0,0,0,1,0]} \\ &{} &{} \varGamma _{\mathcal {R}_2} &{}= \texttt {[0,0,0,0,0,0,1]}\\ \end{array} $$

    Accordingly, \([{C_0}_L, {C_0}_R, {C_1}_L, {C_1}_R]\) can be obtained using the following masks:

    $$\begin{array}{ll} \varGamma _{{C_0}_L} := \varGamma _{\mathcal {A}} = \varGamma _{{M_0}_L} \oplus \varGamma _{ K} &{}= \texttt {[1,0,0,0,1,0,0]}, \\ \varGamma _{{C_0}_R} := \varGamma _{\mathcal {B}} = \varGamma _{{M_0}_L} \oplus \varGamma _{{M_0}_R} \oplus \varGamma _{ K} &{}= \texttt {[1,1,0,0,1,0,0]}, \\ \varGamma _{{C_1}_L} := \varGamma _{\mathcal {C}} = \varGamma _{{M_1}_L} \oplus \varGamma _{ K} &{}= \texttt {[0,0,1,0,1,0,0]}, \\ \varGamma _{{C_1}_R} := \varGamma _{\mathcal {D}} = \varGamma _{{M_1}_L} \oplus \varGamma _{{M_1}_R} \oplus \varGamma _{ K}&{}= \texttt {[0,0,1,1,1,0,0]}. \\ \end{array} $$

    Besides, we have \(\varGamma _{\mathcal{X}\mathcal{Y}} = \varGamma _{\mathcal {X}} \oplus \varGamma _{\mathcal {Y}}\) for \(\mathcal {X}, \mathcal {Y} \in \{ \mathcal {A}, \mathcal {B}, \mathcal {C}, \mathcal {D}, \mathcal{A}\mathcal{C}, \mathcal{B}\mathcal{D}, \mathcal{A}\mathcal{B}, \mathcal {R}_1, \mathcal {R}_2\}\).

  2. 2.

    Subspace Generation: Create the subspaces from given vectors and combinations:

    • Set-1-1: span of \(\{\varGamma _{\mathcal {A}}, \varGamma _{\mathcal {B}}, \varGamma _{\mathcal {C}}, \varGamma _{\mathcal {D}}\}\).

    • Set-1-2: span of \(\{\varGamma _{\mathcal {A}\mathcal {R}_1}, \varGamma _{\mathcal {B}\mathcal {R}_1}, \varGamma _{\mathcal {C}\mathcal {R}_1}, \varGamma _{\mathcal {D}\mathcal {R}_1}\}\).

    • Set-1-X: span of \(\{\varGamma _{\mathcal{A}\mathcal{C}}, \varGamma _{\mathcal{B}\mathcal{D}}, \varGamma _{\mathcal{A}\mathcal{B}}\}\).

    • Set-2-1: span of \(\{\varGamma _{\mathcal {A}\mathcal {R}_1}, \varGamma _{\mathcal {B}\mathcal {R}_2}, \varGamma _{\mathcal {C}\mathcal {R}_1}, \varGamma _{\mathcal {D}\mathcal {R}_2}\}\).

    • Set-2-2: span of \(\{\varGamma _{\mathcal {A}\mathcal {R}_1}, \varGamma _{\mathcal {B}\mathcal {R}_2}, \varGamma _{\mathcal {C}\mathcal {R}_2}, \varGamma _{\mathcal {D}\mathcal {R}_1}\}\).

    • Set-2-3: span of \(\{\varGamma _{\mathcal {ABCD}}\}\).

    Note that \(\texttt {Set-1-2}\) is the setting of Gohr’s aaaa-blinding experiment.

  3. 3.

    Remove randomness: In light of the observations from [15], where it’s determined that \(\mathcal {N}\mathcal {D}\)s in the single-key attack setting don’t leverage the key schedule, we can adapt the Speck32/64 key schedule to employ independent subkeys. This means we treat K along with \(\mathcal {R}_1\) and \(\mathcal {R}_2\) as random variables. Consequently, any vector that has a component of \(\varGamma _{ K}\), or \(\varGamma _{ \mathcal {R}_1}\), or \(\varGamma _{ \mathcal {R}_2}\) is deemed random, and hence, devoid of information. For example, \(\varGamma _{{C_i}_J}\) has a linear component \(\varGamma _{ K}\), thus, a standalone \({C_i}_J\) lacks information, where \(i\in \{0,1\}\) and \(J \in \{L, R\}\). Accordingly, we do as follows.

    1. (a)

      After creating each subspace, randomness is removed from each subspace according to whether a vector has a component from \(\varGamma _{ K}\), or \(\varGamma _{\mathcal {R}_1}\), or \(\varGamma _{\mathcal {R}_2}\). Without ambiguity, the sanitized sets are also denoted by Set-i-j for \(i\in \{\texttt {1,2}\}\) and \(j \in \{\texttt {1,2,3,X}\}\).

  4. 4.

    Comparison: The sanitized sets are then compared against each other to determine if one set equals or is a subset of the other.

The result shows that \(\texttt {Set-1-1}\) equals \(\texttt {Set-1-2}\) and \(\texttt {Set-1-X}\), meaning that the combination of Inter-XOR and Intra-XOR is exactly what an information-theoretically optimal distinguisher accepting ciphertext pairs can use under the assumption that it does not use key-schedule.

As we proceed, we delve deeper to ascertain the specific XOR information that holds significance.

Table 3. Experimental results \(\mathcal {N}\mathcal {D}\) leveraging select XOR information.

Which of the XOR Information is Significant for Differential-Neural Distinguisher? To isolate the pivotal XOR information, we conducted experiments where a differential-neural distinguisher was given access to only selected XOR data.

All our subsequent experiments were conducted on a 6-round Speck32/64 with an input difference of (0040,0000), adhering to the configurations presented in Table 17 in [4]. The differential-neural distinguishers, trained as per Table 2 Set.1-1 to Set.1-3, serve as baselines (Set.1-2 and Set.1-3 correspond to Gohr’s aaaa-blinding experiment)). In the sequel, we use Set.i-j to refer to the experimental setup, while Set-i-j represents the associated information set for the positive samples, where \(i \in \{1,2\}\) and \(j \in \{1,2,3\}\).

Defining \(\mathcal {R}_1\) and \(\mathcal {R}_2\) as two distinct random values, Set.2-1 in Table 3 retains only Inter-XOR and Total-XOR, while Set.2-2 keeps only Cross-XOR and Total-XOR. Set.2-3, on the other hand, exclusively considers Total-XOR. Firstly, our mechanical analysis of sanitized subspaces reveals the following relations:

figure a

In Table 3 Set.2-1, the differential-neural distinguisher’s access is limited to Inter-XOR and Total-XOR – equivalent to what the DDT distinguisher utilizes. Its accuracy aligns closely with the 6-round DDT’s accuracy of 0.758, without any noticeable enhancement. This underscores the differential-neural distinguisher’s advantage over the DDT arising from its access to extra information. From this observation, we reinforce the subsequent conclusion.

Conclusion 1

The differential-neural distinguisher \(\mathcal {N}\mathcal {D}^{\textsc {Speck}_{rR}}\) ’s superiority over \(\mathcal {D}\mathcal {D}^{\textsc {Speck}_{rR}}\) is mainly due to its exploit of Intra-XOR and Cross-XOR.

This conclusion naturally prompts a more intricate query: How does the differential-neural distinguisher effectively exploit Intra-XOR and Cross-XOR? Upon closer inspection, we can further dismiss the significance of Cross-XOR. Given that \(\texttt {Set-2-3} \subset \texttt {Set-2-2}\), it’s evident that Set-2-3 provides inherently less data than Set-2-2. While in Set.2-2, combining Total-XOR with either Intra-XOR or Cross-XOR results in a valid distinguisher, solely using Total-XOR in Set.2-3 yields an accuracy identical to the distinguisher in Set.2-2. From this, we conclude that Cross-XOR on its own lacks significance. The differential-neural distinguisher likely uncovers new patterns by melding Inter-XOR with either Intra-XOR or Cross-XOR. This line of reasoning culminates in the following conclusion.

Conclusion 2

Unlike Inter-XOR, neither Intra-XOR nor Cross-XOR independently offers useful information. The differential-neural distinguisher relies on combinations of Inter-XOR with either Intra-XOR or Cross-XOR.

Remark 1

(On \(\mathcal {N}\mathcal {D}\) exploiting the key schedule). Gohr’s study in [15] indicates that \(\mathcal {N}\mathcal {D}\)s, in a single-key attack on Speck, do not exploit the key schedule. It naturally raises the question: Do \(\mathcal {N}\mathcal {D}\)s behave similarly in related-key scenarios? Motivated by this, we conduct comparison experiments similar to Gohr’s aaaa-blinding experiment (comparing \(\mathcal {R}\mathcal {K}\text {-}{\mathcal {N}\mathcal {D}}\)s in Set.1-1 and Set.2-1), investigating whether \(\mathcal {R}\mathcal {K}\text {-}{\mathcal {N}\mathcal {D}}\)s use the same ciphertext equivalence classes as the single-key \(\mathcal {N}\mathcal {D}\)s by [15]. In Sect. 5.1, we delve deep into our \(\mathcal {R}\mathcal {K}\text {-}{\mathcal {N}\mathcal {D}}\)s and present an interesting observation reinforcing our following \(\mathcal {N}\mathcal {D}\) explainability in Sect. 3.2.

3.2 Explicitly Rules to Exploit the Information Beyond Full DDT: From a Cryptanalytic Perspective

This section delves into the exact patterns harnessed by the differential-neural distinguisher. Our exploration commences with an intriguing observation from Experiment A, as described in [7]. The experiment unfolds as follows:

  1. 1.

    For each 5-round ciphertext pair difference, \(\delta \), which results in extreme scores surpassing 0.9 (indicative of a good score) and exhibiting a high frequency of occurrence:

    1. (a)

      Generate a set of \(10^4\) random 32-bit numbers.

    2. (b)

      Utilize the difference \(\delta \) to construct a dataset encompassing \(10^4\) data pairs, each bearing the difference \(\delta \).

    3. (c)

      Feed the dataset to the differential-neural distinguisher and count the predicted labels.

While DDT-based distinguishers would predict Experiment A’s entire data as positive, the differential \(\mathcal {N}\mathcal {D}\) does not. For \(\mathcal {N}\mathcal {D}\), the proportion of each difference is consistently at 0.75 (refer to Table 19 in the full version [4]), suggesting that the \(\mathcal {N}\mathcal {D}\) employs criteria beyond simple differential probability in its classifications. The consistent proportion of 0.75 also implies a discernible pattern linked to two specific bits. If a ciphertext pair aligns with this bi-bit pattern, it’s classified as negative, regardless of high output difference probabilities. This observation prompts an investigation into the potential two-bit pattern, motivating us to look into properties of the addition modular \(2^n\) (\(\boxplus \)) from a cryptanalytic perspective.

Enhancing DDT-Based Distinguishers via Conditional Probabilities. In the r-round Speck32/64, denote the input and output differences of the last \(\boxplus \) by (\(\alpha \), \( \beta \), \(\gamma \)), and their respective values by (x y z) and (\(x'\) \(y'\) \(z'\)). For each output pairs \(((C_L, C_R), (C'_L, C'_R))\), one knows the following information: \(\gamma = C_L \oplus C'_L\), \(\beta = (C_L \oplus C_R \oplus C'_L \oplus C'_R)^{\ggg 2}\), and \(y = (C_L \oplus C_R)^{\ggg 2}\). Namely, apart from knowing two differences (i.e., \(\beta \) and \(\gamma \)), one knows a value (i.e., y) around the last \(\boxplus \). Besides, the input difference \(\alpha \) is unknown but might be biased among positive samples and thus is predictable. Concretely, attributes of the information around the last \(\boxplus \) are as follows:

 

unknown but biased

x

unknown and balanced

 

known

known

 

known

z

unknown and balanced

The knowledge of y, which is one of two inputs of the last \(\boxplus \), provides additional information apart from the differences. The concrete analysis is as follows.

When conditioned on a fixed y, the differential probability can differ from the average probability over all possible y. For a valid differential propagation \((\alpha , \beta \mapsto \gamma )\) through \(\boxplus \), consider each bit position i where \(0 \le i < n - 1\): If \(\texttt{eq} (\alpha , \beta , \gamma )_i = 1\), the difference propagation at the \((i+1)\)-th position is deterministic, as elucidated in [20]; Conversely, for \(\texttt{eq} (\alpha , \beta , \gamma )_i = 0\), the \((i+1)\)-th bit’s difference propagation is probabilistic; for a given \((i+1)\)-th bit differences to be fulfilled, the input values at the i-th position (namely, \(x_i\), \(y_i\), \(c_i\) – the carry’s i-th bit) must satisfy a certain linear constraint, detailed in Observation 1.

Observation 1

([10]). Let \(\delta = (\alpha , \beta \mapsto \gamma )\) be a possible XOR-differential through addition modulo \(2^n\) (\(\boxplus \)). Let (xy) and \((x\oplus \alpha , y \oplus \beta )\) be a conforming pair of \(\delta \), x and y should satisfy the follows. For \(0 \le i < n - 1\), if \(\texttt{eq} (\alpha , \beta , \gamma )_i = 0\)

$$ \left. \begin{array}{ll} x_i \oplus y_i = \texttt{xor}(\alpha , \beta , \gamma )_{i+1} \oplus \alpha _i, &{} \quad \text {if } \alpha _i \oplus \beta _i = 0, \\ \left. \begin{array}{ll} x_i \oplus c_i = \texttt{xor}(\alpha , \beta , \gamma )_{i+1} \oplus \alpha _i, &{}\quad \text {if } \alpha _i \oplus \texttt{xor}(\alpha , \beta , \gamma )_i = 0, \\ y_i \oplus c_i = \texttt{xor}(\alpha , \beta , \gamma )_{i+1} \oplus \beta _i, &{}\quad \text {if } \alpha _i \oplus \texttt{xor}(\alpha , \beta , \gamma )_i = 1, \end{array}\right\} &{}\quad \text {if } \alpha _i \oplus \beta _i = 1, \\ \end{array}\right\} $$

where \(c_i\) is the i-th carry bit, \(x \boxplus y = z\), \(\texttt{eq} (a, b, d) = (\lnot a \oplus b) \wedge (\lnot a \oplus d)\) (i.e., \(\texttt{eq} (a, b, d) = 1\) if and only if \(a = b = d\)), and \(\texttt{xor}(a, b, d) =a\oplus b \oplus d\).

In other words, at bit positions i and \(i+1\), a valid difference tuple \((\alpha _{i+1,i}, \beta _{i+1,i}, \gamma _{i+1,i})\) that satisfies \(\texttt{eq} (\alpha _i, \beta _i, \gamma _i) = 0\) imposes a 1-bit linear constraint on the tuple \((x_i, y_i, c_i)\). As \(c_i\) is determined by lower bits, the freedom for conforming to the constraint comes exclusively from the i-th bits of x and y, independent of constraints at other bit positions. Accordingly, the constraints on (\(x_i\), \(y_i\)), (\(x_i\), \(c_i\)), or (\(y_i\), \(c_i\)) as listed in Observation 1 are necessary and sufficient. Therefore, when the constraint at a bit position is fulfilled, the conditional probability \(\tilde{p}\) of a differential whose unconditional probability is p should be calculated as \(2\cdot p\); when unfulfilled, it is 0. In comparison, the conditional probability for random pairs is still at most \(2^{-n}\). Hence, leveraging conditional probability for classification amplifies the advantage.

To clarify when the fulfilment of the constraints at the last \(\boxplus \) can be effectively predicted, we catalog cases from Observation 1 in Table 4, naming them Cxy\(_{(i+1, i)}\), Cxc\(_{(i+1, i)}\), and Cyc\(_{(i+1,i)}\). As above analyzed, in Speck32/64 ’s last \(\boxplus \), among the tuple (xyc) (with \(c = z \oplus x \oplus y\) and unknown z), only y is known. Hence, exploiting knowledge of y requires examining bit positions with differential constraints fulfilling Cyc\(_{(i+1, i)}\) in Table 4.

Table 4. Necessary and sufficient conditions for a one-bit difference from Observation 1

In the Cyc\(_{(i+1, i)}\) case, the constraint is on \(y_i \oplus c_i\). While \(c_i\) may seem unknown, it is determined by lower bits: \(c_i = x_{i-1}y_{i-1} \oplus (x_{i-1} \oplus y_{i-1})c_{i-1}\). The knowledge on \(c_i\) might be inferred if the \((i-1)\)-th bit differences meet the condition \(\texttt{eq} (\alpha _{i-1}, \beta _{i-1}, \gamma _{i-1}) = 0\), as per Observation 1. For example, when \({\left\{ \begin{array}{ll} (\alpha _{i}, \beta _{i}, \gamma _{i}) = (0, 1, 0),\\ (\alpha _{i-1}, \beta _{i-1}, \gamma _{i-1}) = (1, 1, 0) \end{array}\right. }\), one knows that \({\left\{ \begin{array}{ll} \texttt{eq} (\alpha , \beta , \gamma )_{i-1} = 0, \\ \alpha _{i-1} \oplus \beta _{i-1} = 0, \\ \texttt{xor}(\alpha , \beta , \gamma )_{i} \oplus \alpha _{i-1} = 0. \end{array}\right. } \)

From Table 4, one has \(x_{i-1} \oplus y_{i-1} = 0\). Thus, \(c_i = x_{i-1}y_{i-1} \oplus (x_{i-1} \oplus y_{i-1})c_{i-1} = y_{i-1}\). Therefore, \(y_i \oplus c_i = y_i \oplus y_{i-1}\). As a consequence, one can predict the fulfilment of constraint in case Cyc\(_{(i+1, i)}\) by observing whether \(y_i \oplus y_{i-1} = \texttt{xor}(\alpha , \beta , \gamma )_{i+1} \oplus \beta _i\). Table 5 lists more cases where \(c_i\) might be known.

Table 5. Cases for deducing the i-th carry bit \(c_i\)

Incorporating observations from Table 4 and Table 5, one gets Table 6, which lists various cases where the knowledge of y can be used to determine the satisfaction of differential constraints.

Table 6. Cases where the knowledge on y can be used to check the fulfilment of the differential constraints

Note that apart from the general cases (C3 and C4) at the i-th bit, special cases (C1 and C2) emerge at the two least significant bits due to the carry bit \(c_0\) being 0. For example,

  1. 1.

    at the 0th bit position, observing \(\beta _0 = 0\) and \(\gamma _0 = 1\) determines \(\alpha _0 = 1\) based on Alg. 3 in [4]. From case Cyc\(_{(i+1, i)}\) in Table 4 and given \(c_0 = 0\), one knows that \(\texttt{xor}(\alpha , \beta , \gamma )_1 \oplus \beta _0 = y_0 \oplus c_0 = y_0\);

  2. 2.

    at the 1st bit position, \(c_1 = x_0y_0 \oplus (x_0 \oplus y_0)c_0 = x_0y_0\). Given an observed \(y_0=0\), one knows \(c_1 = 0\). Consequently, in case Cyc\(_{(2,1)}\) and \(y_0=0\), one knows \(\texttt{xor}(\alpha , \beta , \gamma )_{2} \oplus \beta _1 = y_1 \oplus c_1 = y_1\);

  3. 3.

    in general case C3, based on Table 5, \(c_i\) is determined as \(y_{i-1}\), leading to the use of \(y_i \oplus y_{i-1}\);

  4. 4.

    in general case C4, applying Table 5 to the \((i-1, i-2)\)-th bit position, it is inferred that \(c_i = c_{i-1} = y_{i-2}\), leading to the use of \(y_i \oplus y_{i-2}\);

  5. 5.

    for cases where \(c_{i-1}=c_{i-2}\), one can further observe differences at the \((i-2)\)-th bit position and continues deducting \(c_{i-2}\) by observing bit differences at the \(i-3\) position.

Table 7 lists some concrete examples of differential patterns where the observation of y enables prediction of whether differential constraints are met.

Remark 2

These constraints on values for valid differential propagation resonate with established concepts. Specifically, insights derived from Table 6 align with findings on multi-bit constraints from [18, 19], quasi-differential trails in [8], and extended differential-linear approximations in [11]. Table 7 exhibits the correspondence between examples of cases in Table 6 and these established concepts. For instance, given a differential propagation \((\alpha _{i+1,i,i-1}, \beta _{i+1,i,i-1} \mapsto \gamma _{i+1,i,i-1}) = (\texttt {*01}, \texttt {*11} \mapsto \texttt {*00}) \) (for \(0 < i < n - 1\)),

  1. 1.

    using the 1.5-bit constraints concept and the finite state machines representing the differential properties of modular addition from [18, 19], one can get a new constraint and refine the propagation to (where the notations \(\{\texttt {-}, \texttt {x}, \texttt {>}, \texttt {<}, \texttt {=}, \texttt {!}\}\) are explained below Table 7); more generally, C3 cases correspond to the 1.5-bit constraints \(\{\texttt {>}, \texttt {<}, \texttt {=}, \texttt {!}\}\) in [18, 19];

  2. 2.

    using the quasi-differential trail concept from [8], the differential trail \((\texttt {001}, \texttt {011} \mapsto \texttt {000})\) comprises a non-trivial quasi-differential trail with a mask of \((\texttt {000}, \texttt {011} \mapsto \texttt {000})\). The non-trivial quasi-differential trail has correlation \(-2^{-1}\) (i.e., additional weight of 0). Consequently, the “fixed-y” probability of this differential trail is \((1 - (-1)^{y_i \oplus y_{i-1}}) \cdot 2^{-1}\), i.e., the probability equals 1 when \(y_i \oplus y_{i-1} = 1\) and 0 in the opposite case;

  3. 3.

    using the extended differential-linear connectivity table (EDLCT) concept from [11], assessing the constraint aligns with gauging the bias of the linear approximation that corresponds to selecting bits [\(x_{i+1}\), \(y_{i+1}\), \(z_{i+1}\), ] and [\(x'_{i+1}\), \(y'_{i+1}\), \(z'_{i+1}\), ].

    As noted in [7], \(\mathcal{N}\mathcal{D}\)s rely on differential-linear (DL) properties. We note that pure DL properties do not provide additional information beyond full DDT; the differential-linear distribution can be directly derived from the full differential distribution. It is the extended differential-linear distribution [11] (which includes the selection of ciphertext values apart from differences) that contains additional information.

Table 7. Concrete examples of differential patterns where one can predict the fulfilment of the differential constraints by observing the value of y

To directly exploit these observations for an r-round Speck32/64, a preliminary is to effectively predict the input difference \(\alpha \) at the last \(\boxplus \), which equals \({((\delta _R^{r-2})}^{\lll 2} \oplus {\delta _R^{r-1})}^{\ggg 7}\). Given the known \(\delta _R^{r-1}\) from r-round outputs, the focus shifts to predicting \({(\delta _R^{r-2})}^{\lll 2}\). Notably, for \(r \le 7\) and input difference \(\texttt {(0040, 0000)}\), some bits of \({(\delta _R^{r-2})}^{\lll 2}\) exhibit bias, as detailed in Table 8, enabling predictions of \(\alpha \) for positive samples.

Table 8. Bit bias towards ‘0’ of \({(\delta _R^{r-2})}^{\lll 2}\) for \(4\le r \le 7\), where the input difference of the plaintext is (0040,0000). A positive (resp. negative) value indicates a bias towards ‘0’ (resp. ‘1’).

A Simple Procedure to Improve the DDT-Based Distinguisher. To improve a DDT-based distinguisher for an r-round Speck32/64 using its \(\textrm{DDT} _{\texttt {(0040, 0000)}}\), we proceed as follows, resulting in distinguishers named \(\mathcal {Y}\mathcal {D}^{\textsc {Speck}_{rR}} \):

  1. 1.

    Compute the bias (towards 0) of each bit of \({{(\delta _R^{r-2})}^{\lll 2}}\),

  2. 2.

    Predict bit values for \({{(\delta _R^{r-2})}^{\lll 2}}\) based on their biases: assign a value of 0 if bias \(\ge 0\) and 1 otherwise,

  3. 3.

    Define the absolute bias of the i-bit of \({({(\delta _R^{r-2})}^{\lll 2}})^{\ggg 7}\) as \(\epsilon _{\alpha }(i)\),

  4. 4.

    For each output pair of r-round Speck32/64, use Alg. 1 to predict its classification.

figure l

Results of Improving the DDT-Based Distinguisher. Table 9 presents the performance of \(\mathcal {Y}\mathcal {D}^{\textsc {Speck}_{rR}} \) distinguishers, derived from the described enhancement of \(\mathcal {D}\mathcal {D}^{\textsc {Speck}_{rR}} \). For rounds \(4 \le r \le 7\), \(\mathcal {Y}\mathcal {D}^{\textsc {Speck}_{rR}} \) typically shows improvement. In contrast, when applying a similar method to adjust the \(\mathcal {N}\mathcal {D}^{\textsc {Speck}_{rR}}\) score Z (converting score Z to probability p using \(p = Z/(1-Z) \cdot 2^{-n}\)), the accuracy does not get improved. It is unchanged for \(\mathcal {N}\mathcal {D}^{\textsc {Speck}_{4R}}\) and marginally degrades for rounds \(5 \le r \le 7\) since the threshold \(\tau \) is set less than 0.5. This suggests that the additional information useful in improving DDT-based distinguishers does not help improve \(\mathcal {N}\mathcal {D}\)’s; thus, the \(\mathcal {N}\mathcal {D}\)’s might have maximally utilized this information already. Thus, we conclude as follows.

Conclusion 3

By utilizing conditional differential distributions when the input and/or output values of the last nonlinear operation are observable, a distinguisher can surpass pure DDT-based counterparts. Accordingly, if these conditional distributions differ greatly from the averaged differential distribution, and the satisfaction of the conditions is either observable or effectively predictable, then r-round \(\mathcal {N}\mathcal {D}\)s can outperform r-round DDT-based distinguishers.

For Speck, one of the two inputs of the last non-linear operation (\(\boxplus \)) is observable. If conditioned on this input, the conditional differential distribution can diverge significantly from the averaged one. Therefore, an optimal distinguisher can obviously outperform a pure DDT-based counterpart. A similar analysis applies to Simon. In Simon, the values that go through the last nonlinear operation are fully observable. Consequently, it is interpretable that in the case of Simon, an r-round \(\mathcal {N}\mathcal {D}\) can achieve an accuracy close to the \((r-1)\)-round DDT [3].

This conclusion can be further supported by the following experimental result: In a modified r-round Speck32/64 where the last key XORing is omitted, revealing both z and y (equating to full awareness of the satisfaction of the last round’s differential constraints given a predictable input difference \(\alpha \)), a well-trained r-round \(\mathcal {N}\mathcal {D}\) achieves an accuracy close to the \((r-1)\)-round \(\mathcal {D}\mathcal {D}\). Interestingly, subsequent observations on \(\mathcal {R}\mathcal {K}\text {-}{\mathcal {N}\mathcal {D}}\)s reinforce our conclusion, while the conclusion itself aids in interpreting those observations.

3.3 Distinguishers Using Systematic Computation of Conditional Differential Probability Under Known y

The simple process in Algorithm 1 is fast, but it requires evaluating the bias of each bit of the difference on the right branch of round \(r - 2\) to estimate the input difference \(\alpha \) for the last modular addition. The differential probability can only be adjusted if the estimated bias of the corresponding bit of \(\alpha \) exceeds a certain threshold. As a result, it does not make the most of the information in y. Therefore, we further designed a process, described in Algorithm 2, to systematically calculate the differential probability conditioned under the known value of y and predict based on the \((r-1)\)-round DDTFootnote 1.

In essence, the systematic process involves using \(\beta \), \(\gamma \), and y to determine all possible \(\alpha \)s and the conditional differential probabilities of the last round. It combines this information with the probabilities of the previous \((r-1)\) rounds to calculate the conditional differential probability for r rounds under the known value of y. Finally, it uses the systematically computed conditional probability for prediction.

More concretely, in the process, we have the following procedures:

  1. 1.

    Precomputation: We generate three b-bit conditional DDTs, denoted as \(\textbf{A}_0\), \(\textbf{A}_{\textrm{next}}\), and \(\textbf{A}_{\textrm{next}}^{c}\), of the single modular addition operation \(\boxplus \). These resemble Dinur’s b-bit filter in [12]:

    1. (a)

      \(\textbf{A}_0\) tells all valid b-bit values of \(\alpha \) with their associated probability pr for given b-bit inputs \(\beta \), \(\gamma \), and y at the first b least significant bits (LSB) where the first carry bits are zeros.

    2. (b)

      \(\textbf{A}_{\textrm{next}}\) tells all valid 1-bit values of \(\alpha _{\textrm{next}}\) with their associated probability pr for given b-bit inputs \(\beta \), \(\gamma \), y, and \((b-1)\)-bit \(\alpha \) at intermediate consecutive b bits where the LSB of carry is undetermined.

    3. (c)

      \(\textbf{A}_{\textrm{next}}^{c}\) is similar to \(\textbf{A}_{\textrm{next}}\) but serves scenarios with known carry LSBs.

  2. 2.

    Initialization: From a received ciphertext pair, we derive the output difference \(\gamma \), input difference \(\beta \), and input value y; initialize the to-be-calculated probability p and the last round’s probability factor q with 0 and 1.

  3. 3.

    Generate candidate LSB b-bit of \(\alpha \):

    1. (a)

      Using table \(\textbf{A}_0\), we obtain candidates for the LSB b-bit of \(\alpha \) based on the LSB b-bit of \(\beta \), \(\gamma \), and y, update q with the associated pr.

    2. (b)

      For each valid LSB b-bit of \(\alpha \), we invoke ‘ComputeCarryNextBit’ to determine the carry bits wherever possible according to Table 5.

  4. 4.

    Iterative Calculation: For each valid LSB b-bit of \(\alpha \),

    1. (a)

      Starting from the \((b-1)\)-th bit, we invoke ‘ComputeAlphaPrNextBit’ to sequentially determine \(\alpha \)’s later bits and the respective augmentation of the probability factor to q; alongside, we use ‘ComputeCarryNextBit’ to determine the carry bits wherever possible, preparing to be used to derive later bits of \(\alpha \) in case of Cyc or be used to look up \(\textbf{A}_{\textrm{next}}^{c}\).

    Within procedure ‘ComputeAlphaPrNextBit’:

    1. (a)

      Once \(\alpha \) is fully assigned, we calculate the output difference of the penultimate round and use it to look up the \((r-1)\)-round DDT. The resultant value, upon multiplied by the last round’s probability factor q, yields a contribution term to the final probability p.

    2. (b)

      At an intermediate bit position i, equal three input/output bits differences facilitate the direct determination of the subsequent \(\alpha \) bit.

    3. (c)

      When input/output bits differences at position \((i+1, i)\) conforms to the Cyc\(_{(i+1, i)}\) condition with an determined value for \(c_{i}\), the subsequent \(\alpha \) bit is deduced using \(y_i \oplus c_i\). After determining \(\alpha _{i+1}\), we invoke ‘ComputeCarryNextBit’ to determine the carry bit \(c_{i+1}\) wherever possible.

    4. (d)

      Otherwise (in the absence of conformity or a determined \(c_{i}\) value), \(\alpha _{i+1}\) is enumerated using either \(\textbf{A}_{\textrm{next}}\) or \(\textbf{A}_{\textrm{next}}^{c}\), depending on whether the carry bit before b bits of the \((i+1)\)-th bit is determined.

    5. (e)

      After obtaining \(\alpha _{i+1}\) and its probability pr, we continue to determine the next \(\alpha \) bit, updating the probability factor by multiplying pr to q.

The resulting procedure is slower than the simple one; however, the resulting distinguishers, named “\(\mathcal {A}\mathcal {D}_{\textbf{YD}}\) ”, have accuracy exceeds not only that of the distinguishers \(\mathcal {D}\mathcal {D}\)s but also the neural distinguishers \(\mathcal {N}\mathcal {D}\)s, comparable to the \(r-1\)-round DDT-based key-averaging distinguishers \(\mathcal {A}\mathcal {D}_{\textbf{KD}}\)s [2] (refer to Table 9 and Table 20 in [4]), indicating an exemplary accuracy for \(\mathcal {N}\mathcal {D}\)s.

figure m
Table 9. Performance of the improved DDT-based distinguishers (\(\mathcal {Y}\mathcal {D}\)s and \(\mathcal {A}\mathcal {D}_{\textbf{YD}}\)s) on Speck32/64 and comparisons with pure DDT-based distinguishers (\(\mathcal {D}\mathcal {D}\)s), neural distinguishers (\(\mathcal {N}\mathcal {D}\)s), and DDT-based key-averaging distinguishers (\(\mathcal {A}\mathcal {D}_{\textbf{KD}}\)s)

3.4 Discussion on \(\mathcal {N}\mathcal {D}\) ’s Advantages

Based on the above observations and experiments, we can conclude that \(\mathcal {N}\mathcal {D}\) ’s advantage over pure differential-based distinguishers comes from exploiting the conditional differential distribution under the partially known value from ciphertexts input to the last non-linear operation. More specifically, \(\mathcal {N}\mathcal {D}\)s exploited the correlation between the ciphertexts’ partial value, the ciphertext pair’s differences, and the intermediate states’ differences. Specifically, when some of the last-round nonlinear operations’ inputs and outputs are known (i.e., not XORed with independently randomized key bits), a distinguisher can achieve higher distinguishing accuracy than an r-round pure differential-based distinguisher.

Table 10. The accuracy of differential-neural distinguishers using distinct differences obtained by (0040, 0000) after i rounds of propagation. Prob. represents the probability of the highest probability differential (0040,0000) \(\rightarrow \) “Diff.”.

These findings apply not only to the Speck but also to other block ciphers, such as Simon and Gift (refer to Appendix D.1 in [4]), and demonstrate the ability of neural networks to capture and utilize complex relationships between ciphertext values and intermediate state differences. Note that the neural distinguishers are not aware of the specific details of the ciphers, including their non-linear components and structure. Therefore, these neural distinguishers can be used for ciphers that have unknown components.

On the Performance of Various Distinguishers. Experiments showed that \(\mathcal {N}\mathcal {D}\)s can be more efficient while achieving comparable accuracy to sophisticated manual methods (Alg. 2). Please refer to Table 9 for detailed benchmarks. Note that in benchmarks listed in Table 9, all \(\textrm{DDT}\)-based distinguishers are implemented in C++, whereas \(\mathcal {N}\mathcal {D}\)-based distinguishers are implemented in Python Tensorflow. Although implementations in C++ might be inherently faster than its Python counterpart, \(\mathcal {N}\mathcal {D}^{\textsc {Speck}_{*R}}\)s in Python are still more efficient than \(\mathcal {A}\mathcal {D}_{\textbf{YD}}^{{\textsc {Speck}}_{*R}}\) and \(\mathcal {A}\mathcal {D}_{\textbf{KD}}^{{\textsc {Speck}}_{*R}}\) in C++ (all restricted to run in a single CPU thread). Therefore, we can conclude that the neural network-based distinguishers provide a good trade-off between efficiency and accuracy.

4 Insights and Improvements on Training Differential-Neural Distinguisher

4.1 Relations Between Distinguisher Accuracy and Differential Distribution

Traditional differential cryptanalysis predominantly utilizes high-probability differentials as distinguishers. However, differential-neural cryptanalysis exploits all output differences for distinguishing while fixing input differences for plaintext pairs. In EUROCRYPT 2021, Benamira et al. [7] argued that differential-neural distinguisher is inherently building a very good approximation of the DDT during the learning phase.

Our study delves into the relation between the accuracy of the differential-neural distinguisher and the differential distribution of ciphertext pairs. We modify the input difference of plaintext pairs, inspired by Gohr’s staged training method [15]. In [15], while the basic training method can produce a valid 7-round distinguisher, an 8-round distinguisher must be trained using the staged training approach. The core of the staged training method is training a pre-trained 7-round distinguisher to learn 5-round Speck32/64 ’s output pairs with the input difference (8000,804a) (the most likely difference to appear three rounds after the input difference (0040,0000)). Employing such plaintext pairs aims to concentrate the difference distribution of ciphertext pairs, escalating the output difference’s likelihood and simplifying the distinguisher’s learning task.

In our work, we first introduce a 4-round highest probability differential trail starting from (0040,0000).

$$\texttt {(0040,0000)} \rightarrow \texttt {(8000,8000)} \rightarrow \texttt {(8100,8102)} \rightarrow \texttt {(8000,840a)} \rightarrow \texttt {(850a,9520)}$$

Our experiments (see Table 10) initially employ a 4-round high-probability differential trail starting from (0040,0000), leading to (850a,9520).

By default, we use (0040,0000) as the input difference of the plaintext pair to generate the ciphertext pair. Here, in Table 10, we use the difference of the highest probability of (0040,0000) after \( i \ (1 \le i \le 4) \) rounds of propagation as the input difference of the plaintext pair, respectively.

From Table 8, we can observe that the larger i is, the higher the accuracy of the differential-neural distinguisher. As i increases, the difference distribution in the ciphertext becomes more concentrated, and the probability of each difference increases. Therefore, the more significant the difference between the ciphertext and the random number, the accuracy of the differential-neural distinguisher is continuously improved.

To more comprehensively demonstrate the relation between the accuracy of the differential-neural distinguisher and the differential distribution of the ciphertext pairs, we conducted some experiments from another perspective. We fixed the number of rounds of differential but chose multiple 2-round differences with gradually decreasing probabilities. In Table 11, we notice that the higher the fixed probability of the differential, the higher the accuracy of the differential-neural distinguisher obtained. In other words, a lower probability means that after i rounds of encryption, the differential distribution of the ciphertext is more dispersed, and the neural network is more difficult to learn, resulting in a continuous decrease in the number of rounds and accuracy of the differential-neural distinguisher.

Table 11. The accuracy of the differential-neural distinguisher using distinct differences obtained by (0040, 0000) after 2 rounds of propagation. Prob. represents the probability of differential (0040,0000) \(\rightarrow \) “Diff.”. Round \(2+i\) represents the positive sample of the training set is the ciphertext pair obtained by encrypting the plaintext pair that satisfies this difference for i rounds

In conclusion, controlling differential propagation is imperative to enhance the differential-neural distinguisher’s accuracy and the number of rounds. We thus propose a method to control the differential propagation and reduce the diffusion of features, thereby increasing the number of rounds of the differential-neural distinguisher. However, before the formal introduction, we introduce one method that can simplify the training process of high round distinguisher.

4.2 Freezing Layer Method

In existing experiments on Speck32/64, especially with an input difference of (0040,0000), there has been a notable limitation. Researchers have been able to directly train a differential-neural distinguisher for up to only 7 rounds. Direct training for higher rounds from scratch has been challenging. A potential avenue that has garnered attention is the utilization of various network fine-tuning strategies. Specifically, continuing the training phase from pre-trained models has been proposed to potentially overcome these limitations and expand the distinguisher’s round capability. Examples include the staged training method in [15] and the staged pipeline method in [6].

The inability to directly train the 8-round distinguisher likely stems from feature diffusion associated with the input difference (0040,0000) over increasing rounds. This makes the 8-round features considerably challenging for the distinguisher to learn directly from limited data, as compared to lower rounds. One approach is to either mitigate feature diffusion or narrow the distinguisher’s solution space. While a technique to constrain feature diffusion is discussed in the subsequent chapter, in this context, we employ the classic network fine-tuning strategy, the freezing layer method, to limit the solution space.

Our distinguishers consist of two parts: the convolutional layers and fully connected layers. In the field of artificial intelligence, all convolutional layers are viewed as feature extractors, while all fully connected layers are viewed as a classifier. We argue that the feature extractor can be reused, and the classifiers are relatively similar in adjacent rounds. Therefore, to train an 8-round distinguisher for Speck32/64, we can simply load a well-trained 7-round model and freeze all its convolutional layers, meaning that only parameters in fully connected layers can be updated. Then, we can obtain an 8-round distinguisher with accuracy identical to the ones in [6, 15], remaining all hyperparameters in the training process unchanged.

Relative to the staged training method [15], our approach maintains the same hyperparameters and does not require more samples in the final stage. In comparison with the method in [6], we only need two training rounds instead of multiple rounds in a row as required by the simple training pipeline in [6]. Besides, the simple training pipeline [6] did not produce \(\mathcal {N}\mathcal {D}\)s with the same accuracy as Gohr’s on 8-round Speck32/64; it needs a further polishing step to achieve similar accuracy, demanding more time and data. Our freezing layer method also speeds up the training process due to the reduction of trainable parameters. Therefore, we recommend trying the freezing layer method once the number of the distinguisher is too high to train directly.

5 Related-Key Differential-Neural Cryptanalysis

The \(\mathcal{N}\mathcal{D}\) explainability concept serves as a fundamental theoretical underpinning when aiming to enhance and leverage its capabilities. With the outcome being that \(\mathcal{N}\mathcal{D}\)s can effectively capture additional features and provide a better trade-off between efficiency and accuracy, there is substantial motivation for us to continue refining and exploiting their potential.

In this section, we introduce the related-key into differential-neural cryptanalysis, enabling control over differential propagation and facilitating the training of high-round \(\mathcal{N}\mathcal{D}\)s. Furthermore, we enhance the DDT-based distinguisher under the \(\mathcal{R}\mathcal{K}\) setting by employing the analytical methods and conclusions outlined in Sect. 3. As a result of these advancements, we successfully implement a 14-round key recovery attack for Speck32/64 using the proposed \(\mathcal {R}\mathcal {K}\text {-}{\mathcal {N}\mathcal {D}}\)s.

5.1 Related-Key Differential-Neural Distinguisher for Speck32/64

Here we present the related-key differential-neural distinguishers on Speck32/64 obtained in this work.

The Choice of the Input Difference. The input difference is a crucial and central component of differential-neural cryptanalysis, and numerous papers delve into the study of the input difference, such as [3, 6, 15, 16, 21]. To maximize the number of rounds for both \(\mathcal{N}\mathcal{D}\) and \(\mathcal{C}\mathcal{D}\), as well as the weak key space as large as possible to perform the longest key recovery attack, we use the SMT-based method to search for appropriate \(\mathcal{R}\mathcal{K}\) differential or differential trails. It is important to note that the largest weak key space does not necessarily equate to the largest \(\mathcal{N}\mathcal{D}\) or \(\mathcal{C}\mathcal{D}\), thus requiring a compromise between the three factors. In this paper, the choice of the best input difference is given under different compromises. Table 12 lists the \(\mathcal{R}\mathcal{K}\) differential trails used to constrain the key space in Speck32/64, where we label each distinguisher with an ID. Specifically, \(\text {ID}_1\) is used to restrict the weak key space for the 13-round, \(\text {ID}_2\) and \(\text {ID}_3\) are used to restrict the 14-round. Note that part of the \(\text {ID}_2\)/\(\text {ID}_3\) (2-round to 11-round) \(\mathcal{R}\mathcal{K}\) difference are same as the 10-round optimal \(\mathcal{R}\mathcal{K}\) differential trail for speck32/64 given in Table 9 of [23]. In addition, the round-reduced of the trails are used to restrict the weak key space for shorter rounds, e.g., \(\text {ID}_2\) and \(\text {ID}_3\) are used to restrict the weak key space for 13-round starting from the second round.

Table 12. Related-key differential trails used to constrain the key space in Speck32/64 where we label each distinguisher with an ID. For example, \(\text {ID}_1\) represents the 13-round \(\mathcal{R}\mathcal{K}\) differential trail for the key schedule algorithm with \((\varDelta l^2, \varDelta l^1, \varDelta l^0, \varDelta k^0) = \texttt {(0044,0011,4000,0080)}\)

Network Architecture. Given the success of the neural network consisting of the Inception block and residual network in Speck, Simon and Simeck [25, 26], as well as its superior performance in differential-neural distinguisher, we use this neural network proposed in [26] to train \(\mathcal{R}\mathcal{K}\) differential-neural distinguisher. However, we also made some modifications to the network architecture. In deep learning, odd numbers such as 3, 5, and 7 are often used as the size of the convolution kernel. However, according to the cyclic shift of the round function of Speck32/64, we choose 2 and 7 as the size of the convolution kernel. Furthermore, using 2 as the convolution kernel size can make the model’s accuracy converge faster than 3. In [26], the size of the convolution kernel continues to increase as the depth of the residual network increases. We think it is reasonable to increase the convolution kernel’s size to improve the network’s receptive field, but it cannot always be increased. Therefore, we will limit the size of the convolution kernel to less than or equal to 7.

The Training of Related-Key Differential-Neural Distinguisher. This work still uses the basic training method to train short-round distinguishers. When the basic training method fails, we train the r-round distinguisher with the \((r-1)\)-round distinguisher by using the freezing layer method. Please refer to Appendix F in [4] for the detailed training method.

Performance Evaluation of the Distinguisher. In artificial intelligence, the model’s accuracy is the most critical evaluation indicator. In differential-neural cryptanalysis, it is judged whether the guessed key is correct based on the score of the distinguisher. Therefore, we evaluate the performance of the differential-neural distinguisher regarding both the accuracy and the score.

  • Test accuracy. We summarize the accuracy of the differential-neural distinguisher in Table 13. The 8, and 9-round distinguishers were trained using the basic training method, while the 10-round distinguishers were trained using the freezing layer method. For more insight on related-key differential-neural distinguishers, please refer to Appendix F.2 in [4].

    Table 13. The summary of related-key differential-neural distinguishers on Speck32/64, where the plaintext difference is (0000,0000).
  • Wrong key response profile (WKRP). In [15], the key search policy depends on the observation that a distinguisher’s response to wrong-key decryption varies with the bitwise difference between the guessed and real key. Instead of exhaustive trial decryption, it suggests specific subkeys and scores them. Figure 3 shows the mean response for varying Hamming distances between guessed and actual keys in \(\text {ID}_1\). Notably, high scores emerge when differences in keys are small, especially if the difference relates to {16384, 32768, 49152}. This indicates that errors in the 14th and 15th bits of the subkey minimally impact scores, allowing for a reduced key guessing space. This accelerated key recovery in [15]. For WKRPs of \(\text {ID}_2\) and \(\text {ID}_3\), see Appendix B.2 in [4].

Fig. 3.
figure 3

Wrong key response profile of \(\text {ID}_1\)

Table 14. Experiments detailing the information harnessed by \(\mathcal {R}\mathcal {K}\text {-}{\mathcal {N}\mathcal {D}}\)s using 9 round \(\text {ID}_{(2,\texttt {9182})}\), \(\text {ID}_{(2,\texttt {9382})}\), and \({\text {ID}_{(3,\texttt {9082})}}\), with similar settings in Table 2.

On \({\boldsymbol{\mathcal {R}}}{\boldsymbol{\mathcal {K}}}\text {-}{\boldsymbol{\mathcal {N}}}{\boldsymbol{\mathcal {D}}}{} \mathbf{'s}\) Explainability. Beyond constructing and comparing various \(\mathcal{R}\mathcal{K}\) distinguishers (see Appendix F.2 in [4]), we further undertook experiments analogous to Gohr’s aaaa-blinding experiment. Some \(\mathcal {R}\mathcal {K}\text {-}{\mathcal {N}\mathcal {D}}\)s behaved similarly to single-key setting \(\mathcal {N}\mathcal {D}\)s, while others varied. Refer to \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}} _{\text {ID}_{(2,\texttt {9182})}}}\) and \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}} _{\text {ID}_{(2,\texttt {9382})}}}\) in Table 14 for example for the former and latter case, where the differential trail \(\text {ID}_{(2,\texttt {9182})}\) differs from \(\text {ID}_{(2,\texttt {9382})}\) only at the last round key, and \(\text {ID}_{(2,\texttt {9382})}\) is \(\text {ID}_2\) from round 4 to 12. Notably, the behavior of \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}} _{\text {ID}_{(3,\texttt {9082})}}}\) presented intriguing phenomena (\(\text {ID}_{(3,\texttt {9082})}\) is \({\text {ID}_{3}}\) from round 4 to 12):

  1. 1.

    \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}} _{\text {ID}_{(3,\texttt {9082})}}}\) performed differently on

    \(\texttt {Set-1-1} := \{\varGamma _{\mathcal {A}}, \varGamma _{\mathcal {B}}, \varGamma _{\mathcal {C}}, \varGamma _{\mathcal {D}}\}\) and \(\texttt {Set-1-2} := \{\varGamma _{\mathcal{A}\mathcal{R}_1}, \varGamma _{\mathcal{B}\mathcal{R}_1}, \varGamma _{\mathcal{C}\mathcal{R}_1}, \varGamma _{\mathcal{D}\mathcal{R}_1}\}\), which, under the assumption of a random last-round key K, defines the same information set per Sect. 3.1 (please refer to Table 14).

  2. 2.

    \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}} _{{\text {ID}_{(3,\texttt {9082})}}}}\) showed superior performance over \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}} _{\text {ID}_{(2,\texttt {9382})}}}\) (0.7726 vs. 0.7535, refer to Table 22 in [4]), while theoretically, if there is no information on the key being revealed beyond the key difference, \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}} _{{\text {ID}_{(3,\texttt {9082})}}}}\) should perform exactly the same as \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}} _{\text {ID}_{(2,\texttt {9382})}}}\), since the two differential trails differ only at the last round key difference thus the two output difference distributions are affine-equivalent.

  3. 3.

    Surprisingly, \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}}}_{{\text {ID}_{(3,\texttt {9082})}}}\) even outperformed our manually enhanced distinguisher \({\mathcal {R}\mathcal {K}\text {-}\mathcal {A}\mathcal {D}^{{\textsc {Speck}}_{9R}}_{{\textbf {YD}}}}\) (0.7726 vs. 0.7574, refer to Table 22 in [4]).

Upon closer examination of the differential trail of \(\text {ID}_{(3,\texttt {9082})}\), we identified the causative factor. Let’s denote input/output differences and values around the last \(\boxplus \) in the key schedule producing the 8-round (counting start from 0) key \({k^8}\) as \({\alpha , \beta , \gamma , x, y, z}\). Then from the differential trail \({\text {ID}_{(3,\texttt {9082})}}\), specifically focus on the 7- and 8-round, we have \({\left\{ \begin{array}{ll} \alpha = \texttt {0x8002}^{\lll 7} &{}= \texttt {0b~0000~0101~0000~0000}, \\ \beta = \texttt {0x8480} &{}= \texttt {0b~1000~0100~1000~0000}, \\ \gamma = \texttt {0x8280} &{}= \texttt {0b~1000~0010~1000~0000}, \\ \end{array}\right. } \) According to Tables 4 and 5, we have follows.

  1. 1.

    The (8, 7)-th bit position is in case Cxc1\(_{(8,7)}\), we have \(c_8 = y_7\).

  2. 2.

    The (9, 8)-th bit position is in case Cxc0\(_{(9,8)}\), we have \(c_9 = c_8\).

  3. 3.

    The (10, 9)-th bit position is in case Cxy1\(_{(10,9)}\), we have \(x_9 \oplus y_9 = z_9 \oplus c_9 = 1\).

Consequently, we have \(z_9 \oplus y_7 = 1\). Note that \(z_9 \oplus y_7 = 1\) implies that the 9th bit of the last round key is constantly 1. This does not obscure 1-bit information of the output of the last \(\boxplus \) in the encryption path, allowing for better accuracy of the resulting distinguisher. This explains all the odds on \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}} _{{\text {ID}_{(3,\texttt {9082})}}}}\).

Additionally, for \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}} _{\text {ID}_{(2,\texttt {9382})}}}\), the 10th bit of the last-round key conforming to the round difference has a bias towards 0 (equals 0 with a probability of 3/4), which could explain its slightly differed accuracy between Set.1–1 and Set.1–2 (refer to Table 14). After fixing the 10th bit to be 0 and re-training the distinguisher, it achieves almost the same accuracy as \({\mathcal {R}\mathcal {K}\text {-}\mathcal {N}\mathcal {D}^{{\textsc {Speck}}_{9R}} _{\text {ID}_{(3,\texttt {9082})}}}\). When analyzing the probability of related-key pairs under these conditions, we deduced that restricting the 10th bit for \({\text {ID}_{(2,\texttt {9382})}}\) still results in a larger weak-key space compared with \({{\text {ID}_{(3,\texttt {9082})}}}\) while achieving the same high \(\mathcal {R}\mathcal {K}\text {-}{\mathcal {N}\mathcal {D}}\) accuracy.

5.2 Key Recovery Attack on Round-Reduced Speck32/64

This subsection describes the implementation of \(\mathcal{R}\mathcal{K}\) differential-neural cryptanalysis using the trained distinguisher. The key recovery framework is similar to [3, 15, 26]. Since the whole attack is in the \(\mathcal{R}\mathcal{K}\) setting, we need to specify the difference between each round of subkeys. Specifically, it is unclear how to perform a key recovery attack if only applying a difference to the master key without specifying the difference in the round-key state. In such cases, the guessed one last-round key cannot directly infer the other last-round key in the related pair, as the difference in the last-round key is not specified.

We first introduce some preparatory work before officially implementing the key recovery attack.

Generalized Neutral Bits. We incorporate \(\mathcal{C}\mathcal{D}\) before \(\mathcal{N}\mathcal{D}\) to increase the number of rounds for the key recovery attack. Furthermore, to enhance predictive performance, we employ the distinguisher to estimate the scores of multiple ciphertexts with the same distribution (ciphertext structure) and combine them to obtain the scores for the guessed subkey. However, the \(\mathcal{C}\mathcal{D}\) is probabilistic, and the randomly generated plaintext structure does not retain the same distribution after encryption. Hence, we require neutral bits to generate the plaintext structure, which we encrypt to obtain the ciphertext structure, achieving a successful key recovery attack. Therefore, the \(\mathcal{C}\mathcal{D}\) should have a high probability and a sufficient number of neutral bits. Appendix B.3 in [4] lists the NBs/SNBSs we used to perform the key recovery attack.

The Parameters for Key Recovery Attack. The attacks follow the framework of the improved key recovery attacks in [15]. An r-round main and an \((r-1)\)-round helper \(\mathcal{N}\mathcal{D}s\) are employed, and an s-round \(\mathcal{C}\mathcal{D}\) is prepended. The key guessing procedure applies a simple reinforcement learning procedure. The last subkey and the second to last subkey are to be recovered without exhaustively using all candidate values to perform one-round decryption. Moreover, a Bayesian key search employing the wrong key response profile will be used. We count a key guess as successful if the last round key was guessed correctly and if the second round key is at the hamming distance at most two of the real keys. The parameters to recover the last two subkeys are indicated below.

Parameter

Definition

\(n_{cts}\)

The number of ciphertext structures

\(n_{b}\)

The number of ciphertext pairs in each ciphertext structure,

that is, \(2^{|\text {NB}|}\)

\(n_{it}\)

The total number of iterations in the ciphertext structures

\(c_1,c_2\)

The cutoffs with respect to the scores of the recommended last

subkey and second to last subkey, respectively

\(n_{byit1/2}\)

The number of iterations, the default value is 5

\(n_{cand1/2}\)

The number of key candidates within each iteration, default

value is 32

Complexity Evaluation of Key Recovery Attack. The experiment is conducted by Python 3.7.15 and Tensorflow 2.5.0 in Ubuntu 20.04. The device information is Intel Xeon E5-2680V4*2 with 2.40 GHz, 256GB RAM, and NVIDIA RTX3080Ti 12 GB*7. To reduce the experimental error, we perform 210 key recovery attacks for each parameter setting, take the average running time rt as the running time of an experiment, and divide the number of successful experiments by the total experimental number as the success rate sr of the key recovery attack.

  1. 1.

    Data complexity. The data complexity of the experiment is calculated using the formula \(n_{b}\times n_{ct} \times 2\), which is a theoretical value. In the actual experiment, when the accuracy of the differential-neural distinguisher is high, the key can be recovered quickly and successfully. Not all data are used, so the actual data complexity is lower than theoretical.

  2. 2.

    Time complexity. We use \(2^{32}\) data to test the speed of encryption and decryption on our device, and each core can perform \(2^{26.814}\) rounds of decryption operations per second for Speck32/64. The formula for calculating the time complexity in our experiments: \(2^{26.814}\times rt\).

The Result of Key Recovery Attacks. We list the results of key recovery attacks in multiple differential modes in Table 15. We calculate the corresponding weak key space wks according to the probabilities of \(\text {ID}_1\), \(\text {ID}_2\), and \({\text {ID}_{(3,\texttt {9082})}}\). Adv. represents the advantage compared to the time complexity of brute forcing. The time and data complexity can be reduced by reducing \(n_{cts}\) and \(n_{it}\), but the success rate sr also decreases accordingly. The first metric for our experiment is to reduce the time complexity.

Table 15. Summary of key recovery attacks on Speck32/64

Remark 3

(The profiling information of the key-recovery attack). To pinpoint the attack’s bottleneck, we profiled a 14-round key-recovery attack using ID3. The main result is detailed in Table 16. From the profiling result, the performance of our implementation of the attack is mostly limited by the speed of neural network evaluation (the proportion taken by \(\mathcal {N}\mathcal {D}\) making the prediction is 79.18% + 5.17% = 84.35%). The next limiting factor is the speed of computing the weighted Euclidean distance with the wrong key response profile.

Table 16. Profiling information of the key-recovery attack

Remark 4

(Efficiency measures in symmetric-key cryptanalysis attacks). Assessing the efficiency of distinguishers and key recovery attacks in symmetric-key cryptanalysis poses intricate challenges, particularly when pinpointing computational complexities based on real-time attack timings and then extrapolating these to equivalent primitive evaluations, as done in both \(\mathcal {N}\mathcal {D}\)-based and traditional attacks in [12, 14, 24] (listed in Table 1).

Factors influencing these complexities include architecture compatibility and algorithmic suitability, varied computation intensity and various operation costs across platforms, memory constraints and flexible trade-offs, and implementation factors. Given these complexities, it is a good idea to have secondary metrics for comparison, for instance, power consumption and cost efficiency (please refer to Appendix E in [4] for detailed discussions). While there’s a pressing need for universal metrics, formulating such benchmarks is challenging, warranting caution when interpreting the comparison results and warranting further exploration.

6 Conclusion

This paper provides explicit rules that a distinguisher can use beyond the full differential distribution table to achieve better distinguishing performance. These rules are based on high correlations between values of bits in right pairs of differential propagation through addition modular \(2^n\). By leveraging the value-dependent differential probability, which is not typically applied in traditional differential distinguishers, we can equip additional knowledge to DDT-based distinguishers, enhancing their accuracy. These rules or their equivalent form are likely the additional features beyond full DDT that the neural distinguishers exploit. While these rules are not difficult to derive with careful analysis, they rely on non-trivial relations that traditional distinguishers often overlook. This indicates that neural networks help break the limitations of traditional cryptanalysis. Studying this unorthodox model can provide new opportunities to understand cryptographic primitives better.

Another investigation in this paper revealed that controlling differential propagation is crucial to enhance the accuracy of differential-neural distinguisher. It is typically believed that introducing differences into the keys provides chances to cancel differences in the encryption states, thus resulting in stronger differential propagations. However, unlike traditional differential attacks, differential-neural attacks do not specify the output difference and, thus, are not limited to a single differential trail. Therefore, it is unclear whether the difference in a key is helpful in differential-neural attacks. It is also unclear how resistant Speck is against differential-neural attacks in the \(\mathcal{R}\mathcal{K}\) setting. This work confirmed that differential-neural cryptanalysis in the \(\mathcal{R}\mathcal{K}\) setting could be more powerful than in the single-key setting by conducting a 14-round key recovery attack on Speck32/64.