1 Introduction

The Keccak hash function [3] was a submission to the SHA-3 competition [22] in 2008. After 4 years of intensive evaluation, it was selected as the winner of the competition in 2012. It was then formally standardized in 2015 by the National Institute of Standards and Technology (NIST) of the USA as Secure Hash Algorithm-3 (SHA-3) [27]. The SHA-3 family has four instances with fixed digest lengths, namely SHA3-224, SHA3-256, SHA3-384 and SHA3-512, which correspond to Keccak[c\(\triangleq \) Keccak[\(r=1600-c,c\)] where \(c\in \{448,512,768,1024\}\). The SHA-3 family also has two extendable-output functions (XOFs) SHAKE128 and SHAKE256. To promote cryptanalysis of the Keccak hash function, the Keccak designers proposed variants with lower security levels in the Keccak Crunchy Crypto Collision and Pre-image Contest (the Keccak contest for short) [2], for which the digest lengths are 80 and 160 bits for pre-image and collision resistance, respectively. For clarity, these variants are denoted by Keccak\([r,c,n_r,d]\) with parameters \(\{r,c,n_r,d\}\) to be specified later.

Since the Keccak hash function was made public in 2008, there has been intensive cryptanalysis from the public research community [1, 9,10,11,12,13,14, 16, 17, 21]. In this paper, we mainly investigate collision attacks on the Keccak hash function, in particular, those with practical attack complexities. In collision attacks, the aim is to find two distinct messages which lead to the same hash digest. Up to date, the best practical collision attacks against Keccak[448]/Keccak[512] are of 4 (out of 24) rounds found by Dinur et al. [10] in 2012 and later furnished in the journal version [12]. These 4-round collisions were generated by combining a 1-round connector and a 3-round differential trail. The same authors presented practical collision attacks on 3-round Keccak[768]/Keccak[1024], and theoretical collision attacks (with complexities beyond the reach of practical resources) on 5/4-round Keccak[512]/Keccak[768] in [11] using internal differentials. For the Keccak contest, the best solutions reached up to 4 rounds in [12, 18]. To the best of our knowledge, there exist neither practical collision attacks against 5-round Keccak[448]/Keccak[512]/ Keccak[768]/Keccak[1024], nor for any 5-round instances of the Keccak contest.

Our contributions Following the framework of Dinur et al. [10], we develop a new algebraic and differential hybrid method to launch collision attacks on Keccak family and present seven real collisions on six Keccak variants, including 5-round SHAKE128, 5-round SHA3-224, 5-round SHA3-256, and 5-round as well as 6-round instances of the Keccak collision contest.

These results are obtained by combining a differential trail and a connector which links the initial state of Keccak to the input difference of the trail. Previously, connectors of Keccak covered one round [10], while in this paper we propose new connectors of up to three rounds. The improvements on connectors are possible mainly due to S-box linearizations of Keccak, i.e., the linearization of the nonlinear layer of Keccak. Specifically, we linearize the first round for constructing 2-round connectors for Keccak. When the degree of freedom of a 2-round connector is sufficiently large, it can be extended to three rounds by further linearizing (part of) the second round.

In this paper, two types of S-box linearizations are proposed, i.e., full linearization and non-full linearization, which are achieved based on several crucial observations. First, we observe that the Keccak S-box can be re-expressed with linear transformations when the input is restricted to certain affine subspaces. Daemen et al. [4, 8] and Dinur et al. [10] have already noted that when the input and output differences of the Keccak S-box are fixed, the solution set forms an affine subspace of dimension up to 3. In this work, we show that the maximum subspaces allowing linearization of the S-box are of dimension 2 and any 2-dimensional affine subspace allows S-box linearization. For affine subspaces of dimension 3, six 2-dimensional affine subspaces out of it could allow the linearization. However, we do not have to fully linearize all S-boxes of the first round if only partial output bits of them need to be linear for constructing 2-round connectors. We further observe that non-full S-box linearizations help to save some degrees of freedom. More specifically, when linearizing part (not all) of the output bits of a non-active S-box (defined by whether there is any difference in the input and output), at most 2 degrees of freedom are consumed by specifying 2 binary linear equations over the input bits. For an active S-box whose entry in the differential distribution table (DDT) is 8, 4 out of 5 output bits are already linear when the input is chosen from the solution set. Note that to restrict the input to the solution set for such an S-box, 2 binary linear equations should be specified over the input bits. Therefore, for both non-active and active S-boxes, two or fewer degrees of freedom may be enough to linearize part of the output bits, while fully linearizing an S-box consumes at least 3 degrees of freedom.

With these properties in mind, we linearize S-boxes of the first round such that their output bits which affect the input difference of the trail that our connector tries to link to become linear. Therefore, the first round function of the Keccak permutation is transformed into a (partial) linear one. Combining with an inversion method of the S-box layer of the second round, we convert the problem of finding 2-round connectors into that of solving a system of linear equations. Solving the system produces sufficiently many solutions so that at least one pair of them will follow the differential trails over the last rounds. To extend the connector to 3 rounds, we first construct a 2-round connector where the first round is fully linearized by enforcing full S-box linearizations. Then upon the 2-round connector, 3-round connectors are obtained through constructing 2-round connectors over the second and third round, where non-full S-box linearizations are applied to the second round.

A side effect of S-box linearization is a quick reduction in the degree of freedom which in turn determines the existence of 2-round or 3-round connectors. To address this problem, we aim to find differential trails that impose the least possible conditions on the connectors. We design dedicated strategies to find suitable differential trails of 3 and 4 rounds. Our GPU implementations confirmed the correctness of this idea and successfully found desirable differential trails for our collisions attacks.

Results obtained in this paper are listed in Table 1, compared with the best previous practical collision attacks and related theoretical attacks. In brief, we obtain actual collisions on three 5-round instances of SHA-3, i.e., SHAKE128, and SHA3-224, SHA3-256, and three instances of Keccak contest. The number of practically attacked rounds of Keccak instances now is increased to 6. For the instance of contest Keccak[1440, 160, 6, 160], an inner collision [3] on the last 160-bit of the output state is mounted to construct collisions from messages of any block length.

Table 1 Summary of our attacks and comparison with related works

Organization The rest of the paper is organized as follows: In Sect.  2, a brief introduction to the SHA-3 hash function is given. Section 3 first presents the notations used in this paper and subsequently provides a synopsis of the collision attacks against SHA-3 and Keccak instances. The properties of S-box linearization and non-full linearization are illustrated in Sect. 4, followed by detailed explanations over how the 2-round and 3-round connectors are constructed. Section 5 outlines the search for differential trails as well as the highly efficient GPU implementations of Keccak. In Sect. 6, the experimental results of collision attacks are given. We conclude the paper in Sect. 7. Details of the differential trails and actual collisions are postponed to Appendix.

2 Description of SHA-3

2.1 The Sponge Function

The sponge construction is a framework for constructing hash functions from permutations, as depicted in Fig. 1. The construction consists of three components: an underlying b-bit permutation f, a parameter r called rate and a padding rule. The value \(c = b - r\) is called capacity. A hash function following this construction takes in a message M as input and outputs a digest of d bits. Given a message M, it is first padded and split into r-bit blocks. The b-bit state is initialized to be all zeros. The sponge construction then proceeds in two phases. In the absorbing phase, each message block is XORed into the first r bits of the state, followed by application of the permutation f. This process is repeated until all message blocks are processed. Then, the sponge construction switches to the squeezing phase. In this phase, each iteration returns the first r bits of the state as (part of) the output and then applies the permutation f to the current state. This repeats until all d bits digest are obtained.

Fig. 1
figure 1

Sponge construction [3]

2.2 The Keccak Hash Function

The Keccak hash function follows the sponge construction. The underlying permutation of Keccak is chosen from a set of seven Keccak-f permutations, denoted by Keccak-f[b], where \(b\in \{25, 50, 100, 200, 400, 800, 1600\}\) is the width of the permutation in bits. The default Keccak employs Keccak-f[1600]. The 1600-bit state can be viewed as a 3-dimensional \(5\times 5 \times 64\) array of bits, denoted as A[5][5][64]. Let \(0\le i,j<5,\) and \(0\le k<64\); A[i][j][k] represents one bit of the state at position (ijk). Defined by the designers of Keccak, \(A[*][j][k]\) is called a row, \(A[i][*][k]\) a column, and \(A[i][j][*]\) a lane. Note the lane size is \(\frac{b}{25}\) which varies according to b.

The Keccak-f permutation has \(12+2l\) rounds (\(l=\mathrm {log}_2\frac{b}{25}\)), each of which consists of five mappings \(R=\iota \circ \chi \circ \pi \circ \rho \circ \theta \).

\(\theta \)::

\(A[i][j][k]\leftarrow A[i][j][k]+ \sum _{j'=0}^4 A[i-1][j'][k]+ \sum _{j'=0}^4 A[i+1][j'][k-1]\).

\(\rho \)::

\(A[i][j]\leftarrow A[i][j]\lll T(i,j),\text {where T(i,j)s are constants}\).

\(\pi \)::

\(A[j][2i+3j]\leftarrow A[i][j]\).

\(\chi \)::

\(A[i][j][k]\leftarrow A[i][j][k]+(A[i+1][j][k]+ 1)\cdot A[i+2][j][k]\).

\(\iota \)::

\(A[0][0]\leftarrow A[0][0]+ RC_{i_r},\)where \(RC_{i_r}\) is the \(i_r\)th round constant.

Here, the additions and multiplications between the terms are in \(\mathsf {GF}(2)\). As \(\iota \) plays no essential role in our attacks, we will ignore it in the rest of the paper unless otherwise stated.

2.3 Instances of Keccak and SHA-3

There are four instances of Keccak, denoted by Keccak[c\(\triangleq \) Keccak[\(r=1600-c,c\)], where the capacity \(c\in \{448,512,768,1024\}\) and the digest length d is half of c. To promote cryptanalysis against Keccak, the Keccak design team also proposed variants with lower security levels in the Keccak contest, where \(b \in \{1600,800,400,200\}\), \((d= 80, c=160)\) for pre-image contest and \((d=160,c=160)\) for collision contest. In this paper, we follow the designers’ notation Keccak\([r,c,n_r,d]\) for the instances in the contest, where r is the rate, \(c=b-r\) is the capacity, d is the digest size, and \(n_r\) is the number of rounds the underlying permutation Keccak-f is reduced to.

The Keccak hash function uses the multi-rate padding rule which appends to the original message M a single bit 1 followed by the minimum number of bits 0 and a single bit 1 such that the length of the resulting message is a multiple of the block length r. Namely, the padded message \(\overline{M}\) is \(M \Vert 10^*1\).

The SHA-3 standard adopts the four Keccak instances and names them SHA3-224, SHA3-256, SHA3-384 and SHA3-512, respectively. In these four instances of SHA-3, the message is appended ‘01’ first. After that, the multi-rate padding is applied. The SHA-3 standard also contains two extendable-output functions named SHAKE128 and SHAKE256, which are defined from two instances of Keccak with the capacity c being 256 and 512 respectively and the digest of any length. For SHAKE, a four-bit suffix ‘1111’ is added to the original message M before applying the multi-rate padding.

3 Notations and Overview

3.1 Notations

We summarize the major notations to be used in this paper here.

c:

Capacity of a sponge function

r:

Rate of a sponge function

b:

Width of a Keccak permutation in bits, \(b=r+c\)

d:

Length of the digest in bits

\(d^{\dashv }\):

Length of collision on the last bits

p:

Number of fixed bits in the initial state due to padding

\(n_r\):

Number of rounds

\(\theta ,\rho ,\pi ,\chi ,\iota \):

The five mappings that comprise a round. A subscript i denotes the mapping at the ith round, e.g., \(\chi _i\) denotes the \(\chi \) layer at the ith round for \(i = 0, 1, 2,\ldots \).

L:

Composition of \(\theta ,\rho ,\pi \) and its inverse denoted by \(L^{-1}\)

\(RC_i\):

Round constant for the ith round, \(i= 0,1,2,\ldots \)

\(\mathtt{R}^i(\cdot )\):

Keccak permutation reduced to the first i rounds

\(\mathtt{S}(\cdot )\):

5-bit S-box operating on each row of Keccak state

\(\delta _{\mathrm{in}},\delta _\mathrm{out}\):

5-bit input and output differences of an S-box,

\(\mathtt{DDT} \):

Differential distribution table, and \(\mathtt{DDT} (\delta _{\mathrm{in}},\delta _\mathrm{out})=~|\{x:\mathtt{S}(x)+\mathtt{S}(x+\delta _{\mathrm{in}})=\delta _\mathrm{out}\}|~\), where \(|\cdot |\) denotes the size of a set.

\(\alpha _i\):

Input difference of the ith round function, \(i= 0,1,2,\ldots \)

\(\beta _i\):

Input difference of \(\chi \) in the ith round, \(i= 0,1,2,\ldots \)

\(w_i\):

Weight of the ith round, \(w_i\), \(i= 0,1,2,\ldots \)

\({\mathtt{DF}} \):

Degree of freedom of the solution space of connectors

\(\overline{M}\):

Padded message of M. Note that \(\overline{M}\) is of one block in our attacks.

\(M_1||M_2\):

Concatenation of strings \(M_1\) and \(M_2\)

3.2 Overview of the Attack

In this subsection, we give an overview of our collision attacks. Following the framework by Dinur et al. [10], as well as many other collision attacks utilizing differential trails, our collision attacks consist of two parts, i.e., a high-probability differential trail and a connector linking the differential trail with the initial state, as depicted in Fig. 2. Let \(\varDelta S_{I}\) and \(\varDelta S_{O}\) denote the input and output differences of the differential trail, respectively. A connector covering \(n_{r_1}\) rounds produces message pairs \((M,M')\) such that

$$\begin{aligned} \mathtt{R}^{n_{r_1}}({\overline{M}} || 0^{c}) + \mathtt{R}^{n_{r_1}}(\overline{M'}||0^{c}) = \varDelta {S_{I}} \end{aligned}$$

always holds, i.e., the difference after \(n_{r_1}\) rounds is always \(\varDelta S_{I}\). The differential trail is then fulfilled probabilistically with many such message pairs, and collisions can be found if the first d bits of \(\varDelta S_{O}\) are zero.

Fig. 2
figure 2

Overview of (\(n_{r_1}+n_{r_2}\))-round collision attacks

Given an \(n_{r_2}\)-round differential of probability \(2^{-w}\), our (\(n_{r_1}+n_{r_2}\))-round collision attack proceeds in two stages:

  • Stage 1 — Connecting stage: Construct an \(n_{r_1}\)-round connector and get a subspace of messages bypassing the first \(n_{r_1}\) rounds.

  • Stage 2 — Brute-force searching stage: By brute force, find a colliding pair following the \(n_{r_2}\)-round differential trail from the subspace found in stage 1. The time complexity of this stage is \(2^w\).

The brute-force searching stage is simple, though it may be time-consuming. Therefore, the core steps of the attack are finding good differentials and constructing connectors. In [10], Dinur et al. explored a method, which they call target difference algorithm, to construct 1-round connectors, namely to find message pairs \((M, M')\) such that the output difference after one-round permutation is \(\varDelta S_{I}\). In the next section, we show an algebraic method to extend this connector to two and three rounds, followed by details of the differential trail search in Sect. 5.

Without further specification, we assume in this paper the messages used are of one block after padding. To fulfill the Keccak padding rule, one needs to fix the last bit of the padded message to be ‘1’; hence, the first \(r-1\) bits of the state are under the full control of the attacker through the message bits, and the last c bits of the state are fixed to zero as in the initial state specified by Keccak. When applied to SHA3-d, \(d\in \{224,256,384,512\}\) (resp. SHAKE), there are \(r-4\) (resp. \(r-6\)) free bits under control, by setting the last 4 (resp. 6) bits of the padded message to be ‘1110’ (resp. ‘111111’) so to be compatible with the specific padding rule. For convenience, the number of extra fixed bits is denoted by p, and only \(r-p\) bits are under control. Overall, the constraints of \(n_{r_{1}}\)-connectors are that the last \(c+p\) bits of the initial state are fixed, and that the output difference after \(n_{r_1}\) rounds is given and fixed. (This is determined by the differential trail to be used.) We are to utilize the degree of freedom from the first \(r-p\) bits of the initial state to find solutions efficiently. As we are aiming at low-complexity attacks, finding solutions of connectors should be practical. For the same reason, the probability of the differential trail should be sufficiently high.

4 S-box Linearization and Connectors

We observe that the number of degrees of freedom for Keccak instances is \(r-p\) which is large for instances like Keccak[1440, 160, 5, 160]. One can then choose some message subsets with special properties such that the starting difference of a given trail is fulfilled deterministically. Since we intend to extend the connector to more than one round, it is natural to consider linearizing the nonlinear mapping \(\chi \) of the first round by exploiting degrees of freedom, i.e., the expression of each S-box in the first round can be rewritten as a linear transformation when the input is restricted to certain subsets. Once \(\chi \) is linearized, the entire first round becomes linear and then 2-round connectors are possible by applying the core ideas of Dinur et al.’s target difference algorithm [10] to the second round.

In this section, we elaborate on techniques for linearizing the Keccak S-box, both fully and partially, based on which 2-round and 3-round connectors are achieved.

4.1 S-box Linearization

It is obvious that the Keccak S-box is nonlinear when the entire \(2^5\)-sized input space is considered. However, affine subspaces of size up to 4, as to be shown below, could be found so that the S-box can be linearized. Note that the S-box is the only nonlinear mapping of the Keccak round function. Hence, the entire round function becomes linear when all S-boxes are restricted to such subspaces. Formally, we define:

Definition 1

(Linearizable affine subspace) Linearizable affine subspaces are affine input subspaces on which S-box substitution is equivalent to a linear transformation. If V is a linearizable affine subspace of an S-box operation \(\mathtt{S}(\cdot )\), then \(\forall x\in V, \mathtt{S}(x)=A\cdot x+b\), where A is a \(5 \times 5\) matrix and b is a 5-bit constant vector.

For example, when input is restricted to the affine subspace \(\{\mathtt{00000,00001,00100}\), \(\mathtt{00101}\}\) (\(\{\mathtt{00,01,04,05}\}\) in hex), where the bits are written as \(x_4x_3x_2x_1x_0\), the corresponding output set of the Keccak S-box is \(\{\mathtt{00000,01001,00101,01100}\}\) (\(\{\mathtt{00,09,05,0C}\}\) in hex), and the expression of the S-box can be rewritten as a linear transformation:

$$\begin{aligned}y= \begin{pmatrix} 1&{}\quad 0&{}\quad 1&{}\quad 0&{}\quad 0 \\ 0&{}\quad 1&{}\quad 0&{}\quad 0&{}\quad 0\\ 0&{}\quad 0&{}\quad 1&{}\quad 0&{}\quad 0\\ 1&{}\quad 0&{}\quad 0&{}\quad 1&{}\quad 0\\ 0&{}\quad 0&{}\quad 0&{}\quad 0&{}\quad 1 \end{pmatrix} \cdot x \end{aligned}$$

where x and y are bit vector representations of input and output values of the Keccak S-box with \(x_0\) at the top. By rotation symmetry, four more linearizable affine subspaces can be deduced from the one above.

Exhaustive search for the linearizable affine subspaces of the Keccak S-box shows:

Observation 1

Out of the entire 5-dimensional input space,

  1. a.

    There are totally 80 2-dimensional linearizable affine subspaces, as listed in Table 7 in Appendix A.

  2. b.

    There does not exist any linearizable affine subspace with dimension 3 or more.

For completeness, any 1-dimensional subspace is always linearizable.

Since the affine subspaces are to be used together with differential trails, we are especially interested in linearizable affine subspaces with fixed input and output differences, which is more relevant with the differential distribution table (DDT) of the S-box. We observe:

Observation 2

Given a 5-bit input difference \(\delta _\mathrm{in}\) and a 5-bit output difference \(\delta _\mathrm{out}\) such that \(\mathtt{DDT} (\delta _\mathrm{in},\delta _\mathrm{out})\ne 0\), i.e., the solution set \(V=\{x:\mathtt{S}(x)+\mathtt{S}(x+\delta _\mathrm{in})=\delta _\mathrm{out}\}\) is not empty,

we have

  1. a.

    if \(\mathtt{DDT} (\delta _\mathrm{in},\delta _\mathrm{out})=2\) or 4, then V is a linearizable affine subspace.

  2. b.

    if \(\mathtt{DDT} (\delta _\mathrm{in},\delta _\mathrm{out})=8\), then there are six 2-dimensional subsets \(V_i\subset V,i=0,1,\ldots ,5\) such that \(V_i (i=0,1,\ldots ,5)\) are linearizable affine subspaces.

It is interesting to note the 2-dimensional linearizable affine subspaces obtained from the analysis of DDT cover all the 80 cases in Observation 1. It is already noted in [15] there is one-to-one correspondence between linearizable affine subspaces and entries with value 2 or 4 in DDT. As for the DDT entries of value 8, there are 6 choices of 2-dimensional linearizable affine subspaces. As an example, the 3-dimensional affine subspace corresponding to \(\mathtt{DDT} (\mathtt{01,01})\), i.e., with both input and output differences being 01, is \(\{\mathtt{10, 11, 14, 15, 18, 19, 1C, 1D}\}\) and the six 2-dimensional linearizable affine subspaces from it are

$$\begin{aligned} \begin{aligned}&\{\mathtt{10, 11, 14, 15}\}, \{\mathtt{10, 11, 1C, 1D}\}, \{\mathtt{14, 15, 18, 19}\}, \\&\{\mathtt{10, 11, 18, 19}\}, \{\mathtt{18, 19, 1C, 1D}\}, \{\mathtt{14, 15, 1C, 1D}\}.\\ \end{aligned} \end{aligned}$$

When projected to the whole Keccak state, the direct product of affine subspaces of each individual S-box forms affine subspaces of the entire state with larger dimensions. In other words, when all the S-boxes in the round function are linearized, the entire round function becomes linear.

4.2 Non-full S-box Linearization

Suppose we are to construct a connector of two rounds. Given the input difference \(\varDelta S_I\) of the differential trail, i.e., the output difference of the connector \(\alpha _2\), it may be the case that some output bits of the S-boxes in the first round of the connector do not influence \(\varDelta S_I\). The S-boxes with such output bits do not need full linearizations. That is, it is enough to restrict the input to certain subsets such that only the output bits of the first round affecting \(\varDelta S_I\) are to be expressed with a linear relation with input bits. Such linearization is called non-full S-box linearization and may contribute to a lower consumption of degree of freedom, as to be shown in Observations 3 and 4 below.

Before presenting the two observations, let us introduce some notations. Let the input and output value of \(\chi _0\) (the \(\chi \) mapping of the first round) be x and y, respectively. Let \(u=(u_0, u_1,\ldots ,u_{b-1})\) be a flag vector where \(u_i = 1\)\((0\le i< b)\) if \(y_i\) influences \(\varDelta S_I\); otherwise, \(u_i=0\). Let \(U=(U_0,U_1,\ldots ,U_{b/5-1})\) where \(U_i=u_{5i}u_{5i+1}u_{5i+2}u_{5i+3}u_{5i+4}\), \(0\le i< b/5\). For the ith S-box of \(\chi _0\), if \(U_i\) is not 00000, aka. some bits of the corresponding S-box will influence \(\varDelta S_I\), this S-box should be fully or partially linearized. The value of u is determined by \(\beta _1\) and \(\alpha _2\) (\(\varDelta S_I\)), which will be further illustrated in the next subsection. In this subsection, we suppose u is already known and focus on non-full S-box linearizations.

Observation 3

For a non-active Keccak S-box, when \(U_i\) is not 11111,

  1. a.

    if \(U_i=00000\), it does not require any linearization;

  2. b.

    if \(U_i\in \{{\mathtt{00001, 00010, 00100, 01000, 10000, 00011, 00110, 01100,}}\,{\mathtt{11000, 100}}\,{\mathtt{01}}\}\), at least 1 degree of freedom is consumed to linearize the output bit(s) of the S-box marked by \(U_i\);

  3. c.

    otherwise, at least 2 degrees of freedom are consumed to linearize the output bits of the S-box marked by \(U_i\).

This observation comes from the algebraic relation between the input and output of \(\chi \). Suppose the 5-bit input of the S-box is \(x_4x_3x_2x_1x_0\) and the 5-bit output \(y_4y_3y_2y_1y_0\). Then the algebraic normal forms of the S-box are as follows.

$$\begin{aligned} y_{0}&=x_{0} + (x_{1}+1)\cdot x_{2},\\ y_{1}&=x_{1} + (x_{2}+1)\cdot x_{3},\\ y_{2}&=x_{2} + (x_{3}+1)\cdot x_{4},\\ y_{3}&=x_{3} + (x_{4}+1)\cdot x_{0},\\ y_{4}&=x_{4} + (x_{0}+1)\cdot x_{1}. \end{aligned}$$

Take \(U_i=\mathtt{00001}\) as an example. It indicates that \(y_0\) should be linearized. As can be seen, the only nonlinear term in the expression of \(y_0\) is \(x_1\cdot x_2\). Fixing the value of either \(x_1\) or \(x_2\) makes \(y_0\) linear. Without loss of generality, the value of \(x_1\) is assumed to be fixed to 0 or 1. When \(x_1=0\), we have \(y_0=x_0+x_2\); otherwise, \(y_0=x_0\). When \(U_i=\mathtt{01111}\), it maps to 4 output bits \(y_0, y_1, y_2, y_3\) which should be linearized. We can fix the value of two bits \(x_2\) and \(x_4\) only. Once \(x_2\) and \(x_4\) are fixed, the nonlinear terms in the algebraic form of all \(y_0, y_1, y_2, y_3\) will disappear. Other cases work similarly. If \(U_i=\mathtt{11111}\), a full linearization is required by fixing the value of any three input bits which are not cyclically continuous, e.g., \((x_0, x_2, x_4)\).

As noted in Observations 1 and 2, it costs at least three degrees of freedom to fully linearize an S-box, even if it has \(\mathtt{DDT} \) entry of 8. However, Observation 4 shows that two degrees of freedom may be enough to partially linearize an S-box of \(\mathtt{DDT} \) entry 8, and thus 1 bit degree of freedom could be saved.

Observation 4

For a 5-bit input difference \(\delta _\mathrm{in}\) and a 5-bit output difference \(\delta _\mathrm{out}\) such that \(\mathtt{DDT} (\delta _\mathrm{in},\delta _\mathrm{out})= 8\), 4 out of 5 output bits are already linear if the input is chosen from the solution set \(V=\{x~:~\mathtt{S}(x)+\mathtt{S}(x+\delta _\mathrm{in})=\delta _\mathrm{out}\}\).

Take \(\mathtt{DDT} (\mathtt{01,01})= 8\) as an example (see Table 6 of [24]). The solution set is \(V=\{\mathtt{10,11,14,15,18,19,} \mathtt{1C,1D}\}\). We rewrite these solutions in the following 5-bit strings where the bits are written as \(x_4x_3x_2x_1x_0\).

It is easy to see for the values from this set, \(x_1=0\) and \(x_4=1\) always hold, making \(y_0,y_2,y_3,y_4\) linear since their algebraic forms could be rewritten as

$$\begin{aligned} y_{0}&=x_{0} + x_{2},\\ y_{1}&=(x_{2}+1)\cdot x_{3},\\ y_{2}&=x_{2} + x_{3}+1,\\ y_{3}&=x_{3},\\ y_{4}&=1. \end{aligned}$$

If the only nonlinear bit \(y_1\) does not influence \(\varDelta S_I\), setting \(x_1=0\) and \(x_4=1\) (consuming two degrees of freedom) is enough for linearizing the remaining four bits.

Therefore, for an S-box, both active and non-active, it may consume less than three degrees of freedom for non-full linearizations, which allows to producing relatively large message spaces for the brute-force collision searching stage.

4.3 2-Round Connector

The core idea of our 2-round connector is to convert the problem to that of solving a system of linear equations. Note the first two rounds of Keccak permutation can be expressed as \(\chi _1\circ L \circ \chi _0 \circ L\) (omitting the \(\iota \)), as depicted in Fig. 3. Using the techniques discussed above, \(\chi _0\) which has fixed input and output differences can be (partially) linearized, i.e., the first three operations \( L \circ \chi _0 \circ L\) become (partially) linear. For convenience, the differential transition of the ith round is denoted by \(\alpha _{i-1}\xrightarrow {L}\beta _{i-1}\xrightarrow {\chi }\alpha _i\), where \(i=1,2,\ldots \). We will give details of the method how input and output differences of \(\chi _0\), i.e., \(\beta _0\) and \(\alpha _1\), are selected later. Now, we show how the \(\chi _1\) can be inverted by adding more linear equations of constraints.

Fig. 3
figure 3

First two rounds of Keccak permutation

In the setting of a 2-round connector, the output difference of \(\chi _1\), i.e., \(\alpha _2\), is given as \(\varDelta S_I\)—the input difference of the differential trail. It is not necessary that all S-boxes of \(\chi _1\) are active, i.e., with a nonzero difference. Here only active S-boxes are concerned, and each of them is inverted by randomly choosing an input difference with a nonzero number of solutions which is called compatible input difference. Formally, given the output difference \(\delta _\mathrm{out}\) for the Keccak S-box, its compatible input differences are from the set \(\{\delta _\mathrm{in}: \mathtt{DDT} (\delta _\mathrm{in}, \delta _\mathrm{out}) \ne 0\}\). As noted previously in [4, 8, 10], for any pair of \((\delta _\mathrm{in}, \delta _\mathrm{out})\), the solution set \(V = \{x : \mathtt{S}(x) +\mathtt{S}(x+\delta _\mathrm{in}) = \delta _\mathrm{out}\}\) forms an affine subspace. In other words, V can be deduced from the set \(\{0,1\}^5\) by setting up i constraints that turn out to be binary linear equations, when the size of the solution set V is \(2^{5-i}\). For example, corresponding to \(\mathtt{DDT} (\mathtt{03,02})\) is the 2-dimensional affine subspace \(\{\mathtt{14, 17, 1C, 1F}\}\) which can be formulated by the following three linear equations as the constraints:

$$\begin{aligned} \begin{pmatrix} 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0\\ 1 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 \end{pmatrix} \cdot x = \begin{pmatrix} 1\\ 0\\ 1 \end{pmatrix}. \end{aligned}$$

It is important to note, under the i linear equations or set V, \(\delta _\mathrm{in}\) propagates to \(\delta _\mathrm{out}\) deterministically. Hence, each active S-box in \(\chi _1\) is inverted by a choice of compatible input difference together with the corresponding i linear equations on the input values. Once the input difference and linear equations for all active S-boxes of \(\chi _1\) are enforced and fulfilled together with the linearization of the first round, solutions of the 2-round connector are found. Note that a compatible input difference of \(\chi _1\) is a choice of \(\beta _1\), and \(\alpha _1\) can be uniquely determined by the relation \(\alpha _1 = L^{-1}(\beta _1)\). In the remaining part of this subsection, more details on the implementation of this idea are given.

Generation of Linear Equations As depicted in Fig. 3, the variables of the equation system are the bit values before \(\chi _0\) denoted by vector x. Additionally, y and z are bit vectors of intermediate values for further interpretation where y represents the output after \(\chi _0\) and z the bits before \(\chi _1\). The main task is to derive all constraints on differences and affine subspaces into that on the variables x. Suppose \(\beta _1\) and \(\beta _0\) (details will be given in Sect. 4.6) are fixed, and \(\varDelta S_I\) (aka.\(\alpha _2\)) is given, we show how the system of equations could be set up.

With the input difference \(\beta _1\) and output difference \(\alpha _2\) of \(\chi _1\), all the linear equations on the input affine subspaces of the active S-boxes in the second round can be derived and expressed as

$$\begin{aligned} B\cdot z=t_B, \end{aligned}$$
(1)

where B is a block diagonal matrix in which each diagonal block together with corresponding constants in \(t_B\) formulates the equations of one active S-box.

A similar procedure can be done for input affine subspaces of the first round given \(\beta _0\) and \(\alpha _1\), and the corresponding linear equations on x are denoted by

$$\begin{aligned} A_1\cdot x=t_{A_1}. \end{aligned}$$
(2)

Note an additional constraint x needs to fulfill is that the last \((c+p)\)-bit of initial state are pre-fixed, which can be expressed as

$$\begin{aligned} A_2\cdot x=t_{A_2}. \end{aligned}$$
(3)

where \(A_2\) is a submatrix of \(L^{-1}\) and \(t_{A_2}\) is the preset value. Apart from Eqs. (2) and (3), some extra equations are imposed to linearize \(\chi _0\). To derive these equations, we need to specify the flag vector u identifying the output bits of \(\chi _0\) which affect \(\varDelta S_I\) and should be linearized. In fact, the value of u is determined by Eq. (1) which can be re-expressed as

$$\begin{aligned} B\cdot L \cdot (y+RC_0)=t_B, \end{aligned}$$
(4)

since \(z = L\cdot (y+RC_0)\). Then \(u_i=1\) if \(y_i\) appears in Eq. (4) with a nonzero coefficient; otherwise, \(u_i=0\). Recall that \(U=(U_0,U_1,\ldots ,U_{b/5-1})\) where \(U_i=u_{5i}u_{5i+1}u_{5i+2}u_{5i+3}u_{5i+4}\), \(0\le i< b/5\). According to Observation 2, S-boxes with DDT entry being 2 or 4 are already linearized due to the fact that the input is confined to the solution set \(V = \{x : \mathtt{S}(x) +\mathtt{S}(x+\delta _\mathrm{in}) = \delta _\mathrm{out}\}\) by linear equations in (2). So we only need to handle non-active S-boxes and active S-boxes with DDT entry 8.

Full S-box linearization.\(U_i=\mathtt{11111}\) indicates that fully linearizing the ith S-box is required.

  • For non-active S-boxes, randomly choose any 2-dimensional linearizable affine subspace from 80 such subspaces by imposing 3 linear equations on x.

  • For active S-boxes with DDT entry 8, randomly choose any 2-dimensional linearizable affine subspace from 6 such subspaces by imposing an extra linear equation on x. [Two linear equations are already included in Eq. (2).]

Non-full S-box linearization.\(U_i\ne \mathtt{11111}\) indicates that only part of the output bits of the ith S-box needs to be linearized.

  • For non-active S-boxes, linearize the outputs marked by u by imposing a least possible number of equations according to Observation 3. At most 2 linear equations are required.

  • For each active S-box with DDT 8, if its only nonlinear output bit is not marked by U, then we need not handle it. Otherwise, impose an extra linear equation on x in a similar way as what we do for full S-box linearizations.

We denote these extra linear equations for linearizing \(\chi _0\) by

$$\begin{aligned} A_3\cdot x=t_{A_3}. \end{aligned}$$
(5)

Suppose bits in y marked by u now can be computed from x linearly through

$$\begin{aligned} y = L_{\chi _0}\cdot x + t_{L_{\chi _0}}, \end{aligned}$$
(6)

and further an equivalent system of Eq. (4) can be obtained as

$$\begin{aligned} B\cdot L \cdot L_{\chi _0} \cdot x + B\cdot L \cdot (t_{L_{\chi _0}}+RC_0) = t_B. \end{aligned}$$
(7)

We merge the four systems of linear equations on x together, i.e., Eqs. (2), (3), (5) and (7), and denote the resulting system of linear equations by \(E_M\):

$$\begin{aligned} A\cdot x=t_{A}. \end{aligned}$$
(8)

Then, solutions fulfilling \(E_M\) will be solutions of the 2-round connector.

figure a

Algorithm of the 2-Round Connector Since Eq. (3) remains static, and given differences \((\beta _0,\alpha _1,\beta _1,\alpha _2)\), Eqs. (2) and (1) are determined, the core part is the generation of Eq. (5), i.e., linearizing \(\chi _0\), which is sketched in the procedure basicLinearization (see Algorithm 2). The procedure mainLinearization (see Algorithm 1) invokes basicLinearization and generates the final system of linear equations \(E_M\) whose solutions are exactly the solutions of the 2-round connector.

The inputs to mainLineariztion are \(\beta _0,\alpha _1, \beta _1, \alpha _2 (\varDelta S_I)\) and \(E_M\), where \(E_M\) is initialized with Eqs. (2) and (3). Similarly, basicLinearization takes as input the same \(\beta _0,\alpha _1\) and \(E_M\), as well as a flag vector U generated in mainLinearization. If basicLinearization succeeds, it will return to mainLinearization the (partial) linear mapping of \(\chi _0\), i.e., \((L_{\chi _0}, t_{L_{\chi _0}})\), and the updated \(E_M\). Further, the success of mainLinearization will give rise to the final system of equations \(E_M\).

In basicLinearization, a procedure named preProcess is called on Line 1 which is described more at length in Algorithm 3. preProcess identifies all possible S-boxes of \(\chi _0\) whose output bits marked by U are already linear under the initial \(E_M\), so the number of equations in Eq. (5) will be minimized. In fact, the conditions of Lines 11 and 16 would never hold if \(E_M\) does not contain Eq. (3), and they hold more likely for large Eq. (3), i.e., large \(c+p\). Therefore, preProcess is crucial to the success of solving Keccak instances with a relatively large capacity, such as SHA3-256.

figure b

Note that Algorithms 1 and 2 do not succeed all the time. To overcome this problem, we repeat random picks of compatible input differences \(\beta _1\) (\(\beta _0\) changes accordingly) for a given \(\varDelta S_I\), until the main procedure succeeds.

4.4 3-Round Connector

We construct 3-round connectors upon a 2-round connector. Since almost all output bits of the first round affect \(\alpha _3\) (the difference after the third round), full linearizations are enforced on the first nonlinear layer \(\chi _0\) while constructing a 2-round connector for the first two rounds. Once a 2-round connector succeeds, the resulting system of equations \(E_M\) on x can be re-expressed as an equivalent system of equations \(E'_M\) on z, i.e., the input value of the second round. It can be considered that the first round disappeared. Next, given \(E'_M\), a 2-round connector is to construct for the second and the third rounds where the second round is partially linearized. Once the second 2-round connector succeeds, its solutions will bypass the first three rounds. That is to say, a 3-round connector is obtained.

Due to a great consumption of degree of freedom, the solution space of a 3-round connector may have a dimension smaller than the weight of the following differential, making it hard to find a real collision with one solution space. However, this problem can be solved by constructing the second 2-round connectors repeatedly. In these 3-round connectors, the degrees of freedom for linearizing the second round can be reused and hence are not consumed. Thus, multiple solution spaces can be generated successively if one is not enough. Such successively generated 3-round connectors are called adaptive 3-round connectors, for which a detailed description can be found in Section 7.2 of [26].

4.5 Analysis of Degrees of Freedom

Let the degree of freedom of the final \(E_M\) for the connector be \({\mathtt{DF}} \). In our collision attacks, \({\mathtt{DF}} \) is a key factor on success. A solution space with \({\mathtt{DF}} \) larger than the weight of the differential trail is possible to suggest a message pair with digest collision. After linearizing the first round, the degree of freedom is \(\sum _{i=0}^{b/5-1}{\mathtt{DF}} ^{(1)}_i\) in which \({\mathtt{DF}} ^{(1)}_i\) is the degree of freedom of 5-bit input space of the ith S-box in the first round and calculated according to rules in Table 2. This concerns not only the transition from \(\beta _0\) to \(\alpha _1\), but also the S-box linearizations.

Table 2 Value of \({\mathtt{DF}} ^{(1)}_i\)

The constraints on the initial state reduce \((c+p)\) degrees of freedom for the pre-fixed \((c+p)\)-bit. Another decrement on the degree of freedom is due to the constraints on the input values of the S-box layer in the second round. The definition of \({\mathtt{DF}} ^{(2)}_i\), the degree of freedom of 5-bit input values to S-boxes in the second round, is

$$\begin{aligned} {\mathtt{DF}} ^{(2)}_i= {\left\{ \begin{array}{ll} 1, &{}\quad \mathtt{DDT} (\delta _\mathrm{in},\delta _\mathrm{out})=2,\\ 2, &{}\quad \mathtt{DDT} (\delta _\mathrm{in},\delta _\mathrm{out})=4,\\ 3, &{}\quad \mathtt{DDT} (\delta _\mathrm{in},\delta _\mathrm{out})=8,\\ 5, &{}\quad \mathtt{DDT} (\delta _\mathrm{in},\delta _\mathrm{out})=32, \end{array}\right. } \end{aligned}$$
(9)

where \(\delta _\mathrm{in}\) and \(\delta _\mathrm{out}\) are the input and output differences of the ith S-box in the second round. For the ith S-box in the second round, we add \((5-{\mathtt{DF}} ^{(2)}_i)\) equations to \(E_M\) and suppose to deduce the degree of freedom by this amount.

Therefore, the degree of freedom of the final \(E_M\) of our 2-round connector is estimated as

$$\begin{aligned} {\mathtt{DF}} =\sum _{i=0}^{b/5-1}{\mathtt{DF}} ^{(1)}_i-(c+p)-\sum _{i=0}^{b/5-1}(5-{\mathtt{DF}} ^{(2)}_i). \end{aligned}$$
(10)

Large \({\mathtt{DF}} \) benefits our search for collisions in rounds beyond the second round, and sufficiently large \({\mathtt{DF}} \)s (say > 140) may allow 3-round connectors.

4.6 How to Choose \(\beta _2\), \(\beta _1\) and \(\beta _0\)

Choosing\(\beta _1\)in the 2-Round Connector Recall that we randomly choose compatible input differences \(\beta _1\) according to \(\varDelta S_I\) (\(\alpha _2\)) until the 2-round connector succeeds. As the number of active S-boxes in the second round is large enough (range from tens to hundreds in our attacks), there is a huge number of compatible choices of \(\beta _1\) so that we can only choose those \(\beta _1\) such that \(\beta _1\rightarrow \alpha _2\) is of the best probability for the given \(\alpha _2\). This allows a speedup of 2-round connectors, as well as an increment of \({\mathtt{DF}} \). Interestingly, the differential probability of \(\beta _1\rightarrow \alpha _2\) does not need to be high because this transition is covered in our 2-round connector.

Choosing\(\beta _0\)in the 2-Round Connector So far we have not given details on how \(\beta _0\) can be selected. We follow Dinur et al.’s work [10] in a more general way to uniquely determine \(\beta _0\), namely the difference before \(\chi \) layer of the first round. Specifically, the so-called target difference algorithm is used, which consists of a difference phase and a value phase.

Given \(\varDelta S_{I}\), we have randomly chosen a compatible input difference \(\beta _1\). We then build two equation systems \(E_{\varDelta }\) and \(E_{M}\) accordingly. \(E_{\varDelta }\) is on differences of the message pairs (specifically, on \(\beta _0\)), and \(E_{M}\) is on the value of one message (specifically, on the value of the message before \(\chi _0\), i.e., x, as shown in Fig. 3). The initialization of \(E_{\varDelta }\) should abide by (1) the constraints implied by the \((c+p)\)-bit pre-fixed value that the corresponding difference should be 0, and (2) the input difference bits of non-active S-boxes in the first round equal to 0. The initialization of \(E_M\) should abide by the pre-fixed value of the last \(c+p\) bits. These rules are easy to be implemented as the variable vector x is the image of the initial vector by an invertible linear mapping. Therefore, in the initialization we equate the corresponding bits to their enforced values in \(E_{\varDelta }\) and \(E_M\).

For \(E_{\varDelta }\), we add additional equations so that its solution, i.e., \(\beta _0\), is compatible with \(\alpha _1\). Though the obvious way is to equate the 5 input difference bits to a specific value for each active S-box in the first round, this will restrict the solution space significantly. As suggested in [10], we chose one of the 2-dimensional affine subsets of input differences instead of a specific value for each active S-box. This is based on the fact that given any nonzero 5-bit output difference to a Keccak S-box, the set of possible input differences contains at least five 2-dimensional affine subspaces. After a consistent \(E_{\varDelta }\) system is constructed, the solution space is an affine subspace of candidates for \(\beta _0\). Then we continue to maintain \(E_{\varDelta }\) by iteratively adding two additional equations to uniquely specify a 5-bit input difference for the active S-boxes. For each active S-box, once the specific input difference is determined, we add equations to \(E_M\) system to enforce every active 5-bit of x (input bits to active S-box) to an affine subspace corresponding to the uniquely determined \(\delta _\mathrm{in}\) and \(\delta _\mathrm{out}\). In this way, we always find a compatible \(\beta _0\) for \(\alpha _1\) that fulfills the constraints imposed by the \((c+p)\)-bit pre-fixed value.

Choosing\(\beta _2\)in the 3-Round Connector\(\beta _2\) is chosen before running the 3-round connector. Its value fulfills two requirements: (1) Given \(\alpha _2=L^{-1}(\beta _2)\), a 2-round connector for the first two rounds can be obtained with a sufficiently large degree of freedom; and (2) \(\beta _2\rightarrow \alpha _3\) is of high probability where \(\alpha _3\) is the input difference of the following differential trail. While constructing 3-round connectors, specifically the second 2-round connector, \(\beta _2\) is fixed and does not change. In this way, the second 2-round connector can be constructed efficiently thanks to the high probability of \(\beta _2\rightarrow \alpha _3\).

5 Search for Differential Trails with GPU

In this section, we elaborate on searching differential trails of Keccak. The idea of searching differential trails greatly benefits from previous differential analysis of Keccak [9, 14, 19, 21]. The associated techniques developed in the previous works are reviewed first, followed by the considerations in finding differential trails that fit in our attacks. Subsequently, the exact searching algorithm that provides differential trails for practical collision attacks against Keccak[640, 160, 5, 160], 5-round SHAKE128, 5-round SHA3-224, 5-round SHA3 256, Keccak[1440, 160, 5, 160] and Keccak[\(1440,160,6,160\)] is illustrated. To speed up the searching process, we introduce GPU implementations of Keccak. The differential trails to be used in our collision attacks are listed at the end of this section.

5.1 Differential Properties of Keccak

In this subsection, we recall special properties of the linear and nonlinear layers of Keccak round function which have been identified in previous works. In the following paragraphs, we follow the notations from [9].

Key Properties used in Differential Analysis The concept of column parity, defined in [4], is the most critical property of \(\theta \) operation. Formally, the column parity (or parity for short) P(A) of a value (or a difference) A is defined as the parity of the columns of A, i.e., \(P(A)[i][k]=\varSigma _j A[i][j][k]\). A column is even, if its parity is 0; otherwise, it is odd. A state is in column parity kernel (CP-kernel, for short) if all of its columns are even. \(\theta \) adds a pattern, called the \(\theta \)-effect, to the state. The \(\theta \)-effect of a state A is \(E(A)[i][k]=P(A)[i-1][k]+P(A)[i+1][k-1]\). Thus, the effect of \(\theta \) depends only on column parities. Given a state A in CP-kernel, the Hamming weight of A remains unchanged after \(\theta \). Another interesting property is that \(\theta ^{-1}\) diffuses much faster than \(\theta \). To be precise, a single bit difference can propagate into half-state bits through \(\theta ^{-1}\) on average.

Regarding the nonlinear operation S-box, given a nonzero input difference, all compatible output differences occur with the same probability. Particularly, for input differences with one active bit, the S-box acts as identity with probability \(2^{-2}\). However, given an output difference of the S-box, the probabilities of compatible input differences vary even though the best probability is determined. When the best probability is only \(2^{-3}\), there are multiple input differences achieving this. Therefore, given an output difference of \(\chi \), usually there are multiple input differences fulfilling the best probability.

Representation of Trails and Their Weights As in previous sections, we denote the differences before and after the ith round by \(\alpha _{i-1}\) and \(\alpha _{i}\), respectively. Let \(\beta _i = L(\alpha _i)\), then an n-round differential trail starting from the first round is of the following form

$$\begin{aligned} \alpha _0\xrightarrow {L}\beta _0\xrightarrow {\chi }\alpha _1\xrightarrow {L}\cdots \alpha _{n-1}\xrightarrow {L}\beta _{n-1}\xrightarrow {\chi }\alpha _n. \end{aligned}$$

For simplicity, a trail can also be represented with only \(\beta _i\)’s or \(\alpha _i\)’s.

The weight of a differential \(\beta \rightarrow \alpha \) over a function f with domain \(\{0,1\}^b\) is defined as

$$\begin{aligned} w(\beta \rightarrow \alpha )=b-\mathrm {log}_2|\{x:f(x)\oplus f(x\oplus \beta )=\alpha \}|. \end{aligned}$$

In other words, the weight of a differential \(\beta \rightarrow \alpha \) is equal to \(-\mathrm {log}_2\mathrm {Pr}(\beta \rightarrow \alpha )\). \(\alpha \) and \(\beta \) are compatible when \(\mathrm {Pr}(\beta \rightarrow \alpha )>0\); otherwise, the weight of \(\beta \rightarrow \alpha \) is undefined.

The weight \(w(\beta _i \rightarrow \alpha _{i+1})\) is denoted by \(w_i\), and thus the weight of a trail is the sum of the weights of round differentials that constitute the trail. In addition, \(\#\mathrm {AS}(\alpha )\) is used to represent the number of active S-boxes in a state difference \(\alpha \). According to the properties of \(\chi \), given \(\beta _{i}\) the weight of (\(\beta _{i}\rightarrow \alpha _{i+1}\)) is determined; also, given \(\beta _{i}\) the minimum reverse weight of (\(\beta _{i-1}\rightarrow L^{-1}(\beta _i)\)) is fixed.

As in [4], \(n-1\) consecutive \(\beta _i\)’s, say \((\beta _1,\ldots ,\beta _{n-1})\), is called an n-round trail core which defines a set of n-round trails \(\alpha _0\xrightarrow {L}\beta _0\xrightarrow {\chi }\alpha _1\xrightarrow {L}\beta _1\cdots \xrightarrow {L}\beta _{n-1}\xrightarrow {\chi }\alpha _n\) where the first round is of the minimal weight determined by \(\alpha _1=L^{-1}(\beta _1)\), and \(\alpha _n\) is compatible with \(\beta _{n-1}\). The first step of mounting collision attacks against n-round Keccak is to find good \((n-1)\)-round trail cores.

5.2 Requirements for Differential Trails

Good trail cores are those satisfying all the requirements which will be explained as follows. Firstly, the difference of the output needs to be zero, i.e., \(\alpha _{n_r}^d=0\) (\(\alpha _{n_r}^d\) represents the first d bits of \(\alpha _{n_r}\)).

Secondly, the consumption of degree of freedom must be within budget. With the definition of weight, Eq. (10), i.e., the estimation of \({\mathtt{DF}} \) of our 2-round connectors, can be represented in an alternative way

$$\begin{aligned} {\mathtt{DF}} =\sum _{i=0}^{b/5-1}{\mathtt{DF}} ^{(1)}_i-(c+p)-w_1. \end{aligned}$$
(11)

The first term of the formula depends on the output bits of the first round that need to be linearized and DDT entries of active S-boxes as depicted in Table 2. Empirically, the inputs of most S-boxes are confined to 2-dimensional subsets. Therefore, we heuristically set \(\frac{b}{5}\times 2\) as a threshold for the first term in (11) and denote a threshold of the first two terms in (11) for further search conditions by

$$\begin{aligned} {\mathtt{TDF}} = \frac{b}{5}\times 2-(c+p). \end{aligned}$$

Note that this is a pessimistic estimation of the first two terms. The actual value is usually higher due to the technique of non-full S-box linearization and the fact that there may exist dependencies in the system of equations used for constructing connectors.

In order to mount collision attacks against Keccak\([r,c,n_r,d]\) with methods described in Sect. 4, it is necessary that

$$\begin{aligned} {\mathtt{TDF}} >w_1+\cdots +w_{n_r-2}+w_{n_r-1}^d \end{aligned}$$
(12)

where \(w_{n_r-1}^d\) is the portion of \(w_{n_r-1}\) that relates to the digest. The trail searching phase is performed to provide \(\varDelta S_I\) for the connector. However, the conditions for a good trail core is restrained by results of the connector, i.e., the degree of freedom of the solution space. So we take (12) as a heuristic condition for searching good trail cores which are promising for collision attacks.

Thirdly, the collision attack should be practical. Note that after we obtain a subspace of message pairs that fulfill the first two or three rounds, the complexity of searching a collision is \(2^w\), where \(w =w_{n_r-3}+\cdots +w_{n_r-1}^d\) is the weight of the differential trail over the last three rounds. To make our attacks practical, we restrict w to be small enough, say 55.

We summarize the requirements for differential trails as follows and list \({\mathtt{TDF}} \)s for different variants of Keccak\([r,c,n_r,d]\) in Table 3.

  1. (1)

    \(\alpha _{n_r}^d=0\), i.e., the difference of output must be zero.

  2. (2)

    \({\mathtt{TDF}} >w_1+\cdots +w_{n_r-1}^d\), i.e., the degree of freedom must be sufficient;

  3. (3)

    \(w=w_{n_r-3}+\cdots +w_{n_r-1}^d\le 55\); the complexity of finding a collision should be practical.

Table 3 \({\mathtt{TDF}} \)s of different versions of Keccak\([r,c,n_r,d]\)

5.3 Searching Strategies

In 5-round collision attacks on Keccak, we are to combine a 2-round connector and a 3-round differential trail. Given \(\alpha _2\), i.e., the input difference of the differential trail over the last three rounds, the minimal weight of the second round \(w_1\) is determined. Note that the second round is covered by the connector, and the smaller the \(w_1\), the fewer the constraints imposed to the connector. In order to make it practical to construct 2-round connectors, we aim to find 3-round differential trails that impose a least possible number of constraints to the connector. Namely, \(w_1\) should be as small as possible. Actually, this means we need to find good 4-round differential trails cores. Similarly, in 6-round collision attacks, a 3-round connector is combined with a 3-round differential trail. As shown in Sect. 4.4, \(\beta _2,\beta _3(\alpha _3)\) are known and fixed in the 3-round connector, so 5-round trails cores are needed, whose first two rounds will be covered by the connector.

Differential Trails with two Rounds in CP-Kernel The differential trails with two rounds in CP-kernel are searched to fulfill the requirements discussed above. The 2-round trail cores, denoted by \((\beta _3)\), that ensure \(\alpha _3\) and at least one choice of \(\alpha _4\) in CP-kernel are the starting point of our algorithm for searching differential trails. A detailed analysis of the reason to do so is postponed to Appendix C.1. By extensively making use of the KeccakTools [5] developed by the Keccak Team, we generate such 2-round trails cores, based on which trail cores of 4 or 5 rounds for our collision attacks are obtained in the following way.

Algorithm for Searching Differential Trails We sketch below our steps for finding 4-round differential trail cores of Keccak, upon which 5-round collision attacks are mounted, as illustrated in Fig. 4.

  1. 1.

    Using KeccakTools, find special \(\beta _3\)’s with a low Hamming weight, say 10.

  2. 2.

    For every \(\beta _3\) obtained, traverse all possible \(\alpha _4\) using a tree structure, compute \(\beta _4=L(\alpha _4)\) and test whether there exists a compatible \(\alpha _5\) where \(\alpha _5^d=0\) (Requirement (1)). If so, keep this \(\beta _3\) and record its forward extension; otherwise, discard it.

  3. 3.

    For the remaining \(\beta _3\)’s, traverse all possible \(\beta _2\)’s which are compatible with \(L^{-1}(\beta _3)\)’s. For the trail core (\(\beta _2,\beta _3,\beta _4\)), check Requirement (2) and (3). If these two requirements are satisfied, then output (\(\beta _2,\beta _3,\beta _4\)).

We provide a description in more detail in Appendix C.2. To mount collision attacks on 6-round Keccak, 5-round differential trail cores are needed. In this case, we just extend forward the 4-round trail cores by one more round, as shown in Fig. 5.

Fig. 4
figure 4

Search for differential trails and collision attacks on 5-round Keccak

Fig. 5
figure 5

Search for differential trails and collision attacks on 6-round Keccak

5.4 GPU Implementations of Keccak

Techniques for GPU implementations of Keccak are introduced to improve our computing capacity. While one could expect a speed of order \(2^{21}\)Keccak-f evaluations per second on a single CPU core, we show in this section this number could increase to \(2^{29}\) per second on NVIDIA GeForce GTX1070 graphic card. The significant speedup will benefit us in two stages: searching for differential trails among larger spaces and brute-force search of collisions following differential trails with lower probability.

Overview of GPU and CUDA GPUs (graphics processing units) are intended to process the computer graphics and image originally. With more transistors for data processing, a GPU usually consists of thousands of smaller but efficient ALUs (arithmetic logic units), which can be used to process parallel tasks efficiently. CUDA is a general-purpose parallel computing architecture and programming model that is used in NVIDIA GPUs [23]. One of the programming interfaces of CUDA is CUDA C/C++ which is based on standard C/C++. Here, we mainly focus on CUDA C++.

To better understand the techniques for implementations, some basic concepts are introduced. From the view of hardware architecture, a GPU is comprised of several SMs (streaming multiprocessors), which determine the parallelization capability of GPU. In Maxwell architecture, each SM owns 128 SPs (streaming processors)—the basic processing units. Warp is the basic execution unit in SM, and each warp consists of 32 threads. All threads in a warp execute the same instructions at the same time. Each thread will be mapped into a SP when it is executed.

Existing Implementations and our Implementations Guillaume Sevestre [25] implemented Keccak in a tree hash mode, the nature of which allows each thread to run a copy of Keccak. Unfortunately, there are no implementation details given. In [7], Pierre-Louis Gayrel et al. implemented Keccak-f[1600] with 25 threads that calculate all 25 lanes in parallel in a warp and these threads cooperate via the shared memory. One disadvantage of this strategy is bank conflict—concurrent access to the shared memory of the same bank by threads from the same warp will be forced to be sequential. Besides, there are two open-source softwares providing GPU implementations of Keccak: ccminer (ref. http://ccminer.org) and hashcat (ref. https://hashcat.net) in CUDA and OpenCL, respectively.

Having learned from the existing works and codes, we implemented Keccak following two different strategies: one thread for one Keccak or one warp for one Keccak. From experimental results, we find that one thread for one Keccak gives a better number of Keccak-f evaluations per second. So we adopt this strategy in this paper. More detailed techniques of implementation optimization are introduced in Appendix B.1.

Benchmark With all the optimization techniques in mind, we implemented Keccak-f[1600] in CUDA, and have it tested on NVIDIA GeForce GTX1070 and NVIDIA GeForce GTX970 graphics cards. The hardware specifications of GTX1070 and GTX970 are given in Appendix B.2.

Table 4 Benchmark of our Keccak implementations in CUDA

Table 4 presents the performance, where Keccak-f[1600]v1 and Keccak-f[1600]v2 are our implementations used to search for differential trails and to find real collisions in the brute-force stage, respectively. The difference between the two versions is: Keccak-f[1600]v1 copies all digests into global memory, and Keccak-f[1600]v2 only copies the digest into global memory when the resulting digest is equal to a given digest value. For both versions we did not include the data transfer time. It can be seen that roughly GTX1070 can be \(2^8\) times faster than a CPU core. The source codes of these two versions are available freely via http://team.crypto.sg/Keccak_GPU_V1andV2.zip.

5.5 Searching Results

Some of the best differential trail cores we obtained with the help of GPU are listed in Table 5. As can be seen, Trail cores No. \(1\sim 3\) are all suitable for collision attacks against Keccak[1440,160,5,160] and the 5-round SHAKE128. Trail core No. 4 is sufficiently good for collision attacks against Keccak[640, 160, 5, 160]. To mount collision attacks on 5-round SHA3-224, 5-round SHA3-256 and Keccak[1440,160,6,160], Trail core No. 3 and 5 are used, respectively. Trail core No. 6 is slightly different from the other cores as it leads to an inner collision of the last 160 state bits for Keccak[1440,160,6,160]. Details of these differential trail cores are provided in Appendix D.

6 Experiments and Results

6.1 Summary of Collision Attacks

In this section, we employ 4-round (5-round) trail cores in Table 5 to mount collision attacks against 5-round (6-round) Keccak. Recall that our collision attack consists of a connecting phase and a brute-force searching stage. Let \(T_c\) denote the time consumed by the connecting stage and \(T_b\) by the brute-force searching stage. After the connecting stage, a message space with \({\mathtt{DF}} \) degrees of freedom is returned by the connector. If \({\mathtt{DF}} \) is greater than the weight w of the differential trail over the last rounds beyond the connector, a colliding pair will be found in the brute-force searching stage with a very high probability.

Table 5 Differential trail cores of Keccak, where \(160^{\dashv }\) stands for an inner collision on the last 160 bits of the state

Table 6 summarizes seven collision attacks we obtained (for six Keccak variants), and the corresponding timings, \({\mathtt{DF}} \)s of the returned message space and w of the differential trail. With the attacks on Keccak[1440, 160, 5, 160], Keccak[640, 160, 5, 160] and Keccak[1440, 160, 6, 160], we provide the first solutions to three instances of Keccak contest [2]. The source codes for verifying the seven collisions are available via http://team.crypto.sg/VerifyCollisions.zip.

Table 6 Experimental details of our collision attacks

Collisions of 5-Round Instances Take the first target in Table 6 as an example. We apply Trail core No. 2 to the collision attack on Keccak[1440, 160, 5, 160]. After solving the 2-round connector in 9.6 s, the resulting solution space has a degree of freedom \({\mathtt{DF}} \) of 162 which is sufficiently larger than 36, i.e., the weight of the differential trail over the remaining 3 rounds. The time for searching a collision is 2.48 core hours. We give one example of collisions in Table 13. The attacks on other 5-round targets work similarly.

Collision on Keccak [1440, 160, 6, 160]. Our collision attacks on 6-round instances exploit 3-round connectors to the first three rounds, where a 2-round connector for the first two rounds is constructed first, based on which adaptive 3-round connectors are constructed. The attack on Keccak[1440, 160, 6, 160] uses Trail core No. 5. The 2-round connector of the first two rounds succeeds in 4.5 core hours and returns a message space with 174 degrees of freedom. Every time the second round connector outputs a subspace of messages (i.e., solution of 3-round connectors) with \({\mathtt{DF}} \in [40,45]\). In order to find one colliding pair, at least \(2^{47.81}\) pairs of messages are required. This could be achieved by repeating the second round connector for \(2^{2.81}\sim 2^{7.81}\) times. By running our CUDA implementation on three NVIDIA GeForce GTX970 GPUs, the first collision is found in 112 h, which is equal to \(2^{49.07}\) message pair evaluations.Footnote 1 An example of collision is given in Table 18.

Inner Collision on the Last 160-bit of Keccak [1440, 160, 6, 160] When an inner collision on the last 160 bits of the state is obtained (i.e., a collision on capacity bits), we could then construct state collisions of messages of any block length more than 2 by choosing the next message block properly. Specifically, we choose certain values for the next message block such that the r-bit difference of the state is canceled after absorption. Thus, in the next permutation the whole states become identical. This property maintains if all subsequent message blocks are identical, and in such situations we can obtain a collision on the final digest with certainty. With this in mind, a 6-round collision attack on the last 160 bits of Keccak[1440, 160, 6, 160] is mounted. Similar to the normal collision attack, a 3-round connector is constructed based on Trail core No. 6. The size of the solution space we obtained is around \(2^{58}\sim 2^{64}\) which exceeds the data complexity required. To compare with the efficiency of GPU implementations, this collision is obtained with 172 CPU cores consuming approximately 60,200 core hours in total. The first collision occurs after enumerating approximately \(2^{47.4}\) message pairs. An example of a colliding pair is given in Table 19.

6.2 Technical Analysis of the Results

Choice of\(\beta _1\) In our 2-round connector, \(\beta _1\) is randomly chosen such that \(\beta _1\rightarrow \alpha _2\) is of the best probability, except Keccak[1440,160,5,160] for which the connector randomly chooses any compatible \(\beta _1\). The \({\mathtt{DF}} \) of the message space returned by the connector for Keccak[1440,160,5,160] is 162. In comparison, the first 2-round connector for Keccak[1440,160,6,160] returns a message space with \({\mathtt{DF}} =174\), even though its weight of the second round \(w_1\) which is covered by the connector is much larger than that of Keccak[1440,160,5,160], as can be seen in Table 5. This confirms the effect of the choice of \(\beta _1\) on the resulting \({\mathtt{DF}} \).

S-box Linearization Full S-box linearizations are enforced in 2-round connectors for Keccak[1440, 160, 5, 160], Keccak[640, 160, 5, 160] and SHAKE128, as well as the first 2-round connector for Keccak[1440, 160, 6, 160]. Namely, the first round is fully linearized no matter whether it is necessary or not. For SHA3-224, SHA3-256, S-boxes of the first round are fully linearized only when necessary; otherwise, non-full S-box linearizations are used. It is the same case for S-boxes in the second round while constructing the second round connector for Keccak[1440, 160, 6, 160].

It is worth noting that our collision attacks on 5-round SHA3-256 and Keccak\([1440,160,6,160]\) are not possible without the techniques of non-full S-box linearization. Take the 2-round connector for SHA3-256 as an example. While partially linearizing the first round (besides the S-boxes with DDT entry 2 or 4), there are 26 S-boxes in the first round exempted from consuming extra degrees of freedom due to techniques of non-full S-box linearization. Finally, our 2-round connector returns a space of messages with \({\mathtt{DF}} \) almost as large as the weight of the differential trail. Moreover, in this space of messages, we find only one collision. This in turn proves the usefulness and effectiveness of the non-full S-box linearization techniques.

7 Conclusion

We observed that connectors of Keccak can be extended to multiple rounds using S-box linearizations. In this paper, two types of S-box linearization are proposed, i.e., full S-box linearization and non-full S-box linearization. By linearizing S-boxes in the first round, we extend Dinur et al.’s 1-round connector to two rounds. Further, by applying linearization to the first two rounds, 3-round connectors became possible. We combined our connectors with suitable differential trails searched with dedicated strategies and then obtained seven practical collisions on six round-reduced SHA-3 and Keccak variants, including 5-round SHAKE128, 5-round SHA3-224, 5-round SHA3-256, a 6-round instance of Keccak contest and a variant of it. So far, these are the best collision attacks on round-reduced SHA-3 and Keccak contest instances.