Keywords

1 Introduction

A digital signature scheme is an important cryptographic tool from the work of public-key cryptography. It is an essential technology for providing authenticity, integrity and non-repudiation of data and is widely used. Therefore, the design and implementation of secure and practical digital signature schemes is crucial for applications that require data integrity assurance and data origin authentication.

Nowadays, the most commonly used digital signature schemes are ECDSA [11], RSA [3] and DSA [8]. These schemes have their security based on the difficulty of factoring large integers and computing discrete logarithms. In 1994, Shor [19] introduced a quantum algorithm for factoring integers and computing discrete logarithms in the relevant groups in polynomial time, using simple operations on a quantum computer. Thus, resistant digital signature schemes against quantum computing are an active research field.

The one-time signature scheme [6] is based on a hash function and the security relies on the collision resistance of that function [14]. These one-time schemes are inadequate for the most practical situations since each key pair is only used for a single signature. Merkle [2] proposed a solution to this problem, which transforms the validity of an arbitrary and fixed number of one-time verification keys for one single public key, which is the root of the hash tree. The tree leaves are the one-time key pairs. The Merkle signature scheme (MSS) uses a hash function, and its security is based on the collision resistance of the hash function used.

The extended Merkle signature scheme (XMSS), a hash-based digital signature system, was proposed in [24, 25]. This scheme is based on the Merkle signature scheme and uses \(\mathrm{WOTS}^+\), a modified version of the Winternitz one-time signature scheme [26]. The security of XMSS does not rely on the collision resistance of the hash function used but on weaker properties. The variants GMSS [16] and \(\mathrm{XMSS}^\mathrm{MT}\) [18] increase the number of signatures and reduce the memory consumption. They also decrease the runtime of key generation and signing phases, but slightly increase the key sizes.

Our Contributions. We describe an efficient software implementation of MSS and its variants based on the vector instruction set AVX2 on an Intel’s Haswell processor. We show how to speed up key generation, signing and verification by taking advantage of the multi-buffer SHA2 algorithm implementation. In our implementation, we compute four hashes in parallel with a implementation of SHA2-384 using 256-bit vector registers, and eight hashes in parallel with a 256-bit version of SHA2-256. To illustrate the performance of our implementation, we select a set of parameters for 128-bit security against quantum computers, similar to the parameters used in LDWM [13] and XMSS [25] signature schemes. We report the performance results of LDWM [13] and XMSS [25] on a modern Intel Core i7 3.4 GHz. We also show several trade-offs among runtime and key sizes.

This paper is organized as follows. Section 2 presents some related works. Section 3 describes the Winternitz one-time signature scheme (WOTS) and the variant \(\mathrm{WOTS}^+\).The description of the signature scheme MSS and its variants are presented in Sect. 4. Some considerations about the security of MSS and XMSS schemes are given in Sect. 5. Section 6 presents our software implementation and performance results. Section 7 presents the conclusions.

2 Related Works

One-time signature schemes (OTS) first appeared in the work of Lamport [6]. Winternitz [23] proposed an improved Lamport one-time signature scheme, reducing the size of public and private keys. Merkle [23] introduced a many-time signature scheme that constructs a binary tree where the leaves are hashes of OTS public keys, and the root is the public key. Moreover, Merkle was the first to present an algorithm for computing the authentication path (tree traversal), which is used to validate the public key. The Fractal Merkle Tree Representation and Traversal [12] algorithms, presented in 2003, generate a sequence of leaves along with their associated authentication. In 2004, Szydlo [15] proposed a technique for Merkle Tree Traversal, which requires only logarithmic space and time. Buchmann et al. [17] showed an efficient algorithm that computes the authentication path based on the Szydlo approach.

Some MSS variants were proposed: CMSS [7], GMSS [16], XMSS [24] and \(\mathrm{XMSS}^{MT}\) [18]. CMSS [7] builds two chained trees reducing the generation runtime of key pair and signature, and allowing the signature of \(2^{40}\) documents. CMSS reduced the size of the private key using a pseudorandom generator (PRNG). In 2007, GMSS [16] allowed a significant large number of documents to be signed with one key pair and reduced the signature sizes and the signature generation costs. In 2011, XMSS was introduced, a provably forward-secure and practical signature scheme with minimal security requirements. This scheme uses a modified \(\mathrm{WOTS}^+\) version proposed in [26], which is existentially unforgeable under adaptative chosen message attacks when instantiated with a family of pseudorandom functions. Later, \(\mathrm{XMSS}^{MT}\) [18] was introduced which allows to sign a large fixed number of documents.

In 2014, McGrew and Curcio [13] proposed an Internet-Draft that describes a digital signature system based on cryptographic hash functions. This system, named LDWM, specifies a one-time signature scheme based on the work of Lamport [6], Diffie [1], Winternitz [23] and Merkle [23], and a general signature scheme, called Merkle tree signature (MTS). Moreover, it relates a modern security analysis of those algorithms. In 2015, Hülsing et al. [25] proposed an Internet-Draft that specifies XMSS: Extended hash-based signatures with the one-time signature scheme \(\mathrm{WOTS}^+\), a single-tree (XMSS) and a multi-tree \((\mathrm{XMSS}^{MT})\).

SPHINCS [22] is a practical stateless hash-based signature scheme and was designed to provide \(2^{128}\) security even against attackers equipped with quantum computers. This scheme introduces a new method to randomize tree-based stateless signatures and uses HORS with trees (HORST) [5] for signing messages.

3 Winternitz One-Time Signature Scheme (WOTS)

The Winternitz one-time signature scheme (WOTS) [23] is a modification of the Lamport one-time signature scheme (LOTS) [6]. WOTS uses a parameter w which is the number of bits to be signed simultaneously. This scheme produces smaller signatures than Lamport, but increases the number of one-way function evaluations from 1 to \(2^w-1\), in each element of the signing key. The larger w, the smaller the signature and longer the time for signing and verification. WOTS uses a one-way function \(f:\{0,1\}^{n} \rightarrow \{0, 1\}^{n}\) and a cryptographic hash function \(g:\{0, 1\}^* \rightarrow \{0, 1\}^m\), where n and m are positive integers. The chaining function \(f^e\) computes e iterations of f on input \(x \in \{0,1\}^n\) where \(e \in \mathbb {N}\). The chaining function is defined as:

$$f^e(x) = \left\{ \begin{array}{ll} x &{} \quad \text {if } e=0; \\ f(f^{e-1}(x))&{} \quad \text {if } e>0. \end{array} \right. $$

WOTS Key Pair Generation. A parameter \(w \in \) \(\mathbb {N}\) is given and \(t=t_1 + t_2\) is computed, where \(t_1= \lceil {m}/{w}\rceil \), \(t_2=\lceil (\lfloor log_2 t_1 \rfloor +1+w)/{w}\rceil .\) The private key \(sk_i\), for \(i=0,...,t-1\), is chosen uniformly at random. Each \(sk_i\) is a string of n bits. The verification key pk is computed as:

$$ pk=(pk_0,...,pk_{t-1}) \in \{0,1\}^{(n,t)} \qquad where \qquad pk_i=f^{2^{w}-1} (sk_i). $$

WOTS Signature Generation. To generate the signature, first compute the message digest \(d=g(M)\). If necessary, add zeros to the left of d, such that d be divisible by w. Then d is split into \(t_1\) binary blocks of size w, resulting in \(d=(m_{0} || ...|| m_{t_1-1})\), where || denotes concatenation. The \(m_i\) blocks are represented as integers in \(\{0,1,...,2^w-1\}\).

The checksum is computed as \(c=\sum _{i=0}^{t_1-1} (2^w-m_i).\) If necessary, add zeros to the left of the c until the extended string c is divisible by w. Then the extended string c can be divided into \(t_2\) blocks \(c=(c_0||\ldots ||c_{t_2-1})\) of length w. Let \(b=d||c\) be the concatenation of the extended string d with the extended string c. Thus \(b=(b_{0}||b_{1}||\ldots ||b_{t-1})=(m_{0} || \ldots || m_{t_1-1}||c_{0}|| \ldots || c_{t_2-1})\). The signature is:

$$ Sig=(sig_0,...,sig_{t-1})=(f^{b_0}(sk_0), \ldots ,f^{b_{t-1}}(sk_{t-1})). $$

WOTS Verification. To verify the signature \(Sig=(sig_{0},...,sig_{t-1})\) of the message M, we calculate \((b_{0},...,b_{t-1})\) in the same way as it was calculated during signature generation. Then, we compute: \(Sig'= (sig'_0,\ldots ,sig'_{t-1})=(f^{2^w-1-b_0}(sig_0),\ldots ,f^{2^w-1-b_{t-1}}(sig_{t-1})).\) If \(sig'_i = pk_i\) for \(i = 0,1,...,t-1\), then the signature is valid.

3.1 Modification on Winternitz One-Time Signature Scheme (\(\mathrm{WOTS}^+\))

Hülsing [26] proposed \(\mathrm{WOTS}^+\), a modification of WOTS that uses a chaining function \(f^e\) starting from random inputs. This modification allows for shorter signatures without lowering the security. \(\mathrm{WOTS}^+\) uses a family of hash functions \(F_n = \{f_K : \{0, 1\}^n \rightarrow \{0, 1\}^n |K \in \{0, 1\}^n \}.\)

Given \(f_K \in F_n\), the chaining function \(f^e\) computes e iterations of \(f_K\) on inputs \(x \in \{0,1\}^n\) and bitmask \(bm=(bm_1,...,bm_{w-1})\) with \(e \in \mathbb {N}\). The chaining function is defined as:

$$f^e(x,bm) = \left\{ \begin{array}{ll} x &{} \quad \text {if } e=0; \\ f_K(f^{e-1}(x,bm) \oplus bm_{e} )&{} \quad \text {if } e>0. \end{array} \right. $$

\(\mathrm{WOTS}^+\) is parameterized by a security parameter n, the binary message length m, and the Winternitz parameter \(w \in \mathbb {N}\), for \(w > 1\). The parameters m and w are used to compute: \(l_1 = \lceil {m}/({log_2(w)}) \rceil ,\) and \(l_2= \lfloor ({log_2(l_1(w-1))})/({log_2(w)}) \rfloor +1\), \(l= l_1 + l_2.\) These parameters are public known.

\(\mathbf{WOTS}^+\) Key Pair Generation. On input seed Seed and the bitmask \(bm \in \{0,1\}^{(n,w-1)}\), the key generation algorithm computes private keys \(sk=(sk_0,\ldots ,sk_{l-1})\). The public verification key is:

$$pk=(pk_0,\dots ,pk_{l-1})=(f^{w-1}(sk_0,bm),\ldots ,f^{w-1}(sk_{l-1},bm)) .$$

\(\mathbf{WOTS}^+\) Signature Generation. Given an m-bit message digest \(d=g(M)\) and the bitmask bm, the signature algorithm firstly computes a base-w representation of \(g(M)=d=(m_0,\ldots , m_{l_1-1}),\) \(m_i \in \{0,. . . , w-1\}\). After, the checksum value \(C = \sum _{i=0}^{l_1-1} (w - 1 - m_i)\) in base w representation and length \(l_2\) is appended to d, resulting in \(b=(b_0, ..., b_{l-1})\). The size of the signing key and private keys are l strings of n bits. The signature is:

$$Sig = (sig_0, ..., sig_{l-1}) = (f^{b_0}(sk_0,bm),\ldots , f^{b_{l-1}}(sk_{l-1},bm)). $$

\(\mathbf{WOTS}^+\) Verification. To check the signature, the verifier constructs the values \(b=(b_0, ..., b_{l-1})\) as the signature. If \((f^{w-1-b_0}(sig_0,bm),\ldots ,f^{w-1-b_{l-1}}(sig_{l-1},bm))=(pk_0,\dots ,pk_{l-1})\), then the signature is valid.

WOTS$ [24] is a modified version of WOTS that allows to eliminate the need for a collision resistant hash function family. For this modified version, the iterated evaluations of the function \(f_K^e(x)=f_K(\ldots (f_K(x))\) are a random walk through the function family \(F_n\), as described below:

$$f_K^e(x) = \left\{ \begin{array}{ll} K &{} \text {if } e=0; \\ f_{K^{\prime }}(x) &{} \text {if } e>0 \text { and } K^{\prime }=f_K^{e-1}(x), \end{array} \right. $$

where \(K, x \in \{0, 1\}^n\), \( e \in \) \(\mathbb {N}\), and \(f_{K} \in F_n\).

4 Merkle Signature Scheme (MSS)

MSS [23] is a digital signature scheme that consists of three algorithms: key generation, signing and verification. This scheme constructs a binary tree where the one-time signing and verification keys are the tree leaves, and the public key is the root. A tree with height h and \(2^h\) leaves will have \(2^h\) one-time key pairs.

MSS Key Pair Generation. In the Merkle key generation algorithm, the signer must select the tree height h \(\in \) N, \(h \ge 2\). Figure 1 shows this process. The left picture represents the WOTS key generation algorithm described in Sect. 3. This algorithm is executed to generate the WOTS key pairs. Merkle uses the cryptographic hash function \(g:\{0, 1\}^* \rightarrow \{0, 1\}^n\) where n is a positive integer. This key pair will be able to sign/verify documents, and the digest \(g(pk_0||...||pk_{t-1})\) will be a leaf of the Merkle tree.

Merkle Public Key Generation. Appendix A describes the treehash algorithm for generating the public key pub. The tree is built using a stack. Each time a leaf s is calculated, it is stacked. The treehash algorithm checks if the nodes at the top of the stack have the same heights. If they are of the same height, the two nodes will be unstacked and the hash value of the concatenation of both nodes will be pushed into the stack. The algorithm terminates when the tree root is found.

On the right of Fig. 1, we observe the order in which the nodes are stacked on the tree for generating the public key pub. The first node receives \(Y_0\). The third node is the hash of the concatenation of the first and second nodes. The gray nodes represent \(AUT=\{Aut_0,\dots ,Aut_h\}\), the first authentication path. This AUT is formed by the sibling right nodes, connecting the leaf up to the tree root, which are used to validate the public key. The authentication path AUT is saved during the execution of the treehash algorithm.

Fig. 1.
figure 1

Merkle key generation.

MSS Signature Generation. The signature generation consists of two steps: first, the signature of the message digest g(M) is generated using the WOTS signature algorithm and the corresponding secret key \(sk_s\) of the leaf s. Then the signature \(SIG=(s,sig_s,AUT)\) contains the index of the leaf s, the WOTS signature \(sig_s\) and the authentication path AUT. In the second step, the next authentication path AUT is generated. For generating the next authentication path, we use the BDS algorithm [17] which is a modification of the classic authentication path algorithm proposed by Merkle [2].

MSS Verification. The signature verification consists of two steps: first, the signature \(sig_s\) is used to recover a leaf of the tree \(Y_s\); second, the public key of the Merkle tree is validated as the following. The receiver can reconstruct the path \((p_0, ..., p_h)\) from leaf s to root. The index s is used to decide the order in which the authentication path is reconstructed. Initially, \(p_0 = Y_s\). For each \(i=1,2,\ldots , h\) is computed \(p_i\) using the condition (if \( \lfloor s/(2^{i-1})\rfloor \equiv 1 \ \mathrm {mod}\ 2 \)) and the recursive formula:

$$ p_i = \left\{ \begin{array}{ll} g( Aut_{i-1} \Vert p_{i-1} )&{} \text {if }\lfloor s/(2^{i-1})\rfloor \equiv 1 \ \mathrm {mod}\ 2 ; \\ g(p_{i-1} \Vert Aut_{i-1}) &{} \mathrm {otherwise.} \end{array} \right. $$

Finally, if the value \(p_h\) is equal to the public key pub, the signature is valid.

4.1 Merkle Signatures with Virtually Unlimited Signature Capacity (GMSS)

The GMSS algorithm was published in 2007 [16]. The authors proposed a change in the Merkle signature scheme, allowing the signature of \(2^{80}\) messages with one key pair. This modification is based on CMSS [7] that reduces the key pair and signature generation runtime and enable the signature of \(2^{40}\) documents with the construction of two chained trees. Moreover, it reduces the private keys using a pseudorandom generator (PRNG). The basic construction of GMSS consists of a tree with d layers of subtrees, for \(i=0,\ldots ,d-1\), where the lower layer is \(i=0\). Subtrees in different layers may have different heights and different Winternitz parameters w that enable producing smaller signatures. To reduce the runtime of one signature, GMSS distributes the computational cost of the signature generation across several previous signatures.

Appendix B illustrates the GMSS tree. The public key keeps the subtree root in layer \(i=d-1\). The WOTS key pairs of the tree in layer i sign the subtree root in layer \(i-1\), for \(i>0\). The message digest g(M) is signed using the tree leaves in layer \(i=0\). The signature keeps \(SIG=(s_i, sig_i, Auth_i)\) where \(s_i\) is the leaf of the subtree, \(sig_i\) is the WOTS signature and \(Auth_i\) is the authentication path.

4.2 Extended Merkle Signature Scheme (XMSS)

XMSS [24] is a modification of MSS. This scheme uses a slightly modified version of Winternitz \(\mathrm WOTS^+\) described in Sect. 3.1. To capture the formality of the standard definition of digital signature schemes, the authors of XMSS used the same definition of key evolving signature schemes [4]. XMSS is provably forward-secure and efficient when instantiated with two secure and efficient function families: one second-preimage resistant hash function family \(G_n\) and the other a pseudorandom function family \(F_n\), where \(G_n = \{g_K : \{0, 1\}^{2n} \rightarrow \{0, 1\}^n |K \in \{0, 1\}^n \}\) and \(F_n = \{f_K : \{0, 1\}^n \rightarrow \{0, 1\}^n |K \in \{0, 1\}^n \}\).

The parameters of XMSS are: \(n \in \mathbb {N}\), the security parameter; \(w \in \mathbb {N} (w > 1)\), the Winternitz parameter; \(m \in \mathbb {N}\), the message digest length; and \(h \in \mathbb {N}\), the height tree.

An XMSS binary tree is constructed to generate the public key pub. Appendix C shows the XMSS treehash algorithm and Appendix D shows the XMSS tree construction. The XMSS tree is a modification of the Merkle tree. A tree of height h has \(h+1\) levels. The nodes on level j are \(node_{i,j}\), for \(0 < j \le h\) and \(0 \le i < 2^{h-j}\). XMSS uses the hash function \(g_K\) and bitmask (bitmaskTree) \(bm \in \{0,1\}^{2n}\), chosen uniformly at random, where \(bm_{2i+2j}\) is the left bitmask and \(bm_{2i+2j+1}\) is the right bitmask. The bitmasks are the main difference among the others Merkle tree constructions, since they allow to replace the collision resistant hash function family by a second-preimage resistant hash function family. The nodes are computed as: \(node_{i,j}=g_K((node_{2i,j-1} \oplus bm_{2i+2j})||(node_{2i+1,j-1} \oplus bm_{2i+2j+1})).\)

To generate a leaf \(Y_s\) in the XMSS tree, an L-tree is used. The L-tree uses bitmasks (bitmaskLtree) in the same form as in the XMSS tree. The \(\mathrm WOTS^+\) public verification keys \((pk_0,\dots ,pk_{l-1})\) are the first l leaves of an L-tree. If l is not a power of 2, then there are not sufficiently leaves to build a binary tree. Therefore, a node that has a not right sibling is lifted to a higher level of the L-tree until it becomes the right sibling of another node.

4.3 Multi Tree XMSS (\({XMSS}^{MT}\))

The \(\mathrm{XMSS}^\mathrm{MT}\) algorithm proposed in [18] allows a virtually unlimited number of signatures while it preserves its desirable properties. \(\mathrm{XMSS}^\mathrm{MT}\) is existentially unforgeable under a chosen message attack (EU-CMA) secure in the standard model when instantiated with a pseudorandom function family and a second preimage resistant hash function family. In addition, \(\mathrm{XMSS}^\mathrm{MT}\) is forward-secure.

Many layers of trees are used in \(\mathrm{XMSS}^\mathrm{MT}\) as in GMSS  [16]. The number of layers is \(d \in N\) where \(0 \le i < d-1\). The leaves of the tree in layer \(i=0\) are used to sign the messages. The leaves of the trees in layers i, \(i>0\), are used to sign the roots of the trees in layer \(i-1\). The public key is the tree root in layer \(i=d-1\). \(\mathrm{XMSS}^\mathrm{MT}\) uses the \(\mathrm{WOTS}^+\) algorithm as in XMSS.

5 Security Considerations

In this section, we describe a summary of several formal security proofs of MSS and its variants. McGrew and Curcio [13] specifies that LDWM and MSS rely on the security of the hash function used. The security of MSS, according to [14], relies on the fact that the hash function is collision resistant. Furthermore, MSS is forward-secure if the PRNG used is a cryptographically secure pseudorandom generator.

Buchmann et al. [14] show that the existential unforgeability of MSS under an adaptive chosen message attack can be reduced to the collision resistance of the family G and the existential unforgeability of the underlying one-time signature scheme, where \(G=\{g_K:\{0,1\}^* \rightarrow \{0,1\}^n|k \in K\}\) is a family of hash functions. The security level of the signature schemes is the base 2 logarithm of the average attack complexity. The authors proved an attack of complexity \(2^n\) for classical computers and \(2^{n/2}\) for quantum computers on preimage and second-preimage resistances. For collision resistance is different, the bounds are \(2^{n/2}\) for classical computers and \(2^{n/3}\) for quantum computers [14].

XMSS with WOTS$ [24] is forward-secure and existentially unforgeable under chosen message attacks if the function \(g_K\) is a second preimage resistant hash function and \(f_K\) is a pseudorandom function. If a randomized message hashing is used, then the requirement is that the hash function must be second preimage resistant. If a plain hash is used, then the hash function must be collision resistant. XMSS with \(\mathrm{WOTS}^+\) [20] needs a function \(f_K\) to be second preimage resistant. In addition, a PRNG is required to generate the keys.

In summary, the security requirements for XMSS are minimal, because the existence of the two function families described above is known to be necessary for the existence of digital signature schemes [24]. As demonstrated in [20], if an n-bit hash is used, then n-bit security against classical computers and n / 2-bit security against quantum computers is obtained.

6 Implementation

In this section, we describe the details of our efficient implementations of the algorithms: MSS, GMSS, XMSS and \(\mathrm{XMSS}^\mathrm{MT}\). For key generation, we used the standard CTR_DRBG [21] as cryptographically secure pseudorandom number generator (PRNG) to generate the secret keys sk. According to [7], PRNG enables the recovery of all signature keys based only on the initial seed and avoids the storage of the full secret key. To implement the one-way function \(f_K\) and the cryptographic hash function \(g_K\), we used the SHA2 hash function. To compute the authentication path in the signature generation, we used the BDS algorithm [17]. We will provide the software described in this paper into the public domain.

6.1 Multi-buffer Hash Optimization

A leaf of the XMSS tree involves the following computations: first, the l secret key sk\(_i\) are computed using a PRNG. This secret key \(sk_i\) is generated in sequence. Then, for each private key, we can compute the corresponding public key \(pk_i\) by applying \(l(w-1)\) times the function \(f_K\). Unlike of secret key generation, the WOTS public keys can be generated in parallel. Finally, for each internal node of the L-tree, the function \(g_K\) is applied to a binary string computed by the children nodes, and such computation is independent among siblings nodes.

Therefore, the computation of the internal nodes of the XMSS tree is much faster than the leaves. Recovering a leaf from the Merkle tree impacts the performance of the whole algorithm. This is a bottleneck of MSS and XMSS schemes. The computation of hash functions occur over independent messages, and in this scenario the application of vector instructions have an important role in the acceleration of the scheme.

The vector instructions present on the Haswell processor are suitable for the computation of independent hash functions in parallel, where a SIMD processing takes place. The AVX2 instructions process bit operations over 256-bit vector registers, thus each vector register can be seen as either four 64-bit words or eight 32-bit words. In this setting, the operations used to compute the SHA2 function can be applied on vectors with the purpose of computing simultaneous hash values at the same time, which we will further refer as a multi-buffer implementation.

The SHA2-256 algorithm uses operations over words of 32 bits while the SHA2-384 algorithm operates over words of 64 bits. Then, using a multi-buffer approach we can compute either four or eight hash values for the SHA2-384 and SHA2-256, respectively. This parallel hashing technique can be applied to accelerate the operations on WOTS key generation and the generation of the nodes of the L-tree. Thus, to generate the l private key \(pk_i\), we can reduce the number of executions of the function \(f_K\) to \(\lceil (l(w - 1)/8)\rceil \) times with SHA2-256 using the vector instruction set of 256 bits. The number of SHA2-256 evaluations required to computing the L-tree root can also be reduced.

6.2 Optimization of the Function \(f^e\)

In this subsection, we highlight the development of an optimized implementation of the function \(f^e\) using SHA2 [9]. To reduce the runtime of the function \(f^e\), the core operation of the schemes, we implemented the function \(f^e\) by using the multi-buffer technique [27], which allows parallelizing message schedules.

As presented in MSS and XMSS, the generation of the WOTS verification keys involves many applications of the function \(f^e\) in each element of the private key. The chaining function used in WOTS is \(f^e(sk)\) and in \(\mathrm WOTS^+\) is \(f^e(sk,bm)\). Figures 2 and 3 show our optimized implementation of the function \(f^e\) using the SHA2 hash function with a fixed pad. The pad values are fixed because the block sizes of sk are the same for SHA2. Since these values have a fixed pad, we reduced the number of operations of the function \(f^e\) for getting a better performance. In WOTS, \(f^e\) is executed \(t(2^w-1)\) times and in \(\mathrm WOTS^+\) is executed \(l(w-1)\) times for the generation of the one-time verification keys.

Fig. 2.
figure 2

Function \(f^e(sk)\) in WOTS.

Fig. 3.
figure 3

Function \(f^e(sk, bm)\) in \(\mathrm WOTS^+\).

The multi-buffer C code SHA2 from [10] was modified to use 256-bit AVX2 instructions. We adapted this new code to reduce the runtime of the chaining function \(f^e\). Figure 4 shows our implementation of the function \(f^e\) for MSS with SHA2-384 which computes four instances in parallel, generating four public keys pk at the same time. We load four secret keys \(sk_i\) into the 256-bit vector registers, perform e iterations of the \(f^e\) function and then store four private keys \(pk_i\) on memory. We can compute the private keys \(pk_i\) in parallel because its generation is independent.

The parallel technique is also used in the signature generation to update the authentication path because some tree leaves need to be recovered. The use of SIMD instructions helped to reduce the runtime of the function \(f^e\), which are the computationally most expensive parts for key and signature generation.

Fig. 4.
figure 4

Implementation of the chaining function \(f^e(sk)\) in MSS with SHA2-384.

6.3 Optimization of the L-Tree Implementation

The L-tree from XMSS [24] is used to generate the leaves of XMSS. In this section, we show an optimization in the generation of the L-tree for improving the performance of generating each leaf of the XMSS tree. This optimization is similar to the one given in [22]. The function \(g_K\) is applied to each concatenation of children nodes to generate the parent node. Then, we modified the L-tree algorithm to perform eight evaluations of the function \(g_K\) at the same time. We generate eight internal nodes at the same time, from 16 children nodes, which are concatenated 2 by 2. If the amount of remaining nodes is not multiple of 16, we generate the next internal nodes one by one like the traditional way. If l is not a power of 2, then there are not sufficient leaves to build a binary tree. Therefore, a node that has not a right sibling is lifted to a higher level of the L-tree until it becomes the right sibling of another node. Figure 5 show the optimization performed in the generation of the L-tree for w=16 and l=67.

Fig. 5.
figure 5

Construction of the L-tree.

6.4 Choosing Parameters

In the following, we present our choice of parameters for MSS, GMSS, XMSS and \(\mathrm{XMSS}^\mathrm{MT}\). We selected a set of parameters to get a good trade-off among level security, runtime and signature size. These parameters were aligned to LDWM [13] and XMSS [25].

The chosen parameters are: \(w \in \mathbb {N} (w > 1)\), the Winternitz parameter; \(H \in \mathbb {N}\), the total height of the tree; d, the number of layers; m, the message length in bits, and n the output size of the functions \(f_K\) and \(g_K\). To implement \(f_K\) and \(g_K\), we use the hash function SHA2-384 in MSS and SHA2-256 in XMSS for 128-bit security against quantum computers.

In MSS with WOTS, each WOTS signature size is tn bits and in XMSS with \(\mathrm WOTS^+\) is ln bits. These parameters are determined according to the chosen parameter w. The value of w influences the runtime and the signature size. Higher w values imply on greater runtime, but smaller signature sizes. The values H and d directly affect the signature size. We set the maximum height \(H = 60\) and \(d=6\) to limit the signature size and \(w=64\) to have an acceptable runtime.

The MSS secret key \(SK=(s_0, Seed_0, \ldots , s_{d-1}, Seed_{d-1})\) contains one index s per layer and one Seed per layer. The public key is PK=(pub), which is the root of the layer \(d-1\) and the signature \(SIG=(s_0, sig_0, Auth_0, \ldots , s_{d-1},sig_{d-1},Auth_{d-1})\) contains the index s, one WOTS signature and one authentication path per layer.

In XMSS, the secret key \(SK=(Bitmask, s_0, Seed_0, \ldots , s_{d-1}, Seed_{d-1})\) contains the same elements as MSS, also \(Bitmask=(bitmaskTree, bitmaskLtree, bm)\). The public key is \(PK=(pub,Bitmasks)\) and the signature is \(SIG=(s, sig_0, Auth_0, \ldots , sig_{d-1},Auth_{d-1})\) as in MSS.

6.5 Performance Results

This section shows the experimental results for MSS and its variants of our implementation using AVX2 instructions. The results of our implementation were obtained by running benchmarks on a Haswell processor Core i7-4770 at 3.4 GHz, where the Intel Turbo Boost and the Intel Hyper-Threading technologies were disabled. Our implementation was written in C language and compiled using the GNU C Compiler v4.9.0.

First, we make a comparison between our results of the XMSS implementation and XMSS from [20] with WOTS$. Second, we show the performance results of XMSS with \(\mathrm{WOTS}^+\) using a single-buffer (with a implementation of SHA2 using 64-bit vector registers) and a multi-buffer (with a implementation of SHA2 using 256-bit vector registers). Third, we compare MSS and XMSS according to the parameters selected for 128-bit security against quantum computers. In our work, the runtimes for signing and verifying are calculated using the arithmetic average of the first one million signatures.

Table 1. Comparasion of key generation, signing and verification between XMSS from [20] and our XMSS implementation where both use WOTS$. We use a single-buffer (64-bit) implementation. The running times of XMSS [20] were obtained on the Intel(R) Core(TM) i5-2520M CPU 2.5 GHz.

We scale the running times of XMSS [20] from 2.5 to 3.4 GHz in order to estimate the running times for \(H=20\) and \(w=4\), so we have 638,711 ms for key generation, 5.60 ms for signing and 0.32 ms for verification. We compared these estimates with our results in Table 1. We have improved the performance achieving an speedup factor of 1.66x, 1.71x and 1.61x for key generation, signing and verification, respectively.

Table 2. Performance figures of XMSS with \(\mathrm{WOTS}^+\) for parameters \(H=20\), \(K=2\). The chaining function is implemented using SHA2-256. We compare our results using single-buffer and using a multi-buffer for different values of w.

On 256-bit messages, the single-buffer implementation of SHA2 takes 121.25 cycles per byte for eight hashes, while the multi-buffer implementation takes 29.31 cycles per byte for eight hashes, so we obtain an acceleration of 4.13x for hashing. However, this speed factor is reduced for the signature operations of XMSS. In Table 2, we show that the acceleration of the performance due to the multi-buffer optimization, ranges from 2.4x for smaller w to 3.7x for larger w for key generation and signing operations. For verification, the muti-buffer implementation slightly improves the single-buffer implementation, since the optimized implementation of the function \(f^e\) is not used in that operation.

Comparing the results of the single-buffer implementations of XMSS using WOTS$ shown in Table 1 with XMSS using \(\mathrm{WOTS}^+\) shown in Table 2, we observe that XMSS using \(\mathrm{WOTS}^+\) is around 44–49 % faster than XMSS using WOTS$.

Table 3. Results of our implementation of LDWM [13] with WOTS and XMSS [25] with \(\mathrm{WOTS}^+\) for 128-bit security against quantum computer. We use a multi-buffer (256-bit) implementation. \(\mathrm{XMSS}^\mathrm{MT}\) and GMSS with \(d=1\) correspond to XMSS and MSS, respectively.

Since the requirements for the security of XMSS are minimal [24], one could use a smaller hash function output. Then, for the same security level, the runtime of key generation, signature generation and verification of XMSS is faster and uses smaller signature size than MSS. However, the private and public keys of XMSS are larger than MSS, because XMSS stores the bitmasks, as it can be seen on Table 3.

Notice that for larger values of w, the size of the public key and signature are smaller, without a significant impact the runtime. On the other hand, by increasing the number of layers the runtime is reduced for key generation and signing; however, the runtime for verification increases because the signatures of all layers must be checked. Additionally, the size of the secret key and signature is larger, because they store information of each layer. Increasing the tree height allows to produce more signatures without impacting the performance of signing and verifying, but increasing the signature size.

7 Conclusion

In this paper, we addressed the efficient software implementation of hash-based signature schemes on an Intels Haswell processor. Our implementation benefits from the vector instruction set AVX2 and a multi-buffer implementation of the SHA2 hash function. In order to implement the GMSS and \(\mathrm{XMSS}^\mathrm{MT}\) schemes, we selected a set of parameters to maintain a balance among security level, key sizes, signature size, and running times. We achieved the same security level among the schemes implemented by using a different output hash size. For example, GMSS with SHA2-384 and \(\mathrm{XMSS}^\mathrm{MT}\) with SHA2-256 provide 128-bit security against quantum computers. Our performance results show that \(\mathrm{XMSS}^\mathrm{MT}\) [25] was about 40-50 % faster than GMSS [13] for key generation, signing and verification. In addition, the signature size of \(\mathrm{XMSS}^\mathrm{MT}\) is smaller than GMSS, but the public and private key of GMSS are shorter than \(\mathrm{XMSS}^\mathrm{MT}\).

Finally, with the advent of a hardware implementation of SHA2 on modern processors (Intel Skylake processor [28]), the performance of key generation, signing and verification of hash-based schemes will be even improved.