Keywords

1 Introduction

Hashing functions return a fixed-length bit string from an input bit string, [22]. This functionality may be utilized, for example, in passwords storage, data integrity checking, or digital signatures preparation. Three main features of hashing algorithms are:

  1. 1.

    the process of hash calculation may involve more than one usage of the hashing algorithm,

  2. 2.

    it should be impossible to retrieve the content of the original message from its hash,

  3. 3.

    probability of returning the same hash value from two different messages should be minimal.

Currently, the most popular hashing standards, which are also certificated, are called SHA-2 and SHA-3 [14, 15]. Those functions compute a fixed-length hash from a message. Available hashes lengths vary from 224 bits to 512 bits and the length determines the algorithm – e.g. SHA-256 will always return 256 bits of the hash. Certain certificated standards offer possibilities of choosing two different hashes lengths, for example, a hash equal to 224 bits is created from the truncation of the SHA-256 digest. However, the user has no more possibilities to adjust the hash length. It can be concluded that it is either one fixed length that depends on the algorithm or two possible lengths where the second one comes from the truncation of the hash of size equal to the first one. Hashing algorithms whose outputs are smaller, for example, vary from 80 bits to 160 bits, are called light cryptography hash functions [5].

Since late 1970, researchers proposed many different approaches to hash function construction. Most of the ideas are described in [19], for example, hashing function based on the block ciphers, cellular automata, discrete logarithm problem, or knapsack problem. Testing the strength and effectiveness of this kind of algorithms is still a problem. Existing test suits, like for example SHAVS presented in [13], are dedicated to the tested algorithm. Secondly, the National Institute of Standards and Technology (NIST) in the presented report state that [13]: ‘The SHAVS is designed to test conformance to SHA rather than provide a measure of a product’s security...’.

In this paper, an idea of hashing neural networks proposed in [19,20,21] is further developed, showing the performance of the ANNs used. The discussed networks have two layers: a hidden layer with sigmoid neurons, and an output layer consisting of linear neurons. Two hash lengths were considered, that is 256 and 512 bits (to compare results with certificated standards). For each tested hash length seven networks were generated. Generated networks differed in the number of neurons in the hidden layer. The Lorenz Attractor, which appears to have chaotic behavior under appropriate conditions, was utilized for training data preparation. The length of the returned hash could be potentially set with the precision of one bit before the training process. Furthermore, the performance of the proposed networks is tested in comparison with the performance of the chosen certificated standards (SHA-256 and SHA-512) under the MATLAB environment. The time of hash computation from data that differed in size was considered as the performance measure.

The paper after introducing the idea of the approach and giving state-of-the-art, focuses on the presentation of the performance testing of the ANN-based hashing functions, comparing the obtained models with certificated ones.

2 Related Work

In this section hashing algorithms that are based on chaotic systems are described. In each article not only the chaotic model was considered, but also the core of the hashing algorithm.

In [11] authors proposed their own algorithm, which was based on the Lorenz Attractor. However, they incorporated some functions from the SHA-2 algorithm, for example rotations. In their research, similarly to the presented research, time of computation was considered as a performance measure. The algorithm core consisted of four iterations that were combining intermediate hash results and secret keys. The final results were compared with SHA-1. Even though the proposed algorithm was more efficient, SHA-1 is considered as an outdated function, and there were no comparisons with current standards, like SHA-2 or SHA-3.

The algorithm presented in [10] was not a classical hashing scheme but enabled checking the integrity of the data. The proposed procedure was based on huge numbers and their powers under the finite field, which makes the whole idea similar to the RSA ciphering scheme. The authors compared the efficiency of their solution with the Advanced Encryption Standard (AES) and concluded that the performance of their algorithm was slightly worse.

In [9] authors incorporated similar operations as described in the [10], a sponge function that was absorbing input data, and a hyper-chaotic Lorenz system. Because of the complexity of the algorithm, the authors noticed perturbations over time in the function performance. The solution was tested for 256 bit, and 512 bit hash lengths, but enabled returning 1024 bits of hash and more. Authors compared their proposition with SHA-2 and SHA-3 standards, however, did not test it for smaller hashes values.

The innovative idea was presented in [1], where the authors proposed their own equation for input data absorption. Their solution was based on the three-dimensional chaotic map and was excessively tested. The proposed function could return 128, 160, 256, or 512 bits of the hash. Results of the research were compared with SHA-1 and MD5. Both are considered as outdated.

Hashing algorithms may be created in many different ways. For example, in [8] some interesting hashing concepts that are based on evolutionary algorithms and genetic algorithms are presented. More information about hashing strategies is presented in the [17].

Chaotic attractors may be also used in different cryptographic areas. For example, in [6] authors proposed an image encryption scheme based on the Lorenz model. The core of the algorithm was utilizing crossover operations and sequences generated by the attractor. Even though the presented scheme was not a hashing function, research conducted by the authors proved the usefulness of chaotic systems in cryptographic solutions.

In contrast to described solutions, an idea presented in the paper enables the utilization of Artificial Neural Networks (ANNs) trained with the usage of the Lorenz Attractor as hashing models. The main advantage of the proposed scheme is a highly scalable ANNs output. The length of the hash returned by those networks can be adjusted with a precision of one bit (before the training process). Furthermore, the performance of ANNs was tested and results were compared with one of the most popular – and certified by the National Institute of Standards and Technology – Secure Hash Standards, that is SHA-256 and SHA-512. Efficiency comparison performed under MATLAB environment tends to be in favor of the presented networks. Further research will cover the security tests of networks and will also include a comparison with world standards.

3 The Chaotic Model Used

Chaotic equations are non-linear and dynamic systems (models), that are significantly vulnerable to the changes in their initial conditions [4]. The output of such models becomes non-deterministic over time. This phenomenon is also called deterministic chaos. One of the most iconic scientist that was investigating this topic was Edward Lorenz. He once described chaos as a situation [4]: when the present determines the future, but the approximate present does not approximately determine the future.

In our work the Lorenz Attractor was used for ANNs training data preparation. Two sets of data were prepared: input data containing binary strings representing messages to be hashed, and output data that contained hashes of those messages obtained with the usage of Lorenz Attractor. The idea was to code a message into attractors’ initial conditions and then solve the model. The result was considered as a message hash.

The Lorenz Attractor is defined as presented in Eq. (1):

$$\begin{aligned} \left\{ \begin{array}{l} \frac{dx_1}{dt}=a(x_2-x_1)\\ \frac{dx_2}{dt}=cx_1 - x_2 - x_1x_3\\ \frac{dx_3}{dt}=x_1x_2-bx_3 \end{array} \right. \end{aligned}$$
(1)

This model becomes chaotic when: \(a = 10\), \(b = 8/3\) and \(c = 28\) [16]. Each input binary string \(M = [m_1, m_2, m_3, ..., m_n]\) (where n denotes the desired hash length), was divided appropriately, converted into two real numbers that were used as first two initial conditions (\(x_{1, 0}\) and \(x_{2, 0}\)). The algorithm of coding messages into initial parameters is presented below:

  1. 1.

    If n = 256 do Steps 2–4.

  2. 2.

    L11 = ctf(\([m_1, ..., m_{64}]\)), L21 = ctf(\([m_{65}, ..., m_{128}]\)).

  3. 3.

    L31 = ctf(\([m_{129}, ..., m_{192}]\)), L41 = ctf(\([m_{193}, ..., m_{256}]\)).

  4. 4.

    L1 = ctntf(L11, L21), L2 = ctntf(L31, L41).

  5. 5.

    If n = 512 do Steps 6–11.

  6. 6.

    L11 = ctf(\([m_1, ..., m_{64}]\)), L21 = ctf(\([m_{65}, ..., m_{128}]\)).

  7. 7.

    L31 = ctf(\([m_{129}, ..., m_{192}]\)), L41 = ctf(\([m_{193}, ..., m_{256}]\)).

  8. 8.

    L51 = ctf(\([m_{257}, ..., m_{320}]\)), L21 = ctf(\([m_{321}, ..., m_{384}]\)).

  9. 9.

    L31 = ctf(\([m_{385}, ..., m_{448}]\)), L41 = ctf(\([m_{449}, ..., m_{512}]\)).

  10. 10.

    L1 = ctntf(ctntf(L11, L21), ctntf(L31, L41)).

  11. 11.

    L2 = ctntf(ctntf(L51, L61), ctntf(L71, L81)).

  12. 12.

    \(x^{l}_{1, 0}\) = L1, \(x^{l}_{2, 0}\) = L2.

Construction of ctf and ctnf functions is presented in Listing (1.1):

figure a

As it can be seen, two hash lengths were considered, namely: \(n = 256\) and \(n = 512\). These lengths are the most popular ones in SHA-2 and SHA-3 certificated hashing functions families. Third initial condition \(x^l_{3, 1}\) was a random real number from range [0, 1]. The model described in Eq. (1) was solved with the usage of the Runge-Kutta 4th Order method that can be represented as [18]:

$$\begin{aligned} k_1 = \varDelta t * f(t, x_i) \end{aligned}$$
(2)
$$\begin{aligned} k_2 = \varDelta t * f(t + \frac{\varDelta t}{2}, x_i + \frac{k_1}{2}) \end{aligned}$$
(3)
$$\begin{aligned} k_3 = \varDelta t * f(t + \frac{\varDelta t}{2}, x_i + \frac{k_2}{2}) \end{aligned}$$
(4)
$$\begin{aligned} k_4 = \varDelta t * f(t + \varDelta t, x_i + k_3) \end{aligned}$$
(5)
$$\begin{aligned} x_{i + 1} = x_i + \left( \frac{k_1 + 2k_2 + 2k_3 + k_4}{6}\right) \end{aligned}$$
(6)

where f is representing equations described in (1), \(x_i\) is a vector containing solutions in all three dimensions (that is \(x_i = [x_{1, i}, x_{2, i}, x_{3, i}]\)) in \(i-th\) algorithm iteration, t denotes a time in which calculation is done:

$$\begin{aligned} t = t_0 + i * \varDelta t \end{aligned}$$
(7)

\(t_0\) is a moment when computation starts and is equal 0, and \(\varDelta t\) is denoting a step (assumed time intervals in which computations are done), and was equal to 0.1. The parameter i was an iterator in interval [0, 39999], thus always 40000 elements of solution in all three dimensions were generated. An example solution of a Lorenz Attractor for the following initial parameters: \(\varDelta t = 0.1\), \(x_{1, 0} = 0.4\), \(x_{2, 0} = 0.3\), and \(x_{3, 0} = 0.5\) is presented in Fig. 1.

Fig. 1.
figure 1

Example solution of a Lorenz Attractor.

4 ANNs Training and Testing with Usage of Lorenz Attractor

In this section, the process of training and testing of Feed-Forward ANNs is described. Tested ANNs could be divided into two groups: returning 256 bits of hash and returning 512 bits of hash (\(n \in \{256, 512\}\)). The detailed algorithm of the whole process is presented below (with the assumption, that the value of parameter n was already chosen).

  1. 1.

    Input data preparation. Two sets of data were generated:

    $$\begin{aligned} INPUT^{train}[i] = [b_1, b_2, ..., b_n], i = 1,...,10000. \end{aligned}$$
    (8)
    $$\begin{aligned} INPUT^{test}[i] = [b_1, b_2, ..., b_n], i = 1,...,5000. \end{aligned}$$
    (9)

    Both INPUT arrays represented bits of messages (denoted as \(b \in \{0, 1\}\)). In both cases, those bits were generated randomly and all created messages were unique (within the particular matrix as well as between matrices).

  2. 2.

    Target data preparation. To use the Lorenz Attractor, the formula presented in Eq. (1), messages from the training set (\(INPUT^{train}\)) had to be encoded as its initial conditions. Details related to the process of messages compression are described in Sect. 3. As a result, each message could have been represented as two real numbers from interval [0, 1]:

    $$\begin{aligned} IC[i] = [XL_i, YL_i], i = 1,...,10000; XL_i, YL_i \in \mathbb {R} \wedge XL_i, YL_i \in [0, 1]. \end{aligned}$$
    (10)

    IC is denoting an initial condition array. The main advantage of such solution is the fact, that it enabled to algorithmically bond each message with attractor’s formula via its first two initial conditions. With IC array prepared, Lorenz Attractor formula was solved for each message from \(INPUT^{train}\) separately. That is, for \(i-th\) message, the initial conditions were set to: \([x_{1, 0} = IC[i][0], x_{2, 0} = IC[i][1], x_{3, 0} = rand()]\), where rand() was a function returning random real number from range [0, 1]. Then, the attractor was solved with usage of Runge-Kutta 4th Order method (see Eqs. (2)–(6)). The solution array for \(i-th\) message can be represented as:

    $$\begin{aligned} VS'_k[i] = [x_{k, 0}, x_{k, 1}, ..., x_{k, 39999}], k \in \{1, 2, 3\}. \end{aligned}$$
    (11)

    In all three dimensions exactly 40000 elements of solution were generated. To form a hash of \(i-th\) message, results had to be truncated. Truncated vectors are presented in Eq. (12).

    $$\begin{aligned} VS_k[i] = [x_{k, 1*step+1000}, x_{k, 2*step+1000}, ..., x_{k, n*step+1000}], k \in \{1, 2, 3\}, \end{aligned}$$
    (12)

    where:

    $$\begin{aligned} step = \lfloor \frac{40000 - 1000}{n}\rfloor . \end{aligned}$$
    (13)

    The first 1000 elements were skipped in all cases to avoid small distances (in the Euclidean sense) between solutions in 3D space. The step parameter was calculated to cover the whole solution space, and to avoid situations when two neighboring samples in a particular dimension are too close to each other. Neighboring samples might also form an ascending or descending slope, which was undesirable. After the truncation process, all vectors had to be binarized to form hashes. The general binarization formula is presented in Eq. (14).

    $$\begin{aligned} BS[i][j]_{k} = \left\{ \begin{array}{ll} 1 &{}\text {if}\; VS[i][j]_{k} \ge AVG_{k}[j] \\ 0 &{} \text {otherwise} \end{array}\right. \end{aligned}$$
    (14)

    Where \(AVG_{k}[j]\) is the average value calculated from \(VS_k\) array (\(k \in \{1, 2, 3\}\)), for each column \(j \in [1,2,...,n]\). At this stage of computation, every message from \(INPUT^{train}\) array had three hashes candidates (each of length equal to n) stored in arrays \(BS_1\), \(BS_2\) and \(BS_3\). To complete the target data preparation only one array of hashes had to be chosen. To determine which exactly should it be, statistical tests described in [19] were performed on arrays \(BS_{1, 2, 3}\), and after analysis of results, only one was chosen.

  3. 3.

    ANNs training process. For every \(n \in \{256, 512\}\) value, 7 networks were created. The structure of each network was the same and could be represented as: I-HL-OL-O. I was an input layer of size n where input messages were given. HL was a hidden layer containing sigmoid neurons. The number of neurons in this layer varied in networks within one group. OL was an output layer with n linear output neurons. O was a network output that form a hash. All ANNs were trained with the usage of the Scaled Conjugate Gradient method (SCG) [12]. The training set consisted of two arrays: \(INPUT^{train}\) and \(TARGET^{train}\) . The first one, \(INPUT^{train}\), was described in the step 1. The target set was an array containing results obtained from Lorenz Attractor that were truncated but not binarized (see Eq. (12)). Target array was selected in the process of statistical analysis of the three binarized and truncated versions of arrays (see Eq. (14), that is B.P.T., S.T., and C.T. tests described in [19] were performed. Selection of \(BS_i\) array in the statistical analysis process determined that \(VS_i\) was considered as a target set (\(TARGET^{train} = VS_i\)). Example results of such tests performed on a different set of networks is presented in [21].

  4. 4.

    Output data generation. Each trained network was used to generate output test data from \(INPUT^{test}\) data. \(OUTPUT^{test}\) data were used to prepare binarization vectors used in the performance analysis process (see Sect. 5).

  5. 5.

    ANNs evaluation process. The performance of hashing network was tested (in comparison with certificated standards). This was an independent stage in which different sets of data were used. All details and results are described in Sect. 5.

5 Analysis of Performance of the Hashing ANNs

In this section, analysis of the performance of hashing ANNs is presented as well as a comparison of the performance of hashing ANNs with MATLAB implementation of SHA-256 and SHA-512 functions. The performance measure was a time of hash computation. Hashes were calculated from data that differed in size. Accordingly to [3], the computational cost of one feed-forward network pass is O(W), where W is the total number of weights. The training cost is equal to \(O(W^2)\). Hashing algorithms have cost O(1) for small messages, and O(m) for longer messages where m is the message length. This implicates directly from their construction. Experiments’ details are presented below:

  • MATLAB implementation of SHA-512 and SHA-256 presented in [7] was used. This is the only implementation of these certificated hashing functions officially published on the MathWorks website (which is an official file exchange platform dedicated to MATLAB users).

  • MATLAB functions were tested for the following data sizes: 512 b, 100 B, 1 kB, 10 kB, 50 kB, 100 kB, and 500 kB. As a data here strings of appropriate length were considered. Data representing 100 kB and 500 kB were not created directly as strings, but multiple hashing operations were performed on 50 kB data strings (2 times and 4 times, respectively).

  • Hashing networks were tested for the following data sizes: 512 b, 128 B, 1 kB, 10 kB, 50 kB, 100 kB, 500 kB, 1000 kB, 5000 kB, and 50000 kB. As a data in this scenario, arrays of bits representing messages were considered. Every array had exactly n random bits in one row, and the appropriate number of rows. Hashing 1000 kB of data (and more) was tested as a multiple hashing of 500kB array. Performance measurements included the binarization process, but in this scenario, a binarization vector was used, which can be represented as:

    $$\begin{aligned} BV = [AVG_1, AVG_2, ..., AVG_n], \end{aligned}$$
    (15)

    where \(AVG_i\) is the average value calculated from the i-th column of the ANNs \(OUTPUT^{test}\) array (for each network vector BV was generated separately).

  • Experiments were conducted in the MATLAB2020 environment on the Personal Computer with 16 GB RAM, and AMD Ryzen 5 2600 Six-Core Processor (3.4 GHz).

  • Notation \(L\{n\}N\{HL\}\) denotes hashing network train with usage of the Lorenz Attractor, returning n bits of hash, and having HL neurons in the hidden layer.

Results of the experiments are presented in Tables 1, 2, and 3.

Table 1. Performance of SHA-256 and SHA-512 implemented in MATLAB [7].
Table 2. Performance of Lorenz hashing networks returning 256 bits of hash.
Table 3. Performance of Lorenz hashing networks returning 512 bits of hash.

Results from the Table 2 are visualized in Fig. 2, and from Table 3 in Fig. 3. Measurements were also approximated with the usage of a Bezier curve to make them more readable. In both figures and for each network, it can be observed that until about 50 kB data size threshold, the time of computation is growing slightly, and after this threshold, the growth takes the form of a linear function. In all cases, the time of computation for networks with a bigger number of neurons in the hidden layer is growing faster than for networks with a smaller number of neurons (note that the logarithmic scale is used for both: OX and OY axes). Networks returning 512 bits of the hash are also slower than networks returning 256 bits of the hash.

Fig. 2.
figure 2

Performance of hashing networks returning 256 bits of hash.

Fig. 3.
figure 3

Performance of hashing networks returning 512 bits of hash.

In Fig. 4 the performance of MATLAB implementation of SHA-256 and SHA-512 (presented in [7]), and the performance of two fastest hashing networks (one returning 256 bits of hash, and the second one returning 512 bits of hash) is compared. As it can be seen, networks are more efficient in this scenario. The time of computation for MATLAB SHA functions is also linear, however, SHA-512 algorithm is less efficient than SHA-256.

Fig. 4.
figure 4

Comparison of performance of the fastest hashing networks returning 256 bits, and 512 bits of hash, and MATLAB implementation of SHA-256, and SHA-512.

Table 4 shows a comparison of hashing efficiency of all networks and both MATLAB functions for popular data types. Assumed data sizes were [2]: 10 kB for a JPG image (JPG row), 19 kB for a PDF file (PDF row), 3.5 MB for an MP3 file or an Ebook (MP3 row), 4 GB for a DVD movie (DVD row), and 23 GB for a Blue-Ray Movie (BRM row). Times presented in this table were not directly calculated, but interpolated with the usage of the linear interpolation performed on data presented in previous tables in this section. The obtained results show clearly that the cost of hashing using the ANN-based algorithms seems feasible even for very large commonly used files while comparing the obtained results with the classic implementation makes the latter much slower or actually unacceptable (even counted in days).

Table 4. Efficiency of popular data files hashing.

6 Conclusions

In this article, the concept of hashing artificial neural networks trained with the usage of the Lorenz Attractor was presented. The research was focused on the assessment of the performance of the proposed hash generators.

All the discussed networks had one hidden layer with sigmoid neurons and an output layer with n linear neurons (where n was denoting a hash length). Two values of the n parameter were considered, that is 256 and 512. For both values of n, seven networks were created that differed in the number of neurons in the hidden layer. All networks were trained with the usage of the Scaled Conjugate Gradient method. The training set consisted of random messages (input data) and an appropriately prepared target set (hashes created from input messages with usage of the Lorenz Attractor).

ANNs presented significantly better time efficiency than SHA-256, and SHA-512 implemented in MATLAB. One of the biggest advantages of the proposed solution is a potentially scalable output of networks. The length of the returned hash can be established during the training process with a one-bit precision. Networks are also easy to replace, thus compromising one network can be easily fixed via training a new one.

Comparing the efficiency of our algorithms vs. the efficiency of MATLAB implementations of classic hashing algorithms showed, that for large, commonly used files, hashing times are still feasible for the ANN-based approach (reaching hours), while classic SHA implementations runtimes became unacceptable (reaching thousands of days).

Future work is aimed at including:

  • testing different networks structures,

  • testing different chaotic series,

  • testing more values of n parameter,

  • performing statistical tests as presented in [19]. Results of these tests will be also compared with results of the same tests performed on the certificated standards. A suite of tests presented in [19] may be also extended, for example by the Floyd-Marshall algorithm.

The example use case is presented in [21]. In the scenario described in [21], hashing networks are used to perform additional data integrity checking operations in the cloud environment. Because of variable output hash length and virtual machines’ idle time, hash generation in such system has a very small computational overhead.