1 Introduction

Security has become the most important armor responsible for protecting all kinds of resources and data from various types of threats targeting security services such as data confidentiality, integrity, and source authentication. These security services are typically ensured by resorting to cryptographic solutions, which are essential to overcome and limit such threats. Existing attacks can be either active or passive, where passive attacks can seriously impair the Data Confidentiality (DC) and privacy of the system, while active attacks can compromise its authentication (source, user, device), integrity, and availability. Moreover, the nature of passive attacks makes them very difficult to detect compared to active ones. An active attacker may insert, delete or modify data contents. Encrypting communicated or stored data can solve all problems related to passive attacks. However, this requires a distributed scheme and a robust key exchange mechanism. Typically, symmetric-key schemes are used for data encryption, especially since they are more efficient in terms of memory and computational complexity compared to asymmetric-key ones. Furthermore, conventional symmetric key ciphers are either block-based or stream-based. Several standardized cipher algorithms that ensure DC already exist, including the stream cipher RC4 [25], and the block cipher AES [17].

In general, a block cipher [23] uses a round function that can either be based on a Feistel Network (FN) such as DES, or on Substitution-Permutation Networks (SPN) such as AES. SPN lends itself to parallel implementation and requires a lower number of rounds compared to FN and hence, SPN exhibits lower latency and requires fewer resources than FN.

1.1 Related work

AES [6] is a block cipher that processes data in blocks of size 128 bits (16 bytes), and it uses keys of size 128, 192 and 256 bits. The design of AES depends on the SPN principle. It includes a round function, which consists of diffusion and substitution operations, and it is iterated r times, depending on the size of the secret key. The number of rounds, r, is equal to 10, 12, and 14 for a secret key of size 128, 192, and 256 bits, respectively.

Each round, except for the last, includes four operations:

  • RoundKeyAddition: it mixes the plain input block with the specific round key.

  • ByteSubstitution: the operation employs a substitution table, S-Box, to ensure the confusion property.

  • ShiftRows and MixColumn operations are used to ensure the diffusion property. Note that the MixColumn operation is eliminated in the last round.

1.2 Problem formulation

The security level of existing symmetric ciphers, against analytic attacks, depends on the number of rounds r, which leads to a trade-off between the security level and the required latency and resources. Ciphers that are based on a static structure have proven their resistance against analytic cryptanalysis. However, the static structure of the round function represents the main security issue. Moreover, since the cipher primitives are static, the required number of rounds r is high, where different substitution and diffusion operations are performed within each round [18, 20, 21].

Fixed cipher structures lend themselves to future potential attacks [4, 31], which would benefit from the fixed structure (substitution and diffusion primitives) to recover the secret key [27]. Examples of such attacks include implementation attacks such as side-channel attacks and fault attacks [27]. Hence, countermeasures against implementation attacks are required, which would increase the latency and required resources. This, in turn, reduces their performance and makes them not suitable for some of the future systems and applications [2].

1.3 Motivation

To overcome these limitations, our approach uses the dynamic key-dependent structure as in [18, 21] to reduce the required number of rounds and operations. This leads to a good balance between efficiency and security level, as well as offering a simple solution to prevent certain implementation attacks.

To reduce the execution time of the existing cryptographic algorithms, GPU (Graphic Processing Unit) implementations are being adopted. A GPU is useful for cryptographic algorithms, which can benefit from the hundreds and even thousands of cores in a GPU. Researchers use GPUs to generate pseudo-random numbers such as in [1, 13]. Also, standard cryptographic algorithms have been implemented on GPUs such as AES [9, 14, 15], which resulted in an impressive speed-up [5] compared to the CPU implementation. It is worth noting that the efficient implementation of an algorithm on a GPU requires the expertise to optimize the use of the GPU architecture in terms of shared memory, registers, and warp [22].

Recently, an optimized and efficient implementation of AES on GPU was presented in [15]. It achieved an excellent performance and the authors made various optimization compared to the previous related works. Accordingly, this implementation is selected as the reference for comparison against the proposed cipher solution. There is another recent implementation of AES on GPU, PHAST, which was described in [26]. This implementation is more generic and it resulted in about 10% decrease in performance as compared to [15].

1.4 Contributions

The proposed cipher solution follows the recent dynamic key-dependent approach of [7, 20, 21]. In contrast to these related solutions, no integer diffusion operation is used in the proposed dynamic key-dependent stream cipher. This operation is eliminated without weakening the cipher security level since the cryptographic primitives are updated for each new input data. Moreover, the proposed solution does not require the avalanche effect, but it is based on high key sensitivity.

The proposed cipher scheme uses an efficient and simple key-stream generation algorithm that uses dynamic permutation and substitution tables in addition to two different PRNGs with a large number of seeds. To the best of our knowledge, the proposed solution is the first dynamic key-dependent stream cipher algorithm with dynamic seeds and substitution/permutation tables.

Next, we list the technical contribution of this paper as compared to the existing cipher solutions:

  • The proposed cipher is based on a dynamic key-dependent approach, and it is based on a simple key derivation function that uses a variable session key and a Nonce, which change for each new input message, making it highly resistant against attacks.

  • The permutation table is used as a perturbation technique to modify the internal state, which increases the periodicity of the employed PRNGs.

  • The proposed solution uses a dynamic substitution process to increase the nonlinear degree of the generated key-stream and to achieve higher key sensitivity.

  • The proposed cipher exhibits a high level of randomness, which was verified using the “BigCrush” of “TestU01” [12] statistical suite tests on the generated key-stream.

  • The proposed cipher scheme uses lightweight PRNGs and simple operations, which minimizes the latency and required resources, and leads to a simple software implementation.

In summary, The proposed cipher satisfies the desirable cryptographic characteristics such as long periodicity, high level of key sensitivity, and high level of randomness and thus, higher resistance against attacks, with low latency and overhead.

1.5 Organization

The remainder of the paper is organized as follows. Section 2 describes and analyzes existing GPU cipher implementation. In Section 3, the proposed dynamic key derivation is presented. While in Section 4, the employed cipher primitives construction techniques are described. Then, in Section 5, we introduce and describe in details the proposed stream cipher algorithm, along with the functionality of each operation. In Sections 6 and 7, we respectively assess the robustness of the proposed cipher scheme and its performance. Finally, in Section 8, a conclusion and future directions are presented.

2 Existing GPU cipher algorithms and their corresponding implementations

A GPU (Graphic Processing Unit) is a commonly used architecture to accelerate computations. GPUs are used in many computing applications and systems ranging from smart-phones, embedded computing, to supercomputers. The architecture of a GPU is quite different from that of a CPU. In a GPU, the architecture is optimized to maximize the execution throughput of many simultaneous threads. The number of computing cores inside a GPU ranges from hundreds to even thousands. The hardware is designed to execute many threads, even if the bottleneck is the memory access itself. To benefit from the GPUs computing power, users need to use a number of threads that exceeds the number of cores. Hence, while some threads are waiting for their data, other threads are capable of executing. Typically, there are many kinds of memory in a GPU: global memory which is the slowest one, cache memory, texture memory, shared memory, local memory, and a limited set of registers having the fastest access. Consequently, memory management is critical within GPUs.

GPUs are composed of streaming multiprocessors (SMs), and their number varies for different GPU types. Typically, SMs are composed of 32 cores that can execute only a single instruction at a time. So, if two threads are executed on the same SM, one instruction will be executed, while the second one would have to wait. This is called thread divergence. Consequently, “IF” instructions and “WHILE” conditions must be avoided, whenever possible. Threads are scheduled by groups of 32 on an SM, and they are referred to as warps. In practice, threads are organized into blocks; depending on the GPU architecture, the maximum thread number per block is limited to 1024. Hence, it is important to keep in mind that GPUs technical details are constantly changing with every new generation.

3 Dynamic key derivation function

In this section, the proposed dynamic key generation function and the corresponding sub-keys generation schemes are presented and illustrated in Fig. 1. The cipher primitives (seeds and permutation boxes) are dynamic and they change based on this set of sub-keys. The specific secret key, SK, is mixed with a NONCE No (unique for each new input) to produce a dynamic secret, O. Then, the new dynamic key (DK) is obtained by hashing O using a secure cryptographic hash function. To ensure that a different DK is produced for each different input message or session, the SHA-512 hash function is chosen and it is known for its high resistance degree against collisions. The dynamicity introduces robustness against powerful attacks. The dynamic key, DK, is used to generate the required sub-keys as explained next.

  • Master Secret KeyK: It is shared between both legal entities to provide enhanced security. It allows the symmetric secret key to be renewed after each periodic interval, depending on the application itself. For example, Elliptic Curve Diffie Hellman (ECDH) protocols can be selected for this specific task.

  • NonceNo: Each Nonce will be used only once; it is updated for every input image or session. Two possible Nonce generation techniques can be adopted i) generated by the sender and transmitted to the receiver in an encrypted form, by either employing a secret key or by employing the receiver public key; ii) producing the Nonce at the sender and receiver in a synchronized manner, through the use of a deterministic pseudo-random generator.

  • Dynamic KeyDK: The master secret key K is XORed with N0, and the output is hashed using SHA-512. This generates the dynamic key, DK, which represents the MAC value with a size of 512 bits. Then, DK is divided into 4 main sub-keys {KP,KSKg1,Kg2} each with a size of 16 bytes (128 bits).

Fig. 1
figure 1

The proposed Key derivation function and its corresponding construction of cipher primitives

4 Construction of dynamic cipher primitives

The sub-keys are used to generate the required cipher primitives, as described below.

  • Permutation sub-keyKP: it consists of the most significant 16 bytes of DK, and it is used to produce a set of permutation tables (32 Pboxes) that can be employed during the selection process. In this solution, any key-dependent permutation generation algorithm can be employed such as the ones in [19, 20]. The selection of the Modified Key Setup Algorithm (MKSA) of [19] is used to construct the required dynamic key-dependent permutation tables. In fact, MKSA is selected due to its simple hardware and software implementations. To ensure that a P-box has a good cryptographic performance, MKSA should be always iterated with a different key in order to produce different P-boxes. Therefore, KP is used as a seed for the RC4 just to generate a set of permutation sub-keys, and each sub-key is used as a seed for the MKSA to produce a different permutation table. On the other hand, the weaknesses of RC4, as reported in [8, 11, 16] do not affect the proposed solution, which is based on a dynamic key-dependent structure.

    RC4 will be iterated to generate a byte vector of length equals to Np × lp. Then, the output is reshaped to form a matrix with a size of Np × lp, where each row represents one of the dynamic permutation keys with a length of Qp = lp × 8 bits, to be used as a permutation table.

    Note that RC4 is iterated with a dynamic sub-key to avoid any weakness and to achieve a high level of security.

  • Substitution sub-keyKS: it represents the second set of the 16 most significant bytes, and it is used to produce a set of substitution sub-keys, where each sub-key is used to produce a dynamic substitution table (S-box). Any key-dependent algorithm could be used for the generation of the substitution tables. We adopt the simple technique used in [19], which is based on the Key Setup Algorithm (KSA) of RC4. The output of the original KSA, for any input key, is a substitution table that is used as a dynamic S-box. RC4 is iterated to form a byte vector of length equals to Ns × ls. Then, the output is reshaped to form a matrix of size Ns × ls; each row represents one of the dynamic substitution keys, with a size of Qs = ls × 8 bits, and used as a key-dependent substitution table.

  • First PRNG seedKg1: it represents the third most significant 16 bytes of DK and it is used to produce a set of seeds of length lg1, one of which is selected for each thread. Also, in this step, RC4 is selected and it is iterated for \(\frac {lg1 \times Qg1}{8}\) times to generate different N seeds, where N represents the possible number of threads, and Qg1 represents the precision of the first generator, which can be equal to 32, 64 or 128. The output key-stream is reshaped to form a byte matrix of size \(N \times \frac {Qg1}{8}\). Each row of this matrix represents one of the seeds, and it has a length equals to Qg1 bits. Any repeated row (seed) is eliminated from this list and RC4 is re-iterated to produce a new seed.

  • Second PRNG seedKg2: It represents the fourth most significant 16 bytes of DK and it is used to produce a set of seeds of length N, one of which is selected for each thread. Similarly, RC4 is selected and it is iterated for \(\frac {N \times Qg2}{8}\) times to generate different N seeds. Besides, Qg2 represents the precision of the second generator. The output key-stream is reshaped to form a byte matrix of size \(N \times \frac {Qg2}{8}\). Each row has a size of Qg2 bits, and represents one of the seeds. Any repeated row (seed) is also eliminated from this list, and RC4 is re-iterated to produce a new seed.

All notations are shown in Table 1. These steps guarantee a high level of sensitivity, where any tiny change in the dynamic key would result into a completely different cipher primitive in the generation process; such a change was proven in Section 6.2. The parameters’ derivation is illustrated in Fig. 1.

Table 1 Table of notations

5 Proposed stream cipher algorithm

This section describes the proposed stream cipher, “ESSENCE”, which is designed with a single round to outperform AES. The main properties of the proposed solution are: high-security level, reduced computational complexity, and simple and parallel hardware and software implementations.

5.1 Basic Concepts

The proposed scheme is based on 3 main concepts:

  • Parallel Computing: This algorithm is designed to run in parallel. All the threads are independent of each other and they could be all executed in parallel (see Fig. 2), even if it is not possible to schedule all of them at the same time.

    Fig. 2
    figure 2

    Scheme of the proposed lightweight stream cipher algorithm for the ith thread

    Multi-streaming multiprocessors (SM) contain each 32 syn-chrome threads and shared memory and hence, the same operation is applied on these syn-chrome threads but with different inputs. In the proposed scheme, every 32 threads are iterated to perform the same function. For example, the first PRNG with different seeds is iterated to produce 32 outputs, each represented by 32 bits.

  • Flexible Structure: The structure of the proposed stream cipher allows for any pair of outputs of the efficient PRNGs to be used [29]. Therefore, any PRNG that exhibits very good performance and satisfies the randomness properties could be used. For example, in this paper, we use “xoroshiro128plus” and “xorshift”, which were selected due to their simplicity and efficiency. The proposed solution uses both PRNGs since TestU01 can detect the link between all the threads, if only one PRNG is used.

  • Efficient & Lightweight Combination of both PRNGs The selected pairs of PRNGs are combined in an efficient manner (diffusion operation) to produce a key-stream with high periodicity, and a stable randomness degree. The proposed technique benefits from the shared memory of GPU, whereby the output of the first PRNG is stored in the shared memory. Then, the output of the second PRNG is mixed with two different outputs of the first PRNG.

  • Dynamic Selection of Shared Memories. These shared memories (O0, and O1) are selected according to dynamic permutation tables (32 different P-boxes).

In the following, we describe a set of possible pseudo-random generators that can be used in the proposed stream cipher.

5.2 xoroshiro128+ (XOR/rotate/shift/rotate)

It is a successor to Xorshift (implementation at xorshift128+). It uses a carefully handcrafted shift/rotate-based linear transformation, as shown in Algorithm 1. This PRNG ensures a significant reduction in latency and the corresponding resources. Also, this PRNG reaches a high level of randomness. It has a repetition period of (2128-1), which is not long enough for cryptographic algorithms. Therefore, the proposed stream cipher scheme uses a higher number of threads and for each thread, a xoroshiro128+ PRNG is used. Also, a diffusion operation is applied to the output of three different xoroshiro128+ PRNGs (different for each iteration) to increase the periodicity of the proposed key-stream. The xorshift64 algorithm is presented in Algorithm 2.

figure d
figure e

5.3 Xorshift

Xorshift belongs to a class of PRNGs that is based on linear-feedback shift registers (LFSRs), which is described in Algorithm-2. Xorshift allows for an efficient implementation without the need of excessively using sparse polynomials. This makes them extremely fast on any modern computer architecture. Similar to LFSRs, the available parameters must be chosen with extreme caution in order to achieve a long period [24]. However, xorshift generators do not have non-linear steps. This makes them fail some statistical tests [24]. However, Xorshift generators do have numerous advantages including low execution time as well as a simple implementation.

5.4 Proposed encryption algorithm

Below, we describe the various steps of the proposed algorithm, as illustrated in Algorithm 3:

figure f

Note that the input data is stored in the d_input table, and the encrypted data (output) is stored in the d_output table. As the input is not changed, the keyword __restrict__ allows the compiler to optimize the variable’s access, which reduces the memory access time. Other unchanged variables also have this keyword during the execution of the algorithm.

The variable d_xoro is used to store the required internal values for the ”xoroshiro128plus” PRNG [3]. Each thread has a different value. To improve the performance, in many GPU algorithms, one is advised not to compute more than a single value per thread. Consequently, in our algorithm, the variable nb represents the number of elements that each thread is responsible for. The use of the loop is essential to reduce the number of threads used in the code to maximize the GPU’s occupancy. Without the loop, the performance would be diminished.

In the main loop, the xoro variable is used to select 2 permutation tables from the 32 generated ones. Note that we could have chosen bigger permutation tables. However, in this case, we would need to use the __syncthreads() instruction to synchronize threads on different warps. However, such an instruction reduces the performance significantly. These permutation tables are obtained using 32 P-boxes generated with the initial key provided to the proposed ESSENSE PRNG. So, the variable d_pbox contains 32 random permutations tables of size 32. Variable shmem is the shared memory that allows threads to exchange their values. It should be noted that each thread will have values coming from different permutation tables. For example, thread 0 will xor its result with threads 2 and 10, while thread 1 will xor its result with threads 3 and 8, and thread 2 will xor its results with threads 31 and 9, and so on. Moreover, at each iteration of the loop, the values of o0 and o1 change.

Then, the algorithm calls the xoroshiro128plus function, which changes the variable xoro, and puts the result into the variable res. Then, the shared memory is used to save this variable before xoring it with 2 other numbers generated by 2 other threads (according to the two permutation tables, as previously mentioned). The variable res is used as input to the second PRNG (xorshift64), and the result is saved in res2. Then, res2 is xor-ed with two other values coming from two other threads (in the same warp). Next, res and res2 are also xor-ed in order to obtain res3. Finally, a substitution table d_sbox is used to substitute 4 or 8 different bytes of res3 for an output of 32 or 64 bits, respectively. Note that the output is converted to an unsigned char table before applying the substitution operation on each element of the table. At the end of the loop, the internal value of xoro is saved for the next call of the function. Finally, it should be noted that nbele is the total number of threads, which depends on the size of the data to encrypt.

5.5 Proposed decryption algorithm

A legitimate receiver will use the same steps for decryption as the ones for encryption, and the same secret and Nonce to produce the specific dynamic key. This allows for the generation of the required cipher primitives. Then, the decryption algorithm proceeds in a similar manner to the encryption algorithm.

6 Security analysis

An efficient encryption algorithm should be able to resist the most known types of attacks such as statistical, differential, chosen/known plain-text, and brute-force attacks [19, 20]. Extensive experiments are performed in this section to demonstrate the efficiency and security level of the proposed scheme against such attacks. Note that the proposed solution can be used for any kind of data (structured or unstructured), but the following results are provided for multimedia image contents.

6.1 Statistical analysis

To guard against statistical attacks, a cipher must exhibit a high degree of randomness and uniformity [30]. To test the randomness degree, the following statistical security tests were carried out, (a) Probability Density Function (PDF) analysis, (b) Entropy analysis and (c) Correlation between plain and encrypted images.

6.1.1 Uniformity analysis

The most important test is the probability density function(PDF) of the encrypted image, which must be uniform; every symbol has a probability occurrence close to \(\frac {1}{n}\), where n is the number of symbols. The PDFs of two original plain-images and their corresponding cipher images are shown in Fig. 3. It can be seen that the PDFs of the encrypted images are close to a uniform distribution, with a value close to 0.039 that is \(\frac {1}{256}=3.9 \times 10^{-3}\).

Fig. 3
figure 3

a and e show original images; b and f show their corresponding PDF; c and g show their corresponding encrypted images; d and h show the PDF of encrypyed images. In b, f, d and h, the x-axis and y-axis represent the symbol values and their corresponding probability values

6.1.2 Information Entropy Analysis

The information entropy, of a given image M, is a parameter that measures the uncertainty level in a random variable [32], and it is defined by:

$$ H(m)= -\sum\limits_{i=1}^{h^{2}} p(m_{i})\log_{2} \frac{1}{p(m_{i})} $$
(1)

The entropy is expressed in bits, and p(mi) indicates the occurrence probability of symbol mi, and NS the total number of symbols. If the entropy of the encrypted data is either equal to or close to log2(NS), it can be considered as a true random source with a uniform distribution.

The Entropy analysis of the encrypted Lenna image, at the sub-matrix level with a dimension of 16 × 16, and by using a random dynamic key, is shown in Fig. 4. The results indicate that the encrypted images have an entropy similar to the desired value of 8. As such, the proposed cipher is sufficiently secure against any given entropy attack.

Fig. 4
figure 4

Entropy analysis of encrypted Lenna versus 1,000 random secret keys at the sub-matrix level. Encrypted image is divided into a set of sub-matrices of size 16 × 16 and NS = 256 bytes (mean equal to 7.175)

6.1.3 Independence

Removing any correlation between the sequence of elements is highly essential to ensure the robustness of the proposed cipher scheme [19]. Having a correlation coefficient close to zero means that the cipher scheme exhibits a high randomness degree. The correlation test is performed by randomly taking adjacent pixels from an original image and its corresponding encrypted image. This correlation can be done in horizontal, vertical and diagonal directions. The correlation coefficient rxy is calculated using the following equation:

$$ \begin{array}{@{}rcl@{}} && r_{xy}={}{}\frac{cov (x,y)} {\sqrt{D(x)\times{D(y)}}}\\ && \text{where :} \\ && cov (x,y)=\frac{1}{N}\times \sum\limits_{i=1}^{N} (x_{i}-E(x))(y_{i}-E(y)) \\ && E_{x}=\frac{1}{N}\times \sum\limits_{i=1}^{N} x_{i} \\ &&{} D_{x}=\frac{1}{N}\times \sum\limits_{i=1}^{N} (x_{i}-E(x))^{2} \end{array} $$
(2)

The correlation results of the original and encrypted images, and for (2,000 pairs of adjacent pixels), are shown in Figs. 5 and 6, for one random key, and for 1,000 random keys, respectively. The results clearly show that the adjacent pixels of the plain image have a high correlation, close to 1. However, the coefficient correlation of the encrypted images tends is very low, close to 0, confirming the randomness property of the proposed cipher.

Fig. 5
figure 5

Correlation distribution in adjacent pixels (2,000 pairs) in original Lena: a horizontally, b vertically and c diagonally. Correlation in adjacent pixels in ciphered Lena: d horizontally, e vertically and f diagonally

Fig. 6
figure 6

The variation of the correlation coefficient for adjacent pixels in ciphered Lenna image versus 1000 random keys: a horizontally, b vertically and c diagonally

6.1.4 Plain data vs. encrypted data

The encrypted data should be very different from the original one, with a difference of at least 50%, at the bit level. According to the obtained result in Fig. 7a, the proposed cipher scheme satisfies the desirable difference results, with a percentage of at least 50% between the plain and the encrypted Lenna images.

Fig. 7
figure 7

a The different variation between plain and ciphered Lenna image (percentage of the Hamming distance) and b key sensitivity against 1,000 random keys

6.2 Sensitivity tests

Differential attacks are based on studying the relation between two encrypted messages resulting from a slight change, such as a one-bit difference, between two original messages. The sensitivity tests must confirm that a small change in the plain-image or in the key affect the cipher image and generate a different one. The higher the difference, the better is the sensitivity of the encryption algorithm.

6.2.1 Key Sensitivity test

This is one of the most important tests, and it quantifies the sensitivity against a slight change in the secret key. The proposed key derivation function is based on a secret key and a Nonce. To further study the key sensitivity, two dynamic keys are used, DK1 and DK2, which differ by a single random bit. The two plain-images are then encrypted separately, and the Hamming distance of the corresponding encrypted images, C1 and C2, is computed and illustrated in Fig. 7b against 1,000 random dynamic keys. We can see that the majority of values are close to the optimal one (50%). This confirms the high key sensitivity of the proposed cipher algorithm. Additionally, the obtained results of 49.9970 are acceptable when compared to the reported ones of AES.

6.2.2 Plain-text sensitivity

Since a different dynamic key is being used for each input image, the algorithm produces a completely different cipher image for the same plain image. Hence, the proposed cipher successfully satisfies the avalanche criteria.

6.3 Visual degradation

This test is restricted to image and video contents, and it quantifies the visual degradation associated with the output of a cipher scheme. Two popular parameters are assessed to measure the visual quality: the Structural SIMilarity index (SSIM) [28], and the Peak Signal-to-Noise Ratio (PSNR) [10].

The PSNR is derived from the Mean Squared Error (MSE), which represents the cumulative squared error between the encrypted and original images. A low PSNR value indicates a high difference between the cipher and original images. On the other hand, SSIM lies in the [0,1] interval, where 0 means the absence of correlation between original and cipher images, while a value close to 1 indicates a high correlation between the original and cipher images. We measured PSNR and SSIM between the original and encrypted Lenna images for 1,000 random keys. The results are presented in Fig. 8a and b, respectively. It can be seen that the value of the PSNR is 9.23 dB, which is a low value and confirming the high difference between the original and encrypted images. Also, the SSIM values are always close to zero, which confirms that a high and hard visual distortion is achieved by the proposed cipher algorithm.

Fig. 8
figure 8

Variation of PSNR and SSIM between the original and encrypted Lenna image versus 1,000 random keys

6.4 Cryptanalysis: Resistance against well-known types of attacks

In contrast to the majority of existing cipher solutions, our scheme is based on a dynamic key approach, with dynamic substitution, permutation and diffusion layers for each input data. Previous statistical tests (entropy analysis, probability density function, correlation tests) have confirmed the robustness of the proposed cipher scheme and its high resistance against statistical attacks. Moreover, the key sensitivity analysis demonstrated a high sensitivity against key-related attacks. These results are sufficient to infer that no useful information can be inferred from the encrypted data. On the other hand, the resistance against chosen/known plain-text attacks is verified due to the dynamic key approach, which drastically complicates the attacker’s task. As such, the problems of a single message failure and accidental key disclosure are avoided. Furthermore, differential and linear attacks are ineffective since any change in the dynamic key leads to a significant difference in the produced cipher primitive and in the encrypted message as well. Also, the key space of the secret key is of the order of 2128, 2192 or 2256, which is sufficiently large to make brute-force attacks unfeasible. The same is true for the key space of the dynamic key, which is 2512. Note that a large secret key and a large dynamic key are being used since the difficulty of cipher-text-only attack is equivalent to one of the brute force attacks, making it impossible for a cipher-text-only attack to retrieve useful information from the cipher image.

6.5 Statistical tests with TestU01

As previously explained, ESSENCE has been tested with more than 100 seeds via TestU01 [12], and it successfully passed all the tests. In practice, a message of size 512*512 is typically used with all elements set to zero, and the key is initialized only once, at the beginning. All the other variables are also initialized once. Since TestU01 uses many pseudo-random numbers, the same message is used repeatedly over a very large number of iterations, with a single difference between iterations, the different numbers generated by the PRNGs.

7 Performance analysis

In this section, the cipher latency is quantified to assess the performance of the proposed cipher.

7.1 Experiments

To measure the performance of the proposed cipher, ESSENCE is evaluated on a Titan X GPU, which has the following characteristics:

  • Compute capability: 5.2

  • Global memory: 12,207 MB

  • GPU frequency: 1.25 GHz

  • Memory frequency: 3,505 MHz

  • Number of CUDA cores: 3,072

and on a Tesla V100 with the following characteristics:

  • Compute capability: 7.0

  • Global memory: 16,152 MB

  • GPU frequency: 1.53 GHz

  • Memory frequency: 877 MHz

  • Number of CUDA cores: 5,120

To compare the performance against the best AES implementation, we selected the implementation of [15], which uses the ECB operation mode, and we shall refer to it as AES-ECB. The performance tests are based on different 8-bit color images. Note that the throughput of AES-ECB is very close to the result in [26], 570.72 Gbps, which corresponds to 71.3 GBps.

The execution time of the encryption algorithm is the same as the one of the decryption algorithm (stream cipher). Note that our implementation is highly optimized, and the kernel operations of reading and writing an image take approximately the same time. The speed-up of ESSENCE compared to AES-ECB is shown in Tables 2 and 3 and in Fig. 9.

Fig. 9
figure 9

Speed-up (execution time ratio) of ESSENCE compared to AES-ECB on a Titan X GPU and on Tesla V100

Table 2 Throughput of ESSENCE and AES-ECB on a Titan X GPU
Table 3 Throughput of ESSENCE and AES-ECB on a Volta V100 GPU

The obtained results indicate that the proposed cipher scheme is faster compared to AES-ECB, and the ratio varies between 1.4 and 2 depending on the message length on the Titan X, and between 2.4 and 2.9 for the Tesla V100. Therefore, the proposed cipher scheme is more suitable for real-time applications.

8 Conclusion

In this paper, we presented ESSENCE, a new dynamic, key-dependent, one-round stream cipher scheme with an efficient, parallel, and dynamic key-dependent structure, and which was designed targeting a GPU implementation. ESSENCE outperformed the most optimized implementation of AES on GPU, which makes it preferable for real-time applications. Moreover, the proposed cipher scheme offers a high degree of randomness, which was validated by quantifying the produced key-stream, which successfully passed the statistical tests of TestU01. Also, ESSENCE has a high periodicity since it combines the threads’ results of two PRNGs, which are then dynamically xor-ed based on 32 permutation tables, which are also generated and related to the dynamic key. Moreover, the implementation of ESSENCE is very simple compared to other existing cipher schemes. Equally important, the robustness of ESSENCE has been assessed and confirmed via cryptanalysis along with different benchmark tests. Note that other existing cryptanalysis techniques are designed to target static structures, which is not the case of the proposed scheme. In future work, the design of an efficient parallel dynamic key-dependent hash function for GPU will be investigated.