Keywords

1 Introduction

Cybersecurity is critical these days, with cybercrime and cyberattacks carried out by malicious users and hackers on the rise. To protect our data, encryption algorithms are used, like Advanced Encryption Standard (AES), which was developed to tackle security problems found in Data Encryption Standard (DES) in 2001 by the National Institute of Standards and Technology (NIST). AES is a block cipher-based encryption technique. Even after 21 years, AES still withstands all cyberattacks and is also considered to be arguably the most used encryption algorithm. The widespread use of AES led to the development of many optimized implementations for a variety of CPU architectures.

Traditionally, graphical processing units (GPUs) are used by enthusiasts or developers for game playing or development, respectively, video rendering, and all operations that require a considerably large amount of video memory and dedicated processing. GPU architectures work on single instruction multiple data (SIMD) which allows them to execute the same instruction on multiple data streams in parallel fashion. This is also known as “massively accelerated parallelism” [1,2,3,4,5].

The main motive behind adapting this research topic is to enhance data security using encryption algorithms that allow for minimal power consumption, high throughput, and low latency. This includes exploring the field of computation to extract all the benefits we gain from general-purpose computing. Encrypting large files on a CPU tends to take a lot of time as they sequentially perform each calculation, while offloading this computation to a GPU drastically reduces the time taken as it parallelly performs the same calculations [5,6,7,8,9,10]. This means that many similar calculations are performed concurrently, resulting in a faster result. When GPUs are used to perform general tasks rather than video processing, they are known as general purpose graphical processing units (GPGPUs). GPGPUs are used for tasks that are generally performed by CPUs, such as mathematical equations and cryptography, as well as to create cryptocurrency. These GPGPUs are accessible by making use of parallel platforms like OpenCL or CUDA [10,11,12,13,14,15]. The proposed project makes use of Compute Unified Device Architecture (CUDA). It is a NVIDIA-exclusive technology that will be available on select NVIDIA compute devices. This compatibility can be checked on NVIDIA’s official website.

The proposed study seeks to show the potential speedup and advantage of using a GPU to encrypt files using the AES algorithm. Despite making a significant improvement in performance, this speedup is not directly beneficial to end users. Large corporations can truly harness this power as they have to continuously encrypt a large number of files while being confined by time. As a consequence, end users benefit indirectly since it takes less time to respond to their requests. This technique not only saves a lot of time, but also power if the resources are used efficiently. This will save money not only by lowering electricity consumption but also by lowering the cost of cooling the machines. The applications of using GPUs for general-purpose workloads are limitless; encryption is just one of the many others.

2 Related Work

Survey of related works is shown in Table 1.

Table 1 Literature survey

3 Proposed Work

  1. (A)

    AES Algorithm

The AES block cipher works with 128 bits, or 16 bytes, of input data at a time. The substitution-permutation network principle is used in AES, which is an iterative algorithm (Fig. 1). The total number of rounds needed for the encryption or decryption process is determined by the size of the cryptographic key employed. AES's key length and number of rounds is shown in Table 2.

Fig. 1
A block diagram displays three boxes at the top titled binary data, A E S in V RAM, and key bit. The A E S RAM is divided into the encryption of plaintext to ciphertext and the reverse process for decryption.

AES algorithm

Table 2 AES's key length and number of rounds

The input is represented as a 4 × 4 bytes grid or matrix in a column major arrangement, in contrast to traditional row major arrangement followed in system programming. The below equation shows the AES 16-byte matrix of 4 rows and 4 columns, which will be mapped as an array for converting plaintext to ciphertext.

$$ \left[ {\begin{array}{*{20}c} {b0} & {b4} & {b8} & {b12} \\ {b1} & {b5} & {b9} & {b13} \\ {b2} & {b6} & {b10} & {b14} \\ {b3} & {b7} & {b11} & {b15} \\ \end{array} } \right] $$

Round is composed of several processing steps, including substitution, transposition, and mixing of the input plain text to generate the final output of cipher text. Each round is divided into 4 steps—

  1. (1)

    SubBytes—Input 16 bytes are substituted by looking up the S-Box.

  2. (2)

    ShiftRows—Each of the four rows of the matrix is shifted to the left.

  3. (3)

    MixColumns—Each column of four bytes is transformed using a special mathematical function. This operation is not performed for the last round.

  4. (4)

    Add Round Key—The 128 bits of the matrix are XORed to the 128 bits of the round key. If the current round is the last, then the output is the ciphertext.

  1. (B)

    CUDA Implementation

The proposed project consists of a parallel implementation for

  • AES-128-bit encryption

  • AES-192-bit encryption

  • AES-256-bit encryption.

The proposed implementation is developed using Compute Unified Device Architecture (CUDA). It is a parallel computing platform and programming model created by NVIDIA for general computing on NVIDIA GPUs only. CUDA is not just an API, programming language, or SDK. It is mainly a hardware component, allowing drivers and libraries to run on top of it.

The AES algorithm is divided into two parts: one that runs on the CPU and another that runs on the GPU. The calculations performed in AES are performed on the GPU side, and the results are stored back in system memory. The CPU handles reading binary data from images and videos and creating new binary streams after encryption and decryption by the GPU.

Figure 2 presents the NVIDIA CUDA Compiler (NVCC) trajectory. It includes transformations from source to executable that are executed on the compute device. As a result, the.cu source will be divided into host and guest parts and execute on different hardware. This is called hybrid code, which then implements parallel working. We make use of CUDA specifiers like __global, which denote that the code runs on the device and is called from the host. The next specifier used is __device, which runs and calls on the device itself. Along with this, there is another specifier known as __host, which runs and makes calls on host, just like other library APIs or user-defined functions are called.

Fig. 2
A block flow diagram divides dot c u into host and guest codes. The host and guest assembly combines at embedded G P U assembly to lead to the host compiler. It further executes followed by C U D A runtime and driver.

NVCC trajectory

The key for encryption and decryption will be stored in a text file and can be 128, 192, or 256 bits in length. The binary data, key, and number of threads are passed as command line arguments to the program. The binary data can be a text file, a video, or an image to be encrypted given as a relative path. Number of threads is considered for benchmarking the performance of GPUs to measure their potential. After the execution starts, the data stored in the form of blocks and the key will be copied from system memory, i.e., RAM to the GPU Video RAM (VRAM), using arrays. The operations will be performed based on the number of rounds which is determined by the key length. After encryption and decryption operations are completed in VRAM, the results will be copied back to RAM, and the time required for the calculation will be displayed.

Figure 3 depicts a sample code snippet using CUDA specifiers. It includes the byte substitution process, which replaces the state array with the respective S-Box values and addition of a round key which comprises Binary XOR operation. In this manner, all the AES encryption and decryption operations are implemented as C language functions with modified extensibility using CUDA to implement parallelism.

Fig. 3
An algorithm depicts the byte substitution process for C U D A specifiers. The state array gets replaced with S box values which further leads to the implementation of the encryption and decryption operations.

Sample code

4 Result Analysis

  1. (A)

    Evaluation Environment

For the purpose of evaluating the performance of our proposed algorithm, we have used a set of hardware and software components specified in Table 3.

Table 3 Hardware and software technical specifications

Figure 4 demonstrates the use of our implementation. All of the samples are bitmap images having.bmp file extension. After the program has finished the execution, in the root directory of the application, 2 bitmap images will be generated, EncryptedImage.bmp and DecryptedImage.bmp. Above figure is a screenshot which combines both the specified files. It has two parts, the left part of the image contains the encrypted file, which the default image application is unable to open, and on the right side you can see the decrypted image.

Fig. 4
An interface depicts the encrypted and decrypted images on the screen. The encrypted image displays a picture logo whereas the decrypted image exhibits a clear and bright photo.

Image encryption and decryption

Figure 5 describes the CUDA information about the computer system you are using to run the program. This uses the APIs from cuda.h, which includes cudaGetDeviceCount() and cudaGetDeviceProperties(). The summary lists out different parameters like—

Fig. 5
A screenshot of the developer command prompt exhibits the C U D A information along with driver and runtime information, G P U device general, memory, multiprocessors, and thread information.

CUDA device properties

  1. (1)

    Total Number of CUDA Supporting GPU Device/Devices on the System

  2. (2)

    CUDA Driver and Runtime Information

    1. a.

      CUDA Driver Version

    2. b.

      CUDA Runtime Version

  3. (3)

    GPU Device General Information

    1. a.

      GPU Device Number

    2. b.

      GPU Device Name

    3. c.

      GPU Device Compute Capability

    4. d.

      GPU Device Clock Rate

    5. e.

      GPU Device Type—Integrated or Discrete

  4. (4)

    GPU Device Memory Information

    1. a.

      GPU Device Total Memory

    2. b.

      GPU Device Constant Memory

    3. c.

      GPU Device Shared Memory per SMProcessor

  5. (5)

    GPU Device Multiprocessor Information

    1. a.

      GPU Device Number of SMProcessors

    2. b.

      GPU Device Number of Registers per SMProcessor

  6. (6)

    GPU Device Thread Information

    1. a.

      GPU Device Maximum Number of Threads Per SMProcessor

    2. b.

      GPU Device Maximum Number of Threads Per Block

    3. c.

      GPU Device Threads in Warp

    4. d.

      GPU Device Maximum Thread Dimensions

    5. e.

      GPU Device Maximum Grid Dimensions

  7. (7)

    GPU Device Driver Information

    1. a.

      Error Correcting Code (ECC) Support—Enabled/Disabled

    2. b.

      GPU Device CUDA Driver Mode—Tesla Compute Cluster(TCC)/Windows Display Driver Model (WDDM).

  1. (B)

    Evaluation Result

While comparing results with existing implementations, the proposed system includes 2 different performance benchmarks, one which compares the performance obtained on different CPUs and GPUs, thus specifying the need to use parallel computing and the second compares compute capability of different GPUs.

The calculation of the time taken for performing operations on the binary data is done using the “helper_timer” library offered by NVIDIA. This is achieved using the set of APIs—

  1. a.

    sdkCreateTimer()—To create a timer pointer of type StopWatchInterface

  2. b.

    sdkStartTimer()—To start the timer

  3. c.

    sdkStopTimer()—To stop the timer

  4. d.

    sdkGetTimerValue()—To get the timer value after the timer is stopped

  5. e.

    sdkDeleteTimer()—To free the timer pointer.

Figure 6 depicts the program results obtained using 2048 threads. The time required to encrypt and decrypt the images is calculated and displayed in seconds. The number of threads passed to the application is modifiable and is passed as command line arguments to the program.

Fig. 6
A screenshot depicts the text that reads the time to encrypt an image on N V I D I A GeForce G T X 1650 equals 0.002000 and 0.005600 seconds, respectively.

Program execution using 2048 threads

Table 4 shows the different time values required to perform the encryption and decryption on various CPUs and GPUs.

Table 4 CPU and GPU performance comparison
  1. a.

    Column 1—Represents the device on which the program is tested.

  2. b.

    Column 2—Specifies the sample size. Samples are the bitmap images used for testing.

  3. c.

    Column 3—Time required to encrypt the data, represented in seconds.

  4. d.

    Column 4—Time required to decrypt the data, represented in seconds.

Figures 7 and 8 portray the time required for encryption and decryption on CPU and GPU for different sample sizes. According to the results specified in Table 4, as the size of input data increases the GPU takes less time to perform AES operations.

Fig. 7
A line graph depicts a sharp and higher peak for decryption compared to encryption time at approximately 100 M b that declines to rise further at 7 M b.

CPU performance comparison

Fig. 8
A line graph depicts a sharper plunge for decryption compared to encryption. The X axis is divided into Nvidia GeForce G T X 1650 and 3060 on a scale of 800-kilo bytes through 100 megabytes each.

GPU performance comparison

Table 5 depicts the different time values required to perform the encryption and decryption on various GPUs with variable threads. First column represents the name of the GPU, second column specifies the number of threads tested on that GPU. The third and fourth columns state the time required for encryption and decryption measured in seconds, respectively. This data is dynamic as the values can change over different runs. But overall, it gives the idea of performance capabilities of different NVIDIA GPUs.

Table 5 GPU benchmarks

Figure 9 is used to represent the time and speedup factor by visualizing it. From the results, we can clearly see that, the more powerful the GPU, the less time required to complete the task. Here, the number of threads is also a crucial factor while determining the best GPU. For our testing, NVIDIA RTX 3060 was the best performer.

Fig. 9
A group bar chart depicts a higher mean time of decryption for Nvidia to get force M X 450, G T X 1660 super, and R T C 3060 whereas a lesser time for G T X 1650.

GPU benchmarks

From the results, we can say that using CUDA extensively saves time and increases the throughput. This can be useful in hash algorithms as well, which can then be implemented in blockchain technology which will compute the hash of the block a lot faster. CUDA can also be used to conserve the amount of energy and power required to maintain the blockchain network. Basically, it will save time, resources and computational cost will be reduced to a great extent.

5 Conclusion

We proposed a method to parallelize the encryption and decryption processes in order to overcome the issue of high resource consumption in the traditional implementation of AES that would run on the CPU. We designed and implemented the AES encryption and decryption algorithm, which works on 128-bit, 192-bit, and 256-bit key sizes, to run on GPUs using CUDA, thereby reducing power consumption and increasing efficiency. This method provided a significant speedup over the CPU, providing high speed. This may change the way traditional resources are used, as these implementations can be used to encrypt binary data in all forms, including images and videos, as well as full disk encryption like Microsoft BitLocker. This process would require extreme fine-tuning to make such implementations a standard for other security techniques.

6 Future Scope

Currently, the proposed system presents a parallel implementation of the AES algorithm that can only be run on NVIDIA GPUs, as the presented research uses CUDA. This limits portability of testing and deploying to infrastructure using AMD or Intel GPUs, may it be integrated or discrete. To overcome this issue, we would need to develop a codebase using OpenCL that would allow us to cover every GPU and CPU device. But there are various parameters that are yet to be considered to optimize the algorithm to make proper and efficient use of the GPU to save energy and still produce similar results.