1 Introduction

In the post-Moore era, deep learning technology has achieved great breakthroughs in various application fields, such as computer vision, NLP, and autonomous driving [13]. As the main component of deep learning, deep neural networks (DNNs) can address a wide range of AI challenges and even perform better than human beings in some specific domains [4]. The strong feature extraction and learning abilities are attributed to a DNN’s robust network structure and high computational cost. The amount of floating-point operations can reach the order of ten billion or more [5].

To reduce the DNNs computational complexity and improve its training or inference speed, some algorithm-based optimization techniques have been proposed. These techniques include sparse neural network by dropout compression [6] and layer pruning [710]. Although these approaches have achieved great success in some specific networks, they still cannot meet the high computational requirement for some new and complex neural network structures, such as the transformer. In order to fundamentally solve this problem, the DNN accelerator was proposed. By quantizing long floating-point weight files to a short fixed-point format and mapping this to highly parallel PE arrays, a DNN accelerator can improve the throughput of data parallel processing.

However, DNN accelerators have crucial security vulnerabilities [1113]. Attackers can capture the model structure or weight parameters through monitoring sensitive data or snooping the accelerator interrupt patterns [12, 13]. They can also conduct bit-flip attacks on the neural network weight files and this can significantly decrease a DNN model’s inference accuracy [14].

Tackling these challenges requires hardware and software co-design, unifying the theoretical privacy algorithms with secure hardware architecture. In terms of software, researchers have tried their best to design robust network structures or planning more complicated training algorithms, such as adversarial training and GAN-aided training [15]. In parallel, hardware designers also design secure DNN accelerators by means of encrypting sensitive data or using trusted execution environment (TEE) technology [16, 17]. For example, Wang et al. proposed NPUFort which inserted a security unit into the existing FPGA-based heterogeneous computing accelerator by using the AES block cipher method [16]. Hanieh et al. presented DarKnight which utilized a customized data encoding strategy based on matrix masking to achieve data obfuscation on a TEE. DarKnight can guarantee data privacy and integrity in conventional GPUs [17].

However, these protection techniques have inherent performance limitations. For block cipher AES, the complex iterative mode (ECB, CTC, and CTR) can bring about heavy computing latency and lead to the potential risk of data conversion error [18]. Furthermore, to convert the long plaintext message to fixed-size blocks, AES needs some padding data and this can involve extra memory resource overhead. Third, to achieve accurate decision-making, a DNN requires extensive memory and computing resources which does not operate well in a TEE enclave with restricted memory space [19].

To address the above challenges, we propose Nacc-Guard, a lightweight security-enhanced DNN accelerator architecture. In Nacc-Guard, an improved stream cipher algorithm Trivium is adopted in DNN inference accelerators. With its less hardware consumption, Trivium is more suitable for encrypting resource-limited systems. Furthermore, the interrupt signals from the accelerator to a host CPU are refused. Conventional high and low level interrupt signals are recoded to two positive edge latency by 1B/4B algorithm; this confusion can avoid attacks based on monitoring interrupt signal patterns. Third, to ensure the integrity of the uploaded DNN weight file and achieve authentication, hash-based message authentication code (HMAC) is used.

The main contributions are summarized as follows:

  • We propose a novel secure DNN accelerator architecture named Nacc-Guard which can defend DNNs against memory Trojan and neural network bit-flip attacks.

  • Different from conventional AES, an improved linear randomization algorithm Trivium is first adopted in DNN accelerators. Experiments show that Trivium is more suitable for encrypting hardware resource-limited DNN accelerators.

  • Interrupt signals from the accelerator to the host CPU are recoded from the high/low level to positive edge latency; this can help to avoid attacks from monitoring the interrupt signal patterns. Furthermore, hash-based message authentication code is involved to ensure the integrity of the uploaded DNN weight file and achieve authentication.

  • The Nacc-Guard prototype is implemented in NVDLA and SIMD DNN accelerator coupling with RISC-V Rocket and ARM CortexA9 at the RTL level. Runtime evaluation shows that this architecture can successfully ensure secure DNN inference. Experiments on VGG, ResNet50, GoogLeNet, and YOLOv4-tiny shows that Nacc-Guard can bring about 3\(\times \) hardware overhead reduction and 3.63\(\times \) performance improvement over the AES baseline with negligible extra power consumption.

The rest of this paper is organized as follows. Section 2 discusses related work and motivation. Section 3 introduces the design. In Sect. 4, the experimental evaluation and result analysis are presented. In Sect. 5, we conclude the paper.

2 Related work and motivation

2.1 DNN accelerator

The DNN accelerator was first invented to mitigate the bottleneck between a neural network’s heavy computational requirements and its training and inference speed [20]. DNN accelerators (such as GPU, FPGA, and CGRA) have denser parallel computing micro-architectures and can offer higher data throughput bandwidth without significant accuracy loss. It can result in orders of magnitude improvement of computational density with higher power efficiency. Several typical DNN accelerators are shown in Table 1.

Table 1 Several typical DNN accelerators

In terms of DNN stage, there are training and inference accelerators. For example, the Graphics Processing Unit (GPU) is a DNN accelerator which is used for neural network training. A GPU is designed to provide a high-performance computing platform with large data throughput. In parallel, the DNN inference accelerator is also a young but quickly developing technology. In order to meet the fast changing and flexible characteristics of deep neural networks, an inference accelerator’s micro-architecture should always be designed to be both scalable and reconfigurable. This requirement makes FPGA the best choice. Furthermore, with the development of heterogeneous computing technology of a CPU coupling with FPGA in one SoC, FPGA-based DNN inference accelerators are becoming more and more popular. FPGA-based DNN accelerator can be shown in Fig. 1.

Fig. 1
figure 1

DNN accelerator architecture

A well-trained DNN model can be mapped onto an inference accelerator to realize real-time prediction. A pre-trained network weight file always contains long floating-point types. FPGA is better at fixed-point and bit-shift calculations. To deploy a DNN model on FPGA-based inference accelerators, the weight file should be first converted to a fixed-point type (such as 8-bit or 16-bit) through quantization strategies [28, 29]. Short fixed-point representations of weights and feature maps can significantly reduce the computational cost with negligible accuracy loss [3032].

2.2 Trivium and message authentication code

Fig. 2
figure 2

Register framework of the stream cipher Trivium

In general, cryptography involves symmetry cryptograms and public key cryptograms. Public keys are often used for key distribution while symmetric cryptography is used for sensitive data encryption. Block ciphers and stream ciphers are the main components of symmetric cryptography [18, 33].

A block cipher algorithm encrypts a fixed size of data (such as 128 bits as a block) at one time. The 128-bit and 256-bit are the most widely used cipher blocks [18]. For example, a 128-bit plaintext will be encrypted into a 128-bit ciphertext. In cases where the plaintext data is shorter than the block size, a bit padding scheme will be called into use. Advanced Encryption Standard (AES, Rijndael) is now the most commonly used block encryption algorithm. Specifically, AES treats the 128-bit plaintext block as 16 bytes, and these 16 bytes are arranged in four columns and four rows for the next step of processing as a matrix. For the processing, AES uses 10 transformation rounds for 128-bit keys and 14 transformation rounds for 256-bit keys. Each round involves four transforming steps: SubBytes, ShiftRows, MixColumns, and AddRoundKey.

In this paper, an improved linear randomization Trivium algorithm is deployed for buffer data encryption and decryption. Trivium consists of three interconnected nonlinear feedback shift registers (NLFSR) of length 93, 84, and 111 bits [34]. First, Trivium requires a pre-owned 80-bit key and an 80-bit initialization vector (IV). This key and vector are used to pad the above state registers. Then, 1152 steps of the clocking shift procedure are required before Trivium keystream generation. Finally, we can encrypt the plaintext data by the bitwise exclusive or operation (XOR) with the generated keystream. The register structure and keystream generation process are illustrated in Fig. 2. Trivium was originally designed to meet the high data throughput requirement in hardware resource-limited systems. For its deployment, it can guarantee data confidentiality without an undue increase in the hardware resource overhead. For this reason, Trivium is more suitable for DNN inference accelerators than conventional AES techniques.

Message Authentication Code (MAC) is a tag which can be attached to the original files to ensure the integrity and authenticity of the pre-transmitted data. To protect the integrity of the DNN weight files and the DNN model authentication, a hash-based MAC (HMAC) is used in Nacc-Guard. HMAC is generated by applying a hash function to the original data message in combination with a key. A one-bit data change will produce a different HMAC, hence HMAC can guarantee that the DNN weight files are legitimate and do not contain harmful code.

2.3 Vulnerability and motivation

DNNs have security vulnerabilities [35, 36]. Attackers can intentionally implant backdoor or malicious programs in DNN files. They can steal, modify, or even destroy the whole neural network system. To be specific, there are algorithm-based and hardware-based attacks. Algorithm-based attacks include adversarial example attacks [37], model inversion attacks [38], model extraction attacks [39], and data poisoning attacks [40]. In parallel, hardware-based attacks include memory Trojan attacks [41], side-channel information leakage attacks [42], and neural network bit-flip attacks [14, 43].

In this paper, we focus on DNN accelerator memory Trojan attacks and bit-flip attacks. For example: (1) a hacker can change weight file key bit positions that are critical to the DNN model inference accuracy. Malicious programs can also be uploaded together with files to the accelerator without authentication; (2) an eavesdropper can capture the neural network structure via snooping interrupt signal patterns [12]; (3) Trojan attackers can also make an accelerator not work properly through hijacking sensitive data stored in the massive on-chip buffers [40].

To address this challenge, previous approaches used AES encryption in DNN accelerators. For example, Wang et al. presented a secure accelerator architecture named NPUFort [16]. They added AES-based data en/decryption modules and instruction en/decryption modules into DNN accelerators. To some extent, this offers an alternative method that can guarantee accelerator security. Unfortunately, this method only ensures data confidentiality rather than integrity. In addition, AES could bring dramatically higher hardware overhead and data processing latency. Another defense method is to build a TEE in the DNN accelerators [17]. However, to achieve accurate predictions, a DNN requires extensive memory and computing resources which does not operate well in TEE enclaves which have restricted memory space [19].

3 Nacc-Guard

3.1 Threat model

DNN models reuse accelerator hardware resources layer by layer; after a layer is finished, the accelerator will give an interrupt to inform the host CPU to dispatch the next layer. Certain neural network models are accompanied by certain interruption signal patterns. By monitoring the interrupt signal patterns, an attacker can capture a DNN model structure.

Furthermore, due to its special memory hierarchy and high parallel micro-architecture, DNN accelerators have massive on-chip buffers and this makes it more vulnerable to buffer Trojan attacks. By monitoring the plaintext buffers, an attacker can launch hardware Trojan attacks.

3.2 Overview

Fig. 3
figure 3

Overview of the Nacc-Guard micro-architecture

Nacc-Guard is realized on FPGA-based heterogeneous computing SoC platforms. Figure 3 depicts the anatomy of the whole architecture and its main functional modules. In this figure, an on-chip host processor is deployed coupling with the DNN accelerator. Specifically, the host processor connects the accelerator IP core as an IO device through the on-chip bus (such as AXI, or AXI to APB bus). The accelerator IP core registers are mapped to the Linux process virtual address space via the mmap function and the accelerator can communicate with DRAM through the DMA controller. The host processor orchestrates the neural network inference process, including uploading the remote DNN model files from DRAM or clouds and configuring control registers. PE arrays perform feature map matrix multiplication and addition. By using local on-chip memory access patterns, this tightly coupled micro-architecture can significantly reduce the latency of data traffic. All neural network layers reuse the accelerator computing systolic arrays. The scheduler executes the neural network operator layer by layer. Ping-pong buffers are used to alleviate the bottleneck between the higher parallel computing speed and the lower intermediate data access speed.

To ensure security, two keys are used for accelerator data encryption and DNN weight file verification. For the accelerator, key-1 is used for on-chip data encryption, and for the host processor, key-2 is used for generating the message authentication code. To be specific, the accelerator IP core should first store an 80-bit secret key-1 in a register for the improved Trivium engine. Then, we can distribute the other key-2 for the on-chip processor through public key cryptography. Meanwhile, the off-chip or remote users who can upload DNN models files to Nacc-Guard should share the same key-2 with the host processor for generating the message authentication code.

3.3 Encryption engine

A DNN accelerator has a Domain-Specific Architecture (DSA). The PE has locally coupled memory structures and decentralized on-chip buffers. Hackers can attack a DNN accelerator by monitoring the plaintext of intermediate data or inference result on the massive on-chip buffers through memory Trojans [41]. This characteristic means that it has crucial security vulnerabilities and we should encrypt sensitive data in massive on-chip buffers. However, these accelerators often suffer from stringent on-chip hardware resource limitations. This problem is more severe in real-time systems and mobile devices. This challenge makes the stream cipher Trivium more suitable for accelerator encryption than the conventional block cipher AES. In Nacc-Guard, to guarantee the confidentiality of data that are offloaded from the accelerator to the on-chip SRAM or critical buffers, an improved linear randomization encryption algorithm Trivium is used for the AXI buses. This design can support parallel computing in DNN accelerators and can meet the high throughput data requirement. We deployed the Trivium engine in the AXI bus controller to selectively encrypt data before it is transmitted to the data write channel and decrypt data that are coming from the data read channel. The linear randomization encryption Trivium implementation details can be seen in Fig. 4.

Fig. 4
figure 4

Schematic diagram of linear randomization encryption Trivium algorithm in AXI bus controller

According to the AXI protocol, before the master port initiates a data read or write request in the data channel, a corresponding address value should be sent to the address read or write channels. First, the address values are captured and the modulus-8 operation is applied (ADDR mod 8). Then, if the result is 0, the Trivium engine will encrypt the write data or decrypt the read data from the data channels. This lightweight encryption scheme will not bring excessive hardware overhead or performance loss.

$$\begin{aligned} plaintext= & {} m=m_{1}m_{2}m_{3}..., \end{aligned}$$
(1)
$$\begin{aligned} keystream= & {} s=s_{1}s_{2}s_{3}..., \end{aligned}$$
(2)
$$\begin{aligned} cipher= & {} c=m\,xor\,s, \end{aligned}$$
(3)
$$\begin{aligned} plaintext= & {} c\,xor\,s=m\,xor\,s\,xor\,s, \end{aligned}$$
(4)

For data encryption and decryption, the process can be seen in Eqs. (1) to (4). In (1) and (2), m and s are the plaintext message and keystream, respectively. As shown in (3) and (4), the ciphertext can be obtained by applying the XOR operation to the plaintext and keystream. If the ciphertext applies the XOR operation again to the keystream, we will recover the plaintext data. As a result, if a malicious memory Trojan snoops on data from the SRAM or on-chip buffer, they will get the randomized encrypted data and have no ability to launch an attack according to the pre-set plaintext information. This operation can not only guarantee on-chip data confidentiality but also can avoid DNN model inversion attacks. The Nacc-Guard accelerator stores the pre-distributed key-1 in a secure register, while the IV(The 80 bit initialization vector in Trivium nonlinear-feedback shift register) can be generated by a pseudorandom number generator.

3.4 Interrupt signal confused coding

Fig. 5
figure 5

Interrupt signal coding by using the mB/nB algorithm

DNN accelerators have a specific working mode. The CPU scheduler executes the neural network operator layer by layer, and DNN models reuse accelerator PE resources layer by layer. After computing a layer, the accelerator will give an interrupt to inform the host CPU to dispatch the next layer. By monitoring the interrupt signal pattern through side channels, attackers can capture a DNN model layer number and structure. To solve this problem, in Nacc-Guard, interrupt signals from the accelerator to the host CPU are recoded; conventional high/low level interrupt signals are recoded to the positive/negative edge latency by using the mB/nB coding algorithm (1B/4B). As shown in Fig. 5, a specific clock cycle delay represents a real interrupt. This confusion can avoid attacks from snooping the interrupt signal patterns.

3.5 Message authentication code

Fig. 6
figure 6

Architecture of RISC-V Rocket and DNN accelerator; a RoCC coprocessor is used for accelerating HMAC generation

Fig. 7
figure 7

Verification procedure of HMAC in Nacc-Guard

In order to achieve security isolation between the DNN accelerator platform and the user DNN model files, we build an authentication mechanism by using the SHA-256 algorithm to verify the uploaded DNN file integrity and make the authentication. HMAC-SHA256 is a hash-based message authentication code using the SHA-256 algorithm. Compared with the traditional message digest, HMAC can not only guarantee the data integrity, but can also ensure that the DNN model files come from legitimate users.

In Nacc-Guard, the SHA-256 algorithm uses the DNN weight file to generate a message authentication code HMAC. This process is shown in (5) and (6). In Equation (5) and (6), H denotes the SHA-256 function and m is the binary bit data of a DNN weight file.

The HMAC mechanism can make the accelerator computing platform have a secure channel between the accelerator SoC and off-chip users. To be specific, before the SoC host processor fetches DNN weight files from off-chip users, the off-chip user should first calculate a message authentication code HMAC-1 according to the pre-distributed key-2 and the weight file data. This HMAC-1 will be sent to the SoC host processor together with the DNN model weight file. After file transmission, the host processor will calculate another HMAC-2 according to the shared key-2 and compare it with HMAC-1. If a hacker inserted poisoned data or malicious programs into the DNN model file, an error will be reported in the verification process before DNN model inference. The warning will be issued and be handled by the host processor. As a result, this malicious or modified DNN file will be forbidden and the accelerator will deny service. Furthermore, this message authentication code verification scheme can also avoid weight data loss or error during data transmission. The whole process is illustrated in Figs. 6 and 7.

4 Evaluation

4.1 Experiment setup

Table 2 The accelerator platform configuration information

Nacc-Guard is developed in the SIMD [44] and NVDLA [45] open source DNN accelerators at RTL. The SIMD accelerator is coupled with an ARM Cortex ®-A9 and NVDLA is coupled with an ARM Cortex®-A9 and RISC-V Rocket host processor. The EDA tool Vivado (version 2019.1) is used to synthesize and implement this lightweight accelerator prototype. The correctness of Nacc-Guard functionality has been verified on runtime platform Zedboard (SIMD) and ZCU-102 (NVDLA) on DNN models of YOLOv2 and ResNet-18. Performance evaluation is made using ModelSim SE and DNN accelerator simulator MAESTRO [46]. When not specifically stated, the DNN accelerators are all evaluated under a 100 MHz clock frequency. The detailed information of the two DNN accelerators is shown in Table 2.

4.2 Hardware overhead

The ARM + SIMD and RISC-V Rocket + NVDLA DNN accelerator SoC platforms are synthesized and implemented on Zedboard and VC709 SoCs. The detailed on-chip hardware resource consumption of BRAMs (36 Kb Block-RAM Blocks), FF (CLB Flip-Flops), LUT (Look-Up Tables), and DSP (18x25 MACCs DSP Slices) are calculated before and after Nacc-Guard is deployed in the two DNN accelerator platforms. The result is shown in Figs. 8 and 9. From this result, we can clearly see that Nacc-Guard architecture results in negligible on-chip resource overhead in the SIMD and NVDLA accelerators.

Fig. 8
figure 8

Hardware resource consumption before and after Nacc-Guard is implemented in SIMD on ZedBoard (percentage of total SoC resource)

Fig. 9
figure 9

Hardware resource consumption before and after Nacc-Guard is implemented in NVDLA on VC709 (percentage of total chip resource)

Furthermore, we also compared the hardware resource consumption of the AES-based encryption method (AES-128, ECB) in ARM + SIMD and RISC-V Rocket + NVDLA heterogeneous computing SoCs. The experimental results show that Nacc-Guard stream cipher-based encryption achieves a 3\(\times \) hardware overhead reduction compared with AES-based encryption for the two DNN accelerators.

4.3 Performance evaluation

Fig. 10
figure 10

Performance of Nacc-Guard and AES-based accelerator on different DNN models

An excellent security environment should bring a small performance overhead to the original DNN accelerator platform. To this end, we evaluate the Nacc-Guard performance overhead through the VGG, ResNet50, and GoogLeNet deep neural network models.

First, runtime experiments on Zedboard (SIMD) and ZCU-102 (NVDLA) show that Nacc-Guard can successfully ensure private DNN inferences. Then, the cycle accurate simulations are made in ModelSim SE and the DNN accelerator simulator MAESTRO [46]. This experimentation shows that the performance overhead in Nacc-Guard mainly comes from buffer data en/decryption and interrupt signal recoding. Interrupt signal coding using the mB/nB coding algorithm (1B/4B) will bring 3 cycles latency for each interrupt signal, and this means each DNN layer will have a 3 clock cycle delay. The performance results for the VGG, ResNet50, and GoogLeNet DNN models are presented in Fig. 10. As a comparison, the AES(ECB)-128 en/decryption model is also embedded in the original DNN accelerator. Through the analysis of the results, we can see that Nacc-Guard has a significantly better performance than the conventional AES en/decryption scheme in DNN accelerators. It achieves a 3.63\(\times \) performance improvement on average.

4.4 Power consumption

Finally, to evaluate the energy consumption of Nacc-Guard, the SoC power is estimated from the implemented netlists in the Vivado EDA tool 2019.1 after on-chip synthesis and implementation (place and route). Due to the fact that ARM has a stable and solidified hardware micro-architecture in Zynq, we chose ARM + NVDLA on Xilinx ZCU102 as the accelerator power evaluation platform. The increment rate of SoC dynamic power consumption with the accelerator working frequency is shown in Fig. 11. We find that the Nacc-Guard encryption engine introduces less energy overhead than AES-based secure DNN accelerators. From this figure, we can also see that Nacc-Guard has a more robust power performance in terms of working frequency. The increasing rate of power with frequency is lower than AES.

Fig. 11
figure 11

Dynamic power consumption of original accelerator, Nacc-Guard, and AES-based accelerator in different working frequency

5 Conclusion

In this paper, we proposed a lightweight DNN accelerator architecture named Nacc-Guard. This architecture aims to defend against memory Trojan attacks and neural network bit-flip attacks. Experimental results show that Nacc-Guard has a 3.63\(\times \) performance improvement with 3\(\times \) less hardware resource cost than AES-based en/decryption. Furthermore, the experimental evaluation shows that Nacc-Guard has low and robust power consumption. Cryptoanalysis shows that the system is secure as long as the key is secured.

For the next step and in our future work, we will deploy Nacc-guard in transformer accelerators and we will deploy it in the autonomous vehicles, such as a tesla car.