Keywords

1 Introduction

With the fast advances in Information technologies [1,2,3] and big data algorithms [4,5,6], NLP (natural language processing) is a fast growing area with the powerful recurrent neural network (RNN) [7] that is more sensitive to temporal tasks. However, the problem of gradient explosion or gradient disappearance occurs when the neural network is too deep or has too much temporal order. So, RNN can learn only short-term dependencies. To learn long-term dependencies, the long short-term memory network LSTM [8] emerged based on RNNs. LSTM uses unique gate structure to avoid RNN-like problems by forgetting some information. LSTM has a wide range of applications in NLP, especially in the field of machine translation, where it is much more effective than RNN. Although the special “gate” of LSTM solves the problem that RNN cannot learn long-term dependencies, it also strengthens the data dependency between modules, leading to its high requirement for temporality. It causes LSTM’s low parallelism.

Many optimizations for LSTMs have also emerged. Some optimization models [9,10,11] tended to optimize for model accuracy, and most of them ignored model training speed and model size. There were also some optimizations to accelerate the inference speed of the model by hardware accelerators [12,13,14]. FPGA became the primary choice for hardware-accelerated LSTMs due to their excellent performance in accelerating CNNs. [15] accelerated the LSTM model characteristics by designing custom accelerators. By analyzing the LSTM model, implementing each module in hardware, and designing custom hardware structure to complete the acceleration according to the model characteristics.

Based on the idea of hardware-software collaborative optimization, this paper presents a modular analysis and collaborative design of LSTM. By implementing pruning, quantization, and improved data flow, the network optimization is carried out in collaboration with hardware and software to maximize parallelization while ensuring normal timing and no data dependencies. The contributions of this article are:

1) The pruning method in this paper uses pruning by row, by which the final pruning result can ensure the same number of elements in each row and prepare for the parallelism design of the hardware. 2) In this paper, we propose a method for optimizing neural networks based on collaborative ideas of hardware and software, and a specific design on a multilayer LSTM model. 3) The modularity analysis of LSTM from a software perspective concludes that it is possible to compute some cells in different layers in parallel. Based on this, the computational order of the multilayer LSTM is redesigned in this paper so that the model is hardware-friendly while properly ensuring its timing.

The rest of the paper is organised as follows: Sect. 2 describes the related work; Sect. 3 introduces the idea of software and hardware co-optimization of the LSTM and the implementation details in software; Sect. 4 describes the design of the corresponding hardware structure based on the optimization effect on the software; The results and analysis of the experiments are presented in Sect. 5. Section 6 concludes this paper.

2 Related Work

Machine learning [16, 17] have been widely applied in various areas and applications, such as finance [18, 19], transportation [20], and tele-health industries [21]. The optimization problem [22, 23] of neural networks is top-rated, and it is divided into two main directions including, optimizing network models and designing hardware accelerators.

The design of hardware accelerators can be divided into two main categories. One category is for data transmission. Three types of accelerators were designed by [14] for different data communication methods. The first one was to stream all data from off-chip memory to the coprocessor. This had high performance but was limited by the off-chip memory bandwidth. The second one used on-chip memory to store all the necessary data internally. This scheme achieved a low off-chip memory bandwidth, but it is limited by the available on-chip memory. The third one balanced the former two options to achieve high performance and scalability.

The other category focused on computational processes. Paper [24] proposed a hardware architecture for LSTM neural networks that aimed to outperform software implementations by exploiting their inherent parallelism. Networks of different sizes and platforms were synthesized. The final synthesized network ran on an i7-3770k desktop computer and outperformed the custom software network by a factor of 251. Paper [25] proposed an fpga-based LSTM-RNN accelerator that flattened the computation inside the gates of each LSTM, and the input vectors and weight matrices were flattened together accordingly. Execution was pipelined between LSTM cell blocks to maximize throughput. A linear approximation was used to fit the activation function at a cost of 0.63% error rate, significantly reducing the hardware resource consumption.

Optimizing network models and designing hardware accelerators could have the effect of optimizing the network, but they only considered one optimization direction. The former only considered the optimization of the network and ignores the reconfigurable design of hardware. The latter focuseed on hardware design without considering the characteristics of the network itself, and there was no synergistic design between software and hardware.

3 LSTM Software and Hardware Co-optimization

Our design is based on both of the software dimension and the hardware dimension, considering how to design to make it hardware friendly when optimizing the algorithm. And when designing the hardware architecture, consider how to adapt the hardware architecture to better support the software algorithm. In this paper, we propose the idea of software and hardware co-optimization. It is a spiral optimization method that considers the software optimization method while adjusting the software optimization and hardware structure with feedback according to the characteristics of the target hardware structure. The software optimization is combined with the hardware design to design the hardware structure supporting the above model from the hardware perspective while the structure of the neural network model and the number of parameters are reasonably designed.

3.1 Approach

In this paper, we discuss on that neural network algorithms and hardware architectures interact with each other in the computational process with a considerable degree of collaboration. Therefore, software optimization needs to combine with hardware optimization. Often, the number of model parameters is too large for the hardware, and it is necessary to determine whether direct deployment to the hardware is feasible before performing the analysis. If the model parameters put too much pressure on the computational resources of the hardware, the number of model parameters can be reduced using pruning. There are various pruning methods, and while reducing the number of parameters, it is necessary to choose a pruning method that is more compatible with hardware storage and computation. If the transfer pressure of data in each computing module is too high, the data can be quantized to relieve the transfer pressure.

After compressing the model, we need to analyze the model as a whole. From a software perspective, the algorithms and data flow between different modules are analyzed. The relationship between the modules also needs to be sorted. From the hardware perspective, the data dependencies of each module of the model are analyzed, and which parts can be optimized by parallelization, data flow design, and pipeline design are considered. It is worth noting that these two tasks are carried out at the same time. While considering algorithm optimization and tuning the model structure, one needs to consider how to tune it to make it hardware-friendly. While considering hardware optimization, one needs to consider how to tune the algorithm and structure to make it decoupled and design hardware structures with higher parallelism and lower latency. Finally, the model structure and hardware architecture have good adaptability through the collaborative design of hardware and software.

3.2 Model Compression

With the development of neural networks, the accuracy of model is no longer the only metric to evaluate a model, and researchers are increasingly focusing on other metrics of the model in the application domain, such as model size, model consumption of resources, etc. [26]. While pruning and quantization as the common methods for model compression, their algorithms are being optimized as the demand expands. [27] proposed a deep compression by designing a three-stage pipeline: pruning, trained quantization and Huffman coding. Combining multiple compression methods onto a neural network, it was finally demonstrated experimentally that the requirements of the model were reduced by a factor of \(35\times \) to \(49\times \) without the accuracy loss.

Pruning: Network pruning is one of the commonly used model compression algorithms. In the pruning process, we can construct redundant weights and keep important weights according to certain criteria to maintain maximum accuracy. There are many pruning methods, and they each have their focus. Since the computation process of LSTM contains a large number of matrix vector multiplication operations, this paper focuses on the pruning operation of the weight matrix of LSTM. The pruning method in this paper is the less common pruning by row, which differs from the general pruning by threshold method in that the selection of pruning range by row is changed. It takes a row of the weight matrix as a whole and selects a threshold for pruning according to the percentage of a row, and by this way, the final pruning result can ensure a consistent number of elements in each row. In the hardware implementation of matrix operations for high parallelism in computation, we expand the matrix by rows and compute it in parallel from row to row. The idea of hardware-software collaboration is fully reflected here. The accuracy of the model was degraded after pruning, so retraining was performed after pruning as a way to improve the accuracy. It is worth noting that the pruned parameters can easily be added with an offset value, which will cause the pruning effect to be lost without doing something during the retraining process. Therefore, we add a mask matrix during retraining to ensure that the 0 elements of the weight matrix are not involved in training before the weight matrix is involved during retraining. The shape of the mask matrix is the same as that of the weight matrix. When the element value of the weight matrix is 0, the element value of the mask matrix for the position is also 0.

Fig. 1.
figure 1

Pruning by row and its benefit for parallel processing.

Figure 1(a) shows the matrix of memory gate \(O_i\)’s weights in a certain neuron. Figure 1(b) shows the result map \(O_i^a\) after unbalanced pruning while Fig. 1(c) is the result map \( O_i^b\) after pruning by row. After pruning the weight matrix, we assign each row of the matrix to different PEs for parallel computation. Figure 1(d) and Fig. 1(e) are the running times of several PEs after the two pruning, respectively, and we can see that the unbalanced pruning Fig. 1(d) will result in a different amount of data per row, which leads to a large difference in the elapsed time between the parallel PEs.

The total elapsed time will be determined by the computation time of the most loaded PEs, while the other PEs will have different degrees of waiting, which leads to a decrease in running efficiency. The total time is five clocks. While pruning by row makes each row have the same amount of data, parallel PEs do not experience waiting, making the overall operation efficiency guaranteed to be high. With the guarantee that all rows require the same clock cycle, the total time is only three clock cycles.

Quantification: Quantification is another universal approach to model compression. In general, when a well-trained model infers, the model can deal with some input noise. It means that the model can ignore all non-essential differences between the inference and training samples. To some extent, low precision can be considered as a source of noise that provides non-essential differences. Therefore, theoretically, the neural network can give accurate results even if the data precision is low. At the same time, neural networks may occupy a large amount of storage space. This means that the network requires not only a large amount of memory but also a large number of computational resources during operation. This problem also provides the need for quantification.

The parameters of the LSTM weight matrix are 32-bit single-precision floating-point numbers. The memory requirement is very large when the network size is large, and the computation of floating-point numbers requires a lot of hardware resources. [28] used a linear quantification strategy for both weights and activation functions. The authors analyzed the dynamic weight range of all matrices in each LSTM layer and explore the effect of different quantification bits on the LSTM model. In this paper, we quantify float-32 to fixed-8. Although a certain amount of accuracy is lost, multiple data can be read at a time for processing, which substantially improves the efficiency of computing.

By compressing the model, we greatly reduce the number of parameters and computational effort. It is worth noting that when choosing the pruning method, we abandoned the method with a higher compression rate and chose row pruning which is more convenient for hardware computation, which paves the way for the following hardware design.

4 Hardware Design

There are many hardware designs for LSTM, but most of them did not consider the characteristics of the hardware structure afterward in the process of software optimization. Therefore, the final design of the hardware structure was often limited by the characteristics of the model and data. In this paper, we take into account the design characteristics of the hardware structure in the process of model compression and choose a compression method that is more compatible with our hardware design to ensure hardware friendliness to the maximum extent.

Fig. 2.
figure 2

Overall architecture diagram

4.1 Overall Structure

The overall architecture of the hardware is shown in Fig. 2. The entire hardware architecture consists of the controller, the input and output buffers, BRAM, and LSTM computation blocks. The ARM core controls the overall flow of the model by transmitting instructions to the hardware’s controller. In addition, the parameters and input data required for the operation are pre-processed by model compression and stored in BRAM. The controller accepts instructions from the ARM core and then coordinates and calls other areas according to the instructions. The input and output buffer temporarily stores the data in the BRAM to reduce the data transfer latency of the LSTM computation block. The LSTM computational block implements the computational logic of multi-layer LSTM. Each neuron in this unit corresponds to and implements the computational flow of a single neuron in the LSTM algorithm, while the LSTM neurons are densely connected by input and output ports to implement the data path between neurons.

4.2 Compute Optimization

The main computations of LSTM cells are four matrix multiplications, activation functions, dot-product, and addition. Our optimization scheme focuses on matrix multiplication, which accounts for a large part of the overall compute. We block the matrix by rows and perform the input in parallel. In matrix computation, the internal loop is unrolled when the elements of multiple rows are simultaneously assigned to different PE. Since the pruning by row in model compression allows the same number of elements in each row of the weight matrix, it ensures that the data assigned to different PEs in different rows have the same computation time. After computing finishes, PE can proceed directly to the next cycle without waiting for other PEs. Given the limited resources on the hardware, the number of parallel PEs can be fine-tuned according to the application scenario. If PE resources are sufficient, we reduce the time complexity to O(N) in an ideal case.

The computation process of an LSTM cell is shown in Fig. 3. On the left is the processed weight matrix. We divide it into rows and assign them to different PEs. And on the top is the input vector x and the previous neuron input h\(_{t-1}\). After the input parameters(\(h_{t-1}\), \(x_t\), \(c_{t-1}\)) enter LSTM cell, \(h_{t-1}\), \(x_t\) first carry out matrix multiplication and addition operations with the four weight matrices(\(W_f\), \(W_i\), \(W_c\),\(W_o\)) and then obtain the intermediate results(\(i_t\), \(f_t\), \(g_t\),\(o_t\)) after \(\sigma \) and \(\tanh \) activation functions. The compute process of LSTM includes cell state \(c_t\) and hidden state \(h_t\). The cell state \(c_t\) is obtained from the dot product of \(i_t\) and \(g_t\) plus the dot product of \(f_t\) and another input parameter \(c_{t-1}\), while the hidden state \(h_t\) is the result of \(c_t\) after tanh and then dot product with \(o_t\). The computation of input parameters and weight matrix is the most time-consuming portion of the entire procedure. Since the four matrix calculations are independent, this paper replicates the input parameters so that the four matrix operations are no longer limited by the transmission delay of the input parameters and are computed simultaneously, which reduces the overall computation time.

Fig. 3.
figure 3

Computational process of a single LSTM neuron.

4.3 Data Flow Optimization

We analyze the data flow between each neuron in the layer dimension and time dimension of the LSTM. The LSTM of the same layer is executed sequentially in 2 dimensions on the CPU, which leads to low parallelism of the model. In the hardware implementation, the strong timing of the LSTM results in a parallel strategy that can only be carried out in two separate dimensions including, between different layers of the same sequence, such as Layer1_1, Layer2_1 and Layer3_1; And between different times of the same layer, such as Layer1_1, Layer1_2 and Layer1_3.

Fig. 4.
figure 4

Multi-layer LSTM data stream (take 3 layers as an example).

The distinction between hardware and software implementations inspires us. Hardware LSTMs often choose to implement each cell computation in turn and spread it out to match the rich hardware resources, while software calls the same cell module in turn. This difference in implementation brings a new perspective on the data flow. As shown in Fig. 4, when the first time step of the first layer LSTM is computed at \(t_0\) moment, the output \(h_0\) at the moment \(t_0\) is passed to the second time step of the first layer and the first time step of the second layer LSTM after completion of the computation. After that, at \(t_1\) moment, the first time step of the second layer LSTM and the second time step of the first layer LSTM can compute simultaneously, and so on. Thus some cells of LSTM can compute in parallel to achieve acceleration, which is similar to the acceleration idea of the pulsating array [29]. After optimizing the computational order, we can observe that the overall model is more parallelized. As a result, we can adjust the timing of the cells that satisfy the computational prerequisites at the same moment in the hardware implementation. Figure 5(a) shows the timing diagram before the adjustment, where the whole computation process took nine clock cycles, while in Fig. 5(b), the computation process took only five clock cycles after the adjustment. For this stepwise computation order, Layer1_3, Layer2_2, and Layer3_1, which are in different layers, start running at the same time because they satisfy the computation prerequisites at the same time, thus satisfying the two parallel strategies at the same time.

Fig. 5.
figure 5

Pipeline structure of multi-layer LSTM.

5 Experimental Results and Analysis

We implemented the LSTM neural network on the MNIST [30] dataset using the pytorch framework. The LSTM neural network was trained on the MNIST dataset, which consists of four main components: the training set, the test set, the validation set images and the label information. There are 55,000 images in the training set, each with a size of 784 (28*28); 55,000 labels in the training set, each label being a one-dimensional array of length 10; 10,000 images in the test set and 5,000 images in the validation set.

The experimental results showed that the accuracy before pruning reached 95.75%. After pruning the weight matrix, the compression rate reached \(21.99\times \), but as we were pruning uniformly by row, this inevitably led to the loss of valid parameters and the accuracy dropped to 68%, such a reduction in accuracy is unacceptable. We therefore retrained after the pruning was completed, and the final accuracy was 96%. Finally, we had completed the effective compression of our model.

We implemented the LSTM computation process on Vivado High-Level Synthesis V2020.2 and the results are shown in Table 1. The device we chose was the XC7Z020, and in order to make the best use of hardware resources, we explored the relationship between the size of the weight matrix and the consumption of hardware resources and its percentage of the total amount of each resource. It is clear from Table 1 that when the size of the weight matrix is \(32\times 18\), the percentage of BRAM_18 K, DSP and FF usage is small, but the LUT usage reaches 27132, accounting for 51% of the total resources.

As the size of the weight matrix increases, all the hardware resources increase with it, except for FF, whose percentage did not change much. Among them, BRAM_18 K increases the most, directly from the initial 34% to 103%. As the scale increases to \(64\times 256\), the resources required for BRAM_18 K exceed the total resources available from the hardware. When the model size is 50\(\,\times \,\)200, the proportion of each resource is more reasonable.

Table 1. Hardware resource usage for different model sizes

We compared the hardware resource consumption before and after optimization and computed the percentage of total resources on the hardware accounted for by LUT, FF, DSP, and BRAM_18 K. Before optimization, LUT and BRAM_18 K account for a large proportion of the hardware resource consumption, accounting for 52% and 34% of the total number of resources, respectively. Due to the use of various hardware optimization schemes, the latency is reduced while the consumption of various resources is significantly increased. Since LSTM compute contains many matrix multiplications, our optimized DSP has the largest increase in consumption with 40%, accounting for 70% of the total DSP resources.

6 Conclusion

In this paper, we proposed the method of optimizing neural networks based on the idea of hardware-software collaboration for multilayer LSTM. The selection of the pruning scheme and the corresponding hardware design in this paper fully reflected the idea of hardware-software collaboration. After analyzing the data flow and data dependency of the LSTM, we redesigned the computational order of the multilayer LSTM to parallelize the serial computation of some neurons. The final experiment proved that our solution had \(10\times \) the throughput of CPU and improved in terms of throughput, latency, and resource utilization compared to other hardware accelerators.