1 Introduction

Matrix computing is widely used in the fields of computing science and engineering applications, such as signal processing [1], image processing [2], and convolutional neural networks [3]. In recent years, although matrix computing has achieved good performance on many acceleration platforms such as CPU, GPU, TPU, and FPGA, its computing performance is still the bottleneck for the performance improvement of the entire system.

Compared with other platforms, field-programmable gate arrays (FPGAs) have the advantage of designing computing structures and storage schemes for specific applications. Many studies have shown that FPGAs are superior to other platforms in terms of performance and power consumption [4], and are suitable as a low-cost hardware accelerator for matrix computing.

There have been many solutions for matrix computing based on FPGAs. Some scholars have proposed algorithms and structures for fixed-point matrix multiplication [5] and matrix calculation [6]. A systolic array structure has achieved high data throughput and computing speed [7, 8]. But its demand for the number of PEs and I/O bandwidth is very high. Another linear array structure is widely used in matrix multiplication [9,10,11]. Some studies have carried out different types of optimization based on the structure [12], such as storage optimization [13], I/O, and storage scheduling optimization [14]. This structure has also achieved better results in the acceleration of convolutional neural networks [15, 16], multi-operation and continuous matrix computing [17].

However, matrix computing based on FPGAs still faces the following concerns. Firstly, the previous work was specifically designed for matrix computing with certain data precision. After determining the data type, the calculation structure can not handle the data with another precision. For example, the signal processing in the controller has a greater demand for matrix computing with different precisions, which requires a special hardware accelerator to speed up real-time computing capabilities. Secondly, most of the existing research was interested in matrix multiplication. In many engineering applications, such as the filtering algorithm, matrix addition and matrix subtraction are used together with matrix multiplication. Therefore, they are both important calculation models. Thirdly, consider that the matrix calculation accuracy in the deep learning model does not need to be too high. It is necessary to design a half-precision floating-point matrix calculation accelerator that satisfies performance and calculation efficiency.

This study aims to develop an efficient multi-precision matrix computing acceleration (MCA) unit core based on a unified architecture. The major contributions of this work are as follows:

  • Based on the parallel block matrix computing algorithm and linear array, we present two parallel single-precision floating-point matrix computing unit. Through the splicing data and the mapping strategy from algorithm to calculation structure, each PE can complete two single-precision floating-point matrix computing in parallel, which truly improves the computing speed.

  • The proposed multi-precision MCA unit core with a unified structure can handle three precisions (double-precision, single-precision, and half-precision) and three commonly used matrix operations (matrix multiplication, matrix addition and matrix subtraction) which enhances the adaptability of data types.

  • We develop the MCA system based on the DSP-FPGA platform. Compare to some state-of-the-art structures, the proposed system meets engineering needs and achieves better performance.

The remainder of this paper is organized as follows. In Sect. 2, the parallel block matrix computing algorithm and linear array are introduced. The details of single-precision, half-precision floating-point matrix computing, multi-precision functional component, the MCA core and system are described in Sect. 3. The implementation results and discussion are shown in Sect. 4. Finally, Sect. 5 concludes the work.

2 The Parallel Block Matrix Computing Algorithm and Linear Array

The matrix multiplication involved in this article is shown in Eq. (1), where matrix A, matrix B, and matrix C are dense matrices of arbitrary size.

$$ {\text{C}}_{{{\text{M}} \times {\text{N}}}} = {\text{A}}_{{{\text{M}} \times {\text{K}}}} \times {\text{B}}_{{{\text{K}} \times {\text{N}}}} + {\text{C}}_{{{\text{M}} \times {\text{N}}}} . $$
(1)

We introduce a parallel block matrix multiplication algorithm [17]. As shown in Algorithm 1, it includes three outer loops and two inner loops. The outer loops with loop variables \({\text{T}}_{{\text{p}}}\), \({\text{T}}_{{{\text{t}}1}}\), and \({\text{t}}2\) are used for the data transmission of the sub-matrix A and B, and the initialization of the sub-matrix C. The inner loops with loop variables p and t1 are used for the calculation in each PE. After the block processing, the sizes of the sub-matrix blocks A, B, and C are \({\text{S}}_{{\text{p}}} \times {\text{K}}\), \({\text{K}} \times {\text{S}}_{{{\text{t}}1}}\), and \({\text{S}}_{{\text{p}}} \times {\text{S}}_{{{\text{t}}1}}\), respectively. Parameters \({\text{S}}_{{\text{p}}}\) and \({\text{S}}_{{{\text{t}}1}}\) are also represent the number of PE and the depth of on-chip RAM in each PE.

figure a

As shown in Fig. 1, the structure of linear array corresponding to Algorithm 1 is usually composed of a set of PEs, a data transfer controller, and two-stage FIFOs. Each PE includes two registers for storing elements of sub-matrix A and B, a FIFO for buffering data flow a, an on-chip RAM block for storing a row of elements of sub-matrix C, and a multiply-accumulate functional component for matrix multiplication.

Fig. 1.
figure 1

Linear array architecture and computing process.

Before the calculation starts, all elements in the on-chip RAM block are initialized to zero in advance, and a column of elements from sub-matrix A is preloaded and distributed to the register a in each PE. When the elements from the sub-matrix B flow into the linear array by rows, the linear array starts parallel calculations. Then, each PE completes the \({\text{c}}^{{\text{k}}} = {\text{a}} \times {\text{b}} + {\text{c}}^{{{\text{k}} - 1}}\) operation through the write-back mechanism and saves the intermediate result in the on-chip RAM block. After K iterations, the result of sub-matrix C is moved to the off-chip memory.

The above algorithm and structure can complete the functions of \({\text{A}} \times {\text{B}} \pm {\text{C}}\). When \({\text{B}} = 1\), the addition and subtraction operations can be realized. It uses the identity matrix I to realize the matrix addition (subtraction) of sub-matrix A and sub-matrix C, as shown in Eq. (2).

$$ {\text{C}}_{{M \times K}} = {\text{A}}_{{M \times K}} \times {\text{I}}_{{K \times K}} \pm {\text{C}}_{{M \times K}} . $$
(2)

The matrix addition (subtraction) uses the same structure and similar data flow method as matrix multiplication. The sub-matrix C originally initialized to a zero matrix is loaded into the on-chip RAM block by the RAM channel in advance.

3 Multi-precision Floating-Point Matrix Computing Based on a Unified Structure

To solve the problem of multiple data types for matrix computing, we first discuss how to design algorithms, data flows, and PE structures for single-precision and half-precision floating-point matrix computing based on the aforementioned double-precision floating-point matrix computing and linear array. Then, we introduce the corresponding floating-point multiply-accumulate (subtract) functional component in each PE. Finally, we describe the multi-precision floating-point MCA unit core and take the core to build a multi-precision MCA system.

3.1 Two Parallel Single-Precision Floating-Point Matrix Computing

In this section, a PE in the proposed structure can complete two single-precision floating-point data operations in the same clock cycles.

As shown in Algorithm 2, two parallel single-precision floating-point matrix multiplication uses the same block method, loop mode, and data flow as Algorithm 1. The difference is that the size of sub-matrix A and C becomes \(2{\text{S}}_{{\text{p}}} \times {\text{K}}\) and \(2{\text{S}}_{{\text{p}}} \times {\text{S}}_{{{\text{t}}1}}\), respectively. In the innermost loop, the original multiply-accumulate operation becomes two parallel multiply-accumulate operations.

figure b

Figure 2 shows the PE structure of several data types. The data sources in Fig. 2(a) and Fig. 2(b) are double-precision and single-precision floating-point data, respectively.

Generally, each PE has three source operands. In the structure of two parallel single-precision floating-point matrix multiplication, we use a data splicing mechanism when preparing the source operands. Two 32-bit single-precision floating-point data are spliced into a 64-bit source operand, that is, the two single-precision floating-point data are placed in the upper half and the lower half of the register, respectively.

Fig. 2.
figure 2

The PE structure of matrix multiplication. (a) Double-precision floating-point. (b) Two parallel single-precision floating-point. (c) Four parallel half-precision floating-point.

Figure 3 describes the mapping process of single-precision floating-point matrix multiplication from algorithm to hardware structure. Firstly, the value of the sub-matrix C is initialized to zero matrix. Then, when loading a column of data (\(2S_{p}\)) from sub-matrix A, we stitch two adjacent single-precision data into a 64-bit operand and distribute it to the corresponding PE. When loading data row by row from the sub-matrix B, we also stitch two identical single-precision data into a 64-bit operand and distribute it to each PE. At the same time, the calculation of the linear array begins.

After the write-back mechanism and K iterations, the results are 64-bit operands. Each PE is responsible for calculating the two rows of sub-matrix C, where the lower 32 bits are the result of the previous row of sub-matrix C, and the upper 32 bits are the result of the subsequent row of sub-matrix C.

Fig. 3.
figure 3

The mapping process of two parallel single-precision floating-point matrix multiplication.

The single-precision floating-point matrix addition (subtraction) uses a similar algorithm and calculation structure with the single-precision floating-point matrix multiplication. The difference is that the sub-matrix B becomes an identity matrix, and the sub-matrix C participating in multiply-accumulate (subtract) operation needs to be imported into the on-chip RAM in advance.

3.2 Four Parallel Half-Precision Floating-Point Matrix Computing

In half-precision floating-point matrix multiplication, one PE can complete four floating-point data operations in the same clock cycles. As shown in Algorithm 3, the size of sub-matrix A is \(4{\text{S}}_{{\text{p}}} \times {\text{K}}\), the size of sub-matrix B is unchanged, and the size of sub-matrix C is \(4{\text{S}}_{{\text{p}}} \times {\text{S}}_{{{\text{t}}1}}\). In the innermost loop, the original multiply-accumulate operation becomes four parallel multiply-accumulate operations.

figure c

As shown in Fig. 2(c), when we prepare the source operand, four 16-bit half-precision floating-point data are spliced into a 64-bit source operand. For example, half-precision floating-point source operands \(a_{{41\_}}^{\prime} a_{{31\_}}^{\prime} a_{{21\_}}^{\prime} a_{{11}}^{\prime}\) and \(b_{{11\_}}^{\prime} b_{{11\_}}^{\prime} b_{{11\_}}^{\prime} b_{{11}}^{\prime}\) are operated to obtain four half-precision operation results \(c_{{41\_}}^{\prime} c_{{31\_}}^{\prime} c_{{21\_}}^{\prime} c_{{11}}^{\prime}\).

Similar to the process shown in Fig. 3, when loading a column of data (\(4S_{p}\)) from sub-matrix A, we stitch four adjacent half-precision data into a 64-bit source operand and distribute it to the corresponding PE. When loading data row by row from the sub-matrix B, we also stitch four identical half-precision data into a 64-bit source operand and distribute it to each PE. Here, each PE is responsible for calculating four rows of sub-matrix C.

Besides, the half-precision floating-point matrix addition (subtraction) has similarities with the half-precision floating-point matrix multiplication.

3.3 Multi-precision Floating-Point Multiply-Accumulate (Subtract) Functional Component

We adjust and optimize a self-designed and high-performance floating-point multiply-accumulate (subtract) functional component [18] to meet the needs of matrix calculation. Figure 4 shows the three precision floating-point data formats that comply with the IEEE 754 standard. According to the rules of data format, the multi-precision functional component can perform single-precision and half-precision floating-point operations by maximizing the logical structure of double-precision floating-point operation.

Fig. 4.
figure 4

Different precision floating-point formats used in the unit. (a) Double-precision format. (b) Single-precision format. (c) Half-precision format.

The multi-precision functional unit adopts a six-stage pipeline design which mainly includes the modules, such as operand preparation, mantissa multiplication, Multiply-add, normalization processing, exception judgment, and result selection.

After reading the operands, the functional unit separates the exponent and mantissa according to the format and performs corresponding operations respectively. Figure 5(a) shows that in half-precision calculation, the input 64-bit source operand is split into four half-precision floating-point format data.

Fig. 5.
figure 5

(a) The process of half-precision data preparation. (b) Multiplexing multiplier in the mantissa multiplication module.

Besides, we use four single-precision multipliers to perform each group of mantissa multiplications in parallel. The hardware overhead can be saved by multiplexing the multipliers. As shown in Fig. 5(b), when performing half-precision floating-point operations, the separated 4-way multiplication operation data are respectively sent to the upper 11 bits of the four 32*32 bit multipliers.

3.4 Implementation for Multi-precision MCA Unit Core and MCA System

The MCA unit core supports matrices computing of arbitrary size. As shown in Fig. 6, the core receives the matrix operation mode signal provided externally and decides whether to load or initialize the sub-matrix C. It can also select the appropriate precision mode according to the actual needs. Then we supply the corresponding operands and process the corresponding operands according to the precision mode.

In the process of implementing the multi-precision MCA unit core, we add logic for selecting data types, data splicing, and parts that cannot be reused, but the logic delay constraints meet the design requirements.

Fig. 6.
figure 6

The architecture of multi-precision MCA unit core and MCA system.

The overall structure of the system is shown in Fig. 6. The MCA system uses the architecture of DSP + FPGA. Firstly, the DSP sends instructions and control signals to the coprocessor and transfers the data from the memory on the host side to the DDR on the coprocessor side by starting the SRIO module. Then, the coprocessor calculates autonomously. After the overall calculation is completed, the system starts the SRIO module again and returns the calculation result. Therefore, when the coprocessor is working, the DSP can execute other instructions to achieve independent acceleration.

The coprocessor includes an MCA unit core and data transmission logic. The logic mainly includes the DMA module for selecting operation mode, controlling data flow, and transmitting data within the coprocessor. The connection between modules and the core uses the AXI protocol to facilitate communication between each other.

4 Experimental Results and Discussions

The MCA unit core is programmed in Verilog HDL, synthesized, and simulated using synthesis tools (Xilinx Vivado 2016). Then, we deployed the MCA system on the DSP + FPGA platform.

4.1 Synthesis Results

We use the 585T development board as the target device and synthesize the MCA unit using 128 PEs under typical conditions. The delay of each stage of the pipeline meets the target delay of 550 ps and the total power consumption is 5.48 W.

Different numbers of PEs directly affects the calculation efficiency and hardware overhead in the proposed structure. Table 1 shows the logical resource consumption of the MCA unit with different numbers of PEs. We can see that as the number of PEs increases, LUTs and Slice Registers increase. In addition to the number of PEs, the RAM block consumption is also related to the size of the sub-matrix block. Because the number of PEs determines the number of RAM blocks used, and the size of the sub-matrix block determines the depth of the RAM blocks.

Because the depth of the RAM block in the PE represents the number of times that the data of sub-matrix block A can be reused. To maximize the reuse of data, we can choose the depth of the RAM block according to the size of the sub-matrix B. In principle, the larger the better if the on-chip storage allows. We set the depth of the RAM block to 512 in the project.

Table 1. Resource consumption of different numbers of PEs.

For \({\text{M}} \times {\text{N}}\) rectangular matrices, due to the idea of block matrix, our storage requirement is only \(2S_{p} S_{{t1}}\), where the parameters \(S_{p}\) and \(S_{{t1}}\) represent the number of rows and columns of the sub-matrix C, respectively. Therefore, this structure has the characteristic of low storage requirement in large-scale matrix computing.

4.2 Performance Analysis

Table 2 shows the relationship between the maximum operating frequency and peak performance of the multi-precision MCA unit with the number of PEs. When we set 128 PEs, the maximum operating frequency achieves 180 MHz. At this time, the multi-precision MCA unit can complete 128 double-precision multiplication operations and 128 addition operations in one cycle. The peak performance of double-precision floating-point matrix computing, for example, can be estimated as \(180\,{\text{Mhz}} \times 128\,{\text{FLOP}} \times 2 = 46.1\,{\text{GFLOPS}}\). It can also complete two times single-precision floating-point data and four times half-precision floating-point data at the same time. Therefore, single-precision floating-point and half-precision floating-point performance can reach 92.1 GFLOPS and 184.3 GFLOPS, respectively.

Table 2. Number of PEs, clock speed, and peak performance for double-precision/ single-precision/half-precision floating-point data.

In large-scale matrix computing, we have to consider the time to initialize the system, the time to transfer data from external memory, and the calculation unit waiting for calculation data that may occur during the calculation process.

Firstly, for the time to initialize the system, the main consumption is preloading data. In multi-precision matrix multiplication and matrix addition (subtraction), the amount of preloading data are \(S_{p}\) and \(S_{p} + S_{p} S_{{t1}}\), respectively. When the size of the matrix becomes larger, the ratio of data preload time to the total time will be small, and the calculation is more efficient.

Secondly, according to the characteristics of the algorithm and structure, we analyze the data transfer time in the calculation process of matrix computing. As shown in Eq. (3), we use \(\vec{B}_{i}^{r}\) and \(\vec{C}_{i}^{r}\) to denote elements containing the ith row of matrix B and matrix C, respectively.

$$ \vec{{\text{C}}} _{1}^{{\text{r}}} = {\text{a}}_{{1,1}} \vec{{\text{B}}} _{1}^{{\text{r}}} + {\text{a}}_{{1,2}} \vec{{\text{B}}} _{2}^{{\text{r}}} + \ldots + {\text{a}}_{{1,{\text{k}}}} \vec{{\text{B}}}_{{\text{k}}}^{{\text{r}}}.$$
(3)

Let one PE calculates a row of elements from left to right. After k iterations, the first PE can complete the calculation of the first row of the matrix C. Then, the ith PE is responsible for the calculation of the ith row of the matrix C. More generally, the extension to all rows of the matrix C is represented by Eq. (4).

$$ \vec{{\text{C}}} _{{\text{i}}}^{{\text{r}}} = {\text{a}}_{{{\text{i}},1}} \vec{{\text{B}}} _{1}^{{\text{r}}} + {\text{a}}_{{{\text{i}},2}} \vec{{\text{B}}} _{2}^{{\text{r}}} + \ldots + {\text{a}}_{{{\text{i}},{\text{k}}}} \vec{{\text{B}}} _{{\text{k}}}^{{\text{r}}}. $$
(4)

It can be seen from this formulation that all PEs can work in parallel without affecting each other. We can reuse the data \(a_{{i,k}}\) by reasonably setting the number of elements in a row of matrix B, and at the same time, the latency of floating-point functional component and the data moving time can be hidden into the computing time, which effectively avoids the time of waiting data and ensures the pipeline of the structure.

4.3 Discussion

As shown in Table 3, we compare the proposed multi-precision matrix calculation acceleration unit with related work.

Compared to the double-precision floating-point matrix multiplication structure [13], our calculation structure merges more precisions and more matrix calculation modes. Compared with [17], although our proposed multi-precision MCA unit places fewer PEs, it does not affect the performance of the calculation unit. We can see from Table 3 that when calculating single-precision floating-point data, the performance is close to the structure [17]. When calculating half-precision floating-point data, the performance exceeds the structure [17].

Although the proposed structure based on the circulant matrix [6] has reached a high performance, it is only for fixed-point data. This can not meet the needs of most engineering calculations, such as the nonlinear Kalman filter algorithm and the extended Kalman filter algorithm that require floating-point matrix multiplication and matrix addition. Its size of the matrices is equal to the number of PEs, and it can only handle square matrices. Therefore, its processing capacity is limited. Besides, the computational performance of our structure on a half-precision floating-point (16bit) is better than that of the structure on fixed-point (18bit) [6]. Our preparation time for preloading data is significantly shorter than their preparation time for preloading and generating circulant matrix.

Another matrix computing system [19] with multiple accelerators attempts to set up separate computing arrays for different matrix operations. We can see that our structure is superior to the structure in both performance and energy efficiency.

Table 3. Performance and hardware overhead compared to related work.

5 Conclusions

This paper extends the matrix calculation to a multi-precision and multi-operation environment based on the block matrix computing algorithm and linear array. Through adjusting data flow and reusing logic, the proposed multi-precision floating-point MCA unit can process half-precision, single-precision, and double-precision floating-point data at the same time. Then, we built the MCA system based on the MCA unit core. Compared with the existing matrix calculation structure, our matrix calculation unit can handle three kinds of precision and three modes of operation in a unified structure and features low storage and high efficiency. We plan to improve the arithmetic component to support more precise numerical calculations including fixed-point and extended double-precision in the future.