Keywords

1 Introduction

Iterative stencil computations are present in scientific and engineering applications such as: numerical integration of partial differential equations [11, 18, 26], image processing [3, 22], Particle-in-Cell simulation [22], graph processing [15], among others. The acceleration of stencil codes using parallel architectures has been widely studied [2, 6, 13, 17, 18, 25,26,27,28]. One of the most important limitations of the stencil computation is its small operational intensity [23, 26], which makes it difficult to take advantage of the supercomputers that have a large number of processing units (microprocessors and GPUs) [23].

In the last years, FPGA based accelerators has been proposed to improve stencil computation performance with low power consumption [5,6,7, 10, 16, 21, 23, 26]. FPGAs have a large number of registers, which facilitates the transfer of data between the iterations of a computation without the need to access an external memory. This leads to an increase in operational intensity and processing speed [26]. In a previous work, we presented an evaluation of architectures for stencil computation, in which it is shown that it is possible to reach execution times similar to those of a CPU used as a reference using registers and on-chip memory [1]. However, these architectures were experimentally tested for small mesh sizes due to the resource limitations of the FPGA used.

FPGA accelerators are usually implemented by means of a hardware design language (HDL) [9, 10, 19, 26]. However, HDL designs require extensive knowledge of the hardware [26]. In order to raise the level of abstraction of designs and facilitate implementation, some High-Level Synthesis tools (HLS) has been used as in [8, 14, 19, 20, 24, 28]. The HLS tools allow to ignore some hardware details, but often deliver solutions less efficient compared to those obtained using HDL [19]. In these cases it is necessary to manually rewrite the code to optimize, for example, memory access [8].

There have been attempts to improve the performance of HLS solutions. For example, in [6], a set of design options have been explored to accommodate a large set of constraints. Most literature works achieve high performance by evading spatial blocking and restricting the input size. On the other hand, in [28], spatial and temporal blocking are combined in order to avoid input size restrictions. It is well known that one of the bottlenecks in the HLS solutions is access to data [4, 8]. In this way, it is necessary to optimize memory management. In [8], graph theory is used in order to optimize the memory banking. In [4], a non-uniform partition of the memory is proposed in such a way that the number of memory banks is minimized. Loop pipelining is another key method for optimization in HLS [13]. However, the performance level of the solutions may not be optimal when complex memory dependencies appear. In [12,13,14], loop pipelining capabilities are improved in order to handle uncertain memory dependencies.

In this document, it is presented a strategy that attacks the HLS optimization problem on two fronts: memory management and loop pipelining. To achieve the task, it is proposed a method to split the mesh in such a way that total latency is reduced using on-chip memory partitioning and pipeline directives. As a case study is used the two-dimensional Laplace equation implemented for two different development systems, the ZedBoard using Vivado Design Suite and the Ultra96 board using Vivado SDx. The performance is evaluated and compared according to the amount of inner loop divisions and the memory partitions in terms of the latency, power consumption, use of FPGA resources, and speed-up. The rest of the document is organized as follows. In Sect. 2, it is presented the two-dimensional Laplace equation and the approach to its numerical solution by means of finite difference method. In Sect. 3, the details of the implemented stencil computing system are presented. Results are presented in Sect. 4. Finally, the conclusions are given in Sect. 5.

2 Case Study: Two-Dimensional Laplace Equation

Suppose \(\varOmega \) as a domain of \(R^2\) with boundary defined as \(\partial \varOmega \). The partial differential equation shown in (1) is considered elliptical for all points \((x,y) \in \varOmega \).

$$\begin{aligned} \frac{\partial ^2 u}{\partial x^2} + \frac{\partial ^2 u}{\partial y^2} = 0 \end{aligned}$$
(1)

This expression is known as two-dimensional Laplace equation and it is used to describe the stationary temperature distribution in a two-dimensional region given some boundary conditions. An approach to the numerical solution of this equation is obtained using the finite difference method. The \(\varOmega \) region is discretized in the two dimensions x and y by defining a number of points I and J respectively. This approach is obtained for the iteration \(n+1\) as in (2), considering a uniform distribution of the points in the domain of the solution.

$$\begin{aligned} u^{n+1}_{ij} = \frac{1}{4} (u^n_{i+1,j}+u^n_{i-1,j}+u^n_{i,j+1}+u^n_{i,j-1}) \end{aligned}$$
(2)

The implementation of this approach is known as the Jacobi algorithm. For Dirichlet boundary conditions, it is described for a number of iterations N as shown in Algorithm 1.

figure a

3 System Implementation

The implementation was performed for two different development systems, the ZedBoard using Vivado Design Suite and the Ultra96 board using Vivado SDx. In both cases, the Zynq processing system (PS) interacts through of AXI interface and DMA with a custom IP created in Vivado HLS using C language for the programmable logic (PL) section. The ARM core is the host processor where the main application runs over a PetaLinux terminal, and the custom IP core executes the algorithm based on the stencil scheme. This application includes the generation of initial values and the boundary conditions which are stored in BRAM. Then, the number of iterations is defined and the stencil computation function is called. When the stencil algorithm execution is finished, the results become available in the DDR3 RAM and these can be read and saved in a text file with 15 significant digits in decimal format. The block diagram of the system is shown in Fig. 1.

Fig. 1.
figure 1

Block diagram of the system implemented using Vivado Design Suite.

3.1 Baseline Architecture of the Custom IP Core

The sequential implementation of the code, defined as architecture \(A_1\), was used as reference to compare against the performances of the parallel implementations. Given that the data arrive to the main function as a vector, the line code for stencil operation in Algorithm 1 is implemented as shown in Algorithm 2.

figure b

The maximum size that can be used for a square mesh is determined by the amount of BRAM memory blocks in the FPGA device (\(BRAM\_Blocks\)). Considering a BRAM block size of 18 Kb, the use of simple floating point format, and that the algorithm requires two arrays to store the values of the last two iterations, the mesh size is calculated as shown in (3).

$$\begin{aligned} mesh\_size_{max} = \sqrt{\frac{RAM\_Blocks \times 18000}{32\times 2}} \end{aligned}$$
(3)

3.2 Parallelization

The acceleration of the algorithm execution, defined as architecture \(A_2\), was achieved using a pipeline directive in Loop 1 of the stencil code. In addition, some modifications are made to the stencil code implementation to reduce the latency.

A first approach makes the most of the Loop 1.1 of the Laplace function, considering the transfer operations used when the vector u is updated with the vector v to calculate a new iteration. Thus, the upper limit of the external loop is reduced by half as shown in Algorithm 3.

figure c

To improve performance, a method for splitting the mesh into three blocks on the y axis is proposed, as shown in Fig. 2. The distribution is made so that the number of divisions in block B2 is a power of 2, and considering that the number of rows in blocks B1 and B3 is odd because of the rows of boundary conditions. This distribution allows the application of the parallelization directives in such a way that the synthesis time, the amount of resources and the latency are reduced. The memory partition directive allows the different blocks to access the corresponding data concurrently, which are distributed in a number of smaller arrays defined by the partition factor. The approach, defined as architecture \(A_3\), is described as shown in Algorithm 4.

Fig. 2.
figure 2

Distribution of blocks for processing.

figure d

4 Results

The performance of the implemented system was evaluated according to numerical results, execution times, and physical resources of FPGA. Numerical results were obtained for different mesh sizes, from boundary conditions and initial values defined as shown in (4).

$$\begin{aligned} \left\{ \begin{matrix} u_{xx} + u_{yy} = 0 &{}\\ u = 0, &{}\forall (x,y) \in \varOmega \\ u = 1, &{}\forall (x,y) \in \partial \varOmega \end{matrix}\right. \end{aligned}$$
(4)
Fig. 3.
figure 3

Latency for 4 processing blocks of the middle division according to number of iterations and partition factor of on-chip memory.

Performances of the implemented architectures are obtained measuring execution time. The architecture \(A_3\) has several configurations, therefore, a design space exploration is performed based on two parameters: number of subdivision in the middle block and the memory partition factor. For this purpose, latencies are obtained in terms of clock cycles for different combinations of both parameters and the number of iterations. The latency measurements are performed for block sizes of 4, 8, 16, 32, and 64 for the middle block, and assigning values of 2, 4, 8, 16, 32, and 64 as memory partition factor. For each combination of these parameters the simulation is carried out for \(10^1\), \(10^2\), \(10^3\), \(10^4\), \(10^5\), y \(10^6\) iterations. In Fig. 3 are shown the latencies for 4 subdivisions of B2 based on number of iterations and memory partition factor. It is observed that the performance improves with the increase of this last parameter. Latencies obtained are used for the execution time calculations considering a 100 MHz clock frequency.

Table 1 show the speed-up achieved using the \(A_3\) with 4 processing blocks in relation to the base architecture \(A_1\) and to the sequential execution on CPU. It is observed a number of iterations from which the acceleration tends to a constant value.

Table 1. Speed-up with regards to the base architecture A1 based on number of iterations and memory partition factor.

The best performance was determined making a plot of the latency based on the number of subdivisions in the block B2 and memory partition factor for a number of \(10^6\) iterations, as shown in Fig. 4. The lowest latency was observed using a combination of 4 subdivisions and memory partition factor of 64.

Fig. 4.
figure 4

Latency for \(10^6\) iterations based on the number of processing blocks and the partition factor of on-chip memory.

Execution times were measured experimentally for the architectures implemented on the ZedBoard and an Ultra96 board. Table 2 shows execution times according to the number of iterations for the implemented architectures.

Table 2. Execution times in microseconds for different number of iterations with the different architectures implemented.

The speedup achieved for the implemented architectures calculated in relation to the baseline and the sequential implementation on CPU is shown in Table 3.

Table 3. Speed-up achieved for the implemented architectures in relation to the sequential implementation on CPU.

The consumption of hardware resources for each architecture is shown in Table 4.

Table 4. Hardware resources required on ZedBoard and Ultra96. The entire system includes processing modules.

The power consumption for the implemented architectures is shown in Table 5.

Table 5. Power consumption for the implemented architectures. The entire system includes processing modules.

5 Conclusions

This paper presents a strategy for the implementation of algorithms based stencil on SoC-FPGA using Vivado HLS, addressing the problem of optimization in terms of memory management and parallelization of cycles. The general scheme of the implemented architectures involves the use of an ARM Cortex-A9 microprocessor that acts as master, on which the main application is executed. The processor interacts through an AXI interface with an IP created in Vivado HLS, which performs the execution of the algorithm based on stencil. The architectures are implemented on a ZedBoard Zynq Evaluation and Development Kit under the Vivado Design Suite environment and on an Ultra96 board using Vivado SDx. The source code of the main application is made in C and executed under PetaLinux on the PS using a terminal console. The communication is done using an AXI interface and direct access to memory (DMA).

To improve performance in terms of execution time a method is proposed to split the mesh into three parts on the y-axis. The distribution is done so that the number of rows in block B2 is a multiple of a power of 2, considering that blocks B1 and B3 have one row less because they include contour conditions. An unrolling of the internal cycle is proposed so that the latency of the intermediate cycle is reduced according to the number of subdivisions of B2. Additionally, the on-chip memory partition is made in such a way that each subdivision can access the corresponding data concurrently.

An exploration of the design space for the generalized architecture is performed, based on the number of B2 processing subdivisions and the factor used for the memory partition. For this, latencies are obtained in terms of clock cycles for different combinations of both parameters and number of iterations. It is observed that the performance improves with the increase of the memory partition factor. It is found that the configuration that provides the best performance and that can be implemented in the ZedBoard is with 4 divisions of B2 and 32 partitions of memory. For this configuration we obtain an acceleration of approximately 209.83\(\times \) in relation to the base architecture and 6.76\(\times \) in relation to the CPU used as reference. The power consumption with this configuration is approximately 3.6 watts. For the Ultra96 the A3 architecture is implemented with a configuration of 4 divisions of B2 and 64 memory partitions. In this case an acceleration of 9.2\(\times \) to 100 MHz and 18.21\(\times \) to 200 MHz is achieved in relation to the sequential execution on CPU.