Evaluations of OpenCL-written tsunami simulation on FPGA and comparison with GPU implementation

Kono, Fumiya; Nakasato, Naohito; Hayashi, Kensaku; Vazhenin, Alexander; Sedukhin, Stanislav

doi:10.1007/s11227-018-2315-8

Evaluations of OpenCL-written tsunami simulation on FPGA and comparison with GPU implementation

Published: 16 March 2018

Volume 74, pages 2747–2775, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The Journal of Supercomputing Aims and scope Submit manuscript

Evaluations of OpenCL-written tsunami simulation on FPGA and comparison with GPU implementation

Download PDF

Fumiya Kono ORCID: orcid.org/0000-0002-4186-3879¹,
Naohito Nakasato¹,
Kensaku Hayashi¹,
Alexander Vazhenin¹ &
…
Stanislav Sedukhin¹

378 Accesses
6 Citations
2 Altmetric
Explore all metrics

Abstract

When a tsunami occurred on a sea area, prediction of its arrival time is critical for evacuating people from the coastal area. There are many problems related to tsunami to be solved for reducing negative effects of this serious disaster. Numerical modeling of tsunami wave propagation is a computationally intensive problem which needs to accelerate its calculations by parallel processing. The method of splitting tsunami (MOST) is one of the well-known numerical solvers for tsunami modeling. We have developed a tsunami propagation code based on MOST algorithm and implemented different parallel optimizations for GPU and FPGA. In the latest study, we have the best performance of OpenCL kernel which is implemented tsunami simulation on AMD Radeon 280X GPU. This paper targets on design and evaluation on FPGA using OpenCL. The performance on FPGA design generated automatically by Altera offline compiler follows the results of GPU by several kernel modifications.

Acceleration of Computing and Visualization Processes with OpenCL for Standing Sea Wave Simulation Model

Performance Portability Analysis for Real-Time Simulations of Smoke Propagation Using OpenACC

Performance Optimization of the 3D FDM Simulation of Seismic Wave Propagation on the Intel Xeon Phi Coprocessor Using the ppOpen-APPL/FDM Library

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Tsunami is a secondary natural disaster which follows after a submarine earthquake. The faster prediction of tsunami is strongly desired for disaster prevention. When an earthquake occurs, we can forecast tsunami propagation by using numerical simulations with an initial condition and the laws of physics governing the phenomenon. However, such simulations should cover vast region by processing a large amount of computational data, and therefore, in the sequential computation, it is often difficult to complete the simulation faster than real time. Accordingly, large-scale and real-time simulations require massively parallel computing technologies with various parallel computing architectures, their programming models, and languages.

There are several previous works on high-performance tsunami simulations using new and modern computing systems based on the heterogeneous computing paradigm.

Imamura et al. [1] developed Tsunami package (TUNAMI-N1) with the staggered leap-frog scheme. Gidra et al. [2] evaluated parallelized TUNAMI-N1 code by CUDA on NVIDIA QUADRO FX 1700. They showed the results on various sizes of the ocean bathymetry data sets for 7200 time steps. For a $1040\times 668$ grid, they obtained 5.86x speedup as in comparison with sequential computation with a single processor.

Acuna and Aoki [3] used Tesla M2050 GPU to solve the shallow water equations for tsunami simulation. They used a numerical solution based on the CIP-CSL2 semi-Lagrangian scheme and the method of characteristics. They simulated tsunami over a large grid covering the entire Pacific ocean using a Tsubame 2.0 system with multiple GPUs. By using adaptive mesh refinement (AMR), they saved memory usage by 20–40%. Finally, they achieved 313 GFlops with a single GPU. Fujita [4] also reported his accelerated tsunami simulation on FPGA. He manually extracted large data flow graphs from the program and compiled it into FPGA circuits. The size of computation grid is $1040\times 668$, and the simulation is conducted 7200 steps regarding one time step as 1 s. It was shown that FPGA tsunami simulation is 46 times faster than Intel core i7 processor at 2.93GHz.

In this research, we investigate parallel computing algorithms and architectures that are suitable for high-performance tsunami simulation based on the method of splitting tsunami (MOST) [5, 6]. In the future, we will combine our parallelized code into the tsunami visualization tool [7] which is currently in development for the real-life applications such as where it is effective to put the tetrapods or breakwaters for reducing the damage generated by tsunami. Therefore, this research will contribute to make their experiments for modeling tsunami more faster with various initial conditions.

MOST, which is our target algorithm for acceleration, is one of the solvers for shallow water equations used for tsunami numerical simulation. The MOST algorithm can be considered as a combination of finite difference method and the Euler method for time integration. Our motivation for the acceleration of the MOST algorithm is to simulate tsunami propagation before tsunami actually arrives at the coastal area in real time. From the shallow water equations described in Sect. 2, we have the phase velocity of wave motion c as $c = \sqrt{gH}$, where g is the gravitational acceleration and H is the sea depth. For instance, the average sea depth in Pacific Ocean is known as 4000 m. In that case, the velocity of tsunami is about $c=712\,\hbox {km/h}$. When the distance between the coastal area and the epicenter is 100 km, it takes about 8.5 min for tsunami arriving at the coastal area. For this case, a prediction based on numerical simulations must be conducted in shorter time than this time limit.

To speed up the simulations, we have parallelized MOST algorithm by using OpenMP, OpenACC, and OpenCL (Open Computing Language) [8] and evaluated their performance on Multi-core CPU and GPU. In that benchmarking, we have obtained 185 GFlops which was the best performance by using OpenCL on AMD Radeon 280X GPU [9].

On the other hand, Nagasu et al. [10] designed the stream computing architecture and hardware for practical tsunami simulation. They introduced multiple stream processing element (SPE) arrays with parallel internal pipelines to exploit further available hardware resources. Their implementation with Arria 10 FPGA achieved the performance of 383 GFlops and the performance per power of 8.41 GFlops/W with six cascaded SPEs. Therefore, the dedicated implementation for Arria 10 FPGA shows higher performance than our best GPU implementation. The performance per power of the FPGA implementation is also better than the GPU implementation [10].

Meanwhile, there are some works to design FPGA accelerator by using OpenCL. OpenCL is one of the well-known framework for parallel programming on heterogeneous environments. It has versatility to compute on various devices including CPUs, GPUs, and reconfigurable systems such as FPGAs. With specific compilers, it is possible to generate hardware design for FPGAs automatically from OpenCL kernel without explicitly designing the hardware architecture. There are several studies working on FPGA design generating from OpenCL kernel.

Takei et al. [11] implemented FPGA accelerator of finite-difference time-domain (FDTD) method which is widely used in an electromagnetic simulation using OpenCL. They reported that the computation time of the FPGA design generated by OpenCL kernel was about 10 times faster than the computation by their GPU implementation.

Tatsumi et al. [12] also implemented FPGA accelerator of the stereo correspondence matching. They exploited pipeline stages for Fourier transform efficiently for FPGA. Also, Waidyasooriya et al. [13] used the FPGA accelerator generated from OpenCL kernel to simulate molecular dynamics. Their hardware is implemented loop-pipelining, and it achieved over 4.6 times of speedup comparing with CPU by using only 36% of the Stratix V FPGA resources.

In more recent studies, Yinger et al. [14] presented the FPGA implementation for deep neural network as the application of matrix multiplication by writing OpenCL kernel. Wang et al. [15] also designed the FPGA accelerator for convolution neural networks by using OpenCL. Roozmeh and Lavagno [16] focused on the problem about high energy consumption and power dissipation for the modern datacenters. They presented the FPGA accelerator to speed up the join operation on the database.

Houtgast et al. [17] implemented highly efficient FPGA accelerator for the Smith–Waterman algorithm to find the optimal pairwise alignment in bioinformatics. They succeeded in implementing the same accelerator by writing only 90 lines of OpenCL kernel which is about 20% of their VHDL code.

As we can see, designing FPGA accelerators for various scientific applications by using OpenCL is now feasible. Nevertheless, since FPGA design design from OpenCL kernel is a technology appeared recently, the example of applications is not plenty yet against GPU implementation. In this paper, we focus to accelerate the MOST algorithm by using OpenCL. We have already developed OpenCL implementation of the MOST algorithm which was applied well-known spacial blocking. Here, we ported the OpenCL code and gave a several optimizations to our previous OpenCL kernel for the benchmarking on Arria 10 FPGA.

This paper presents the evaluation and comparison of MOST algorithm written in OpenCL among four implementations:

1.
Originally developed kernel by using spatial blocking on GPU as baseline;
2.
Same kernel for GPU as baseline on FPGA design (without any optimization for FPGA);
3.
Optimized kernel using shift registers for FPGA design;
4.
Further optimized kernel to improve the parallelism by expanding the width of the data path.

The rest of this paper is organized as follows. In Sect. 2, the outline of MOST algorithm is given. Section 3 presents the description of the original MOST algorithm and parallelization by using spatial blocking algorithm. Section 4 shows the OpenCL implementation and its performance on several GPUs as baseline for following evaluations on FPGA. Section 5 shows the evaluation of OpenCL implementation on FPGA design generated automatically and its further optimizations. Section 6 shows the consideration and comparison of OpenCL implementation between GPU and FPGA. Finally, Sect. 7 concludes this paper with a mention of future work.

2 MOST: method of splitting tsunami

Firstly, we show the original MOST algorithm for the solution of shallow water equations. Shallow water equations which are nonlinear approximation of shallow water system are represented by following partial differential equations (PDEs) [5, 6].

$$\begin{aligned} u_t + uu_x + vu_y + gH_x= & {} gD_x ,\nonumber \\ v_t + uv_x + vv_y + gH_y= & {} gD_y , \\ H_t + (uH)_x + (vH)_y= & {} 0 .\nonumber \end{aligned}$$

(1)

Here, $H = H(x,y,t) =\eta (x,y,t) +D(x,y)$, where $\eta $ and D are the wave height and the depth profile (bathymetry), respectively, u and v are the wave velocity in each spatial coordinate, g is gravitational acceleration. Figure 1 schematically shows these quantities as an 1-D plot.

An alternative form of Eq. (1) is represented as follows:

$$\begin{aligned} \frac{\partial \mathbf {z}}{\partial t} + \mathbf {A} \frac{\partial \mathbf {z}}{\partial x} + \mathbf {B} \frac{\partial \mathbf {z}}{\partial y} = \mathbf {F} , \end{aligned}$$

(2)

where

$$\begin{aligned} \mathbf {z}= & {} \left( \begin{array}{c} u \\ v \\ H \end{array} \right) , \mathbf {A} = \left( \begin{array}{ccc} u &{}\quad 0 &{}\quad g \\ 0 &{}\quad u &{}\quad 0 \\ H &{}\quad 0 &{}\quad u \end{array} \right) ,\\ \mathbf {B}= & {} \left( \begin{array}{ccc} v &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad v &{}\quad g \\ 0 &{}\quad H &{}\quad v \end{array} \right) , \mathbf {F} = \left( \begin{array}{c} gD_x \\ gD_y \\ 0 \end{array} \right) . \end{aligned}$$

The numerical treatment of MOST is based on two auxiliary systems. Applying spatial decomposition to Eq. (2) along each coordinate, we get two auxiliary systems, $\mathbf {\Phi } = (u, 0, H)^T$ and $\mathbf {\Psi } = (0, v, H)^T$, which depend only on one spatial variable such as

where

$$\begin{aligned} \mathbf {F}_1 = \left( \begin{array}{c} gD_x \\ 0 \\ 0 \end{array} \right) , \mathbf {F}_2 = \left( \begin{array}{c} 0 \\ gD_y \\ 0 \end{array} \right) . \end{aligned}$$

MOST algorithm uses the method of characteristics for the numerical solutions. For the solution along x-coordinate, Eq. (3a) is transformed into following form:

$$\begin{aligned} \frac{\partial \mathbf {W}}{\partial t} + \mathbf {A'} \frac{\partial \mathbf {W}}{\partial x} = \mathbf {F}_1', \ \end{aligned}$$

(4)

where

$$\begin{aligned} \mathbf {W} = \left( \begin{array}{c} v \\ u+2\sqrt{gH} \\ u-2\sqrt{gH} \end{array} \right) . \end{aligned}$$

(5)

Here, all elements in $\mathbf {W}$ are the Riemann invariants which are constants along the characteristic curves of the equation, and diagonal matrix $\mathbf {A'}$ and $\mathbf {F}_1'$ are expressed as following form:

$$\begin{aligned} \mathbf {A'} = \left( \begin{array}{ccc} \lambda _1 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad \lambda _2 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad \lambda _3 \end{array} \right) , \mathbf {F}_1' = \left( \begin{array}{c} 0 \\ gD_x \\ gD_y \end{array} \right) , \end{aligned}$$

(6)

where $\lambda _1$, $\lambda _2$, and $\lambda _3$ are eigenvalues of $\mathbf {A}$,

$$\begin{aligned} \lambda _1 = u, \lambda _2 = u + \sqrt{gH}, \lambda _3 = u - \sqrt{gH} . \end{aligned}$$

For the numerical solution of Eq. (4), the following finite difference method (FDM) and the explicit Euler method for time integration are applied as

$$\begin{aligned}&\frac{\mathbf {W}^{n+1}_{i,j} - \mathbf {W}^{n}_{i,j}}{\Delta t} + \mathbf {A'} \frac{\mathbf {W}^{n}_{i+1,j} - \mathbf {W}^{n}_{i-1,j}}{2 \Delta x} \nonumber \\&\qquad - \mathbf {A'} \Delta t \frac{\mathbf {A'}(\mathbf {W}^{n}_{i+1,j} - \mathbf {W}^{n}_{i,j}) - \mathbf {A'}(\mathbf {W}^{n}_{i,j} - \mathbf {W}^{n}_{i-1,j})}{2 \Delta x^2} \nonumber \\&\quad =\frac{\mathbf {F'}_{i+1,j} - \mathbf {F'}_{i-1,j}}{2 \Delta x} - \mathbf {A'} \Delta t \frac{\mathbf {F'}_{i+1,j} -2\mathbf {F'}_i + \mathbf {F'}_{i-1,j}}{2 \Delta x^2}. \end{aligned}$$

(7)

Here, n denotes the nth computational step, and i, j corresponds to x, y-coordinates, respectively. $\Delta t$ and $\Delta x$ also denotes time step and grid resolution, respectively. The criterion of stability for the MOST algorithm can be written as the relationship between time step and grid resolution [18] :

$$\begin{aligned} \Delta t \le \frac{\Delta x}{\sqrt{gH}} . \end{aligned}$$

(8)

The actual calculation procedure for one time step is summarized as follows:

1.
u, v, and H are transformed by Eq. (5).
2.
Calculate the solution along x-coordinate by Eq. (7).
3.
The variables are transformed back to the original variables u, v and H.
4.
v, u and H are transformed by the equations corresponding to Eq. (5) for y.
5.
Calculate the solution along y-coordinate.
6.
The variables are transformed back to the original v, u and H.

In this procedure, we need 200 floating-point operations for updating one cell in total.

The accuracy of simulations based on the MOST algorithm generally depends on following three factors;

Algorithm and program for calculating tsunami wave propagation,
Accuracy of bathymetry data,
Accuracy of generating initial wave displacement.

For the first factor, the original MOST is a second-order accurate in space and a first-order accurate in time. And it is standard and well-verified software used in sequential computation. In this paper, we applied various optimizations for parallelization to the original algorithm and found that there is no significant difference in the results due to such optimizations. On the other hand, we use single-precision (SP) floating-point operations in our evaluation. In the majority of Pacific Ocean, the sea depth is roughly 4000 m in average so that we consider SP arithmetic is sufficiently accurate. However, there are some areas whose sea depth is more than 10000 m like ocean trench. When the difference of sea depth for two adjacent cells is very large, we experienced the computation by SP arithmetic causing large numerical errors. For that case, we can easily switch to use double precision floating-point operations in our OpenCL-based parallelization of the MOST algorithm.

To simulate tsunami generated by an earthquake for practice, a deformation model of the sea floor [19] can be used to compute initial H, u and v. The initial condition is modeled by parameters such as the epicenter (the point on the Earth’s surface vertically above the earthquake source), the earthquake magnitude, and the distance between the earthquake source and epicenter. In this paper, for the bathymetry and initial wave displacement, we use flat bathymetry and simple initial wave displacement as we describe in Sect. 4. We believe they are not significant for our performance evaluation.

3 Algorithms for parallelization

3.1 Original computing algorithm

Before we present details of optimizations for MOST algorithm, we show the original computing algorithm of MOST presented in Sect. 2. Assume quantities such as D, H, u, and v are stored in the 2-D arrays. Inputting D, H, u, and v at the time step $n=0$, we update H and u, v on every time step. In the original MOST program, each datum is stored in the format of structure of array (SOA). Each quantity contains different 2-D arrays.

Each 2-D data array is updated by using the 1-D temporary array based on the scheme which we showed in Sect. 2. Figure 2 shows the procedure for data updating along longitude in one time step.

In this case, updating is conducted row-by-row in the following steps. First, the data of the selected row are copied from the 2-D array to a 1-D temporary array. Second, H and u, v are transformed into Riemann invariants. Third, FDM and the Euler method are applied to each cell in the 1-D temporary array. Fourth, Riemann invariants H and u, v which were transformed previously are reverted. Finally, the updated data in the 1-D temporary array are copied back to the 2-D array. The update along longitude is finalized by applying this procedure for all rows.

Afterward, the processing of 2-D data is implemented along latitude. As shown in Fig. 3, this procedure is very similar to the update along longitude. In this case, the computations are conducted on every column. Thus, H and u, v in the 2-D array are updated in one time step. Importantly, the algorithm has a high-probability cache miss in the 2-D array for every data copied into 1-D array due to C/C++ row- or longitude-wise storage for planar data in memory.

3.2 Algorithm with spatial blocking

In our GPU implementation, spatial blocking is applied to the original MOST algorithm in order to obtain high level of parallelism on GPU. The data in the 2-D array is firstly divided into spatial blocks, and updated every spatial block, respectively.

Let $N_\mathrm{bsize}$ be the block size for each spatial block. As shown in Fig. 4, a spatial block is extracted to update central $N_\mathrm{bsize}\times N_\mathrm{bsize}$ cells in the 2-D arrays. To update a block with $N_\mathrm{bsize} \times N_\mathrm{bsize}$ data cells, $(N_\mathrm{bsize}+2) \times (N_\mathrm{bsize}+2)$ cells are actually used since halo is required to update boundary cells in the block.

Table 1 Number of loading and storing data required for updating $N_\mathrm{bsize} \times N_\mathrm{bsize}$ cells in stencil computation

Evaluations of OpenCL-written tsunami simulation on FPGA and comparison with GPU implementation

Abstract

Similar content being viewed by others

Acceleration of Computing and Visualization Processes with OpenCL for Standing Sea Wave Simulation Model

Performance Portability Analysis for Real-Time Simulations of Smoke Propagation Using OpenACC

Performance Optimization of the 3D FDM Simulation of Seismic Wave Propagation on the Intel Xeon Phi Coprocessor Using the ppOpen-APPL/FDM Library

1 Introduction

2 MOST: method of splitting tsunami

3 Algorithms for parallelization

3.1 Original computing algorithm

3.2 Algorithm with spatial blocking

4 Implementation and evaluation on GPUs

4.1 GPUs for performance benchmarking

4.2 Performance evaluation of GPU implementation

4.3 Performance evaluation on GPU

5 Implementation and evaluation on FPGA

5.1 FPGA for performance benchmarking

5.2 Optimization by using shift register and its performance on FPGA

5.3 Multiple computations techniques on one pipeline stage for increasing parallelism and its performance on FPGA

6 Discussion

6.1 Summary of performance benchmarking and comparison with other works

6.2 Estimation of the applicability for real-time simulation

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation