1 Introduction

Tsunami is a secondary natural disaster which follows after a submarine earthquake. The faster prediction of tsunami is strongly desired for disaster prevention. When an earthquake occurs, we can forecast tsunami propagation by using numerical simulations with an initial condition and the laws of physics governing the phenomenon. However, such simulations should cover vast region by processing a large amount of computational data, and therefore, in the sequential computation, it is often difficult to complete the simulation faster than real time. Accordingly, large-scale and real-time simulations require massively parallel computing technologies with various parallel computing architectures, their programming models, and languages.

There are several previous works on high-performance tsunami simulations using new and modern computing systems based on the heterogeneous computing paradigm.

Imamura et al. [1] developed Tsunami package (TUNAMI-N1) with the staggered leap-frog scheme. Gidra et al. [2] evaluated parallelized TUNAMI-N1 code by CUDA on NVIDIA QUADRO FX 1700. They showed the results on various sizes of the ocean bathymetry data sets for 7200 time steps. For a \(1040\times 668\) grid, they obtained 5.86x speedup as in comparison with sequential computation with a single processor.

Acuna and Aoki [3] used Tesla M2050 GPU to solve the shallow water equations for tsunami simulation. They used a numerical solution based on the CIP-CSL2 semi-Lagrangian scheme and the method of characteristics. They simulated tsunami over a large grid covering the entire Pacific ocean using a Tsubame 2.0 system with multiple GPUs. By using adaptive mesh refinement (AMR), they saved memory usage by 20–40%. Finally, they achieved 313 GFlops with a single GPU. Fujita [4] also reported his accelerated tsunami simulation on FPGA. He manually extracted large data flow graphs from the program and compiled it into FPGA circuits. The size of computation grid is \(1040\times 668\), and the simulation is conducted 7200 steps regarding one time step as 1 s. It was shown that FPGA tsunami simulation is 46 times faster than Intel core i7 processor at 2.93GHz.

In this research, we investigate parallel computing algorithms and architectures that are suitable for high-performance tsunami simulation based on the method of splitting tsunami (MOST) [5, 6]. In the future, we will combine our parallelized code into the tsunami visualization tool [7] which is currently in development for the real-life applications such as where it is effective to put the tetrapods or breakwaters for reducing the damage generated by tsunami. Therefore, this research will contribute to make their experiments for modeling tsunami more faster with various initial conditions.

MOST, which is our target algorithm for acceleration, is one of the solvers for shallow water equations used for tsunami numerical simulation. The MOST algorithm can be considered as a combination of finite difference method and the Euler method for time integration. Our motivation for the acceleration of the MOST algorithm is to simulate tsunami propagation before tsunami actually arrives at the coastal area in real time. From the shallow water equations described in Sect. 2, we have the phase velocity of wave motion c as \(c = \sqrt{gH}\), where g is the gravitational acceleration and H is the sea depth. For instance, the average sea depth in Pacific Ocean is known as 4000 m. In that case, the velocity of tsunami is about \(c=712\,\hbox {km/h}\). When the distance between the coastal area and the epicenter is 100 km, it takes about 8.5 min for tsunami arriving at the coastal area. For this case, a prediction based on numerical simulations must be conducted in shorter time than this time limit.

To speed up the simulations, we have parallelized MOST algorithm by using OpenMP, OpenACC, and OpenCL (Open Computing Language) [8] and evaluated their performance on Multi-core CPU and GPU. In that benchmarking, we have obtained 185 GFlops which was the best performance by using OpenCL on AMD Radeon 280X GPU [9].

On the other hand, Nagasu et al. [10] designed the stream computing architecture and hardware for practical tsunami simulation. They introduced multiple stream processing element (SPE) arrays with parallel internal pipelines to exploit further available hardware resources. Their implementation with Arria 10 FPGA achieved the performance of 383 GFlops and the performance per power of 8.41 GFlops/W with six cascaded SPEs. Therefore, the dedicated implementation for Arria 10 FPGA shows higher performance than our best GPU implementation. The performance per power of the FPGA implementation is also better than the GPU implementation [10].

Meanwhile, there are some works to design FPGA accelerator by using OpenCL. OpenCL is one of the well-known framework for parallel programming on heterogeneous environments. It has versatility to compute on various devices including CPUs, GPUs, and reconfigurable systems such as FPGAs. With specific compilers, it is possible to generate hardware design for FPGAs automatically from OpenCL kernel without explicitly designing the hardware architecture. There are several studies working on FPGA design generating from OpenCL kernel.

Takei et al. [11] implemented FPGA accelerator of finite-difference time-domain (FDTD) method which is widely used in an electromagnetic simulation using OpenCL. They reported that the computation time of the FPGA design generated by OpenCL kernel was about 10 times faster than the computation by their GPU implementation.

Tatsumi et al. [12] also implemented FPGA accelerator of the stereo correspondence matching. They exploited pipeline stages for Fourier transform efficiently for FPGA. Also, Waidyasooriya et al. [13] used the FPGA accelerator generated from OpenCL kernel to simulate molecular dynamics. Their hardware is implemented loop-pipelining, and it achieved over 4.6 times of speedup comparing with CPU by using only 36% of the Stratix V FPGA resources.

In more recent studies, Yinger et al. [14] presented the FPGA implementation for deep neural network as the application of matrix multiplication by writing OpenCL kernel. Wang et al. [15] also designed the FPGA accelerator for convolution neural networks by using OpenCL. Roozmeh and Lavagno [16] focused on the problem about high energy consumption and power dissipation for the modern datacenters. They presented the FPGA accelerator to speed up the join operation on the database.

Houtgast et al. [17] implemented highly efficient FPGA accelerator for the Smith–Waterman algorithm to find the optimal pairwise alignment in bioinformatics. They succeeded in implementing the same accelerator by writing only 90 lines of OpenCL kernel which is about 20% of their VHDL code.

As we can see, designing FPGA accelerators for various scientific applications by using OpenCL is now feasible. Nevertheless, since FPGA design design from OpenCL kernel is a technology appeared recently, the example of applications is not plenty yet against GPU implementation. In this paper, we focus to accelerate the MOST algorithm by using OpenCL. We have already developed OpenCL implementation of the MOST algorithm which was applied well-known spacial blocking. Here, we ported the OpenCL code and gave a several optimizations to our previous OpenCL kernel for the benchmarking on Arria 10 FPGA.

This paper presents the evaluation and comparison of MOST algorithm written in OpenCL among four implementations:

  1. 1.

    Originally developed kernel by using spatial blocking on GPU as baseline;

  2. 2.

    Same kernel for GPU as baseline on FPGA design (without any optimization for FPGA);

  3. 3.

    Optimized kernel using shift registers for FPGA design;

  4. 4.

    Further optimized kernel to improve the parallelism by expanding the width of the data path.

The rest of this paper is organized as follows. In Sect. 2, the outline of MOST algorithm is given. Section 3 presents the description of the original MOST algorithm and parallelization by using spatial blocking algorithm. Section 4 shows the OpenCL implementation and its performance on several GPUs as baseline for following evaluations on FPGA. Section 5 shows the evaluation of OpenCL implementation on FPGA design generated automatically and its further optimizations. Section 6 shows the consideration and comparison of OpenCL implementation between GPU and FPGA. Finally, Sect. 7 concludes this paper with a mention of future work.

2 MOST: method of splitting tsunami

Firstly, we show the original MOST algorithm for the solution of shallow water equations. Shallow water equations which are nonlinear approximation of shallow water system are represented by following partial differential equations (PDEs) [5, 6].

$$\begin{aligned} u_t + uu_x + vu_y + gH_x= & {} gD_x ,\nonumber \\ v_t + uv_x + vv_y + gH_y= & {} gD_y , \\ H_t + (uH)_x + (vH)_y= & {} 0 .\nonumber \end{aligned}$$
(1)

Here, \(H = H(x,y,t) =\eta (x,y,t) +D(x,y)\), where \(\eta \) and D are the wave height and the depth profile (bathymetry), respectively, u and v are the wave velocity in each spatial coordinate, g is gravitational acceleration. Figure 1 schematically shows these quantities as an 1-D plot.

Fig. 1
figure 1

1-D representation for wave propagation characteristics

An alternative form of Eq. (1) is represented as follows:

$$\begin{aligned} \frac{\partial \mathbf {z}}{\partial t} + \mathbf {A} \frac{\partial \mathbf {z}}{\partial x} + \mathbf {B} \frac{\partial \mathbf {z}}{\partial y} = \mathbf {F} , \end{aligned}$$
(2)

where

$$\begin{aligned} \mathbf {z}= & {} \left( \begin{array}{c} u \\ v \\ H \end{array} \right) , \mathbf {A} = \left( \begin{array}{ccc} u &{}\quad 0 &{}\quad g \\ 0 &{}\quad u &{}\quad 0 \\ H &{}\quad 0 &{}\quad u \end{array} \right) ,\\ \mathbf {B}= & {} \left( \begin{array}{ccc} v &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad v &{}\quad g \\ 0 &{}\quad H &{}\quad v \end{array} \right) , \mathbf {F} = \left( \begin{array}{c} gD_x \\ gD_y \\ 0 \end{array} \right) . \end{aligned}$$

The numerical treatment of MOST is based on two auxiliary systems. Applying spatial decomposition to Eq. (2) along each coordinate, we get two auxiliary systems, \(\mathbf {\Phi } = (u, 0, H)^T\) and \(\mathbf {\Psi } = (0, v, H)^T\), which depend only on one spatial variable such as

figure a

where

$$\begin{aligned} \mathbf {F}_1 = \left( \begin{array}{c} gD_x \\ 0 \\ 0 \end{array} \right) , \mathbf {F}_2 = \left( \begin{array}{c} 0 \\ gD_y \\ 0 \end{array} \right) . \end{aligned}$$

MOST algorithm uses the method of characteristics for the numerical solutions. For the solution along x-coordinate, Eq. (3a) is transformed into following form:

$$\begin{aligned} \frac{\partial \mathbf {W}}{\partial t} + \mathbf {A'} \frac{\partial \mathbf {W}}{\partial x} = \mathbf {F}_1', \ \end{aligned}$$
(4)

where

$$\begin{aligned} \mathbf {W} = \left( \begin{array}{c} v \\ u+2\sqrt{gH} \\ u-2\sqrt{gH} \end{array} \right) . \end{aligned}$$
(5)

Here, all elements in \(\mathbf {W}\) are the Riemann invariants which are constants along the characteristic curves of the equation, and diagonal matrix \(\mathbf {A'}\) and \(\mathbf {F}_1'\) are expressed as following form:

$$\begin{aligned} \mathbf {A'} = \left( \begin{array}{ccc} \lambda _1 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad \lambda _2 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad \lambda _3 \end{array} \right) , \mathbf {F}_1' = \left( \begin{array}{c} 0 \\ gD_x \\ gD_y \end{array} \right) , \end{aligned}$$
(6)

where \(\lambda _1\), \(\lambda _2\), and \(\lambda _3\) are eigenvalues of \(\mathbf {A}\),

$$\begin{aligned} \lambda _1 = u, \lambda _2 = u + \sqrt{gH}, \lambda _3 = u - \sqrt{gH} . \end{aligned}$$

For the numerical solution of Eq. (4), the following finite difference method (FDM) and the explicit Euler method for time integration are applied as

$$\begin{aligned}&\frac{\mathbf {W}^{n+1}_{i,j} - \mathbf {W}^{n}_{i,j}}{\Delta t} + \mathbf {A'} \frac{\mathbf {W}^{n}_{i+1,j} - \mathbf {W}^{n}_{i-1,j}}{2 \Delta x} \nonumber \\&\qquad - \mathbf {A'} \Delta t \frac{\mathbf {A'}(\mathbf {W}^{n}_{i+1,j} - \mathbf {W}^{n}_{i,j}) - \mathbf {A'}(\mathbf {W}^{n}_{i,j} - \mathbf {W}^{n}_{i-1,j})}{2 \Delta x^2} \nonumber \\&\quad =\frac{\mathbf {F'}_{i+1,j} - \mathbf {F'}_{i-1,j}}{2 \Delta x} - \mathbf {A'} \Delta t \frac{\mathbf {F'}_{i+1,j} -2\mathbf {F'}_i + \mathbf {F'}_{i-1,j}}{2 \Delta x^2}. \end{aligned}$$
(7)

Here, n denotes the nth computational step, and ij corresponds to xy-coordinates, respectively. \(\Delta t\) and \(\Delta x\) also denotes time step and grid resolution, respectively. The criterion of stability for the MOST algorithm can be written as the relationship between time step and grid resolution [18] :

$$\begin{aligned} \Delta t \le \frac{\Delta x}{\sqrt{gH}} . \end{aligned}$$
(8)

The actual calculation procedure for one time step is summarized as follows:

  1. 1.

    uv, and H are transformed by Eq. (5).

  2. 2.

    Calculate the solution along x-coordinate by Eq. (7).

  3. 3.

    The variables are transformed back to the original variables uv and H.

  4. 4.

    vu and H are transformed by the equations corresponding to Eq. (5) for y.

  5. 5.

    Calculate the solution along y-coordinate.

  6. 6.

    The variables are transformed back to the original vu and H.

In this procedure, we need 200 floating-point operations for updating one cell in total.

The accuracy of simulations based on the MOST algorithm generally depends on following three factors;

  • Algorithm and program for calculating tsunami wave propagation,

  • Accuracy of bathymetry data,

  • Accuracy of generating initial wave displacement.

For the first factor, the original MOST is a second-order accurate in space and a first-order accurate in time. And it is standard and well-verified software used in sequential computation. In this paper, we applied various optimizations for parallelization to the original algorithm and found that there is no significant difference in the results due to such optimizations. On the other hand, we use single-precision (SP) floating-point operations in our evaluation. In the majority of Pacific Ocean, the sea depth is roughly 4000 m in average so that we consider SP arithmetic is sufficiently accurate. However, there are some areas whose sea depth is more than 10000 m like ocean trench. When the difference of sea depth for two adjacent cells is very large, we experienced the computation by SP arithmetic causing large numerical errors. For that case, we can easily switch to use double precision floating-point operations in our OpenCL-based parallelization of the MOST algorithm.

To simulate tsunami generated by an earthquake for practice, a deformation model of the sea floor [19] can be used to compute initial Hu and v. The initial condition is modeled by parameters such as the epicenter (the point on the Earth’s surface vertically above the earthquake source), the earthquake magnitude, and the distance between the earthquake source and epicenter. In this paper, for the bathymetry and initial wave displacement, we use flat bathymetry and simple initial wave displacement as we describe in Sect. 4. We believe they are not significant for our performance evaluation.

3 Algorithms for parallelization

3.1 Original computing algorithm

Before we present details of optimizations for MOST algorithm, we show the original computing algorithm of MOST presented in Sect. 2. Assume quantities such as DHu,  and v are stored in the 2-D arrays. Inputting DHu,  and v at the time step \(n=0\), we update H and uv on every time step. In the original MOST program, each datum is stored in the format of structure of array (SOA). Each quantity contains different 2-D arrays.

Each 2-D data array is updated by using the 1-D temporary array based on the scheme which we showed in Sect. 2. Figure 2 shows the procedure for data updating along longitude in one time step.

Fig. 2
figure 2

Procedure for data updating along x-direction characteristics

In this case, updating is conducted row-by-row in the following steps. First, the data of the selected row are copied from the 2-D array to a 1-D temporary array. Second, H and uv are transformed into Riemann invariants. Third, FDM and the Euler method are applied to each cell in the 1-D temporary array. Fourth, Riemann invariants H and uv which were transformed previously are reverted. Finally, the updated data in the 1-D temporary array are copied back to the 2-D array. The update along longitude is finalized by applying this procedure for all rows.

Fig. 3
figure 3

Procedure for updating along y-direction characteristics

Afterward, the processing of 2-D data is implemented along latitude. As shown in Fig. 3, this procedure is very similar to the update along longitude. In this case, the computations are conducted on every column. Thus, H and u, v in the 2-D array are updated in one time step. Importantly, the algorithm has a high-probability cache miss in the 2-D array for every data copied into 1-D array due to C/C++ row- or longitude-wise storage for planar data in memory.

3.2 Algorithm with spatial blocking

In our GPU implementation, spatial blocking is applied to the original MOST algorithm in order to obtain high level of parallelism on GPU. The data in the 2-D array is firstly divided into spatial blocks, and updated every spatial block, respectively.

Let \(N_\mathrm{bsize}\) be the block size for each spatial block. As shown in Fig. 4, a spatial block is extracted to update central \(N_\mathrm{bsize}\times N_\mathrm{bsize}\) cells in the 2-D arrays. To update a block with \(N_\mathrm{bsize} \times N_\mathrm{bsize}\) data cells, \((N_\mathrm{bsize}+2) \times (N_\mathrm{bsize}+2)\) cells are actually used since halo is required to update boundary cells in the block.

Fig. 4
figure 4

Extracting cells which should be updated in a spatial block from the entire 2-D array (in case of central \(N_\mathrm{bsize} \times N_\mathrm{bsize}\) cells are updated)

Table 1 Number of loading and storing data required for updating \(N_\mathrm{bsize} \times N_\mathrm{bsize}\) cells in stencil computation

In this figure, the red-colored cells are updated by the stencil computation, and the other cells represent halo. Table 1 is the summary of the number of loading and storing data actually required to process the stencil computation. In case of \(N_\mathrm{bsize} = m\), the number of cells updated by one stencil computation is \(C = fm^2\), where f is the coefficient for the computation, and the total number of memory reference is \(M = 2m^2+4m\). Therefore, the computational intensity C / M is given as

$$\begin{aligned} \dfrac{C}{M} = \dfrac{fm^2}{2m^2+4m} = \dfrac{f}{2+\dfrac{4}{m}}, \end{aligned}$$

It is clear that \(m=1\) gives us the highest level of parallelism. In contrast, larger m is desirable for the higher computational intensity. Both the high parallelism and high computational intensity are required for high-performance computation on GPU due to availability of massively hardware parallelism. The optimal \(N_\mathrm{bsize}\) that the parallelism and the computational intensity are compatible on each GPU is different. We examine the optimal \(N_\mathrm{bsize}\) for each hardware in the following benchmarking.

4 Implementation and evaluation on GPUs

We parallelized MOST algorithm based on spacial blocking described in Sect. 3. In this section, we present the performance evaluation of our OpenCL implementation for GPUs.

Throughout the present work, we measured the execution time of our code for 300 time steps. This particular choice of the number of time steps is just for evaluation in this paper. For practical simulations, much more number of time steps are sometimes required. However, we have confirmed that our implementation is scalable for any number of computation steps, and the number of computation steps does not affect the performance and its evaluation.

The size of the 2-D array is \(2581\times 2879\) which is equal to the existing bathymetry size of entire Pacific Ocean used by the original MOST program. For the simplicity, in this benchmarking, we used a simple flat bathymetry where D is constant everywhere as \(D = 2500\) m in the computation grid. We generate the initial wave at the center of the computational grid as a cosine wave with the peak height of 10 m.

For the treatment of boundary condition in the MOST algorithm, the reflecting boundary is applied for the boundary between sea and land and the open boundary, at which wave passes through to outside of the computation domain, is applied for the edges of the computation domain. In our evaluation, we have no land inside the computation domain and only apply the open boundary condition at all edges.

In this section, we show the specification of performance benchmarking, implementation, and performance evaluation on GPUs, respectively.

4.1 GPUs for performance benchmarking

Our MOST algorithm was written in C++ so that we used g++ (ver. 4.8) compiler for benchmarking on GPU. The following AMD GPUs and NVIDIA GPU are used in this performance evaluation: Radeon R9 280X, FirePro W8100, W9000 (see Tables 2, 3), and Tesla K20c (see Table 4).

Table 2 Hardware specification of AMD GPU, Radeon
Table 3 Hardware specification of AMD GPU, FirePro
Table 4 Hardware specification of NVIDIA GPU, Tesla

The last row in these tables shows theoretical peak performance of single-precision (SP) arithmetic operation in each architecture.

4.2 Performance evaluation of GPU implementation

As described in Sect. 3, our OpenCL kernel for MOST algorithm is based on spatial blocking. Before starting the computation, the memory spaces are allocated for the variables used in the computation on GPU. After that, quantities such as DHu,  and v which are stored in the 2-D arrays and some constants such as gravitational acceleration g, and size of the spatial block m are all transferred to global memory on GPU. Based on the number of spatial blocks which are computed in parallel, the number of threads (work items) is determined. Regarding of the efficiency of parallel computation on GPU, the total number of threads is set as multiples of 128.

Figure 5 is the overview of our OpenCL kernel as baseline. This kernel is called by every thread (work item) running on GPU. In OpenCL kernel, predefined functions such as get_global_id() are provided to identify the threads. In our case, we used them to assign each thread to process the specific spatial block.

Fig. 5
figure 5

Overview of baseline OpenCL (Code GPU) implementation

Lines 19 to 32 in Fig. 5 show that the portion of the kernel copies the data such as DHu,  and v which are required to process stencil computation from 2-D arrays in global memory. Actually, the data in global memory are stored as 1-D array. The macro-function GET() is defined to convert the data in 1-D array in global memory to 2-D array in private memory whose name ends with _g suffix which expresses the spatial block. As we can see, the format of storing data for the stencil computation is SOA.

Here, private memory is one of the memory on GPU which is assigned to each thread individually, and basically allocated to registers. Generally, the variables declared in the OpenCL kernel without any prefix are attempted to store in private memory. Nevertheless, it is expected that the size of private memory is not sufficient to store variables used in our MOST implementation shown in Fig. 5. Some variables spilled from registers are stored in global memory.

After the data copy is finished, computation in the spatial block including transformation to Riemann invariants and update by Euler method follows.

4.3 Performance evaluation on GPU

Tables 5 and 6 show the computation time and performance of OpenCL code which was originally implemented as baseline on each GPU. The computation time is converted into GFlops by considering that there is 200 floating-point arithmetic operations for updating one cell in the stencil.

Table 5 Computation time of original OpenCL kernel on AMD FirePro W8100, W9100, Radeon R9 280X, and NVIDIA Tesla K20c Unit: (s)
Table 6 Performance of original OpenCL kernel on AMD FirePro W8100, W9100, Radeon 280X, and NVIDIA Tesla K20c Unit: (GFlops)

The optimal value of \(N_\mathrm{bsize}\) for the computation is different for each architecture. The performance on NVIDIA Tesla GPU has a peak when \(N_\mathrm{bsize}=1\) or 2. \(N_\mathrm{bsize}=4\) is optimal for the computation on AMD FirePro GPU, and the performance depends on the version of GPU. AMD Radeon GPU achieved the best performance among all architectures which we evaluated. The optimal \(N_\mathrm{bsize}\) on Radeon GPU is \(N_\mathrm{bsize}=2\) and the computation is finished within 2.5 s and its performance is 185GFlops given by multiplying the number of grid points (e.g., \(2581\times 2879\)), the number of floating-point operations (200), the computation time steps (300), and the inverse of computation time.

Regarding of the specifications such as GPU clock frequency and single-precision floating-point operations per second, AMD FirePro W9100 GPU (see Table 3) was expected to achieve the best performance in each implementation. Though both Radeon and FirePro GPUs are devices produced by AMD and its partners, both GPUs are designed for different purposes [20]. The difference between Radeon and FirePro series was seen by using CodeXL, a performance profiling tool.

In case of the original kernel, cache hit rate is reached to 60–70%, and 138 vector registers were used on Radeon GPU. Besides, vector ALU instructions are processed in more than 85% of computation time on GPU which is nearly optimal value. Conversely, on FirePro GPU series, cache hit is less than 10%, and only 97 vector registers were used. In terms of vector ALU instructions, they were processed in about 15% of computation time on GPU. Furthermore, we detected that memory stall and write stall occurred about 25% of computation time on FirePro GPU.

The difference in performance on Radeon and FirePro GPUs is originated in the difference in generated instructions for each GPU architecture. Since Radeon and FirePro GPUs are targeted for consumer market and professional graphics market, respectively, the device drivers which are responsible to emit the machine instructions are different. We examined the machine instructions for both GPU and found that a way to load data from global memory is different in each case.

For Radeon with the device driver OpenCL 1.2 AMD-APP version 1729.3 targeting consumer graphics, we found that it explicitly use texture cache to load data from global memory. The texture cache is highly effective for loading read-only data from global memory. It gives us high cache hit rate as we found. For FirePro with the device driver OpenCL 2.0 AMD-APP version 1642.5 targeting professional compute and graphics, we found it does not use the texture cache. Accordingly, cache hit rate is as low as 10%. At the moment, we cannot explicitly use texture cache on FirePro. An alternative way to mitigate this problem in OpenCL is to explicitly use local memory which is shared by work items in the same local work group for caching the data from global memory. It should also improve the performance of Tesla K20c. In our recent work [21], we evaluated the performance of optimized kernels using local memory for GPUs.

5 Implementation and evaluation on FPGA

In this section, we present the implementations and evaluation of tsunami simulations on FPGA design. We modify and further optimize the OpenCL kernel implemented for GPUs to accommodate architecture of FPGA.

5.1 FPGA for performance benchmarking

For the benchmarking of OpenCL code on FPGA, we use specific compilers to design hardware automatically from OpenCL kernel. We used the compiler aoc (Intel FPGA SDK for OpenCL, 64-Bit Offline Compiler, Quartus 16.0.2). In this paper, we show the result of performance benchmarking on a DE5a-Net Arria 10 FPGA board which has two independent DDR3 memories.

We show the benchmarking of four OpenCL kernels: original code as baseline which was previously mentioned (Code GPU), the optimization for FPGA shown later (Code SR), and another optimization to expand the width of the data path (Code MC1 and MC2). Code MC1 and MC2 presents the technique to improve the parallelism on one pipeline. The benchmarking on FPGA is also conducted under the same initial condition as on GPU.

5.2 Optimization by using shift register and its performance on FPGA

Cache system is the element of on-chip memory for loading and storing data efficiently. Spatial blocking which we applied for MOST algorithm is assumed to use cache memory (or local memory) for efficient memory access. Therefore, we cannot obtain the high performance on FPGA by using the algorithm which depends on cache memory.

As an optimization for FPGA, there is a way to write OpenCL kernel so that shift registers are used for loading and storing data on FPGA [22]. In every clock cycle, a new data are shifted into the array shown in Fig. 6 . Assume COLS is the number of columns of entire computation domain, we use the shift register whose size is \(2\times \) COLS+3 for \(3\times 3\) stencil. After inserted sufficient number of data to the shift register for updating the central element of the stencil, the computation is started. In this implementation, the parallelism between each loop iteration is extracted and loop-pipelining is generated by the compiler.

Fig. 6
figure 6

The data held by shift register which are required for stencil computation when the quantities at (ij) element are updated

Figure 7 shows the overview of this implementation. This kernel is written as executed with a single thread which is known as task-parallel programming.

The 1-D arrays named urows, qrows and others which have rows suffix represent the shift register, which is stored to the private memory in the format of SOA. Lines 7 to 17 show the implementation of shift registers. New data are shifted into the buffer every cycle. By unrolling, this loop allows the compiler to infer a shift register. In addition, by unrolling every loop in the kernel, the compiler attempts to pipeline and enable multiple iterations of every loop to execute concurrently.

Fig. 7
figure 7

Overview of OpenCL implementation by using shift registers (Code SR)

Table 7 summarizes the resource utilization of hardware design automatically generated from our two OpenCL kernels on DE5a-Net Arria 10 FPGA. The second column is the specifications of the FPGA design. The third column named Code GPU is the resource utilization of the original kernel developed for GPUs (Fig. 5), and the last column named Code SR is the resource utilization of which was generated from the optimized kernel using the shift registers (Fig. 7), respectively. Here, the original kernel is compiled for FPGA design assumed \(N_\mathrm{bsize}=1\).

Table 7 Resource Usage on DE5a-Net Arria 10 FPGA generated automatically from OpenCL kernels

We have conducted the performance benchmarking of these two OpenCL kernels. First, the computation of Code GPU whose clock frequency is 242.24MHz takes 2.5 hours for 300 steps. As we mentioned, this was implemented for GPUs and spatial blocking is a technique to utilize cache memory effectively. The original OpenCL kernel can run on FPGA design, which is far from the sufficient performance.

In contrast, Code SR, the optimized kernel, implemented shift registers were well pipelined and exploited the loop parallelism by the compiler, and successive iterations launched every cycle on FPGA. That resulted in the performance improvement. The computation time of Code SR is 10.53 s which is 827 times faster than Code GPU.

Table 8 Number of floating-point operators on FPGA generated automatically from Code SR

The number of floating-point operators on FPGA generated from Code SR is shown in Table 8, and totally 201 operators are used. We used the compile option -mad-enable to extract multiply–add operations from OpenCL kernel, then the fifth column shows the number of multiply–add operations actually generated by the compiler. The number of operators in sixth column represents 6 fpext and 26 fptrunc operations which seems to be generated as auxiliary operators for Divider or Sqrt. For the performance evaluation on FPGA design, we assume that 200 operators are used, which is the same number of floating-point operations on GPU.

The clock frequency of generated FPGA design is 248.63 MHz. Then, we can estimate the peak performance \(0.248 \mathrm{GHz}\times 200=50\) GFlops. The actual performance obtained from the computation time is 42.3 GFlops which is given by the same arithmetic as obtaining the performance on GPU. This is 85% of the hardware peak performance.

We conducted the performance profiling of the kernel code of Code SR by using Altera Dynamic Profiler for OpenCL. It is confirmed that the kernel occupancy and bandwidth efficiency of data transmission kept almost 100% in the computation. However, storing data in global memory which is shown at Line 43 in Fig. 7 causes memory stall at most 8 %, and this leads to the performance drop.

5.3 Multiple computations techniques on one pipeline stage for increasing parallelism and its performance on FPGA

By estimating from Table 7, we expect that our device can implement at most 4 calculation pipelines. It is multiple SIMD-like operations by widening the width of data path in the same stage.

In that case, the performance of OpenCL kernel on FPGA is expected to approach to one of on GPU. In our implementations, the number of the pipelines is equal to the number of data inserted into the shift register and updated on it in one computation step.

Fig. 8
figure 8

Shift register holding the data required for stencil computation in case of updating two or three cells (green-colored cells) in one computation step

Figure 8 shows the example of designing shift registers for updating two or three cells in one computation step. The size of 1-D array which represents shift register varies relative to the number of cells updated in one computation step.

Here, let \(N_\mathrm{buf}\) be the number of cells updated in the pipeline stage in one computation step, the length of shift register can be represented as \(2\times \mathrm{COLS} + (2+N_\mathrm{buf})\). This can be implemented by changing each loop condition in Fig. 7 that is very similar to changing \(N_\mathrm{bsize}\) for stencil computation on GPU. By compiling the code applied, this changes with aoc, the loops are unrolled, and multiple SIMD-like operations are generated.

Let this kernel code be Code MC1. We have conducted the benchmarking under the same condition, and generated designs for \(N_\mathrm{buf}=2\) and 4 are summarized in Table 9. Code MC1 especially consumes the resource of memory bits on FPGA by increasing the number of computations on one pipeline stage. Table 10 shows the computing time of Code MC1. Though Code MC1 gives the correct results, the performance was significantly worsened even if increasing the number of pipelines.

Table 9 Resource Usage on DE5a-Net Arria 10 FPGA generated from Code MC1
Table 10 Computation time and hardware clock frequency of Code MC1 (generated for \(N_\mathrm{buf}=2\) to 4)

In this benchmarking, our implementation for both GPU and FPGA design is designed based on the spatial blocking algorithm. As mentioned before, spatial blocking is expected to use cache memory for efficient computation. The purpose of using kernel codes based on spatial blocking for this benchmarking is to compare the performance of similar OpenCL codes on different architectures. Since there is no longer room to optimize the kernel code based on spatial blocking for further high performance on FPGA design, we reconstruct the design of OpenCL kernel for FPGA design.

Figure 9 is an overview of the new OpenCL kernel named code MC2 without spatial blocking. The major modification in Code MC2 is reducing memory accesses in the kernel code.

Fig. 9
figure 9

Overview of OpenCL kernel further optimized for FPGA design (Code MC2)

Fig. 10
figure 10

Illustration for the difference of memory reference between Code MC2 and others

Figure 10 illustrates the difference of memory accesses between Code MC2 and previous implementations. In Code MC1, the data in the shift register are first copied to spatial block represented by a 2-D array and the stencil computation is processed every computation step. In contrast, Code MC2 is designed to update the data in the shift registers directly (without copying to other memory spaces). In case of MOST algorithm, the transformation to Riemann invariants is conducted for the stencil computation. That appears twice before and after the update by finite difference method in a 2-D array every computation step. Code MC2 conducts the transformation only once per one data when the data inserted into the shift register. Namely, in this implementation, the shift register always has the data which has already transformed for MOST algorithm, and the computation of finite difference can be conducted to the area of stencil on the shift register. After the update, only the data which will be stored in global memory should be reverted.

In addition, the data structure is also replaced. The data such as \(D,\ q\) and uv for the stencil computation is loaded from global memory as the format of SOA. In Code MC2, they are stored to the private memory as the format of array of structure (AOS). In our case, the structure has members which are used for the stencil computation for updating each cell, and they are aligned closely each other on the memory. That is expected to improve the efficiency of memory access. On the other hand, we also tried to store the data as SOA which is the same style as previous FPGA and GPU implementations. However, the OpenCL kernel implemented the data structure as SOA for Code MC2 failed to generate FPGA design correctly due to enormous memory usage during compilation against our compilation environment.

Besides, there are several division operations by a constant in the original MOST program. In order to reduce the number of floating-point operators of divider on FPGA design, we substituted the multiplication of inversion for division.

Table 11 Computation time and hardware clock frequency of Code MC2 (generated for \(N_\mathrm{buf}=1\) to 5)

For the performance benchmarking, we measured the computation time of each FPGA device with \(N_\mathrm{buf}\) of pipelines. Table 11 shows the computation time of Code MC2 in case of \(N_\mathrm{buf}=1\) to 5.

In this implementation, we obtained the performance improvement by changing \(N_\mathrm{buf}\). Though the performance is expected to increase proportional to \(N_\mathrm{buf}\), \(N_\mathrm{buf} = 4\) gives the peak whose computation time is about 6.5 s for this implementation.

In this implementation, aoc can successfully exploit the loop parallelism well and successive iterations launched every cycle in any \(N_\mathrm{buf}\). The problem for computing with pipelines seen in Code MC1 is solved. This modification also affected the compilation time. It took from 12 hours to one day for any previous kernels to generate hardware by aoc, that became less than half. It is critical for examining with various \(N_\mathrm{buf}\).

Nevertheless, the memory stall for storing to global memory still remains. Besides, the clock frequency of FPGA design generated from this OpenCL kernel reaches just 200 MHz for any \(N_\mathrm{buf}\). They are the factor for degrading the performance of this new FPGA design. When \(N_\mathrm{buf}=1\), the performance is actually dropped comparing with Code SR.

Table 12 Resource Usage on DE5a-Net Arria 10 FPGA generated from Code MC2
Table 13 Number of floating-point operators on FPGA generated from Code MC2

Here, we show the specification of a FPGA design generated from Code MC2 for \(N_\mathrm{buf}= 1\) to 5. Table 12 shows the resource usage on FPGA. For the comparison, the resource usage generated in \(N_\mathrm{buf} = 1\) to 5 is shown at each column. The usage of DSP blocks are increased as \(N_\mathrm{buf}\) is increased. Other resource usages are keeps or increased slowly relative to DSP blocks usage. Table 13 is the number of floating-point operators in case of \(N_\mathrm{buf}=4\) which gives the best performance in our FPGA implementation. It is considered as four times of total operators for the update of one cell.

In spite of being also used -mad-enable option for compilation, the number of Madder was decreased comparing with Table 9 which represents the hardware of Code SR. Substituting multiplication of inversion for division also did not directly promote to decrease the number of divider operators.

Here, we also estimate the performance of this FPGA design in case of \(N_\mathrm{buf}=4\). The clock frequency of this hardware is 198.33 MHz. In this case, we can obtain the peak performance of this hardware, \(0.198\,\mathrm{GHz} \times 4 \times 200 \sim 160\) GFlops, which is the multiplication of clock frequency, the number of pipelines, and the number of floating-point operations per pipeline. Actual performance of this design under the same evaluation as previous is 68.7 GFlops. The performance for multiple computations on the pipeline stage are at most 43% of hardware peak performance. In the current implementation, there is still 20% memory stall of computation in the kernel, which is the critical bottleneck to improve the performance further.

Nagasu et al. [23] were working to design the custom hardware for MOST accelerator. Their implementation is different from our automatically generated hardware; exploiting not only spatial but also temporal parallelism in which computations for multiple timesteps are cascaded. Comparing our Code SR using shift registers with their implementation on Stratix V 5SGXA7 FPGA, the performance of our implementation is approximately same as the 160 MHz MOST accelerator in case of 1 SPE is implemented.

In the latest implementation by Nagasu et al. [10], they evaluated the performance and power consumption of their dedicated FPGA implementation of the MOST algorithm on the same Arria 10 FPGA. In addition, they presented a performance model applied both spatial and temporal parallelism. Specifically, the performance model was constructed for the case (nm), where n and m are spatial and temporal parallelism, respectively. Our best implementation Code MC2 with \(N_\mathrm{buf} = 4\) without temporal parallelism corresponds to \((n, m) = (4,1)\) in their model. According to their model, we see that the required memory bandwidth (BW) is proportional to n. With \(n = 4\), the required BW is 28.8 GB/s, while the theoretical BW of our hardware is 17 GB/s with two 4GB DDR3-1066 SODIMM. In fact, they found a large gap between the sustained performance and the theoretical performance for \(n > 1\) given \(m = 1\) (see Figure 19 in [10]). Since the best design with \((n,m) = (1,6)\) archived 383 Gflops in their paper, they concluded that the temporal parallelism is the most effective optimization for the acceleration of the MOST algorithm.

Waidyasooriya et al. [24] presented the optimization methodology for general stencil computations. They also suggested to exploit temporal parallelism for the performance improvement in stencil computation. Our current implementation only exploits the spatial parallelism. Though the MOST algorithm is basically classified to five-point stencil computation, the stencil for MOST algorithm is much more complicated than the stencil computations presented in Waidyasooriya et al. [24]. Thereby, we have not appropriately implemented the temporal parallelization by using Intel FPGA SDK. We will improve our OpenCL kernels for the spatial and temporal parallelism with reference to their implementation.

6 Discussion

In this section, we first summarize our results of performance benchmarking presented in this paper and compare them to other related works. Additionally, we discuss the applicability of our GPU and FPGA implementation to the real-time tsunami simulation.

6.1 Summary of performance benchmarking and comparison with other works

We here summarize the hardware specification and benchmarking results of each implementation for GPU and FPGA in Tables 14 and 15. For the comparison on each platform, the computation time of different optimized kernels as shown in Sect. 5 is converted to the performance as floating-point operations per second (FLOPS).

Table 14 Summary of hardware specification presented in this paper, GPU and FPGA generated from OpenCL kernels
Table 15 Summary of performance achievement in this paper by OpenCL implementation for GPU and FPGA Unit: (GFlops)

Consequently, in OpenCL implementation, the original kernel computing on Radeon 280X achieved the best performance in our latest study. GPU has high peak performance for floating-point operations. In this paper, we especially presented the baseline implementation for GPU to compare with FPGA implementation. The performance on GPU is expected to get even higher by explicitly using efficient memory system represented by local memory and texture memory on GPU. However, its high power consumption often issues as the disadvantage of GPU. In our computation, Radeon GPU consumes 12.5W in the idle status without computing and 184.9W in the computing status, respectively [10].

The OpenCL kernel for computing on GPU also can be executed on FPGA design. By adopting the shift registers and loop unrolling, the OpenCL kernel on FPGA achieved the approximately same performance as the original kernel on FirePro GPU. Furthermore, increasing the number of computations on the pipeline stage also contributed the performance improvement.

FPGA is known that the power consumption is much lower than GPU on the same computation. In Nagasu et al. [10], their FPGA board consumes 25–30 W in the idle status and 29.1–45.5 W in the computing status, respectively. Therefore, the performance per power of FPGA accelerator is approximately eight times higher than that of GPU implementation. With little modification of kernel code, OpenCL kernels developed for GPU can be executed on FPGA design and obtain the same or high performance as GPU computation. Designing hardware architecture and logic circuits appropriately is difficult and takes much time. Actually the Verilog HDL code generated from our OpenCL kernel consisted from totally more than 600,000 lines, that can be relieved by writing several hundred lines of OpenCL kernel code with the specific compiler. In our case for the MOST algorithm, the performance was improved by the kernel modification to increase the number of computations per a pipeline. That can be achieved by unrolling a for loop and storing them as Array of Structure in the OpenCL kernel. Therefore, OpenCL programmers have an additional environment of the application, though implementing optimal OpenCL kernel for FPGA design generation requires several trials.

6.2 Estimation of the applicability for real-time simulation

Here, we estimate the applicability of our GPU and FPGA implementation for the real-time tsunami simulation by using currently obtained results. As the practical evaluation, we evaluate our implementations based on the phase velocity derived from the shallow water equations. The phase velocity of tsunami is obtained as \(c=\sqrt{gH}\), where g is the gravitational acceleration and H is the sea depth. Given the distance between the coastal area and the epicenter D, the numerical simulation must be finished within D / c.

The estimation is presented by referring following past disasters, the 2011 Tohoku earthquake and tsunami in Japan. In this case, the epicenter is located at \(D=180\,\hbox {km}\) from the coastal region.

Assuming the average see depth H is 1500 m, we obtain the phase velocity \(c=436\,\hbox {km/h}\). Computing the arrival time of tsunami under these conditions, we find that is approximately 27 min. In this earthquake, tsunami has actually arrived at the coastal area of Fukushima with 30 min. This estimate is fairly correct.

We then estimate the computation time for this required situation by using our OpenCL kernel on Radeon GPU (Code GPU) and FPGA (Code MC2). The computation time is calculated by assuming the computation domain and total grids \(N\times N\), and determining \(\Delta t\). Here, we consider the computation domain which covers \(L\times L\,\hbox {km}^2\) area where we set \(L=200\). To simplify the calculation, assume the computation domain is covered by the square grid (\(2581\times 2581\)). For the accurate simulation, we must choose appropriate value for time step \(\Delta t\) to hold Eq. (8) shown in Sect. 2. Here, we modify Eq. (8) as follows to multiply the constant for reliability \(\alpha \) (\(0 < \alpha \le 1\)) on the right-hand side and use it in the evaluation.

$$\begin{aligned} \Delta t \le \alpha \frac{\Delta x}{\sqrt{gH}} . \end{aligned}$$
(9)
Table 16 Estimation of the application of our OpenCL computation on Radeon GPU and FPGA for the tsunami in Tohoku, Japan, 2011

Table 16 presents the condition for the accurate simulation and the estimated time for tsunami simulation under that condition. In this case, \(\Delta x\) is given as \(200\,\hbox {km}/2581 = 77.48\) m.

When we set \(\alpha =1.0\) as the optimistic estimation which gives the strict limit for \(\Delta t\) as shown in Eq. (8), \(\Delta t\) must be smaller than 0.64 s. If we choose \(\Delta t = 0.5\) s, 3240 total computation steps are required to simulate tsunami for 1620 s in real time. Using the elapsed time for updating 1 cell which is obtained from our benchmarking results on GPU and FPGA presented in previous sections (fifth row in Table 16), we can estimate the computation time for this situation. As shown in ninth row, the computation time on each hardware is estimated at most 70 s. Those are much shorter than 1620 s in real time.

To ensure the higher reliability of our simulation for the practical application, we have another estimation with \(\alpha = 0.5\). In this case, the upper limit for \(\Delta t\) is about 0.30. When we use \(\Delta t = 0.25\) s, 6480 steps are required to simulate the tsunami for 1620 s in real time. The computation time of each environment under this condition is still shorter than 1620 s in real time. Therefore, the computation by using either OpenCL kernels for Radeon GPU or FPGA is applicable to the forecast for this situation. Note that the performance of our implementation is independent of each constant.

Finally, we remark how smaller \(\Delta x\) we can use to improve the accuracy of simulation. In other words, we estimate how larger N we can use in our computation environment for tsunami forecasting. Let the computation domain be \(L\times L\) divided by \(\Delta x \times \Delta x\) grids and the total simulation time be T with the time step of \(\Delta t\), respectively. We obtain the total number of grids computed by our stencil computation as

$$\begin{aligned} \dfrac{T}{\Delta t}\times N^2 = \dfrac{T}{\Delta t}\times \left( \dfrac{L}{\Delta x}\right) ^2. \end{aligned}$$
(10)

When the computation time required for updating one cell in one time step is f, the total computation time \(T_\mathrm{comp}\) is given as following.

$$\begin{aligned} T_\mathrm{comp} = \dfrac{T}{\Delta t}\times \left( \dfrac{L}{\Delta x}\right) ^2 \times f \end{aligned}$$
(11)

Assuming \(T_\mathrm{comp} = T\), we find the lower limit for \(\Delta x\) by using Eq. (9).

$$\begin{aligned} \Delta x = \root 3 \of {\dfrac{\sqrt{gH} \times L^2 f}{\alpha }} \end{aligned}$$
(12)

Thus, we can calculate \(\Delta x\) and N by using this formula and estimate the computation time for particular simulation.

In case of the computation on Radeon GPU (by using Code GPU), we obtain the limit of \(\Delta x\) as \(\Delta x = 21.88\,\hbox {m}\) by substituting \(L=200\,\hbox {km}\), \(H=1500\,\hbox {m}\), \(f=1.08\times 10^{-9}\) s, and \(\alpha = 0.5\), respectively. To cover the computation domain with this \(\Delta x\), \(N = L/\Delta x\) is at most 9139. On the other hand, in case of the computation on FPGA (by using Code MC2), we obtain the limit of \(\Delta x\) as \(\Delta x = 30.45\,\hbox {m}\) by using \(f=2.91\times 10^{-9}\) s. In this case, \(N = L/\Delta x\) is at most 6568.

7 Conclusion

We developed our tsunami simulation codes based on the MOST algorithm applied spatial blocking and parallelized them by OpenCL. OpenCL kernels can be executed not only on GPUs but also on FPGAs and other architectures. The best result of performance benchmarking on GPU with a \(2581\times 2879\) computation grid is currently that OpenCL code with \(2\times 2\) spatial blocking takes approximately 2.41 s (185.0 GFlops) on AMD Radeon R9 280X GPU for 300 time steps.

In this paper, we aimed to achieve the high performance on FPGA design by using the different OpenCL kernels from GPU implementation. Here, we used the compiler supported by Intel FPGA SDK for generating the FPGA design automatically that enables us to write OpenCL kernel the same way as for GPUs. Though the same kernel as developed for GPU can be executed on FPGA, it did not achieve the expected performance. The performance of an optimized kernel implemented shift registers reaches to the original kernel running on FirePro GPU. In addition, we reconstructed the GPU kernel code for FPGA implementation, and our latest OpenCL kernel named Code MC2 is able to support the SIMD-like operations to increase parallelism on FPGA design. In case of implementing four computations per a pipeline, our optimized kernel on FPGA design achieved 6.49 s (68.7 GFlops) under the same condition as evaluations on GPU.

However, we have to write OpenCL kernel in the specific ways in order to obtain the high-performance hardware as MOST accelerator. In our current study, memory stall, especially during the storing to global memory, interrupts the fluent computation in OpenCL kernel, and that makes the performance with Code MC2 away from hardware peak. We will research the coding techniques in OpenCL which is translated to hardware computing efficiently. In particular, there is room to improve the performance by exploiting temporal parallelism presented by other papers.

On the other hand, the current implementation achieves sufficient performance in terms of applicability for the real-time simulation. In the case of the 2011 Tohoku earthquake and tsunami in Japan, our computation is more than 20 times faster than real time. In the future, we extend our MOST program to compute on the nested grids which has the high-resolution grids computed precisely for the intensive area cooperating with other studies [18]. In that case, it is required the several variations of computation kernel for each computation domain. It is much easier and takes fewer time to customize OpenCL kernel than manually designing FPGA.

Finally, we note that our OpenCL implementation is also applicable to distributed-memory clusters. One possible application is that we simulate different models concurrently on individual nodes in such cluster. Additionally, our OpenCL implementation can be easily used in modeling one large computation in parallel for various GPU clusters. In fact, we are currently working on the optimization of our parallel implementation for GPU clusters using Message Passing Interface (MPI). Furthermore, although we currently have no available distributed cluster with FPGA, our OpenCL kernel should easily work on such clusters available in near future. For instance, Amazon Elastic Compute Cloud is offering a compute instance with FPGA (called F1 instance). We will evaluate the performance of our MPI+OpenCL implementation on the F1 instance in future publications.