Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The last decade has been witness to radical changes in number crunching hardware [1, 2]. The graphics processing unit (GPU) has reached approximately an order of magnitude in peak, double precision, floating point operations per second (FLOPS), compared to the central processing unit (CPU). Today, the GPU can deliver more than 3000 GFLOPS, while the CPU delivers just above 500 GFLOPS. Additionally, the peak memory bandwidth, the velocity to fetch data from memory, is around 500 GB/s for the GPU compared to under 80 GB/s for the CPU. And the trends are getting steeper for every new processor generation presented [3]. At the present time, the GPUs double their performance approximately every year while the CPUs do it almost every two years.

Supercomputers are increasingly relying upon the GPUs each year [2], mostly to increase the FLOPS delivered given a fixed monetary budget. Besides their highest performance, the GPUs bring savings because they deliver the best cost/operation and energy/operation ratios. Surprisingly, the transition to GPUs has been slow in the programming side and supercomputers are still dominated by distributed, multi-core applications. One of the main reasons behind the slow implementation of algorithms into the GPU is that the methodologies need to be adapted to the GPU architecture following a theoretical ideal GPU-programming philosophy [3]. Perhaps, the applied mathematics community has not paid sufficient attention to the changes in the hardware and has produced a series of methodologies that might be suited for some GPU acceleration [4, 5], but not for exclusive, multiple GPU, scalable performance.

In the field of continuum dynamics, theoretical research has been mainly focused in weak formulations of partial differential equations and their discretization using finite elements [68] . The research groups in applied mathematics have produced a great variety of theorems and applications for this particular formulation [913]. The finite elements are highly attractive because complex geometries and a mixed systems of equations, including constrains, can be discretized and solved using implicit formulations, reducing stability constrains for the time steps. Once the numerical engine for the discretization and the inversion of the matrix has been implemented, the extension of the method to new applications is relatively simple, and new research can be directed to find the numerical convergence of the algorithms and the best choice for the preconditioning of the matrix. The finite element method produces a great sparse matrix with mixed entries that in most cases is solved using a General Minimum Residual (GMRES) algorithm [14].

The applied mathematician often overlooks the use of the GMRES algorithm as a black box and focuses mainly in establishing the weak formulation of the partial differential equations. Here is where we find the greatest obstacle for scalable simulations. The community argues that sooner or later they will have a general sparse matrix inverter, GMRES or else, working on multi-processors and they will be able to simply change the matrix inversion engine in their codes. The community had some success using domain decomposition for the inversion of the sparse matrix in systems of multiple processors but, to this date, this is still one of the most important open themes in the field [15, 16].

Unfortunately for the domain decomposition of matrices, many-core architectures like GPUs are quite different than multi-cores. Multi-cores can assign a large region of elements in the sparse matrix to each core and use iterative matchings to achieve convergence. The GPU requires one code applied independently to each element, using massive memory fetches for the discretization and writing its result to memory, fully in parallel.

The use of methods like GMRES makes sense in GPUs for systems of linear or non-linear equations to be solved for every element, independently. That is, a local inversion of a small matrix per thread. But large matrix inversions are not easy to implement in parallel. The main problem is that GMRES was designed and optimized as a sequential algorithm. It requires matrix-vector multiplications, rotations and the creation of a basis using the Gram-Schmidt algorithm. Matrix-vector multiplications can be implemented in parallel, but the rotations and creation of the basis vectors is always sequential [17]. Additionally, the domain decomposition of matrices using iterative algorithms faces a fundamental problem, the slow transmission of the long distance components of the solutions. For example, when solving a Poisson equation, the potential has always local components due to sources located at very long distances. Numerically, the computation must be able to communicate those sources rapidly when solving for the corrections at the intersection of the domains. Given the algorithm in its current form, only partial accelerations can be achieved for a large number of cores, computing matrix multiplications in the GPUs and reductions in the CPUs [17].

Intensive research is being done in the field and the reader must not loose attention to new developments. Finite elements can adapt to the GPU programming philosophy with some modifications of the general matrix strategy. Some alternatives are to produce explicit finite element methods and to reduce the equations complexity to allow the use of multigrid. Successful implementations can be found in this direction [1823].

Other methods have evolved under relative isolation during the fertile decade of finite elements, these include finite differences, finite volumes [24] and Lagrangian advection [25, 26]. The algorithms are popular among engineers but are ignored or regarded as low order by most mathematicians. In this direction, our team has been successful in producing new high-order numerical schemes that combine the best features of these methods, exploiting explicit integration, full multi-dimensionality, fast memory fetches and fine grain parallel processing, while avoiding memory latencies with a fixed memory model.

In this paper we show that the development of high-order, moment preserving, semi-Lagrangian schemes [27], combined with the explicit solution of diffusion equations and multigrid or multiscale algorithms [28], provide a high-order convergent framework for incompressible, compressible and constrained continuum dynamics, ideally scalable in an array of many-core architectures.

First, we present a general programming philosophy and a heat equation template for multiple GPUs, showing its scalability using a large number of units. After that, the programming philosophy is applied to a series of model equations that contain all the possible systems of equations found in continuum dynamics. These are pure hyperbolic systems like the shallow-water equations, elliptic-hyperbolic systems like the vorticity equations, and parabolic-hyperbolic systems like the porous media equations.

2 Programming Model

The programming model must exploit the hardware characteristics of the GPUs to achieve peak performance in each card, and communicate several GPUs efficiently to obtain scalability. It is possible that we obtain acceleration using several GPUs, but the main goal must be the run of very large problems with small time penalties. We must be able to run a multiple sized problem in the same multiple number of GPUs for approximately the same computing time as the unit problem run in one. Therefore, we need to observe how the computational time varies with the number of mesh points for a fixed problem size per GPU, known as weak scalability.

Each GPU has a very large memory bandwidth and a very large number of low power processors. The programming model for each GPU must encourage the massive parallelization of the work space into the maximum number of threads, each using of the maximum number of memory fetches, intense computations in local registers and a few memory writes. If global iterations are needed, the iterations are better implemented on top of the GPU routines, after full workspace operations are finished.

Fig. 1.
figure 1

Programming model for a single GPU (left) and message passing model for many GPUs (right). For each node in the numerical domain, massive L2 memory fetches are combined with intensive operations in registers and limited writes to global memory. The data missing in each face is transferred by the CPUs with MPI.

Figure 1 (left) shows the programming model for a single GPU. The run is carried out entirely on the GPU without large data transfers to the CPU. Inside the GPU, each thread represents a node of the numerical grid. For every thread, there are many memory reads using a read-only array, allowing the state-of-the-art compiler to assign a fast L2-memory transfer without the need to declare texture transfer arrays. This allows fast and massive memory transfers to local registers. Then the registers are used for the numerical operations and the results are written back to a write-only array in global memory. Every algorithm must be converted in every possible way to this philosophy in order to efficiently exploit the capabilities of the GPUs.

The multiple GPU model is based on domain decomposition of the computational domain and communication using the message passing interface MPI. Domain decomposition of continuum dynamics problems always involve communications between domains to update the domain border data. Depending on the approximation order of the algorithm, more than one line of border data can be communicated. The border data is stored in a special array that is transferred to the local CPU and communicated to the corresponding neighbor CPU using MPI. Once the data has been transferred between CPUs, each CPU loads the border data to their respective slave GPU.

Figure 1 (right) shows a three-dimensional domain decomposition and a sketch of the communication model using MPI. The domain decomposition can be done in one, two or three dimensions, as desired, but it must be kept in mind that it could be important to minimize the area to be communicated. A full three-dimensional domain decomposition provides the minimal area of communication.

Bounded irregular domains can be handled including the irregular domain in a bounding box and eliminating the unnecessary nodes. Infinite domains must be handled using adequate boundary conditions.

3 Template Code

We establish a template code for the chosen programming philosophy. A programmable building block, basis of all the algorithms presented here, used for diagnostics of performance and scalability.

The model problem chosen for the template is parabolic, known as the heat diffusion equation

$$\begin{aligned} \frac{\partial u}{\partial t} + \kappa \nabla ^2 u = f, \end{aligned}$$
(1)

where \(u = u(x,t)\), \(x \in [0,L_1]\times [0,L_2]\times [0,L_3]\), \(t = [0, \infty ]\), \(u(x,0) = u_0(x)\), and the given source \(f=f(x,t)\).

We use first order forward differences in time and second order, 27-point stencil finite differences in space. We obtain an explicit algorithm that reads the values of 27 nodes, operates the finite differences and writes the solution for the next time step.

This algorithm is stable for time steps \(\varDelta t < \frac{(\varDelta x) ^2}{2 \kappa }\), with \(\varDelta x = max_i(\varDelta x_i)\), and therefore has a strong time step restriction. We use two arrays to distinguish present and future and use them only for reading and writing, respectively, at every time step. Given the characteristics of the algorithm, threaded for every node in the domain, a single GPU achieves peak performance and the penalties of the time step are reduced in the overall computational time. The algorithm is also perfectly scalable for GPUs with different numbers of processors.

Fig. 2.
figure 2

Computational time against number of mesh points for one, two and four GPUs Tesla C2075. The dotted line shows the almost perfect scalability slope. Runs for 8 million mesh points can be performed by four GPUs almost in the same computational time than 2 million mesh points in one GPU. The small slope of the dotted line indicates the penalization of the message passing between GPUs.

For multiple GPUs, the computational domain is partitioned in \(N_1 \times N_2 \times N_3\) cubes or parallelepipeds, assigning a subdomain to every GPU. After a full mesh time iteration, the faces of the subdomains are loaded to the local CPU and communicated with MPI using send and receive commands. The corresponding neighboring CPU receives the data and loads it to its slave GPU. The transfer of the data is done in an ordered way for every direction.

We fix the spatial discretization and increase the number of nodes. The code is first run in one Tesla C2075 card and the results are shown in Fig. 2 with a light grey colored line. It shows a linear relation between the number of mesh points and the computational time. Reaching approximately 20 s for 6400 time steps and two million mesh points. We run the code for two (grey line) and four (black lines) GPUs. We observe that for the same two million mesh points, two GPUs need almost half the computational time than one, accelerating the computation. But four GPUs are not able to do it in a fourth of the time. Accelerating computations using several GPUs is possible but it’s not our main goal.

The use of several GPUs is needed to run problems with a large number of mesh points. In Fig. 2, the dotted line shows practically perfect scalability. The little slope shows the penalization time of the message passing to bind the subdomains. It means that we can run more than eight million mesh points in four GPUs, using the same computational time than two million points in one GPU. The slope of the dotted line could be used to estimate the computational time using many GPUs. The black line is double because the run in four GPUs is performed with one- (upper) and two-dimensional subdomain partitions, showing no considerable differences.

When using clusters with a large number of GPUs, we have observed perfect scalability as long as the GPUs share a motherboard. A large penalty can result of the message passing to distributed motherboards depending on the type of network. In such cases, perfect scalability is sacrificed against the possibility of a very large computation. Abacus I, an SGI cluster in Cinvestav, Mexico, achieves perfect scalability for two Tesla K40 cards that share motherboard. The computational time for the 20 s benchmark increases ten times for 27 cards in 14 nodes, computing 56 million points; and fifty times for 64 cards in 32 nodes, computing 134 million points.

4 Model Equations and Algorithms

We show a set of model equations that cover the whole range of the systems found in continuum dynamics simulations. The numerical algorithms are original and have been developed following the programming model presented for several GPUs. Therefore the algorithms have the same scalability properties of the heat equation template. The heat equation template was our model for parabolic equations. Here, we extend the programming philosophy to elliptic equations, using a multiscale method close to a multigrid [28]; to hyperbolic equations, using a moment preserving high-order semi-Lagrangian scheme [27]; and to mixed systems of equations with constraints.

4.1 Elliptic Equations

The chosen model for elliptic systems of PDEs is the three-dimensional Poisson equation

$$\begin{aligned} \nabla ^2 u = f, \end{aligned}$$
(2)

for \(u \in [0,L_x] \times [0,L_y] \times [0,L_z]\), given \(f = f(x,y)\).

This equation is solved using the multiscale algorithm described in [28], closely related to a multigrid. The multiscale algorithm consists of solving heat equations iteratively, using coarse nested discretizations, helped by interpolations to the rest of the mesh, in a descending cycle, until the heat equation is solved in the whole mesh. The process has been seen to converge to machine precision when full cycles are repeated a few more times for the residual function.

In Fig. 3, we show an illustration for the elliptic equation’s solver constructed from the template. It shows the solution of the Poisson equation for a singular source with the shape of Mexico. Neumann boundary conditions are used because the solution is used to deform a mesh. The mesh is deformed using equations of motion for each mesh point, with a force computed from the potential found as solution to the Poisson equation.

Fig. 3.
figure 3

Potential function, solution of the Poisson equation with Neumann boundary conditions and a singular one-dimensional source with the shape of Mexico. Illustrative for a multigrid, elliptic equation’s solver constructed from the template. The solution is used as the potential of a force to move the mesh points of a regular grid and obtain adaptivity.

4.2 Hyperbolic Equations

The chosen model for hyperbolic equations are the two-dimensional shallow-water equations: the conservation of water volume

$$\begin{aligned} \frac{D h}{Dt} = -h \nabla \cdot v, \end{aligned}$$
(3)

and the acceleration of the water column

$$\begin{aligned} \frac{D v}{Dt} = - \nabla h, \end{aligned}$$
(4)

where h is the height of the water column, and v its velocity. The total time derivate, known as the material derivative \(D/Dt = \partial / \partial t + v \cdot \nabla \), requires the solution of the trajectories \(dx / dt = v\).

The equations are solved using a semi-Lagrangian, moment preserving, numerical scheme described in [27]. The scheme makes use of fluid elements that move and deform, starting from a Cartesian set, the corners of the elements are treated as movable nodes that travel in space following the given equations with total derivatives in time. As the points travel, the element that they describe deforms, and after a few time steps a non-linear exact map is performed to see the new element in a space where coordinates are orthogonal, where high-order, moment preserving interpolations are performed to restart the mesh points in a reference mesh.

The algorithm requires 216 neighbors. It uses arrays for the position of the nodes and computes trajectories and interpolations explicitly. It has a very relaxed CFL time step restriction, as long as the trajectory is well integrated.

Fig. 4.
figure 4

Height of the water column for shallow water waves in a tank, illustrative of a hyperbolic system of equations’ solver constructed from the template. The left wall is moved with an oscillating piston, and the opposite side has a dissipation zone to kill the waves. The floor has a transverse bump with a centered aperture. The waves show diffraction.

In Fig. 4, we present the dynamic solution to the shallow water waves in a tank with variable floor. The floor has an obstacle, a transverse bump with a centered aperture. The solution shows the formation of a diffraction pattern. The waves are originated at the left face, moving the wall with an oscillating piston. The end of the channel has a dissipation zone to kill the waves.

Fig. 5.
figure 5

Solution to the porous media equations for one fluid. (a) Graph of the contours of pressure for an extraction point in the top. (b) Graph of the streamlines colored with the velocity magnitude.

4.3 Model for a Parabolic-Hyperbolic Systems of Equations

The chosen model for a parabolic-hyperbolic system of equations is the three-dimensional flow in porous media: the conservation of mass

$$\begin{aligned} \frac{D n}{D t} + n \nabla \cdot {v} = q, \end{aligned}$$
(5)

for the effective density \(n = \phi \rho \), where \(\phi \) is the porosity, v is the intrinsic velocity, and q is a mass source; and the pressure equation

$$\begin{aligned} \left[ (1-\phi )C_r+\phi C_f \right] \frac{\partial P}{\partial t} = \phi \left( - \nabla \cdot {v} + \frac{q}{n} \right) , \end{aligned}$$
(6)

where \(C_r\) and \(C_f\) are the compressibiity of the solid and the fluid, respectively. The system is closed using Darcy’s law for the intrinsic velocity

$$\begin{aligned} {v} = {\varvec{K}} \cdot ( -\nabla P + {\varvec{f}} ), \end{aligned}$$
(7)

where K is the permeability over the porosity and f is an external force.

We solve the equations using a heat equation solver for the parabolic pressure equation combined with the semi-Lagrangian scheme for the hyperbolic part. In Fig. 5, we show the contours of pressure and the streamlines for a flow in porous media with a sink at the top’s face center.

4.4 Model for an Elliptic-Hyperbolic Systems of Equations

The chosen model for an elliptic-hyperbolic system of equations is the two-dimensional vorticity equation

$$\begin{aligned} \frac{D \omega }{Dt} = 0, \end{aligned}$$
(8)

coupled to the Poisson equation

$$\begin{aligned} \nabla ^2 \phi = - \omega , \end{aligned}$$
(9)

with the velocity vector given by \(v = \nabla \times \phi \hat{k}\).

These equations represent the motion of incompressible fluids. The vorticity is advected using the semi-Lagrangian scheme and the potential is found solving the Poisson equation with multigrid.

Fig. 6.
figure 6

Snapshot of the vorticity, illustration of the numerical solution of elliptic-hyperbolic systems of equations like those found in incompressible flow simulations, using a code based on the template. We solve the two-dimensional vorticity dynamic equations for an initial random distribution of Gaussian bell vortices with both signs. The box has slip wall boundary conditions.

In Fig. 6, we show a snapshot of vorticity, represented by the height of the mesh, starting from a random distribution of Gaussian bell vortices in positive and negative directions. The box has Dirichlet boundary conditions and therefore the walls have normal velocity equal to zero.

5 Conclusion

We presented a programming model using trivial domain decomposition with message passing for computations using multiple GPUs. The programming model has been proven to be perfectly scalable for GPUs that share a motherboard. In that case, the computational times remain in the same range for runs with increasing number of nodes using a large and fixed number of nodes in each GPU. For clusters of many GPUs in distributed motherboards, we have found a variable penalty associated to the message passings between nodes, dependent on the network. Problems of hundreds of millions of nodes can be solved sacrificing perfect scalability. The novelty of the work is that the programming philosophy is used to implement a diffusion equation solver for parabolic equations, a multiscale solver for elliptic equations, and a semi-Lagrangian advection scheme for hyperbolic equations. The three models are combined in schemes to solve systems of equations that model all the types of systems found in continuum dynamics, for incompressible, compressible and constrained flows.