Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The simulation of fluids interacting with elastic structures has a broad number of applications in engineering, medicine and architecture. Aerodynamic design, thermodynamic cycles in motors, containment of fluids and blood flow, to name a few, can be described using the compressible Navier-Stokes equations for the fluid dynamics, combined with linear elasticity theory for the mechanics of solid boundaries. The computational tools designed for the numerical integration of the governing equations have therefore been the focus of a large research effort. The most general problem requires the handling of complex dynamic geometries, the two way coupling of the forces in the fluid and the solid, and often, the resolution or approximation of turbulent flows.

In aerodynamic design, early works were focused on the use of finite differences or finite volume formulations using curvilinear grids [1,2,3,4]. This description can be useful for static boundaries but unpractical for moving or elastic ones. Mathematically, the grid orientation is an important factor to reduce numerical errors due to the lack of multidimensionality of the dimensional splitting technique, producing mesh-dependent solutions. Another partially successful technique is the use of vortex methods [5,6,7]. These were successful in describing the flow around boundaries for simple grids in Cartesian, cylindrical and spherical coordinates, for incompressible flows reaching Reynolds numbers of several thousands, and providing insight into turbulent flow control with actuators [8]. Nevertheless, the description of more general boundaries was challenged by the need of accurate interpolations near the complex interphase.

A promising technique to deal with complex geometries is the finite element method [9,10,11,12]. This technique has been the focus of many works, specially in blood flow simulations [13]. A clear advantage of this technique is the use of triangular or tetrahedral elements able to handle any boundary’s geometric complexity. The mesh can be adaptive and able to follow moving boundaries. The elasticity of solid boundaries can be coupled to the fluid pressure for compressible or incompressible flows. Many libraries are available for the simulation of fluid-structure interactions using finite elements [16, 17]. But one must keep in mind that the generation of the grids is not trivial and can be quite time consuming. Also, the equations are expressed in weak form and solved implicitly in time, what generates a large non-symmetrical system of linear equations to be solved using GMRES [14], a solver that is difficult to implement with fine grain parallelism [15].

Another technique is the use of Smoothed Particle Hydrodynamics (SPH) [18,19,20,21,22]. In this technique, the equations are discretized following the fluid element’s trajectories and derivatives are approximated using the superposition of interpolation kernels for every particle. Some of the advantages of this technique are that there is no need to generate grids and that it naturally follows moving boundaries. Given the purely Lagrangian formulation, the fluid moves with the boundary velocity right at the interphase. If the domains are compressed or even the volume vanishes in some regions, like in the case of flow in pistons, the particles are able to leave the domain entirely and fill newly open spaces. The streamlines can be sketched using the particles’ trajectories and simple equations of state can be used to simulate quasi-incompressible flows.

Hybrid techniques have been developed, combining both finite elements and particle trajectories, using a finite element grid that is advected at the boundary [23]. In fact, there is no alternative to this description because necessarily the mesh must follow the boundary. In this case, the generation of the mesh is a problem to be considered. A full Lagrangian description is out of the question for the most general case given that in the presence of vorticity, the grids get highly deformed producing a badly conditioned mass matrix.

During the last decade, the graphics processing unit (GPU) has surpassed the performance of the central processing unit (CPU) in floating point operations per second and local memory transfer velocity [24, 25], and the trend is going to continue. The GPUs double their performance approximately every year while the CPUs do it every two years [26]. Therefore, the hardware has become an important factor in the design of fluid flow solvers. They must be designed or adapted to run entirely inside the GPU, with minimum communications with the CPU. This is because the velocity of memory transfers from CPU to GPU is several orders of magnitude slower than the internal memory transfers of the GPU, which can work entirely with L2-cache shared memory.

Considering all the factors, we have collected new ideas for the simulation of fluid-structure interactions using several GPUs. We are motivated to do it because our group has access to Abacus I, a supercomputer with one hundred Tesla K40 GPUs, providing 1200 GB of RAM and 288 000 cores of 745 MHz. In this chapter we describe the programming philosophy and the algorithms necessary for the implementation of a novel particle method running entirely inside a set of GPUs, associated to one another through a corresponding set of CPU threads using the Message Passing Interface library (MPI), and communicating by data transfers between the corresponding CPUs using MPI non-blocking send and receive functions. The particle method is entirely Lagrangian and different from an SPH because it does not use a smoothing particle kernel nor radial weighting functions. It has some similarities to the Moving Particle Semi-implicit method (MPS) [27], but in general, the approximations and search of neighbors are completely new. Its main characteristic is that it computes derivatives using averaged radial finite differences using approximations of area integrals on adaptive spheres around each particle, complemented with the divergence theorem when necessary. It is explicit in time, not requiring the inversion of a linear system of equations. It considers compressible fluids, closing the system using an equation of state for the pressure. The elastic boundary is simulated using another set of particles kept together with forces approximated by springs and damps. The two sets of particles are coupled using the forces of the fluid pressure and the velocity of the elastic surface. The results show the potential of the method to handle complex geometries.

2 Fluid Particles

Consider a Newtonian, compressible fluid with density \(\rho \), pressure P and viscosity \(\mu \), kept at a constant temperature T, for simplicity but without loss of generality, which motion is described by the velocity vector \({\varvec{v}}\). The material derivative \(D/Dt = \partial / \partial t + {\varvec{v}} \cdot \nabla \) is the chain rule for the total derivative in time, describing the variations along the trajectories of the fluid elements given by \(d{\varvec{x}} / dt = {\varvec{v}}\).

The system of equations describing its dynamics are given by the conservation of mass

$$\begin{aligned} \frac{D \rho }{Dt} = - \rho \nabla \cdot {\varvec{v}}, \end{aligned}$$
(1)

and Newton’s second law

$$\begin{aligned} \rho \frac{D {\varvec{v}}}{Dt} = - \nabla P + \mu \nabla ^2 {\varvec{v}}, \end{aligned}$$
(2)

closed by an equation of state relating the density and the pressure \(P = P(\rho ,T)\). We consider moving solid boundaries where the fluid velocity is set to the solid velocity \({\varvec{v}} = {\varvec{v}}_s\). The equation of state

$$\begin{aligned} P = \frac{\rho }{3} \end{aligned}$$
(3)

is used for simplicity during this work.

3 Solid Particles

The solid is represented by particles which conserve mass, subject to linearly elastic, spring like forces and damping. The conservation of mass in the solid is given by

$$\begin{aligned} \frac{D \rho _s}{Dt} = - \rho _s \nabla \cdot {\varvec{v}_s}. \end{aligned}$$
(4)

Newton’s second law for the forces in the solid is given by

$$\begin{aligned} \rho _s \frac{D {\varvec{v}}_s}{Dt} = {\varvec{f}}_s, \end{aligned}$$
(5)

where \({\varvec{f}}_s\) is the sum of the elastic, friction and pressure forces, such that \({\varvec{f}}_s = {\varvec{f}}_e + {\varvec{f}}_f +{\varvec{f}}_p \).

4 Discretization

The discretization of the system could be done by seeding particles inside the fluid domain as desired, keeping them at a minimum distance between each other, denoted by \(h_{min}\). We have chosen to seed the boundary with particles, either by a given boundary mesh, necessary for complex geometries, or using a level set function. If the boundary is provided by a given mesh, the location of the particles must be complemented with the surface’s normal vector pointing towards the fluid domain for every boundary particle. These normal vectors are going to be used to determine the interior and exterior of the computational domain by simple weighted dot products with the relative position vector from the surface. If the boundary is given by a level set function, the boundary is seeded with particles and the level set is used to know if any position in space is part of the computational domain. We choose to seed the particles inside the fluid domain using a virtual Cartesian mesh covering the desired volume to mesh, eliminating the points that lie outside. Each location of the virtual Cartesian mesh is tested to be inside of the computational domain, and if so, stored. Figure 1 shows a cylinder seeded with particles using a Cartesian array with spacing h and keeping particles at a minimum distance \(0.5\,h\) next to the boundary particles.

Fig. 1.
figure 1

Initial particle grid for a cylindrical domain. The particles are seeded using a Cartesian array and adjusted near the boundary to keep a minimum distance between particles.

Once the initial particle positions are given, the fields are initialized for every particle, giving the densities \(\rho \) and \(\rho _s\), and the velocities \({\varvec{v}}\) and \({\varvec{v}}_s\), at time \(t=0\).

The particle trajectories are solved with second order accuracy in time by setting

$$\begin{aligned} {\varvec{x}}_{i}^{n+1} = {\varvec{x}}_{i}^{n} + \frac{1}{2} ({\varvec{v}}^{n}_{i} + {\varvec{v}}^{n+1}_{i}) \varDelta t, \end{aligned}$$
(6)

where the time is discretized, such that the super-index n denotes the time step \(t^{n} = n \varDelta t\) for \(n = 0, 1, ..., N\). This formula provides a second order, explicit integration of the trajectories and is sketched in Fig. 2.

Fig. 2.
figure 2

Second order scheme for the integration of trajectories. The velocity is advanced using the momentum equation and the time-averaged velocity is used to advance the position of the particles.

We choose to use a purely Lagrangian formulation where particles are never re-meshed and interpolations are not necessary for advection. This warranties that the transport is exact along the trajectories of the particles with small errors due to the trajectory integration. Additionally, we leave the mollifier kernel concept used in SPH and focus only in computing the derivatives on the right hand side of the equations for the conservation of mass and Newton’s second law. All the derivatives are computed using averages of radial finite differences, considering a regular distribution of particles. The approximations will reduce accuracy when the particle field deforms and the corrections to this approximations are the focus of future research. Only first line of sight nearest neighbors are considered in every 26 directions in three dimensions and 8 in two dimensions. The gradient is computed using a vector average of radial finite differences

$$\begin{aligned} \nabla P_{i} \approx \frac{D}{N} \sum _{j \ne i} \frac{P_j - P_i}{||{\varvec{x}}_j - {\varvec{x}}_i ||} \widehat{({\varvec{x}}_{j} - {\varvec{x}}_i)}, \end{aligned}$$
(7)

where D is the number of dimensions and N is the number of nearest neighbor particles. The divergence is computed using Gauss’ theorem by

$$\begin{aligned} \nabla \cdot {\varvec{v}}_{i} \approx \frac{A}{V} \sum _{j \ne i} \frac{ ({\varvec{v}}_{j} + {\varvec{v}}_{i})}{2} \cdot \frac{({\varvec{x}}_{j} - {\varvec{x}}_i)}{||{\varvec{x}}_j - {\varvec{x}}_i ||}, \end{aligned}$$
(8)

where \(V = (4/3) \pi R^3\) and \(A = 4 \pi R^2/N\), are the volume and area of a reference sphere for the calculation of the divergence. The radius R is considered as the averaged half distance to the nearest neighbors. Finally, the Laplacian is the combination of both concepts,

$$\begin{aligned} \nabla ^2 v_{i} \approx \frac{A}{V} \sum _{j \ne i} \frac{ (v_{j} - v_{i})}{||{\varvec{x}}_j - {\varvec{x}}_i ||}. \end{aligned}$$
(9)
Fig. 3.
figure 3

Scheme for the radial finite differences approximation of the gradient (7), the divergence (8) and the Laplacian (9). The radial finite difference approximations are based on sums of individual difference vectors to the nearest neighbor particles combined with the divergence theorem using a sphere with radius of the average half distance to the neighbors.

The model for the friction force consists of a damping factor \(\mathcal {D}\) times the square of the velocity magnitude,

$$\begin{aligned} {\varvec{f}}_f = - \mathcal {D} v_s^2 \hat{\varvec{v}}. \end{aligned}$$
(10)

The elastic forces are modelled using springs to the nearest neighbors

$$\begin{aligned} {\varvec{f}}_{e,i} = - \sum _{j \ne i} k ( || {\varvec{x}}_j - {\varvec{x}}_i || - L_{i,j} ) \widehat{({\varvec{x}}_j - {\varvec{x}}_i)}, \end{aligned}$$
(11)

where \(L_{i,j}\) is the equilibrium length for the spring bonding particles i and j.

The conservation of mass and Newton’s second law are advanced as ordinary differential equations over the trajectory of each fluid particle, explicitly in time with a forward Euler method, after evaluating the spatial operators on the right hand side.

Particle methods do not have conservation issues due to advection. Only the inaccurate computation of the sources of compressibility and force can partially affect conservation of the advected quantities. The accurate calculation of gradients and divergences can be affected by a highly deformed particle field. Viscous flows have fluid elements that deform smoothly. Anyway, the computation of individual area weights for each neighbor particle is necessary for accurate approximations of the differential operators. Its detailed analysis and implementation are subjects of future research.

5 Data Structures

The data structure is a list of particles for each CPU process, where the solid particles, which describe also the fluid particles at the boundary, are given at the beginning of the list, followed by the particles in the bulk of the fluid.

A list of nearest neighbors is constructed in order to compute the differential operators in the right hand side of the transport equations for the solid and the fluid. Analogously to Lattice-Boltzmann algorithms, we consider 8 neighboring particles in two dimensions and 26 in three. An initial list is constructed or given. The list is updated after a fixed number of time steps during the numerical integration. The new list is produced following the hypothesis that for every particle, every new neighbor was in the neighbor list of its former neighbors. Algorithm 1 is the pseudo-code for the updating of the list of nearest neighbors neighborlist(1 : n, 1 : 26), where n is the total number of particles in the computational domain. The distance between particles is given by the function distance(ij) (line 3 and 11). If any particle lies outside a neighboring radius h, its index is tagged to be replaced. The list is double-checked to avoid the repetition of particles (line 14).

figure a

6 The Programming Model

We use the template code presented in [28] but adapted for a list of particles. The computational domain is geometrically decomposed in a one-dimensional array of M sub-domains. The Message Passing Interface (MPI) library is used to start M threads for the same number of CPU cores. Every CPU thread corresponds to a process to be run in a different GPU. Frequently, each node of the cluster will have one or two GPUs, therefore it is necessary to distribute the M threads in different nodes, such that every process is able to pick at least one exclusive GPU.

Communications between GPUs is achieved loading the necessary GPU data to the local CPU memory, communicating the CPU threads using MPI unblocked but synchronized sends and receives, and loading it back to the GPU. In this implementation, only boundary data is communicated and particles are not transferred between processes. Future versions may contain the transfer of particles between GPUs.

Fig. 4.
figure 4

Scheme of the computational decomposition of the list of particles and its arrangement as GPU threads in a cube. The GPUs are mastered by CPU threads using MPI.

Inside the GPU, operations are threaded over the list of particles, as described in Fig. 4. The list is distributed in a three-dimensional array with power of two dimensions, further subdivided in blocks to be given to the GPU cores. The list of threads will be in general larger than the list of particles. Those threads that do not correspond to a particle perform no work.

The programming model for a single GPU is focused in performing parallel L2 memory reads and a few global memory writes. The nvcc compiler is capable of automatically allocating L2 memory reads if we provide read-only arrays. All numerical operations are done using the registers memory and the results are written back to write-only arrays in global memory.

7 Results

The algorithm is a newly proposed numerical method and the tests are focused in proving its correctness without going into deep analysis of order of convergence. The algorithm has been theoretically designed to be second order for a regular distribution of neighboring particles.

We use an initial Gaussian perturbation in the density

$$\begin{aligned} \rho (r) = 1.0 + 0.01 \exp ^{-r^2/5}, \end{aligned}$$
(12)

where r is the distance from the center of the domain with dimensions [15, 15, 15], in a quiet Newtonian fluid with viscosity \(\mu = 1\).

First we prove that our new radial difference formulas are correct by computing the norm of the pressure gradient and the divergence of the velocity field after a single time step \(\varDelta t = 0.02\). Figure 5 shows the comparison of the differential operators for a cylindrical domain filled with 250000 particles and a Cartesian \(64^3\) mesh with finite differences. It shows agreement and even a slight improvement in the case of the divergence.

Fig. 5.
figure 5

Comparison of the square of the norm of the pressure gradient (top) and the divergence of the velocity (bottom), for the radial differences scheme using particles in a cylindrical domain (left) and finite differences in a cube (right). Both domains are shown cut in half by a plane normal to the x-axis.

We simulate the acoustic wave resulting from the Gaussian initial condition in a cylinder filled with 250000 particles. We compare it with a second order semi-Lagrangian scheme [29] in a \(64^3\) cube. Figure 6 shows agreement between the schemes even though the reflection of the waves is different for the square domain and the cylinder. Therefore, only early stages of the wave are compared.

Fig. 6.
figure 6

Comparison of the acoustic wave for the newly proposed particle method with 250000 particles and a semi-Lagrangian scheme [29] for a \(64^3\) domain. From left to right we can see the density along the y-axis for times t = 0.2, t = 0.6 and t = 1.2.

We have run the code for one, two and four GPUs Tesla C2070. The results for a small run consisting of a thousand time steps are presented in Fig. 7. The runs show strong scalability in the vertical direction and weak scalability in the horizontal direction. The weak scalability shows a small penalty due to the communication between GPUs. The objective of running in many GPUs is not the acceleration of the code, although it is possible to observe significant acceleration in the case of two GPUs compared to one. Nevertheless, we note that for the case of four GPUs, the acceleration is much less and extrapolating we can see that many more GPUs would not achieve significant acceleration. The point is that runs in several GPUs must be focused in the simulation of very large problems or with high resolutions. The GPUs are very fast processing units that should be exploited at maximum with the lowest number of communications possible. The use of several GPUs must be evaluated using the weak scalability concept where larger problems are run in the approximately same computational time, shown in Fig. 7 in the horizontal direction.

Fig. 7.
figure 7

Weak (horizontal) and strong (vertical) scalability of the domain decomposition scheme for the list of particles using MPI communication between GPUs. The two black curves correspond to one- and two-dimensional domain decomposition.

8 Conclusion

We presented a programming model using domain decomposition with message passing for computations using multiple GPUs adapted to a list of particles. This particle template code has been filled with a novel numerical method. The numerical method consists of a particle method for the integration of transport equations along material trajectories of fluid elements. The fluid elements are defined by the location of the particles. The equations are solved in time using a second order mid-point integration scheme for the positions of the particles and a first order explicit Euler integration for the velocity and density. The pressure is obtained explicitly using an equation of state. Averages of radial derivatives combined with the divergence theorem are used for the approximation of the spatial differential operators at the right hand side of the transport equations. These differences have been used in different forms in other works but never in the form presented here. We take the nearest neighbors in the 26 principal Cartesian directions like in a Lattice-Boltzmann scheme. The approximations have been tested to be second order accurate for a regular, Cartesian distribution of particles, and are expected to loose accuracy as the particle field distorts. Moment conservation for the area covered by the neighbors around each particle will be explored to keep the accuracy regular for any configuration of the neighbors. The list of particles is complemented with a list of nearest neighbors, updated after a fixed number of time steps with a local search scheme. The results of the acoustic waves inside a cylinder meshed with a Cartesian array of particles show the potential of the code to solve problems in complex geometries without the need of complex mesh generators. For complex geometries, the boundary node positions and normal vectors pointing to the fluid must be provided.