Keywords

1 Introduction

The time and results of CFD simulations strongly depend on the type and size (number of nodes, elements, faces, etc.) of the computational mesh. Hybrid and mixed meshes [1] are most suitable for modeling flows around the complex shape bodies. Adaptive mesh refinement methods are often used to improve solution accuracy and decrease the mesh size [2]. The adaptive mesh is statically or dynamically refined directly to the flow features. Static adaptation is used for steady flow simulations. The launch of the parallel CFD program alternates with the solution analysis and mesh adaptation [3]. This approach can be used for multiple simulations with different flow parameters and the same starting coarse mesh. Dynamic adaptation techniques are applied for the simulations of unsteady flows [4, 5]. The mesh is refined during the time the solution is being calculated. For example, the fine cells zones are moved according to the current positions of the turbulent regions after each time integration step. In this case, the mesh adaptation routines are included directly to the main CFD code.

This paper describes a parallel algorithm for compressible flow simulations on hanging node mixed meshes. Heterogeneous parallelization approach is based on MPI + OpenMP + CUDA hybrid program model. The CUDA implementation includes a communication and computation overlap. A distinctive feature of the proposed algorithm in comparison with existing CFD codes is that it can be used both for simulations on hanging node meshes and arbitrary polyhedral meshes.

The paper is organized as follows. The mathematical model and numerical method are briefly outlined in Sect. 2. The parallel algorithm and details of the overlap mode and heterogeneous execution are described in Sect. 3. The results of the study of parallel efficiency and performance are presented in Sect. 4. And Sect. 5 contains some conclusions.

2 Mathematical Model and Numerical Method

A compressible viscous fluid is modeled by the Navier–Stokes equations:

$$ \frac{{\partial {\varvec{Q}}}}{\partial t} + \nabla \cdot {\varvec{F}} = 0, $$
(1)

where \({\varvec{Q}} = \left( {\rho ,\rho u,\rho u,\rho u,E} \right)^{T}\) is a vector of conservative variables, \({\varvec{F}} = {\varvec{F}}^{C} + {\varvec{F}}^{D}\) is a flux vector that consists of convective and diffusive fluxes, respectively.

The Eq. (1) are discretized on an hanging node mixed meshes using a explicit cell-centered finite-volume method with a polynomial reconstruction. The starting conformal mesh combines tetrahedra, prisms, pyramids, and hexahedrons. The mesh adaptation procedure consists in the recursive refinement and/or the coarsening of cells using predefined templates (Fig. 1). A subset of the mesh cells are marked for dividing or agglomeration based on the adaption function, which is created from numerical solution data. After adaptation, neighboring cells can’t to differ by more than one level of refinement.

Fig. 1.
figure 1

Cells refinement templates.

During CFD simulations, non-conformal hanging node mesh is transformed to a conformal polyhedral mesh. If a cell is adjacent to four cells along a face, then this face is divided into four parts (Fig. 2).

Fig. 2.
figure 2

Example of transforming a prism to a polyhedron.

Each polyhedral cell \(C_{i}\) has a volume \(\left| {C_{i} } \right|\) and surface \(\partial C_{i}\). In accordance with the finite volume method, a balancing relation is written for cells as:

$$ \frac{\partial }{\partial t}\int_{{C_{i} }} {{\varvec{Q}}d{\Omega }} + \oint_{{\partial C_{i} }} {{\varvec{F}}{\text{d}}{\mathbf{S}} = 0} . $$
(2)

The discrete values of the mesh function \(Q_{i}\) are defined at the cell-centroid \({\varvec{r}}_{i}\) as integral average

$$ Q_{i} = \frac{1}{{\left| {C_{i} } \right|}}\int_{{C_{i} }} {Qd{\Omega }} {.} $$
(3)

Surface \(\partial C_{i}\) of an internal cell consists of flat faces \(\partial C_{ij}\). The geometry of the face \(\partial C_{ij}\) between cells \(C_{i}\) and \(C_{j}\) is set by the unit normal \({\varvec{n}}_{ij}\), the face-midpoint \({\varvec{r}}_{ij}\) and its square \(S_{ij}\).

The discrete variant of (2) has the form

$$ \frac{{\partial {\varvec{Q}}_{i} }}{\partial t} + \sum\nolimits_{{j \in I_{i} }} {\left( {{\varvec{F}}_{ij}^{C} + {\varvec{F}}_{ij}^{D} } \right)S_{ij} = 0} , $$
(4)

where \(I_{i}\) is the set of \(C_{i}\) neighbors.

The convective flux \({\varvec{F}}_{ij}^{C}\) is computed using a Riemann solver

$$ {\varvec{F}}_{ij}^{C} = {\Phi }\left( {{\varvec{Q}}_{ij}^{L} ,{\varvec{Q}}_{ij}^{R} ,{\varvec{n}}_{ij} } \right), $$
(5)

where \({\varvec{Q}}_{ij}^{L}\) and \({\varvec{Q}}_{ij}^{R}\) are the values from the left and right states evaluated at the interface \(\partial C_{ij}\). The implemented set of Riemann solvers [6, 7] \({\Phi }\) includes CUSP, Roe, Roe–Pike for subsonic flows; HLL, HLLE, HLLC, Rusanov and AUSMDV for supersonic flows.

It is assumed that the solution is piecewise linearly distributed over the control volume. In this case, the left and right states can be found from the relations

$$ Q_{ij}^{L/R} = Q_{i/j} + {\Psi }_{{{\text{i}}/{\text{j}}}} \left( {\nabla Q_{{{\text{i}}/{\text{j}}}} \cdot \left( {{ }{\varvec{r}}_{ij} - { }{\varvec{r}}_{i/j} } \right)} \right), $$
(6)

where \(\nabla Q_{{\text{i}}} = \left( {\partial Q/\partial x,\partial Q/\partial y,\partial Q/\partial z} \right)\) is the gradient of \(Q\) at the \(i\)-th cell center and \({\Psi }\) denotes a limiter function [8, 9].

The algorithm for gradient calculation is based on the Green-Gauss approach. For the original mesh cells, the linear reconstruction coefficients are calculated as

$$ \nabla Q_{{\text{i}}} = \frac{1}{{\left| {C_{i} } \right|}}\sum\nolimits_{{j \in I_{i} }} {\left( {g_{ij}^{i} Q_{i} + g_{ij}^{i} Q_{j} } \right){\varvec{n}}_{ij} S_{ij} } . $$
(7)

Geometrical coefficients \(g_{ij}^{i}\) and \(g_{ij}^{j}\) are inversely to the distances from the cell-centroids to the face plane and satisfy \(g_{ij}^{i} + g_{ij}^{j} = 1\).

For reconstruction over transformed polyhedra, dummy control volumes are used. Dummy cell \(C_{i}^{^{\prime}}\) vertices are placed at the points of intersection between \(\partial C_{ij}\) planes and lines connecting the polyhedra centroid \({\varvec{r}}_{i}\) and adjacent cells centroids \({\varvec{r}}_{j}\). The volume of \(C_{i}^{^{\prime}}\) is bounded by triangulation of the convex hull of its vertices. The values of variables at the vertices and faces midpoints are computed by linear interpolation on segments and triangles. Figure 3 shows reconstruction control volume for transformed prism (Fig. 2). The dummy control volume approach can also be used to calculate gradients in the conformal mesh cells.

Fig. 3.
figure 3

Dummy control volume for transformed prism.

In order to evaluate the diffusive fluxes \({\varvec{F}}_{ij}^{D}\) first derivatives of the velocity components and temperature at face-midpoints are computed as the combination of the averaged gradients \(\overline{\nabla Q}_{ij}\) and the directional derivatives:

$$ \overline{\nabla Q}_{ij} = \frac{{\nabla {\varvec{Q}}_{{\text{j}}} + \nabla {\varvec{Q}}_{{\text{j}}} }}{2}, $$
(8)
$$ \nabla Q_{{{\text{ij}}}} = \overline{\nabla Q}_{ij} - \left[ {\overline{\nabla Q}_{ij} {\mathbf{t}}_{{{\text{ij}}}} - \frac{{{\text{Q}}_{{\text{j}}} - {\text{Q}}_{{\text{i}}} }}{{{\text{l}}_{{{\text{ij}}}} }}} \right]{\mathbf{t}}_{{{\text{ij}}}} . $$
(9)

Here \({\text{l}}_{{{\text{ij}}}} = \left| {{\varvec{r}}_{j} - {\varvec{r}}_{i} } \right|\) represents the distance between the cell-centroids, and \({\mathbf{t}}_{{{\text{ij}}}} = \left( {{\varvec{r}}_{j} - {\varvec{r}}_{i} } \right)/{\text{l}}_{{{\text{ij}}}}\) is the unit vector along the line connecting them.

The boundary conditions are specified by setting the appropriate flow variables \({\varvec{Q}}_{{i{\Gamma }}}^{R}\) or fluxes \({\varvec{F}}_{{i{\Gamma }}}\) at the midpoints of the faces \(\partial C_{{i{\Gamma }}}\) at the computational domain boundary \({\Gamma }\).

Finally, explicit multi-stage Runge–Kutta (RK) time integration schemes are used.

3 Parallel Algorithm and Implementation

3.1 Sequential Time Integration Step Algorithm

Each explicit RK step include multiple execution of the following operations:

  • Reconstruction kernel – to get gradients at the cell centers, a loop over the cells;

  • Fluxes kernel – calculation of convective and diffusive fluxes through inner and boundary faces, a loop over the faces;

  • Update kernel – summation of fluxes along adjacent faces and calculation of the flow variables at a new RK scheme stage or time step, a loop over the cells.

The statement blocks of the loops are data independent and are considered as the parallelism units in the CPU (OpenMP) and GPU (CUDA programming model) CFD kernel implementations.

3.2 Distributed-Memory MPI Parallelization

The problem of static load balancing between MPI processes is solved based on the domain decomposition method. Each MPI process updates the flow variables for the cells of the polyhedral mesh subdomain (Fig. 4).

Fig. 4.
figure 4

Two-level decomposition.

Graph partitioning subroutines [10, 11] work with an unstructured node-weighted dual graph. It is a graph in which nodes are associated with mesh polyhedrons, and edges are adjacency relations by shared faces. The polyhedron weight is set as the integer equal to the number of faces. As a result of decomposition, each mesh cell gets the rank of the MPI process, the subdomain of which it belongs to, and each process gets a list of its own cells.

The set of own cells is divided into subsets of inner and interface cells. The interface cells spatial scheme stencil includes cells from neighbouring subdomains or halo cells. Thus, the local computational domain of the MPI process joins up its own and halo cells. The interface cells of one subdomain are halo cells for one or more neighboring subdomains.

The local indexes of the computational domain cells are ordered according to their type: inner, interface, halo. In addition, Cuthill-McKee reordering [12] is used to set indexes within inner and interface cell groups. This approach optimizes the data location in RAM and improves the CFD kernels performance. The halo cell indexes are ordered by the MPI process ranks. Figure 5 shows the distribution of cells by type for the computational domain, built around the 8 × 8 cells subdomain of a structured grid.

Fig. 5.
figure 5

Computational domain structure in 2D.

Parallel execution of the RK stage starts with asynchronous MPI exchanges. The data transfer includes only the flow variables. Halo cells gradients and interface/halo face fluxes computations are duplicated. Persistent requests for messages are created before the simulation starts by the subroutines MPI_Send_Init and MPI_Recv_Init. At the synchronization point, the MPI process packs the interface data to the send buffers, starts collecting requests (MPI_Startall routine) and then waits for all given requests to complete (MPI_Waitall routine). Halo data is received directly to the calculation data arrays without intermediate bufferization.

3.3 Heterogeneous Execution and Overlap Mode

Each MPI process is bound either to fixed CPU cores number or to one core (host) and one GPU (device). The sizes of the local computational domains are proportional to the device performance. The relative performance ratio between CPU and GPU is determined experimentally by the CFD kernels execution time.

If the CFD kernels are running on a GPU, parallel execution includes additional device-to-host and host-to-device data transfers before and after the MPI exchange. Interface flow variables are packed to intermediate buffers on the device and then copied to the host MPI send messages buffers. Halo data received to intermediate host buffers, which are copied to device data arrays.

The execution of the kernels on the GPU allows to overlap communication and computation. The inner cells are computed simultaneously with the halo update operation, then the interface cells are computed using the updated halo. In this case, host intermediate buffers are allocated in page-locked (or pinned) memory. Two asynchronous CUDA streams are created on the device. First stream used as a queue for halo update procedures execution. CFD kernels are placed to the second stream. The algorithm for executing the RK stage on the GPU in overlap mode is shown in Fig. 6.

Fig. 6.
figure 6

RK stage execution algorithm.

4 Parallel Efficiency and Scalability

The study of the parallelization efficiency, the overlap mode and the heterogeneous execution performance was carried out during numerical simulation of the supersonic flow around a sphere at Mach number 2 and Reynolds number 300. The mesh adaptation algorithm and the numerical simulation results discussed in [13]. The computational mesh is refined at the detached shock wave position, near the isosurface with a local Mach number (\(M = 1\)) and near the recirculation region boundary (Fig. 7a). Figure 7b illustrates the flow structure.

Fig. 7.
figure 7

Flow around a sphere.

Coarse and fine adaptive hanging node meshes were used for testing (Table 1). Both meshes contain cells of the first and second refinement levels.

Table 1. Computational meshes parameters.

The parallel program was run on a heterogeneous supercomputer K-100 with asymmetric nodes. Two Intel Xeon X5670 CPUs and three NVIDIA 2050 GPUs are available on one node to run the program.

4.1 MPI Parallelization and Overlap Mode

Figure 8 shows parallel efficiency graphs for a coarse mesh in the GPU mode (3 MPI processes per node). Blue line - results of pure computations without MPI exchange. The minimum parameter value (24 GPU and about 25000 mesh cells per device) is 87%. Here, the CFD kernels execution time increases due to domain decomposition imbalance and duplication of calculations at halo cells. Explicit MPI exchanges (red line) visibly increase the program execution time. For a group of 24 devices, communications are comparable to the computation time. Parallel efficiency drops to 64% and the speedup turns out to be 1.6 times lower than the expected value. In overlap mode, MPI exchanges are successfully hidden behind computations. Parallel efficiency (green line) differs by only 2–5% compared to the results for CFD kernels.

Fig. 8.
figure 8

Parallel efficiency. (Color figure online)

4.2 Heterogeneous Execution

To measure the efficiency of a heterogeneous mode, the values of absolute and relative performance are used. The absolute performance \(P_{N}^{mode}\) on \(N\) nodes for specific mode (CPU only, GPU only, heterogeneous CPU + GPU) is equal to the number of RK time integration steps per second. The relative performance \(R_{N}^{mode}\) is calculated as

$$ R_{N}^{mode} = \frac{{P_{N}^{mode} }}{{P_{N}^{CPU} + P_{N}^{GPU} }}. $$
(10)

In CPU only mode, each MPI process occupies 3 cores to run OpenMP threads (4 MPI processes per supercomputer node). The configuration for running 6 processes per node in a heterogeneous mode is shown in Fig. 9. Here, MPI processes and OpenMP threads get strong affinity to CPU cores.

Fig. 9.
figure 9

Program launch configuration.

Figure 10 shows a graph of relative performance for a coarse mesh. The ratio of values practically does not change for a different number of nodes. The relative heterogeneous performance is about 83%. The difference between \(R_{N}^{HET}\) and \(R_{N}^{CPU} + R_{N}^{GPU} = 1\) is partly due to the fact that 3 cores are used to control the execution of CFD kernels on devices. In this case, the ideal value is \(R_{N}^{HET} = 0.75R_{N}^{CPU} + R_{N}^{GPU} \approx 0.91\). Thus, the real \(R_{N}^{HET}\) differs from the theoretical maximum value by only 8%.

Fig. 10.
figure 10

Relative performance.

4.3 Parallel Program Scaling

The study of the program scaling was carried out for a fine 5 million cells mesh. Figure 11 shows a speedup graph for heterogeneous parallelization (green line). For comparison, there is a graph of ideal linear speedup (red dashed line). The performance ratio of the CPU and the GPU has a fixed value of 1:1.6. The low current efficiency of the CFD algorithm implementation for the GPU is explained by the use of the same parallelism units.

Fig. 11.
figure 11

Speedup for fine mesh. (Color figure online)

The speedup for 35 nodes (70 CPUs, 105 GPUs, 210 processes MPI group size) is 29.3 times, which corresponds to a parallel efficiency of 84%. The absolute performance is \(P_{N}^{HET} \approx 56\) steps per second. That is, it takes \(0.0179\) s for one time integration step or \(0.0045\) s for RK stage (include MPI exchange).

5 Conclusions

This paper describes a heterogeneous CFD algorithm for compressible flow simulations on adaptive mixed meshes. This algorithm can be successfully used in computations using static adaptation mesh methods, when the solution analysis and mesh refining routines are not included in the main CFD code. The algorithm software implementation has high scalability and parallel efficiency. In the future, the CUDA CFD kernels will be optimized as described in [14] and dynamic adaptation methods will be added.