Keywords

1 Introduction

Computational Fluid Dynamics (CFD) simulations have revolutionized the design process in various scientific, engineering, industrial, and medical fields. The current Reynolds averaged Navier-Stokes (RANS) methods can solve steady viscous transonic and supersonic flows, but are not able to reliably predict turbulent separated flows [28]. Lattice Boltzmann method (LBM) is a young and evolving approach to solving these problems in the CFD community [2]. It originates from a mesoscale description of the fluid (based on the Boltzmann equation), and directly incorporates physical terms to represent complex physical phenomena, such as multi-phase flows, reactive and suspension flows, etc. Besides, many collision models have been developed for LBM to improve its stability to the second order of numerical accuracy when simulating high Reynolds number flows [2].

However, it is challenging to achieve high performance for LBM algorithms, since LBM has large data storage costs and is highly memory-bound on current architectures [24]. Driven by our prior work [5] to merge multiple collision-streaming cycles (or time steps) in 2D, this study aims to augment the memory-awareness idea to support parallel 3D LBM to optimize data re-utilization. Although it might seem to be straightforward to move from the 2D space to 3D space, it is significantly much more difficult to design an efficient 3D memory-aware LBM algorithm. In this paper, we target solving the following three main challenges. (1) As geometries change from 2D to 3D, the required data storage increases from \(O(N^2)\) to \(O(N^3)\), meanwhile data dependencies of the lattice model becomes much more complicated. There exist single-copy distribution methods to reduce data storage cost by half, but they require following a particular traversal order. Can we combine the best single-copy distribution method with our idea of merging multiple collision-streaming cycles to design a 3D memory-aware LBM with higher performance? (2) If the combination is possible, since normal 3D tiling [21] does not apply to this case, how to additionally explore the spatial locality? (3) When designing the parallel 3D memory-aware LBM, a non-trivial interaction occurs at the boundaries between threads, how to guarantee thread safety and avoid race conditions? Although some existing works use wavefront parallelism to explore the temporal locality, they insert frequent layer-wise synchronizations among threads every time step [11, 27]. In this paper, we aim to reduce the synchronization cost among parallel threads.

To the best of our knowledge, this paper makes the following contributions. First, we design both sequential and parallel 3D memory-aware LBM algorithms that combine five features: single-copy distribution, loop fusion (single sweep), swap algorithm, prism traversal, and merging two collision-streaming cycles. Second, we present a parallelization method to keep the thread safety on the intersection layers among threads and reduce the synchronization cost in parallel. At last, two groups of experiments are conducted on three different manycore architectures, followed by performance analysis. The first group of sequential experiments (i.e., using a single CPU core) shows that our memory-aware LBM outperforms the state-of-the-art Palabos (Fuse Swap Prism LBM solver) [17] by up to 19% on a Haswell CPU and 15% on a Skylake CPU. The second group evaluates the performance of parallel algorithms. The experimental results show that our parallel 3D memory-aware LBM outperforms Palabos by up to 89% on a Haswell node with 28 cores, 85% on a Skylake node with 48 cores, and 39% on a Knight Landing node with 68 cores.

2 Related Work

Existing research on designing efficient LBM algorithms mainly focuses on optimizing memory accesses within one time step of LBM due to its iterative nature. For instance, a few LBM algorithms (e.g., swap [13, 25], AA [1], shift [19], and esoteric twist [6], etc.) retain a single copy of the particle distribution data (i.e., “single-copy distribution”), and optimize the memory access pattern in the LBM streaming kernel, but each of the algorithms needs to follow a set of constraints (e.g., swap requires predefined order of discrete cell velocities [9], AA requires distinguishing between even and odd time steps, shift requires extra storage [9], esoteric twist requires only one version of the LB kernel [29], etc.) [26] uses a moment-based representation with extra distribution pseudo domain to further reduce the storage cost. Some works hide the inter-process communication cost on multicore accelerators [3], and achieve large-scale parallelization on HPC systems [20] and GPU [1]. [31] introduces a cache oblivious blocking 3D LBM algorithm, but it has an irregular parallelism scheme due to its recursive algorithm design. In summary, the above methods focus on optimizations within one time step. Differently, our 3D memory-aware LBM aims to adopt the efficient single-copy distribution scheme, and design new methodologies to merge two collision-streaming cycles to explore both temporal and spatial data locality at the same time for achieving higher performance.

Another category of works manages to accelerate LBM by wavefront parallelism, which generally groups many threads to successively compute on the same spatial domain. [11] presents a shared-memory wavefront 2D LBM together with loop fusion, loop bump, loop skewing, loop tiling, and semaphore operations. But due to its high synchronization cost incurred by many implicit barriers in wavefront parallelism, their parallel performance has only 10% of speedup on average. [7] presents a shared-memory wavefront 3D LBM with two-copy distributions, and does not use spatial locality techniques such as loop fusion and loop blocking. [27] presents a shared-memory wavefront 3D Jacobi approach together with spatial blocking. It uses two-copy distributions and has simpler 6-neighbors dependencies (rather than the 19 or 27 neighbors in 3D LBM). [12] combines the wavefront parallelism with diamond tiling. By contrast, our 3D memory-aware LBM does not use the wavefront parallelism, but judiciously contains three light-weight synchronization barriers every two collision-streaming cycles. In addition, we partition the simulation domain and assign a local sub-domain to every thread, rather than all threads work on the same sub-domain in wavefront parallelism. In each sub-domain, each thread in our algorithm computes multiple time steps at once, rather than one thread computes one time step at a time in wavefront parallelism. In addition, each of our threads also utilizes prism techniques to optimize spatial locality. This strategy in particular favors new manycore architectures, which tend to have increasingly larger cache sizes.

Modern parallel software packages that support LBM can be classified into two categories based upon their underlying data structures. One category adopts matrix-based memory alignment at the cell level (e.g., Palabos [10], OpenLB [8], HemeLB [14], HemoCell [30]). Since neighbors can be easily found through simple index arithmetics in this case, they are more suitable for simulations with dense geometries. The other category adopts adjacent list data structures (e.g., Musubi [15], waLBerla [4], HARVEY [20]). They are often used for simulating domains with sparse and irregular geometries, but their cells require additional memory of pointers, and double the memory consumption in the worst case. In this study, we choose the widely-used and efficient matrix-based data structure in the LBM community, and select the state-of-the-art Palabos library as the baseline, since Palabos provides a broad modeling framework, supports applications with complex physics, and shows high computational performance.

[18] designs a locally recursive non-locally asynchronous (LRnLA) conefold LBM algorithm, which uses recursive Z-curve arrays for data storage, and recursively subdivides the space-time dependency graph into polytopes to update lattice nodes. However, our work uses a more directly accessible matrix-based data storage and has a regular memory access pattern. Besides, our prism traversal can independently or integrate with merging two time steps to operate on the lattice nodes, while [18] operates on the dependency graph.

3 Baseline 3D LBM Algorithm

The baseline 3D LBM algorithm in this paper is called Fuse Swap LBM as shown in Algorithm 1, which involves three features: single-copy distribution, swap algorithm, and loop fusion. We choose the swap algorithm [13] since it is relatively simpler than the other single-copy distribution methods, and is more efficient to use simple index arithmetic to access neighbors in the matrix-based memory organization. The swap algorithm replaces the copy operations between a cell and its neighbors in the streaming kernel by a value swap, thereby it is in-place and does not require the second copy. But when combining it with loop fusion, we must guarantee that the populations of neighbors involved in the swap are already in a post-collision state to keep thread safety [9].

The work-around solution is to adjust the traversal order of simulation domains with a predefined order of discrete cell velocities [9]. Thus each cell can stream its post-collision data by swapping values with half of its neighbors pointed by the “red” arrows (1–9 directions for D3Q19 in Fig. 1a), if those neighbors are already in post-collision and have “reverted” their distributions. We define this operation as “\(swap\_stream\)”. The “revert” operation in Fig. 1b lets a cell locally swap its post-collision distributions to opposite directions. To make the Fuse Swap LBM more efficient, Palabos pre-processes and post-processes the boundary cells on the bounding box at line 2 and 7, respectively, so that it can remove the boundary checking operation in the inner bulk domain. Thus Algorithm 1 is divided into three stages in every time step as follows.

Fig. 1.
figure 1

Two operations and three stages computation used in sequential 3D fuse swap LBM. (Color figure online)

figure a

4 The 3D Memory-Aware LBM Algorithm

4.1 Sequential 3D Memory-Aware LBM

We design and develop the sequential 3D memory-aware LBM (shown in Algorithm 2), based on the latest efficient Fuse Swap LBM, by adding two more features: merging two collision-streaming cycles to explore the temporal locality, and introducing the prism traversal to explore the spatial locality. Figure 2 shows an example on how to merge two collision-streaming cycles given a \(4\times 4\times 4\) cube:

  1. 1.

    Figure 2a shows the initial state of all cells at the current time step t. Green cells are on boundaries, and blue cells are located in the inner bulk domain.

  2. 2.

    In Fig. 2b, we compute the first collide, revert and \(boundary\_swap\_stream\) row by row on the bottom layer iX = 1. After a cell completes the first computation, we change it to orange.

  3. 3.

    In Fig. 2c, we compute the first collide and \(boundary\_swap\_stream\) row by row till cell (2,2,1) on the second layer iX = 2.

  4. 4.

    In Fig. 2d, cell (2,2,2) completes its first collide and \(swap\_stream\), so we change it to red since they are inner cells. Then we observe that cell (1,1,1) is ready for the second collide, so we change it to yellow.

  5. 5.

    In Fig. 2e, we execute the second collide and \(boundary\_swap\_stream\) on cell (1,1,1), and change it to purple.

Fig. 2.
figure 2

3D sequential two-step memory-aware LBM on a \(4\times 4\times 4\) cube lattice.

Fig. 3.
figure 3

Sequential 3D prism traversal on a \(4 \times 16 \times 16\) cuboid box.

To further increase data reuse, we optimize the algorithm’s spatial locality by designing a “prism traversal” method, since the shape of this traversal constructs a 3D pyramid prism or a parallelpiped prism. We use an example to explain its access pattern in a \(4 \times 16 \times 16\) cuboid with stride \(tile=4\). Figure 3a–d are the four separate \(16 \times 16\) layers of the cuboid from bottom to top. The cells with the same number on the four layers construct a prism (e.g., the cells with number 1 in Fig. 3a–d construct a pyramid-shape “Prism 1”). In each prism, we still firstly go along Z-axis, then along Y-axis, and upward along X-axis at last. Then we traverse prism-wise from Prism 1 to Prism 30. Finally, if a cuboid is much larger than this example, the majority of prisms are “parallelpiped” shapes like Prism 9 and 10 in Fig. 3e. The reason why the planar slice of a prism is either triangles or parallelograms is due to the \(swap\_stream\) operation. When cutting Fig. 1a (\(swap\_stream\)) along the Y-Z plane, we have a planar slice as shown in Fig. 3f. We observe that a cell (star) swaps with its lower right neighbor (orange) at direction 9. In other words, when the orange cell swaps with the upward row, its neighbor “shifts” one cell leftward. Similarly, if cutting Fig. 1a (\(swap\_stream\)) along the X-Y plane, when a cell swaps data with the upward row, its neighbor “shifts” one cell forward. Thus when we traverse tile number of cells on Z-axis at row iY, they can swap with tile number of cells but shifted one cell leftward at row \(iY+1\), thereby we get parallelograms in Fig. 3a–d. When the shift encounters domain boundaries, we truncate the parallelograms and get isosceles right triangles or part of parallelograms. At last, we can safely combine “prism traversal” with merging two collision-streaming cycles, since the cell at left forward down corner has been in a post-collision state and ready to compute the second computation when following the above traversal order.

figure b

Algorithm 2 presents the sequential 3D memory-aware LBM. Lines 6–10 traverse the domain prism-wise with stride tile. Lines 11–14 merge two time steps computation. The first stream starting from the bottom layer iX = 1 in Line 10 is necessary due to the data dependency for the second computation. In particular, the if-statement in Line 13 ensures that the cell to compute at time step \(t+1\) is in a post-collision state, no matter using D3Q15, D3Q19, D3Q27 or extended lattice models. For simplicity, Lines 16–29 define three helper functions.

4.2 Parallel 3D Memory-Aware LBM

To support manycore systems, we choose OpenMP [16] to realize the parallel 3D memory-aware LBM algorithmFootnote 1. Figure 4 illustrates its idea on a \(8\times 4\times 4\) cuboid, which is evenly partitioned by two threads along the X-axis (height). Then each thread traverses a \(4\times 4\times 4\) sub-domain with prism stride \(tile=4\). Line 4 in Algorithm 3 defines the start and end layer index of each thread’s sub-domain, thus the end layers myEndX are “intersections” (e.g., layer 4 and 8). Figure 4a shows the initial state at time step t. In addition, the parallel 3D memory-aware Algorithm 3 consists of three stages: Preprocessing, Sub-domain computation, and Post-processing.

Fig. 4.
figure 4

Parallel 3D two-step memory-aware LBM on a \(8\times 4\times 4\) cuboid.

figure c
  1. 1.

    Stage I (Preprocessing) line 5 in Algorithm 3: In Fig. 4b, thread 0 and 1 compute the first collide and revert on the “intersection” layers 4 and 8, respectively, and then change them to pink.

  2. 2.

    Stage II (Sub-domain computation) handles five cases from step 2 to 7. In case 0 (lines 1517 in Algorithm 3), when thread 0 and 1 access the cells on the first row and column of each layer except the “intersection” layers, we execute the first \(boundary\_cell\_comp\) on them and change them to orange.

  3. 3.

    Figure 4c shows case 1 (lines 1819 in Algorithm 3). When thread 0 and 1 access the cells on layer myStartX (iX = 1 & 5), respectively, we execute the \(adaptive\_collide\_stream\) on them to compute at time step t, and then change the boundary cells to orange and the inner cells to red.

  4. 4.

    Figure 4d shows case 2 (lines 2023 in Algorithm 3). When thread 0 and 1 are on layer \(myStartX+1\) (iX = 2 & 6), respectively, we execute the first \(adaptive\_collide\_stream\) at time step t and change boundary cells to orange and inner cells to red. Meanwhile, cell (5,1,1) and (1,1,1) have collected the data dependencies to collide at time step \(t+1\), we execute the second collide and revert but without stream on them, and change to light purple.

  5. 5.

    Figure 4e shows that when continuing traversal in Prism 1, thread 0 and 1 are on layer iX = 3 & 6. Since the cells traversed in this figure are in the first row and column, case 0 is used here, otherwise, case 4 is used.

  6. 6.

    Figure 4f shows case 3 (lines 2427 in Algorithm 3). When thread 0 and 1 are on the intersection layers (iX = 4 & 8), we execute the remaining first stream at time step t due to preprocessing in Stage I. Then if cells under one layer (iX = 3 & 7) collect their data dependency at time step \(t+1\), we execute the second \(adaptive\_collide\_stream\) on them.

  7. 7.

    Figure 4g shows case 4 (lines 2831 in Algorithm 3). When thread 0 and 1 are on the other layers of sub-domain, we conduct the first adaptive_collide_stream on (innerX, innerY, innerZ) at time step t, and then the second adaptive_collide_stream on (innerX-1, innerY-1, innerZ-1) at time step \(t+1\). Then we call boundary_neighbor_handler to compute the neighbors of (innerX, innerY, innerZ) at certain locations at time step \(t+1\).

  8. 8.

    Stage III (Post-processing) lines 3335 in Algorithm 3: Firstly, since Stage I and case 3 have completed the first computation on intersection layers, we wrap up the second collide and stream on intersections. Secondly, since case 2 have executed the second collide and revert on the first layers myStartX of each sub-domain, the second stream remains to be executed.

Fig. 5.
figure 5

Handle thread safety on intersection layers.

How to Handle Thread Safety near Intersection Layers: We aim to keep thread safety and minimize the synchronization cost during parallel executions. To this end, we need to carefully design the initial state of each thread so that the majority of computation stays in each threads’ local sub-domain. The left part of Fig. 5 shows the view of Fig. 4 along X-Z axis, and layer 4 is the intersection layer that partitions two threads’ sub-domains. The right part shows the data dependencies near the intersection layer in two time steps. In the figure, the red block represents Stage I of Algorithm 3, yellow blocks Stage II, and green blocks Stage III. The arrows indicate that data are transferred from layer A to B by using a procedure (or B depends on A). There are three non-trivial dependencies requiring to handle thread safety near intersection layers. (1) Since the swap algorithm only streams data to half of the neighbors under one layer, the \(swap\_stream\) on layer 5—the first layer of thread 1’s sub-domain—should be delayed after the revert on layer 4 in thread 0’s sub-domain. Thus, in Stage I, we pre-process collide and revert at time step t but without stream on layer 4, since stream on layer 4 depends on the post-collision on layer 3, which has not been computed yet. (2) In Stage II, the second \(swap\_stream\) on layer 6 called by the case 4 procedure should be delayed after the second revert but without \(swap\_stream\) on layer 5. This is because thread 1 cannot guarantee that thread 0 has completed the second \(swap\_steam\) on layer 4. To keep thread safety, \(swap\_stream\) on layer 5 is delayed to Stage III. (3) Thus, in Stage III, the second \(swap\_stream\) on layer 5 is delayed after the second \(swap\_stream\) on layer 4. Above all, since the major computation happens in Stage II of each thread’s sub-domain, we avoid the frequent “layer-wise” thread synchronizations that occur in the wave-front parallelism. Besides, we only synchronize at the intersection layers every two time steps, hence the overhead of three barriers of Algorithm 3 becomes much less.

5 Experimental Evaluation

In this section, we first present the experimental setup and validations on our 3D memory-aware LBM. Then we evaluate its sequential and parallel performance.

5.1 Experiment Setup and Verification

The details of our experimental hardware platforms are provided in Table 1. To evaluate the performance of our new algorithms, we use the 3D lid-driven cavity flow simulation as an example. The 3D cavity has a dimension of \(lz \times ly \times lx\), and its top lid moves with a constant velocity v. Our 3D memory-aware LBM algorithms have been implemented as C++ template functions, which are then added to the Palabos framework. For verification, we construct a cavity with the same procedure, and then separately execute four algorithms on it, i.e., Palabos solvers fuse() and \(fuse\_prism()\) for N time steps, and our memory-aware algorithms \(two_\_step\_prism()\) and \(two\_step\_prism\_omp()\) for N/2 time steps. Then, we compute the velocity norm of each cell and write to four separate logs. At last, we verify that our algorithms produce the same result as Palabos for guaranteeing software correctness.

Table 1. Details of our experimental platforms.

5.2 Performance of Sequential 3D Memory-Aware LBM

The first set of experiments with 3D cavity flows compare the sequential performance of four different LBM algorithms, which are the Fuse Swap LBM (with/without prism traversal), and the Two-step Memory-aware LBM (with/without prism traversal). For simplicity, we use the abbreviations of fuse LBM, fuse prism LBM, 2-step LBM and 2-step prism LBM, respectively. The problem input are 3D cubes with edge size \(L =\,\)64–896. Every algorithm with a prism stride configuration is executed five times, and the average MFLUPS (millions of fluid lattice node updates per second) is calculated. For the “prism” algorithms, different prism strides (ranging from 8, 16, 32,\(\ldots \), to 448) are tested, and we select the best performance achieved.

Fig. 6.
figure 6

Sequential performance using four LBM algorithms on three types of CPUs.

Figure 6 shows the sequential performance on three types of CPUs. When we use small edge sizes (e.g., \(L = 64, 128\)), 2-step LBM is the fastest. But when \(L\ge 256\), 2-step prism LBM performs the best and is up to 18.8% and 15.5% faster than the second-fastest Palabos (Fuse Prism LBM solver) on Haswell and Skylake, respectively. But since KNL does not have an L3 cache, 2-step prism LBM is only 1.15% faster than Palabos (Fuse Prism LBM solver).

We observe that the performance of algorithms without prism traversal starts to drop when \(L\ge 384\). Since the swap algorithm streams to half of its neighbors on its own layer and the layer below, 23.9 MB\(/layer \times 2\,layers = 47.8\) MB (when \(L=384\)), which exceeds the L3 cache size (35 MB per socket on Haswell). Thus we need to use spatial locality by adding the feature of prism traversal. Consequently, on Haswell and Skylake, fuse LBM is improved by up to 71.7% and 58.2%, respectively, 2-step LBM is improved by up to 28.6% and 50.4%, respectively. When only adding the feature of merging two steps, 2-step LBM is faster than Palabos (Fuse) by up to 53.3% on Haswell and 20.5% on Skylake. Hence, we conclude that both prism traversal and merging two steps significantly increase cache reuse on the large domain.

Fig. 7.
figure 7

Memory usage on two sockets of a Haswell node.

In Fig. 6, we observe that the performance of all algorithms starts to drop when \(L \ge 768\) on Haswell and \(L = 896\) on Skylake. To find out the reason, we use Remora [22] to monitor the memory usage on each socket of the Haswell node. As L increases from 640 to 896, the memory usage on socket 1 (red area) in Fig. 7a–c has enlarged from 2.4 GB to 63.9 GB. When memory usage exceeds the 64 GB DRAM capacity per socket on the Haswell node, foreign NUMA memory accesses are involved, thus the sequential performance reduces. Similar results also happen on the Skylake node. However, because the KNL node only has one socket, the performance on KNL does not drop.

5.3 Performance of Parallel 3D Memory-Aware LBM

Given N cores, Palabos LBM solvers partition the simulation domain evenly along three axes by \(N_{z} \times N_{y} \times N_{x} = N\) MPI processes, which follows the underlying memory layout of cells along the axis of Z, then Y, and X at last. But our 3D memory-aware LBM partitions a domain only along X-axis by N OpenMP threads. Hence, Palabos LBM solvers have a smaller Y-Z layer size per core than our algorithm and have closer memory page alignment especially for a large domain. To exclude the factor caused by different partition methods, when the input of Palabos LBM solvers still uses cubes, 3D memory-aware LBM will take two different inputs. Firstly, it takes the input of the “equivalent dimension” of those cubes, such that a thread in our algorithm and a process in Palabos will compute a sub-domain with the same dimension after the respective partition method. Secondly, it simply takes the identical input of those cubes.

Table 2. Equivalent input used by 2-step prism LBM when the input of Palabos LBM solvers is a cube with \(L = 840\) on a Haswell node.
Fig. 8.
figure 8

Strong scalability performance on three types of compute nodes. “2-step prism eqv” = Parallel 3D memory aware LBM takes the equivalent input of cubes.

Figure 8 shows the strong scalability of three LBM algorithms on three types of compute nodes. The input of Palabos LBM solvers use cubes with edge size L from small to large. Table 2 gives an example of the equivalent input used by 3D memory-aware LBM when Palabos LBM solvers use a cube with \(L = 840\) on a Haswell node. We observe that the 2-step prism LBM scales efficiently and always achieves the best performance in all cases. (1) When using the equivalent input of cubes, for small scale cubes (with \(L=112, 192, 272\)) in Fig. 8a, d and g, 3D memory-aware LBM (green legend) is faster than the second-fastest Palabos (Fuse Prism) (orange legend) by up to 89.2%, 84.6%, and 38.8% on the Haswell, Skylake, and KNL node, respectively. Missing L3 cache on KNL prevents the similar speedup as other two CPUs. In Fig. 8b, e and h, for the middle scale cubes (with \(L=448, 576, 476\)), it is still faster than Palabos (Fuse Prism) by up to 37.9%, 64.2%, and 28.8% on three CPU nodes, respectively. Due to unbalanced number of processes assigned on three axes, we observe that the performance of Palabos Fuse and Fuse Prism drop on some number of cores. In Fig. 8c, f and i, for the large scale cubes (with \(L=840, 960, 680\)), it is still faster than Palabos (Fuse Prism) by up to 34.2%, 34.2%, and 31.8%, respectively. (2) When using the identical input of cubes, although our 3D memory-aware LBM has larger Y-Z layer sizes, it is still faster than Palabos (Fuse Prism) but with less speedup than before, i.e., by up to 21.1%, 54.7%, and 30.1% on three CPU nodes, respectively. The less speedup suggests our future work to partition a 3D domain along three axes to utilize closer memory page alignment on smaller Y-Z layer size.

6 Conclusion

To address the memory-bound limitation of LBM, we design a new 3D parallel memory-aware LBM algorithm that systematically combines single copy distribution, single sweep, swap algorithm, prism traversal, and merging two collision-streaming cycles. We also keep thread safety and reduce the synchronization cost in parallel. The parallel 3D memory-aware LBM outperforms state-of-the-art LBM software by up to 89.2% on a Haswell node, 84.6% on a Skylake node and 38.8% on a Knight Landing node, respectively. Our future work is to merge more time steps on distributed memory systems and on GPU.