Peta-Scale Hierarchical Hybrid Multigrid Using Hybrid Parallelization

Gmeiner, Björn; Rüde, Ulrich

doi:10.1007/978-3-662-43880-0_50

Björn Gmeiner¹⁸ &
Ulrich Rüde¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8353))

Included in the following conference series:

International Conference on Large-Scale Scientific Computing

1368 Accesses
3 Citations

Abstract

In this article we present a performance study of our finite element package Hierarchical Hybrid Grids (HHG) on current European supercomputers. HHG is designed to close the gap between the flexibility of finite elements and the efficiency of geometric multigrid by using a compromise between structured and unstructured grids. A coarse input finite element mesh is refined in a structured way, resulting in semi-structured meshes. Within this article we compare and analyze the efficiencies of the stencil-based code on those clusters.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Parallel Finite Cell Method with Adaptive Geometric Multigrid

Multilevel Parallelization: Grid Methods for Solving Direct and Inverse Problems

Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

Article 13 May 2019

Keywords

1 Introduction

InGmeiner, Björn Rüde, Ulrich electro-chemistry, density functional theory (DFT) plays an important role as a class of models to calculate the electrical potential imposed by the charges of an ensemble of atom nuclei and electrons [7, 11]. One essential step in the DFT is the solution of the potential equation that reduces to Poisson’s equation in the case of a homogeneous dielectricity coefficient [12]. However, often effects of an ionic solvent with varying dielectricity cannot be neglected. The governing equation in this case is given as

$$\begin{aligned} - \nabla \cdot k(x,y,z) \nabla u(x,y,z) = f(x,y,z), \end{aligned}$$

(1)

where $k$ denotes the dielectricity constant, $u$ the potential field and $f$ the right-hand side. For the sake of simplicity, we assume Dirichlet conditions at the boundaries of the simulation domain $\varOmega $. The paper is structured as follows:

The remaining part of this section introduces the software package HHG (Hierarchical Hybrid Grids), and three peta-scale class HPC systems. The second section describes a novel hybrid parallelization strategy implemented within HHG to allow extreme scale simulations on the clusters JUQUEEN and SuperMUC. Scalability experiments and performance analysis on different clusters are presented.

1.1 Parallel Multigrid with Hierarchical Hybrid Grids

For solving partial differential equations (PDEs), finite elements (FE) methods are a popular discretization scheme, since they allow flexible, unstructured meshes. The framework HHG [2, 8] is designed to combine the flexibility of the FE method and the superb performance of geometric multigrid [3, 10] by using a compromise between structured and unstructured grids. A coarse input FE mesh is organized into the grid primitives vertices, edges, faces, and volumes. The primitives are then refined in a structured way (see Fig. 1), resulting in semi-structured meshes. The regularity of the resulting grids may be exploited in such a way that it is no longer necessary to explicitly assemble the global discretization matrix. In particular, given an appropriate input grid, the discretization matrix may be defined implicitly using stencils for each structured patch. Here a stencil represents a row of the global stiffness matrix. Within HHG, we have implemented an MPI^{Footnote 1}-parallel geometric multigrid method that operates on the resulting block-structured grid hierarchy. The settings of the multigrid components and parameters used in this paper are three Gauss-Seidel iterations for pre- and post-smoothing steps, linear interpolation between six multigrid levels, parallel Conjugated Gradient algorithm to solve the coarsest grid problem, and direct coarse grid approximation with coefficient averaging.

The stencils can be stored in registers when the dielectricity is piecewise constant, or it can be assembled on-the-fly for a variable dielectricity. In both cases this results in a so-called matrix-free implementation. This can have significant performance benefits since it reduces memory traffic, possibly at the expense of redundant computations.

1.2 Architectures

Within this article we compare the performance of HHG on three European supercomputers: JUGENE, JUQUEEN are both located at FZ Jülich, and SuperMUC is located in the LRZ supercomputing center in Garching. Table 1 presents a system overview of these clusters.

Table 1. System overview of the IBM clusters JUGENE, JUQUEEN, and SuperMUC.

Full size table

JUGENE, was the largest BlueGene/P installation with $294\,912$ compute cores and 1 petaflop/s peak performance. Each node was equipped with a PowerPC 450 quadcore processor running at a low clock frequency. The architecture provided a very high main memory performance. A three-dimensional torus network in combination with a tree-based collective network was available for communication. We were able to solve FE systems with in excess of $10^{12}$ degrees of freedom on JUGENE with the HHG package. We use these three year old performance results obtained on JUGENE as reference for our new results.

The BlueGene/Q system JUQUEEN is the successor of the JUGENE with a peak performance of $5.9$ petaflop/s. Although the clock-frequency still remains relatively low, it is nearly doubled. Each of the 16 cores available for user applications has four hardware (HW) threads. The memory bandwidth has not scaled up accordingly, but in order to compensate this disadvantage in part, e.g. the prefetching and speculative execution facilities have been improved. The torus network is extended to five dimensions for shorter paths, and the collective network was fused into the torus network. The ratio of peak network bandwidth node performance and peak floating point performance is only 50 % of that of BlueGene/P. On the other hand, the cores within each node and consequently the intra-node communication performance has drastically increased.

SuperMUC is a $3.2$ petaflop/s IBM x iDataPlex cluster. This machine consists of 18 thin islands, carrying $97.5\,\%$ of total performance, and one fat island for moderately parallel, memory intensive applications. Each thin island is equipped with 512 compute nodes. Two sockets with Sandy Bridge-EP Intel Xeon E5-2680 8C provide 16 physical cores. The Xeon processors deliver a significantly higher core and node performance than the PowerPCs in the IBM architectures, at the price of higher power consumption. The nodes within an island are linked by an Infiniband non-blocking tree, whereas a pruned 4:1 tree connects all islands.

2 Porting Hierarchical Hybrid Grids to BlueGene/Q

For the substantial changes that were necessary to use HHG on more than $30\,000$ parallel threads, we refer to [9]. This includes the design of data structures for generating tetrahedral input grids efficiently in parallel. However, for the current and upcoming systems, this alone proved not to be sufficient and thus this paper will present a new hybrid parallelization strategy.

The new system architectures with more powerful and complex compute nodes make a hybrid parallelization approach especially attractive and potentially profitable, since they provide better opportunities for a shared memory parallelization via OpenMP^{Footnote 2}. Thus a hybrid parallelization strategy, including message passing for coarse grain parallelism, and shared memory parallelism within a node for finer scale parallel execution, has been found essential for exploiting the full potential of architectures like JUQUEEN or SuperMUC.

In a pure MPI parallel setting, the available main memory per process is only 256 MB per process on JUQUEEN. This is too small for the three largest runs described in the next sections. In contrast, a hybrid parallelization increases the available main memory for each process. On SuperMUC, the scaling breaks down when too many MPI processes are being used. A hybrid parallelization helps to limit the total number of MPI processes and this helps to maintain scalability for extreme size simulations.

Table 2. Efficiency of the hybrid parallelization compared to a pure MPI parallel approach on JUQUEEN for moderate problem sizes.

Full size table

The current OpenMP implementation in HHG supports parallelism inside kernel executions and copy of ghost layers on several primitives. However, the MPI instructions are executed asynchronously, but not explicitly OpenMP-parallel. Further, OpenMP introduces an additional overhead for spawning threads, which is especially critical on the coarsest grids, where the workloads per thread is small. The quality of the MPI/OpenMP parallel execution is reflected in Table 2. All runs up to the last two are executed by four threads per compute core. From the timings we conclude that serial fraction of the code is still between $1-2\,\%$. We will use a hybrid parallelization with up to eight OpenMP-threads for the largest parallel run on JUQUEEN in the following scaling experiment as the performance loss is still not too high.

2.1 Weak Scaling on JUQUEEN

This section shows the scalability of the HHG approach on a current cluster. The program is compiled with the IBM XL compiler suite on both BlueGene clusters. As a test case we use a piecewise constant dielectricity, and thus can use constant stencils within each HHG block and each geometric primitive. Consequently, the numerical efficiency is extremely high and in a relative sense, the communication is very intensive. Therefore, this is quite a challenging setup for maintaining the parallel scalability as we will show in the performance study in the next section. Table 3 shows the run-time results of a scaling experiment. The smallest test run already solves a system of slightly more than $10^8$ unknowns and one V-cycle takes approximately $2.3\,\mathrm{{s}}$. Note that this is performed on a single compute node on JUQUEEN, demonstrating the high efficiency of the HHG approach. In each further row of the table, the problem size is doubled as well as the number of nodes. This is a classical weak scalability test. The full machine could eventually solve a linear system with $3.3\cdot 10^{12}$ unknowns, corresponding to more than $10^{13}$ tetrahedral finite elements. In total, this computation uses $300$, out of the almost $400\,\mathrm{TB}$ of main memory during the solution process.

Table 3. Weak scaling experiment on JUQUEEN solving a problem on the full machine

Full size table

Four hardware threads are necessary to saturate the performance of one processor core, leading to a parallel execution of more than one million threads. Although, the computational time increases only moderately, we note that the coarse grid solver is only a straightforward Conjugated Gradient (CG) iteration. Therefore in large runs, more than half a second of the V-cycle execution time is spent in the increasing number of CG iterations on the coarsest grid, that is caused by larger and larger coarse grids. This shows clearly, that for perfect asymptotic scalability a better coarse grid solver would be necessary. Nevertheless, we believe that our results with the CG solver indicate clearly that the coarse grid solver performance is not as critical for scalability, as has been discussed in the older literature on parallel multigrid methods.

2.2 Comparison of Scalability Results on Other Peta-scale Clusters

In this section, we compare the parallel efficiency of our code on different HPC clusters. In contrast to the BlueGene systems, the program is compiled with the Intel compiler suite and IBM MPI for SuperMUC with -O3 -xavx compiler flags. As reference, one V-cycle takes 4.25 s for JUGENE, and 1.18 s on SuperMUC on one compute node. Figure 2 shows strong efficiency drops when advancing from one node to several nodes. This is especially prominent on both BlueGene systems. However, from there onwards to larger parallel runs, the parallel efficiency stays nearly constant. Only the transition from a single Midplane on BlueGene/P, or one Node Card on BlueGene/Q to larger sub-portions of the architecture, induce again more significant performance drops. On SuperMUC the efficiencies up to a quarter island ($2.6\cdot 10^{10}$ unknowns) differ between the multigrid cycles. We believe that this is caused by perturbations due to other applications running simultaneously on the same island. From quarter of an island to half of an island ($5.2\cdot 10^{10}$ unknowns) the performance even improves. However, when leaving a single island of the architecture, the parallel efficiency drops significantly. This is likely caused by the reduced communication performance beyond each island in the pruned 4:1 tree. For more than two islands we also disable hyperthreading to obtain substantially more reproducible run-times. In contrast to this observation, the run-times of both BlueGene machines remain more stable for all problem sizes. SuperMUC presents the best parallel efficiency. However, we could not map our mesh onto the torus networks, since the coarsest mesh is basically unstructured.

First scaling experiments on SuperMUC showed a breakdown at 65 536 MPI processes, resulting in roughly four times longer run-times, as well as fluctuations in the timings of up to $15\,\mathrm{s}$ between the single V-cycles compared to runs with 32 768 MPI processes. Figure 2 displays that there have already been problems on 32 768 MPI processes (corresponding to $4.1 \cdot 10^{11}$ unknowns or 4 islands).

The results on larger machine sizes use the hybrid parallelization that allows us to execute the largest two runs with only 16 384 MPI processes, leading to a significantly improved parallel efficiency. The largest run was carried out on 16 islands of the cluster. Different from the behavior of SuperMUC, the hybrid parallelization on JUQUEEN, as used for the largest runs, clearly decreases the parallel efficiency. However a hybrid parallelization is still necessary as explained above in order to have enough main memory available.

Table 4 shows the single node performance, parallel efficiencies, and energy consumptions relatively to JUGENE. The runs were carried out for a node allocation providing $\approx {0.8}$ Pflop/s nominal peak. Even though a major design goal of the BlueGene/P was to have a low energy consumption, the next generation could improve the energy consumption by a factor between six to seven for our application. The SuperMUC turns out not to be as energy efficient as JUQUEEN, however it does not require such a high degree of parallelism from the application.

Table 4. Single node and parallel efficiencies (scaling), as well as power consumptions of used parts of the clusters while running HHG.

Full size table

2.3 Single Node Performance Analysis

This section will analyze the single node performance as given in Table 4. This is for the case of constant dielectricity.

On JUGENE one MPI process is assigned to each compute core. Since the processors provide high memory bandwidth, codes tend to be more limited by instruction throughput than by memory bandwidth. However, the kernel that applies the stencil is affected on JUGENE from a serialization within the PowerPC multiply-add instructions. Additionally, a correct memory alignment for vectorized loads (for the SIMD units) is not assured due to the varying loop sizes that are caused by the tetrahedral macro elements. The limited-issue width and the in-order architecture of the processor leads to further performance limitations and eventually result in a node performance of only $6.1\,\%$ of the peak performance (see Table 4). This is for a complete multigrid cycle.

For a comparison, we refer to results of Datta et al. who present auto-tuning results for an averaging 29-point stencil on this architecture [5, 6]. Their baseline implementation achieves about $0.035$ GStencil/s updates which corresponds to $7.7\,\%$ of the peak performance for a reference (in-cache) implementation. Basically two of eleven optimization techniques (e.g. padding, core blocking, software pre-fetching) techniques can achieve a significant speedup: common sub-expression elimination and register blocking. While the first inherently cannot be applied for our stencil, since we do not have redundant calculations, the register blocking results in our case roughly in a speedup of two. In principle we could use this code optimization, but it leads to very small sub-blocks that will suffer from non-constant loop sizes. Moreover, the issue with serialization remains a bottleneck, which is not the case for the averaging stencil.

On JUQUEEN, we assign one MPI process to each HW thread. The stream benchmark shows that it is possible to run at a high fraction of $\approx {85}\,\%$ of the effective maximal memory bandwidth of $27.8~\mathrm{{GB/s}}$ by using one process per node. Two or four threads per node saturate the effective bandwidth completely. Going from one to two threads per core, HHG gives a factor of two improvement in performance. In these cases, the code is still instruction bound like on BlueGene/P. Going from two to four threads per core, the additional speedup is only a factor $1.3$. Overall, in this case, a multigrid cycle utilizes in average about $18.1~\mathrm{{GB/s}}$ of the main memory bandwidth. Only by reducing the main memory footprint and possibly improving the core performance itself, we see a chance for further reductions of the execution time.

Similarly to the situation on JUQUEEN, the code is mainly memory bandwidth limited on the SuperMUC node architecture. However, the nodes can saturate the bandwidth better and its machine balance suits better the characteristics of our code. Thus we achieve with a better flop/s performance than on JUQUEEN. However, hyperthreading improves the performance only insignificantly by at most a few percent.

3 Conclusion and Future Work

We presented a weak-scaling comparison of HHG on three different HPC petaflop clusters. To reach the full potential of the recent architectures, a hybrid parallelization approach turned out to be necessary for the growing node-level parallelism to compensate memory limitations and maintain scalability. Recently, we designed a Stokes solver for Earth mantle convection simulations [1, 4] within HHG utilizing the presented multigrid performance.

Notes

1.
www.mcs.anl.gov/mpi
2.
www.openmp.org

References

Baumgardner, J.R.: Three-dimensional treatment of convective flow in the earth’s mantle. J. Stat. Phys. 39(5/6), 501–511 (1985)
Article Google Scholar
Bergen, B., Gradl, T., Hülsemann, F., Rüde, U.: A massively parallel multigrid method for finite elements. Comput. Sci. Eng. 8(6), 56–62 (2006)
Article Google Scholar
Brandt, A.: Multi-level adaptive solutions to boundary-value problems. Math. Comput. 31(138), 333–390 (1977)
Article MATH Google Scholar
Burstedde, C., Ghattas, O., Gurnis, M., Stadler, G., Tan, E., Tu, T., Wilcox, L.C., Zhong, S.: Scalable adaptive mantle convection simulation on petascale supercomputers. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, p. 62. IEEE Press (2008)
Google Scholar
Datta, K., Williams, S., Volkov, V., Carter, J., Oliker, L., Shalf, J., Yelick, K.: Auto-tuning the 27-point stencil for multicore. In: Proceedings of the Fourth International Workshop on Automatic Performance Tuning (iWAPT2009) (2009)
Google Scholar
Datta, K., Yelick, K.: Auto-tuning stencil codes for cache-based multicore platforms. Ph.D. thesis, University of California, Berkeley (2009)
Google Scholar
Fattebert, J.L., Gygi, F.: Density functional theory for efficient ab initio molecular dynamics simulations in solution. J. Comput. Chem. 223, 662–666 (2002)
Article Google Scholar
Gmeiner, B., Gradl, T., Köstler, H., Rüde, U.: Highly parallel geometric multigrid algorithm for hierarchical hybrid grids. In: Binder, K., Münster, G., Kremer, M. (eds.) NIC Symposium 2012. Publication series of the John von Neumann Institute for Computing, vol. 45, pp. 323–330. Jülich (2012)
Google Scholar
Gmeiner, B., Köstler, H., Stürmer, M., Rüde, U.: Parallel multigrid on hierarchical hybrid grids: a performance study on current high performance computing clusters. Concurr. Comput. Pract. Exp. (2012). http://dx.doi.org/10.1002/cpe.2968
Hackbusch, W.: Multi-Grid Methods and Applications. Springer, Heidelberg (1985)
Book MATH Google Scholar
Sanchez, V., Sued, M., Scherlis, D.: First-principles molecular dynamics simulations at solid-liquid interfaces with a continuum solvent. J. Chem. Phys. 131(17), 174108 (2009)
Article Google Scholar
Schmid, R., Tafipolsky, M., König, P.H., Köstler, H.: Car-Parrinello molecular dynamics using real space wavefunctions. Phys. Status Solidi B 243(5), 1001–1015 (2006)
Article Google Scholar

Download references

Acknowledgements

The work was supported by the International Doctorate Program (IDK) within the Elite Network of Bavaria. The authors gratefully acknowledge the Gauss Centre for Supercomputing (GCS) for providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS share of the supercomputer JUQUEEN at Jülich Supercomputing Centre (JSC). GCS is the alliance of the three national supercomputing centres HLRS (Universität Stuttgart), JSC (Forschungszentrum Jülich), and LRZ (Bayerische Akademie der Wissenschaften), funded by the German Federal Ministry of Education and Research (BMBF) and the German State Ministries for Research of Baden-Württemberg (MWK), Bayern (StMWFK) and Nordrhein-Westfalen (MIWF).

Author information

Authors and Affiliations

Department of Computer Science 10, University of Erlangen-Nürnberg, Erlangen, Germany
Björn Gmeiner & Ulrich Rüde

Authors

Björn Gmeiner
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Rüde
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Björn Gmeiner .

Editor information

Editors and Affiliations

Inst. of Info. & Comm. Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria
Ivan Lirkov
Bulgarian Academy of Sciences, Sofia, Bulgaria
Svetozar Margenov
Dept. of Informatics and Mathematical Modell, Technical University of Denmark, Kongens Lyngby, Denmark
Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gmeiner, B., Rüde, U. (2014). Peta-Scale Hierarchical Hybrid Multigrid Using Hybrid Parallelization. In: Lirkov, I., Margenov, S., Waśniewski, J. (eds) Large-Scale Scientific Computing. LSSC 2013. Lecture Notes in Computer Science(), vol 8353. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43880-0_50

Download citation

DOI: https://doi.org/10.1007/978-3-662-43880-0_50
Published: 26 June 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-43879-4
Online ISBN: 978-3-662-43880-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Peta-Scale Hierarchical Hybrid Multigrid Using Hybrid Parallelization

Abstract

Similar content being viewed by others

Parallel Finite Cell Method with Adaptive Geometric Multigrid

Multilevel Parallelization: Grid Methods for Solving Direct and Inverse Problems

Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

Keywords

1 Introduction

1.1 Parallel Multigrid with Hierarchical Hybrid Grids

1.2 Architectures

2 Porting Hierarchical Hybrid Grids to BlueGene/Q

2.1 Weak Scaling on JUQUEEN

2.2 Comparison of Scalability Results on Other Peta-scale Clusters

2.3 Single Node Performance Analysis

3 Conclusion and Future Work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Peta-Scale Hierarchical Hybrid Multigrid Using Hybrid Parallelization

Abstract

Similar content being viewed by others

Parallel Finite Cell Method with Adaptive Geometric Multigrid

Multilevel Parallelization: Grid Methods for Solving Direct and Inverse Problems

Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

Keywords

1 Introduction

1.1 Parallel Multigrid with Hierarchical Hybrid Grids

1.2 Architectures

2 Porting Hierarchical Hybrid Grids to BlueGene/Q

2.1 Weak Scaling on JUQUEEN

2.2 Comparison of Scalability Results on Other Peta-scale Clusters

2.3 Single Node Performance Analysis

3 Conclusion and Future Work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation