Keywords

1 Introduction

The multidimensional positive definite advection transport algorithm (MPDATA) [7] is one of the two major parts of the dynamic core of the EULAG geophysical model. EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model for simulating thermo-fluid flows across a wide range of scales and physical scenarios, including the numerical weather prediction (NWP).

The resent research of EULAG parallelization have been carried out using IBM Blue Gene/Q and CRAY XE6 [4]. Three-dimensional MPI parallelization has been used for running EULAG on these systems with tens of thousands of cores, or even with more than 100K cores. When parallelizing EULAG computation on supercomputers and CPU clusters, the efficiency is declined below 10 %. In this study, we propose to rewrite the EULAG dynamical core and replace standard HPC systems by smaller heterogeneous clusters with accelerators such as GPU [5] and Intel Many Integrated Cores (MIC) [3].

Preliminary studies of porting anelastic numerical models to modern architectures, including hybrid CPU-GPU architectures, were carried out in works [5, 10, 11]. The results achieved for porting selected parts of EULAG to CPU-GPU architectures revealed potential in running scientific applications on novel hardware architectures.

In this work, we outline an approach to adaptation of the 3D MPDATA algorithm to the Intel MIC architecture. This approach is based on combination of temporal and space blocking techniques, and allows us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of target computing platforms. We show some of the optimization methods that we found effective, and demonstrate their impact on the performance of both the Intel CPU and MIC architectures. The main focus is using MPDATA to modelling geophysical flows on NWP. The size of computational grid in such problems typically does not exceed 270 thousand grid points (\(2048 \times 1024 \times 128\)). Here, the starting point is an unoptimized parallel implementation of the MPDATA algorithm. In our work, we use OpenMP standard for multi- and many-core programming.

The content of the paper is organized as follow. In Sect. 2, architecture overview is outlined. The introduction to 3D MPDATA algorithm, including characterization of computation and communication, is presented in Sect. 3. Section 4 introduces the proposed approach to adaptation of MPDATA to Intel MIC Architecture, including block decomposition of 3D MPDATA algorithm, improving efficiency of block decomposition, and parallelization. Preliminary performance results are presented in Sect. 5, while Sect. 6 gives conclusions and future work.

2 Architecture Overview

2.1 Architecture of Intel Many Integrated Cores

The Intel MIC architecture combines many Intel CPU cores onto a single chip [2, 3]. The Intel Xeon Phi coprocessor is the first product based on this architecture. The main advantage of these accelerators is that it is built to provide a general-purpose programming environment similar to that provided for Intel CPUs. This coprocessor is capable of running applications written in industry-standard programming languages such as Fortran, C, and C++.

The Intel Xeon Phi coprocessor includes processing cores, caches, memory controllers, PCIe client logic, and a very high bandwidth, bidirectional ring interconnect [3]. Each coprocessor contains of more than 50 cores clocked at 1 GHz or more. These cores support four-way hyper-threading, which gives more than 200 logical cores. The real number of cores depends on the generation and model of a specific coprocessor. Each core features an in-order, dual-issue x86 pipeline, 32 KB of L1 data cache, and 512 KB of L2 cache that is kept fully coherent by a global-distributed tag directory. As a result, the aggregate size of L2 caches can exceeds 25 MB. The memory controllers and the PCIe client logic provide a direct interface to the GDDR5 memory on the coprocessor and the PCIe bus, respectively. The coprocessor has over 6 GB of onboard memory (maximum 16 GB). The high-speed bidirectional ring connects together all the cores, caches, memory controllers and PCIe client logic of Intel Xeon Phi coprocessors.

An important component of each Intel Xeon Phi processing core is its vector processing unit (VPU) [2], that significantly increases the computing power. Each VPU supports a new 512-bit SIMD instruction set called Intel Initial Many-Core Instructions. The new ability to work with 512-bit vectors enables operating on 16 float or 8 double elements per iteration, instead of a single element.

The Intel Phi coprocessor is delivered in form factor of a PCI express device, and cannot be used as a stand-alone processor. Since the Intel Xeon Phi coprocessor runs Linux operating system, any user can access the coprocessor as a network node, and directly run individual applications in the native mode. These coprocessors also support heterogeneous applications wherein a part of the application is executed on the host (CPU), while another part is executed on the coprocessor (offload mode).

2.2 Target Platforms

A summary of key features of tested platforms is shown in Table 1. In this study, we use two platforms containing a single Intel Xeon Phi coprocessor. The first platform is equipped with two newest CPUs, based on the Ivy Bridge architecture, and the Intel Xeon Phi 3120A card. The second one includes two Sandy Bridge-EP CPUs, and the top-of-the-line Intel Xeon Phi 7120P coprocessor.

It should be noted that values of peak performance shown in Table 1 are given for the double precision arithmetic, with taking into account the usage of SIMD vectorization.

Table 1. Specification of tested platforms [1]

3 Introduction to MPDATA Algorithm

The multidimensional positive definite advection transport algorithm (MPDATA) belongs to the group of nonoscillatory forward-in-time algorithms, and performs a sequence of stencil computations. The full description of the MPDATA algorithm can be found in [6, 7].

The whole MPDATA computation in each time step are decomposed into a set of 17 stencil sweeps, called further stages. Each stage is responsible for calculating elements of a certain matrix, based on the corresponding stencil. The stages dependent on each other: prior outcomes from stages are usually input data for the subsequent computations. A part of the MPDATA implementation is shown in Fig. 1. It corresponds to the linear version of MPDATA [7].

A single MPDATA time step requires 5 input and 1 output matrices. Other 16 matrices are temporary, and do not play role in the further computational steps. In the basic, unoptimized implementation of the MPDATA algorithm, every stage uses a required set of matrices from the main memory, and writes results to the main memory after computation. This scheme is repeated for all the stages. In consequence, a heavy traffic to the main memory is generated. Moreover, compute units (cores/threads, and VPUs) have to wait for data transfers from the main memory to the cache hierarchy. In order to take full advantage of the novel architecture, the adaptation of MPDATA to the Intel MIC architecture is considered. The new implementation takes into account the memory-bounded character of the algorithm.

4 Adaptation of MPDATA to Intel MIC Architecture

4.1 Block Decomposition of 3D MPDATA Algorithm

Since the MPDATA algorithm includes so many intermediate computation, one of the primary methods for reducing the intensity of memory traffic is to avoid data transfers associated with these computation. For this aim, all the intermediate results must be kept in the cache memory. Such treatment increases the cache reusing. The memory traffic is generated only to transfer the required input and output data. Such an approach is commonly called the temporal blocking [8, 9].

In order to implement this approach efficiently, the loop tiling technique is applied. The grid is partitioned into blocks. Every block provides computation for all the 17 stages on the assigned part of the grid. Within a single block, each stage computes the adequate chunk of the corresponding matrix. Executing of a sequence of blocks determines the final outcomes for a single MPDATA time step.

Fig. 1.
figure 1

Part of MPDATA implementation

The main requirement for this approach is to keep in the cache hierarchy all the data required for MPDATA computation within each block. Therefore, the size \(nB \times mB \times lB\) of each block has to be selected in an appropriate way. The idea of block decomposition of the MPDATA algorithm is shown in Fig. 2. This decomposition determines four dimensions of distribution of MPDATA calculation across computing resources: i-, j-, and k-dimensions relate to the grid partitioning, while s-dimension is associated with the order of executing MPDATA stages.

Fig. 2.
figure 2

Idea of block decomposition of MPDATA computation

Due to data dependencies between subsequent stages additional computation and communication within each data block are required. These overheads are needed on the borders between adjacent blocks. In the proposed method, the computation associated with all the 17 stages, and executed within each block are extended by adequate halo areas. Adding halo allows to avoid data dependency between blocks within a single MPDATA time step.

The sizes of halo areas are determined in all the four dimensions (\(i\), \(j\), \(k\) and \(s\)), according to data dependencies between MPDATA stages. Thus, each of 5 input, one output, and 16 temporary matrices, is partitioned into chunks of size \(nB \times mB \times lB\), which further is expanded by adequate halo areas with sizes \(iL\), \(iR\), and \(jL\), \(jR\) as well as \(kL\), \(kR\). Table 2 presents the sizes of halo areas in i-, j-, and k-dimensions for chunks of all the matrices.

Table 2. Sizes of halo areas for MPDATA algorithm

This approach allows us to avoid data transfers for intermediate computation at the cost of extra computation associated with halo areas in chunks of temporary matrices, as well as extra communication between the main and cache memories, corresponding to halo areas in chunks of the input matrices. Another advantage of this approach is reducing the main memory consumption because all the intermediate results are stored in the cache memory only. In the case of coprocessors, it plays an important role because the size of main memory is fixed, and significantly smaller than for traditional CPU solutions.

The requirement of expanding halo areas is one of the major difficulties when applying the proposed approach, taking into account data dependencies between MPDATA stages. It requires to develop a dedicated task scheduling for the MPDATA block decomposition.

4.2 Improving Efficiency of Block Decomposition

Although the block decomposition of MPDATA allows for reducing the memory traffic, it still does not guarantee a satisfying utilization of used platforms. The main difficulty here results from extra computation and communication, which have impact on the performance degradation. In particular, there are three groups of extra computation and communication, corresponding to i-, j-, and k-dimensions. Some of them can be reduced or even avoided by applying the following rules:

  1. 1.

    The additional computation and communication in k-dimension can be avoided if \(lB=l\), and the size \(nB \times mB \times lB\) of block is small enough to save in cache all the required data. This rule is especially useful when the value of \(l\) is relatively small, as it is in the case of NWP, where \(l\) is in range [64, 128].

  2. 2.

    The overheads associated with j-dimension is avoided by leaving partial results in the cache memory. It becomes possible when extra computation are repeated by adjacent blocks. In this case, some results of intermediate computation have to reside in cache for executing the next block. This rule requires to develop a flexible management of computation for all the stages, as well as an adequate mapping of partial results onto the cache space. In consequence, all the chunks are still expanded by their halo areas (Table 2), but only some portions of these chunks are computed within the current block. It means that this approach does not increase the cache consumption. The idea of improving the efficiency of block decomposition is shown in Fig. 3.

  3. 3.

    In order to reduce additional calculations in i-dimension, the size \(nB\) should be as large as possible to save in the cache hierarchy all the data required to compute a single block.

Fig. 3.
figure 3

Idea of leaving partial results in cache memory

4.3 Parallelization

In order to utilize computing resources available in the Intel Xeon Phi coprocessor, the proposed approach employs two main levels of parallelism:

  • task parallelism which allows for utilization of more than 200 logical cores;

  • data parallelism to use efficiently 512-bit vector processing units.

Different MPDATA blocks are processed sequentially, following the order proposed for the CPU block decomposition in the previous subsection (Fig. 3). For a fixed MPDATA block, a sequence of stages is executed, taking into account the adequate sizes of halo areas. All computation executed within every stage are distributed across available threads. Assigned chunk to each stage is partitioned into sub-chunk of size \(nB^{*} \times mB^{*} \times lB\), where partitioning takes place along \(i\) and \(j\) dimensions. As a result, a task is assigned to each thread, as a part of MPDATA block. Due to the data dependencies of MPDATA, appropriate synchronizations between MPDATA stages are necessary.

Another level of parallelization is vectorization applied within each thread, so the resulting SIMDification is performed within k-dimension. In consequence, the value of size \(lB\) has to be adjusted to the vector size.

Because of intra-cache communication between tasks, the overall system performance depends strongly on a chosen task placement onto available threads. Therefore, the physical core affinity plays a significant role in optimizing the system performance. In this work, the affinity is adjusted manually, to force communication between tasks placed onto the closest adjacent cores. This increases the sustained intra-cache bandwidth, as well as reduces cache misses, and the latency of access to the cache memory.

5 Preliminary Performance Results

In this section we present preliminary performance results obtained for the double precision 3D MPDATA algorithm on the platforms introduced in Sect. 2. In all the tests, we use the ICC compiler as a part of Intel Parallel Studio 2013, with the same optimization flags. The best configurations for our approach is chosen in an empirical way, individually for each platform. Moreover, we use Intel Xeon Phi in the native mode.

Currently, only the first four stages are implemented and tested. These four stages correspond to the linear version of the MPDATA algorithm. Since all the input matrices are required to provide the correctness of calculation, the overall performance for this part of MPDATA is strongly limited by the memory traffic between the main memory and cache memory.

Figure 4 presents the normalized execution time of the 3D MPDATA algorithm, for 500 time steps and the grid of size \(1022 \times 512 \times 63\). The achieved performance results correspond to the following setups: (a) comparison of the block and improved block versions; (b) advantages of using vectorization; (c) performance for different numbers of threads per core; (d) comparison of Intel Xeon CPU and Intel Xeon Phi (best configurations with SIMD).

Fig. 4.
figure 4

Preliminary performance results: (a) comparison of block and improved block versions; (b) advantages of using vectorization; (c) performance for different numbers of threads per core; (d) comparison of Intel Xeon CPU and Intel Xeon Phi (best configurations with SIMD)

Figure 4a presents a performance gain for the improved version of block decomposition. The proposed method of reducing extra computation allows us to speedup MPDATA block version from 2 to 4 times, depending on the platform used and size of the grid.

The advantages of using vectorization is observed for all the platforms. In particular, for Intel Xeon Phi 7120P, it allows us to accelerate computation more than 3 times, using all the available threads/cores (Fig. 4b).

Figure. 4c shows the performance obtained for different numbers of threads per core, using Intel Xeon Phi 7120P. The best efficiency of computation is achieved when running 4 threads per each core.

The performance comparison of all the platforms is shown in Fig. 4d. For each platform, we use all the available cores with vectorization enabled. As expected, the best performance result is obtained using Intel Xeon Phi 7120P. This coprocessor executes the MPTADA algorithm 2 times faster than two Intel Xeon E5-2697 v2 CPU, totally containing 24 cores. The both models of the Intel Xeon Phi coprocessor give similar performance results.

6 Conclusions and Future Work

Using the Intel Xeon Phi coprocessor to accelerate computations in the 3D MPDATA algorithm is a promising direction for developing the parallel implementation of this algorithm. Rewriting the EULAG code, and replacing conventional HPC systems with heterogeneous clusters using accelerators such as Intel MIC is a perspective way to improve the efficiency of using this model in practical simulations.

The main challenge of the proposed parallelization is to take advantage of many- and multi-core, vectorization, and cache reusing. For this aim, we propose the block version of the 3D MPDATA algorithm, based on combination of temporal and space blocking techniques. Such an approach gives us the possibility to ease memory bounds by increasing the efficient cache reusing, and reducing the memory traffic associated with intermediate computations. Furthermore, the proposed method of reducing extra computation allows us to accelerate the MPDATA block version up to 4 times, depending on the platform used and size of the grid.

In all the performed tests, the Intel Xeon Phi 7120P coprocessor gives the best performance results. This coprocessor executes the MPTADA algorithm 2 times faster than two Intel Xeon E5-2697 v2 CPUs, totally containing 24 cores, and 2.86 times faster than two Intel Xeon E5-2643. Both the manycore and vectorization features of the Intel MIC architecture play the leading role in the performance exploitation. The other important features are the number of threads per core, as well as an adequate thread placement onto physical cores. All these features have a significant impact on the sustained performance.

At this point of our research, only the first four stages of the MPDATA algorithm are implemented, and tested. They correspond to the linear part of MPDATA. The performance achieved for this part of MPDATA is still limited by memory traffic, mostly because all the input data of the whole MPDATA algorithm are required to provide the correctness of computation for the linear part. As a result, the tested part of MPDATA does not extract the full potential of applying this coprocessor to implement MPDATA computation. Moreover, since the remaining part is unleashed from the memory-cache communication, it gives the opportunity for increasing the efficiency of computation. Implementing and optimizing this part of MPDATA will be studied in future works.

The achieved performance results provide the basis for further research on optimizing the distribution of MPDATA computation across all the computing resources of the Intel MIC architecture, taking into consideration features of its on-board memory, cache hierarchy, computing cores, and vector units. Additionally, the proposed approach requires to develop a flexible data and task scheduler, supported by adequate performance models. Another direction of future work is adaptation to heterogeneous clusters with Intel MICs, with a further development and optimization of code.