1 Introduction

It is more and more common to identify simulation as the third pillar of science together with theory and experimentation. Parallel computers provide the computing power required by the more demanding of these simulations. The complexity and heterogeneity of these architectures do, however, force scientists to write complex code (using vectorization, parallelization, accelerator-specific languages, etc.). These optimizations heavily depend on the target machine, and the code has to be adapted whenever it is ported to a new architecture.

As a result, scientists have to become experts in the art of computer optimizations in addition to their own domain of expertise. It is very difficult in practice to maintain a code targeting multiple distinct architectures. One fundamental cause for this situation is the tight interleaving of two distinct concerns imposed by most programming models. On the one hand, the algorithm comes from the expertise of the domain scientists and does not depend on the target architecture. On the other hand, optimizations form another domain of expertise and have to be adapted for a given architecture. Therefore, both algorithm and optimizations concerns are expressed within a single code. This mix impedes simulation code's maintainability and readability while hindering developer's productivity.

Many approaches have been proposed to improve this situation in the form of libraries or languages [4, 17, 19, 20]. Approaches based on automated optimization processes typically isolate the algorithmic aspects well, but restrict their domain of applicability and/or the range of supported optimizations. Approaches based on optimization tools and libraries enable optimization specialists to express common optimizations efficiently but leave others mixed with the algorithm.

In this paper, we propose the independent kernel scheduling (\({\textsc {InKS}}\)) programming model to separate algorithm from optimization choices in HPC simulation codes. We define the \({\textsc {InKS}}_{\textsf {pia} }\) language used to express the algorithm of an application independently of its optimization. This separation aims to improve the readability and maintainability of codes while easing portability and new optimization expression. This approach is used for common optimizations, while \({\textsc {InKS}}_{\textsf {o/C++} }\) is used for less common optimizations. Such a program can then be optimized using \({\textsc {InKS}}_{\textsf {o/XMP} }\) and \({\textsc {InKS}}_{\textsf {o/loop} }\), two domain-specific languages (DSLs) which ask for optimization information only. While these DSLs target some common optimizations, \({\textsc {InKS}}_{\textsf {o/C++} }\) can be used for less common ones.

This paper makes the following contributions: (1) it defines the \({\textsc {InKS}}\) programming model and its platform-independent algorithmic language \({\textsc {InKS}}_{\textsf {pia} }\); (2) it proposes an implementation of \({\textsc {InKS}}\) and tests the \({\textsc {InKS}}_{\textsf {pso} }\) approach with three optimization DSLs, \({\textsc {InKS}}_{\textsf {o/C++} }\), \({\textsc {InKS}}_{\textsf {o/loop} }\) and \({\textsc {InKS}}_{\textsf {o/XMP} }\); and (3) it evaluates the approach on the synthetic NAS parallel benchmarks [3] and on the 6D Vlasov–Poisson solving with a semi-Lagrangian method.

The remaining of the paper is organized as follows. Section 2 analyzes related works. Section 3 describes \({\textsc {InKS}}\) and its implementation. Section 4 shows the use of \({\textsc {InKS}}\) on a 6D Vlasov–Poisson solver. Section 5 evaluates our approach. Section 6 concludes the paper.

2 Related work

We now present approaches currently available to scientific programmers to help implementing simulation applications. A first widely used approach is based on imperative languages such as Fortran or C. Libraries like MPI extend this to distributed memory with message passing. Abstractions very close to the execution machine make fine-tuning possible to achieve good performance on any specific architecture. It does, however, require encoding complex optimizations directly in the code. As there is no language support to separate the algorithm and architecture-specific optimizations, tedious efforts have to be applied [13] to support performance portability. Algorithm and optimizations are instead often tightly bound together in codes.

A second approach is offered by tools (libraries, frameworks or language extensions) that encode classical optimizations. OpenMP [5], REPARA [8] or Kokkos [4] supports common shared memory parallelization patterns. For example, Kokkos offers multidimensional arrays and iterators for which efficient memory mappings and iteration orders are selected independently. UPC [9] or XMP [17] support the partitioned global address space paradigm. For example, in XMP, directives describe array distribution and communications between nodes. These tools offer gains of productivity when the optimization patterns they offer fit the requirements. The separation of optimizations from the main code base also eases porting between architectures. Even if expressed more compactly, optimizations do, however, remain mixed with the algorithm. For instance, in OpenMP or REPARA, parallel concerns are specified on top of an existing code which already carries optimization choices, such as loop order.

A third approach pushes this further with tools that automate the optimization process. For example, PaRSEC [12] or StarPU [1] supports the multitask paradigm. In StarPU, the user expresses its code as a directed acyclic graph (DAG) of tasks with data dependencies that is automatically scheduled at runtime depending on the available resources. Other examples are SkeTo [21] or Lift [19] that offer algorithmic skeletons. Lift offers a limited set of parallel patterns whose combinations are automatically transformed by an optimizing compiler. Automating optimization improves productivity and clearly separates these optimizations which improves portability. The tools do, however, not cover the whole range of potential optimizations such as the choice of work granularity inside tasks in StarPU, for example. The algorithm remains largely interleaved with optimization choices even with this approach.

A last approach is based on DSLs that restrict the developer to the expression of the algorithm only, while optimizations are handled independently, such as Pochoir [20] or PATUS [6], DSLs for stencil problems. In Pochoir, the user specifies a stencil (computation kernel and access pattern), boundary conditions and a space–time domain, while all optimizations are handled by a compiler. These approaches ensure a very good separation of concerns. The narrower the target domain is, the more efficient domain and architecture-specific optimizations are possible. However, it makes it less likely for the tool to cover the needs of a whole application. On the contrary, the wider the target domain is, the less efficient optimizations are possible.

To summarize, one can consider a continuum of approaches from very general approaches where the optimization process is manual to more and more domain-specific where the optimization process can be automated. The more general approaches support a large range of optimizations and application domains but yield high implementation costs and low separation of concerns and portability. The more automated approaches reduce implementation costs and offer good separation of concerns and portability but restrain the range of supported domains and optimizations. Ideally, one would like to combine all these advantages: (1) the domain generality of imperative languages, (2) the ease of optimization offered by dedicated tools and (3) the separation of concerns and performance portability offered by DSLs with (4) the possibilities of fine and manual optimizations offered by both imperative languages and dedicated tools. The following section describes the \({\textsc {InKS}}\) programming model that aims to combine these approaches to offer such a solution.

3 The \({\textsc {InKS}}\) programming model

This section first introduces the design of our \({\textsc {InKS}}\) programming model, based on the use of distinct languages to express the algorithm and optimization choices separately. It then presents a prototype implementation of the model composed of \({\textsc {InKS}}_{\textsf {pia} }\), the algorithm language, and two \({\textsc {InKS}}_{\textsf {pso} }\) optimization languages. The simulation algorithm consists in the set of values computed, the formula used to produce them as well as the simulation inputs and outputs. Optimization choices include all that is not the algorithm: e.g., the computing unit selected for each computation, their ordering, the memory location for each value, etc. Multiple optimization choices can differ in performance, but simulation results depend on the algorithm only. The \({\textsc {InKS}}\) approach is summarized in Fig. 1. The \({\textsc {InKS}}_{\textsf {pia} }\) language is used to express the algorithm with no concern for optimization choices. A compiler can automatically generate non-optimized choices from an \({\textsc {InKS}}_{\textsf {pia} }\) specification, mostly for testing purposes. The \({\textsc {InKS}}_{\textsf {pso} }\) language is used to define optimizations only, while other information is gathered from the \({\textsc {InKS}}_{\textsf {pia} }\) code. Many versions of the optimization choices can be devised for a single algorithm, for example to optimize for multiple targets.

Fig. 1
figure 1

\({\textsc {InKS}}\) model

We now describe the \({\textsc {InKS}}_{\textsf {pia} }\) language and propose three preliminary DSLs to validate strategies for the design of the \({\textsc {InKS}}_{\textsf {pso} }\) language: \({\textsc {InKS}}_{\textsf {o/XMP} }\) handles domain decomposition, \({\textsc {InKS}}_{\textsf {o/loop} }\) focuses on efficient loop nest and \({\textsc {InKS}}_{\textsf {o/C++} }\) enables to use C++ and write arbitrary complex optimization.

3.1 The \({\textsc {InKS}}_{\textsf {pia} }\) language

In InKS\(_\textsf {pia} \) [2], values are stored in infinite multidimensional arrays based on dynamic single assignment (DSA, each coordinate can only be written once). Memory placement of each coordinate is left unspecified. Computations are specified by kernel procedures that (1) take as parameter data arrays and integer coordinates; (2) specify the coordinate they might read and will write in each array; and (3) define either a C++ or \({\textsc {InKS}}\) implementation. An \({\textsc {InKS}}\) implementation defines kernels validity domains: coordinates where C++ kernels can generate values in arrays. Kernel execution order is left unspecified. The simulation entry point is a kernel marked public. Listing 1 presents a simple \({\textsc {InKS}}_{\textsf {pia} }\) code: a 1D 3-point stencil computation. The simulation consists of one 2D logical array, Array (line 12), two kernels, stencil3 (line 1) and boundary (line 6), and is parameterized by two integers, X and T (line 11). Line 12 specifies that the simulation starts with a subset of Array (every values in the space dimension, at the first time step) and expects, as output, another subset: every values in the space dimension at the last time step. To do so, it can call the stencil3 and boundary kernels with a specific set of values for their integer parameters x and t (validity domain) and with Array as their logical array parameter, as expressed on line 15 and 16.

A \({\textsc {InKS}}_{\textsf {pia} }\) code specifies a parameterized task graph (PTG) [7]. This graph is encoded using the polyhedron model [10]. It provides a compact representation of static control programs. This representation covers a large range of problems but imposes a few limitations. Mostly, all the problem parameters must be known at launch time, and it does, for example, not support adaptive mesh or time steps. It is still possible to express these concerns outside \({\textsc {InKS}}_{\textsf {pia} }\) and call the \({\textsc {InKS}}\) implementation multiple times with different parameters.

Once the algorithm is specified in \({\textsc {InKS}}_{\textsf {pia} }\), the goal is to write the optimization choices meaning choosing a memory layout and a scheduling of the kernels. Two approaches exist to produce these choices: the automatic compiler or \({\textsc {InKS}}_{\textsf {pso} }\). All approaches rely on the kernel file. This file is generated by the \({\textsc {InKS}}_{\textsf {pia} }\) compiler which translates the \({\textsc {InKS}}_{\textsf {pia} }\) kernels into C++ functions. The \({\textsc {InKS}}_{\textsf {pia} }\) compiler can produce generic optimization choices with a valid but non-optimized computations scheduling and memory allocations to execute them. Scheduling and memory layouts are computed using the Integer Set Library [22] and recent works on modular mapping in the polyhedron model [14]. Arbitrarily complex versions of optimization choices can also be written manually in plain C++. These functions can be called from any existing code whose language supports the C calling convention. However, that approach requires information present in \({\textsc {InKS}}_{\textsf {pia} }\) to be repeated. The \({\textsc {InKS}}_{\textsf {pso} }\) DSL thus interfaces the optimization process with \({\textsc {InKS}}_{\textsf {pia} }\), offering the optimization specialists to specify optimization only.

figure a

3.2 \({\textsc {InKS}}_{\textsf {pso} }\) DSL implementation

Implementing a complete optimization DSL for the \({\textsc {InKS}}\) model is a long-term objective. In this study, we present the two DSLs: \({\textsc {InKS}}_{\textsf {o/XMP} }\) and \({\textsc {InKS}}_{\textsf {o/loop} }\). Both DSLs solely describe optimizations and get missing information from \({\textsc {InKS}}_{\textsf {pia} }\) code.

figure b

\({\textsc {InKS}}_{\textsf {o/XMP} }\) (illustrated in Listing 2) handles distributed memory domain decomposition by combining C and directives based on XMP and adapted for \({\textsc {InKS}}\). The compiler replaces these directives by C and XMP code. The inks decompose directive supports static or dynamic allocation of logical arrays described in the algorithm. The domain size is extracted from \({\textsc {InKS}}_{\textsf {pia} }\) source, and the user only has to specify its mapping onto memory. As in XMP, \({\textsc {InKS}}_{\textsf {o/XMP} }\) supports domain decomposition mapped onto an XMP topology. In \({\textsc {InKS}}_{\textsf {pia} }\) code, there are no concerns for memory optimization such as dimension ordering or memory reuse. Therefore, \({\textsc {InKS}}_{\textsf {o/XMP} }\) supports dimension reordering and folding which consists in reusing the same memory address for subsequent indices in a given dimension. The exchange directive supports halo exchanges. The user specifies which dimension should be exchanged and which computational kernel will be executed after the exchange. From this information and the \({\textsc {InKS}}_{\textsf {pia} }\) kernel specification, the \({\textsc {InKS}}_{\textsf {o/XMP} }\) compiler then computes the halo size. While XMP requires halo values to be stored contiguously with the domain, \({\textsc {InKS}}_{\textsf {o/XMP} }\) supports a dynamic halo extension where halo values are stored in dedicated, dynamically allocated buffers to reduce memory footprint.

figure c

\({\textsc {InKS}}_{\textsf {o/loop} }\) (illustrated in Listing 3) offers to specify manually loop nests for which the compiler generates plain C++ loops. Plain C++ is usable in combination with \({\textsc {InKS}}_{\textsf {o/loop} }\). The loop keyword introduces a nest optimization with a name, the list of parameters from the algorithm on which the loop bounds depend and a reference to the optimized kernel. Loop bounds can be automatically extracted from \({\textsc {InKS}}_{\textsf {pia} }\), but the Set keyword makes it possible to restrict these bounds. The Order keyword specifies the iteration order on the dimensions named according to the \({\textsc {InKS}}_{\textsf {pia} }\) code. The Block keyword enables the user to implement blocking. It takes as parameters the size of block for the loops starting from the innermost one. If there are less block sizes than loops, the remaining loops are not blocked. The Buffer keyword supports copying data in a local buffer before computation and back after to ensure data continuity and improve vectorization. The compiler uses data dependencies from the \({\textsc {InKS}}_{\textsf {pia} }\) code to check the validity of the loop order and generate vectorization directives whenever possible.

4 The 6D Vlasov/Poisson problem

The 6D Vlasov–Poisson equation, presented in (1), describes the movement of particles in a plasma and the resulting electric field. We study its resolution for a single species on a 6D Cartesian mesh with periodic boundary conditions. We solve the Poisson part using a fast Fourier transform (FFT) and rely on a Strang splitting (order 2 in time) for the Vlasov part. This leads to 6 1D advections: 3 in space dimensions (\(x_1 ,x_2, x_3\)) and 3 in velocity dimensions (\(v_1, v_2, v_3\)). Each 1D advection relies on a Lagrange interpolation of degree 4. In the space dimensions, we use a semi-Lagrangian approach where the stencil is not applied around the destination point but at the foot of characteristics, only known at runtime, as described in more details in [18].

$$\begin{aligned} \left\{ \begin{aligned}&\frac{\partial f(t, x, v)}{\partial t}+v.\nabla _xf(t,x,v)-E(t,x).\nabla _vf(t,x,v)=0 \\&-\varDelta \phi (t, x) = 1 - \rho (t, x) \\&E(t, x) = -\nabla \phi (t, x) \\&\rho (t, x) = \displaystyle {\int } f(t, x, v) dv \end{aligned} \right. \end{aligned}$$
(1)

The main unknown is f (f6D in the code), the distribution function of particles in 6D phase space. Due to the Strang splitting, a first half time step of advections is required after f6D initialization but before the main time loop. These advections need the electric field E as input. E is obtained through the FFT-based Poisson solver that in turn needs the charge density \(\rho \) as input. \(\rho \) is computed by a reduction of f6D. The main time loop is composed of 4 steps: advections in space dimensions, computation of the charge density (reduction) and electric field (Poisson solver) and advections in velocity dimensions. The algorithm of the simulation is presented in Fig. 2.

Fig. 2
figure 2

6D Vlasov–Poisson algorithm

The remaining of the section presents two optimizations implemented in the Selalib [16] version of the 6D Vlasov–Poisson problem.

The 6D nature of f6D requires a lot of memory, but the regularity of the problem means it can be distributed in blocks with good load balancing. Halos are required to hold values of neighbors for the advections. Connected halo zones would increase the number of points in all dimensions and consume too much memory. Split advections mean that halos are required in a single dimension at a time though. We therefore use dynamic halos composed of two buffers, one for each boundary of the advected dimension (denoted “right” and “left”). Figure 3 shows this optimization on a 2D domain. Listing 4 shows the \({\textsc {InKS}}_{\textsf {o/XMP} }\) implementation of this strategy on 6D Vlasov–Poisson.

Fig. 3
figure 3

Dynamic halo exchange representation on a 2D domain

figure d

Advections account for the main computational cost of the problem, up to for 95% of the sequential execution time. Six loops surround the stencil computation of each advection, and in a naive version, the use of a modulo to handle periodicity and application along non-contiguous dimensions slow down the computation. To enable vectorization and improve cache use, we copy f6D elements into contiguous buffers along with the left and right halos. Advections are applied on these buffers before copying them back into f6D. Blocking further improves performance by copying multiple elements at a time. Listing 5 corresponds to the \({\textsc {InKS}}_{\textsf {o/loop} }\) implementation of these optimizations in a sequential version and presents the generated code. The Poisson solver relies on a remapping scheme, where the domain decomposition is modified between each FFT execution.

figure e

5 Evaluation

This section evaluates the \({\textsc {InKS}}\) model on the NAS benchmark, a simple stencil code and the 6D Vlasov–Poisson problem. We have implemented the algorithm of the following programs using \({\textsc {InKS}}_{\textsf {pia} }\).

  • 4 sequential NAS kernels (IS, FT, EP and MG), C++ version [11] as reference;

  • finite difference 3D heat resolution (7-point stencil) ([15] as reference);

  • 6D Vlasov–Poisson, using Fortran/MPI Selalib [16] as reference.

For the 3D heat equation solver, two strategies were implemented: One uses double buffering (Heat/Buf) and the other implements a cache-oblivious strategy (Heat/Obl). To evaluate the \({\textsc {InKS}}\) programming model on Vlasov–Poisson 6D, we have conduct three experiments. The first one is based on optimization choices written in plain C++ and cover the whole simulation. Then, we evaluate \({\textsc {InKS}}_{\textsf {o/XMP} }\) and \({\textsc {InKS}}_{\textsf {o/loop} }\) on Vlasov–Poisson separately as they target different optimizations and are not usable together currently. A first experiment focuses on the sequential aspects with the intra-node optimization of the \(v_1\) advection using either \({\textsc {InKS}}_{\textsf {o/loop} }\) or \({\textsc {InKS}}_{\textsf {o/C++} }\). A second experiment focuses on the parallel aspects with the charge density computation, the Poisson solver and a halo exchange optimized either with C/XMP or with \({\textsc {InKS}}_{\textsf {o/XMP} }\). Table 1 summarizes the optimization choices we have implemented. All codes are compiled with Intel 18 compiler (-O3 -xHost), Intel MPI 2018 and executed on the Irene cluster (TGCC, France) equipped with 192 GB RAM, two Skylake 8168 CPUs per node and a EDR InfiniBand interconnect. Table 2 summarizes the optimization choices we have implemented.

Table 1 Summary of the \({\textsc {InKS}}_{\textsf {o/C++} }\), \({\textsc {InKS}}_{\textsf {o/loop} }\) and \({\textsc {InKS}}_{\textsf {o/XMP} }\) optimization choices implemented based on six \({\textsc {InKS}}_{\textsf {pia} }\) algorithm: the 3D heat solver, 4 NAS kernels (EP, FT, MG, IS) and the Vlasov–Poisson 6D solver
Table 2 Execution time of the \({\textsc {InKS}}_{\textsf {o/C++} }\), \({\textsc {InKS}}_{\textsf {o/loop} }\) and \({\textsc {InKS}}_{\textsf {o/XMP} }\) implementations of the sequential NAS benchmark, class B—time/iteration of the 3D heat equation (7-point stencil), size \((1024^3)\)—time/iteration of the \({\textsc {InKS}}_{\textsf {o/C++} }\) and Fortran implementation of the sequential 6D Vlasov–Poisson, size \((32^6)\)

The NAS CG kernel relies on indirections not expressible in the polyhedron model of \({\textsc {InKS}}_{\textsf {pia} }\). Its implementation would thus have to rely on a large C++ kernel whose optimization would be mixed with the algorithm. For the same reason, the NAS FT kernels was only partially implemented. \({\textsc {InKS}}_{\textsf {pia} }\) can, however, be used to express all other NAS kernels, the 3D heat equation solver as well as the 6D Vlasov–Poisson algorithm. Even if not as expressive as C or Fortran, \({\textsc {InKS}}_{\textsf {pia} }\), through the polyhedron model, handles static controls programs. This covers the needs of a wide range of simulation domains and offers abstractions close to the execution machine rather than from a specific simulation domain. More specifically, it supports programs expressible as parameterized task graphs [7]: a directed acyclic graph of tasks (here kernels) in a compact representation whose dependence and scheduling are parameterized by integers fixed at the \({\textsc {InKS}}\) entry point execution. Among others, it can express computations such as FFTs or stencils with input coordinates unknown at compile time, as in 6D Vlasov–Poisson.

\({\textsc {InKS}}\) separates the specification of algorithm and optimization in distinct files. Multiple optimization strategies can be implemented for a single algorithm, as shown for the 3D heat equation where each relies on a specific memory layout and scheduling. Similarly, based on the \({\textsc {InKS}}_{\textsf {pia} }\) algorithm implemented for the IS, EP and MG NAS kernels, we have developed multiple optimization choices based on \({\textsc {InKS}}_{\textsf {o/C++} }\) and either \({\textsc {InKS}}_{\textsf {o/loop} }\) or \({\textsc {InKS}}_{\textsf {o/XMP} }\). For more complex cases such as the 6D Vlasov–Poisson problem, we were able to develop four optimization choices based on one \({\textsc {InKS}}_{\textsf {pia} }\) algorithm: (1) \({\textsc {InKS}}_{\textsf {o/XMP} }\), (2) C with XMP, (3) \({\textsc {InKS}}_{\textsf {o/loop} }\) and (4) \({\textsc {InKS}}_{\textsf {o/C++} }\). This proves, to some extent, that the separation of concerns is respected.

Finding the right metric to evaluate the easiness of writing a code is a difficult question. As illustrated in Listings 1 and 5, however, algorithm expression in \({\textsc {InKS}}_{\textsf {pia} }\) is close to the most naive C implementation where loops are replaced by \({\textsc {InKS}}\) validity domains with no worry for optimization. The polyhedron model used to represent the \({\textsc {InKS}}_{\textsf {pia} }\) code is compatible with a subset of the C language. Therefore, we could have chosen the C as the algorithm language. However, the optimization process would have suffer from such a choice. Indeed, while \({\textsc {InKS}}_{\textsf {pia} }\) is designed to remove optimizations through the use of DSA and fine granularity kernels, C enables its users to reuse memory and specify a total scheduling. Although it is possible to analyze the code and to retrieve the DSA form and the partial order of the code using the polyhedron model, it will be complicated for optimization specialists to find which loops can be broken or which arrays can be expended from a C code.

Concerning the specification of optimization choices, it is close to their expression in C++. Table 2 compares the GNU complexity score of \({\textsc {InKS}}\) optimizations to the reference code. \({\textsc {InKS}}\) scores are slightly better, which indicates that our language is not more complex than C++. The difference comes from the extraction of the computational kernels, placed in the algorithm, which hides parts of the complexity. In addition, the use of \({\textsc {InKS}}_{\textsf {o/C++} }\) to write optimizations let optimization specialists reuse their preexisting knowledge of this language. Similarly writing the \({\textsc {InKS}}_{\textsf {o/C++} }\) version of optimization choices for the \({\textsc {InKS}}_{\textsf {pia} }\) version of Vlasov–Poisson 6D is close to the expression to the same optimization in Fortran, in the reference version. These considerations should not hide the fact that some information has to be specified in both the \({\textsc {InKS}}_{\textsf {pia} }\) and \({\textsc {InKS}}_{\textsf {o/C++} }\) files with this approach leading to more code overall.

On the contrary, \({\textsc {InKS}}_{\textsf {o/XMP} }\) and \({\textsc {InKS}}_{\textsf {o/loop} }\) enable the developer to specify optimization choices only, while algorithmic information is extracted from \({\textsc {InKS}}_{\textsf {pia} }\) code. This is illustrated in Listing 2 presenting the \({\textsc {InKS}}_{\textsf {o/XMP} }\) 6D domain decomposition and the XMP result. Both are equivalent, but the \({\textsc {InKS}}_{\textsf {o/XMP} }\) expects only optimization choices parameters. Hence, one can test another memory layout, such as a different dimension ordering, by changing only a few parameters, while multiple directives must be modified in XMP. Moreover, using the domain decomposition directive offered by \({\textsc {InKS}}_{\textsf {o/XMP} }\) and XMP code, one can derive a simple parallel code from a \({\textsc {InKS}}_{\textsf {pia} }\) algorithm, as we did with the 3D Heat solver. Note that since \({\textsc {InKS}}_{\textsf {o/XMP} }\) is a wrapper for XMP using \({\textsc {InKS}}_{\textsf {pia} }\) to retrieve some information, it is usable in any codes that can be expressed using \({\textsc {InKS}}_{\textsf {pia} }\) and optimized with XMP. Similarly, with \({\textsc {InKS}}_{\textsf {o/loop} }\) (Listing 5), developers can easily test different optimization choices that would be tedious in plain C++. This is what we did with the 3D Heat solver based on the double buffering technique. As shown in Fig. 1, using \({\textsc {InKS}}_{\textsf {o/loop} }\), we have implemented 3 versions of the loops: the same as reference, one based on a 2D cache blocking and a last one using a 3D cache blocking by modifying almost nothing in the \({\textsc {InKS}}_{\textsf {o/loop} }\) code. Since \({\textsc {InKS}}_{\textsf {o/XMP} }\) and \({\textsc {InKS}}_{\textsf {o/loop} }\) are, respectively, usable with C and C++, \({\textsc {InKS}}\) does not restrict the expressible optimization choices: One can still implement optimizations not handled by our DSLs in C/C++. Moreover, operations such as halo size computation or vectorization capabilities detection are automatized using the \({\textsc {InKS}}_{\textsf {pia} }\) code. In summary, the approach enables optimization specialists to focus on their specialty which make the development easier.

Regarding performance, the \({\textsc {InKS}}\) approach makes it possible to express optimizations that do not change the algorithm. Optimizations of the four NAS parallel benchmarks and 3D heat equation solver in \({\textsc {InKS}}\) were trivial to implement and their performance matches or improves upon the reference as presented in Table 2. Investigation has shown that Intel ICC 18 does not vectorize properly the reference versions of MG and FT. The use of the Intel ivdep directive as done on the \({\textsc {InKS}}\) versions leads to slightly better performance.

For the full Vlasov–Poisson 6D problem, \({\textsc {InKS}}_{\textsf {pia} }\) and C++ enable us to implement the same complex optimization strategies written in the reference version. Therefore, our implementation match the reference in terms of performance, as shown in Table 2. For the \(v_1\) advection, both the \({\textsc {InKS}}_{\textsf {o/C++} }\) and \({\textsc {InKS}}_{\textsf {o/loop} }\) optimizations of the \({\textsc {InKS}}\) code achieve performance similar to the reference as shown in Table 2. For the parallel aspects, the \({\textsc {InKS}}_{\textsf {o/XMP} }\) optimization offers performance similar to XMP as shown in Fig. 4. MPI is faster on all cases compared to both XMP and \({\textsc {InKS}}_{\textsf {o/XMP} }\). At the moment, it seems that XMP does not optimize local copies which slow down the Poisson solver. Besides, some XMP directives are based on MPI RMA which makes the comparison with MPI Send/Receive complex. Still, MPI is much harder to program: More than 350 lines of MPI and Fortran are required to handle domain decomposition, remapping for FFT and halo exchange in Selalib, while 50 lines in XMP and 15 in \({\textsc {InKS}}_{\textsf {o/XMP} }\).

However, \({\textsc {InKS}}_{\textsf {pia} }\), \({\textsc {InKS}}_{\textsf {o/loop} }\) and \({\textsc {InKS}}_{\textsf {o/XMP} }\) have limitations. As mentioned earlier, \({\textsc {InKS}}_{\textsf {pia} }\) cannot be used to express non-static control program, such as the CG or the full FT NAS kernel. It also makes more complex the optimization of arrays passed in parameters (as input or output). \({\textsc {InKS}}_{\textsf {o/loop} }\) offers very limited option, making it unusable with complex loops such as the one in the 3D heat solver optimized using the cache-oblivious strategy. Similarly, loop nests calling multiple \({\textsc {InKS}}_{\textsf {pia} }\) computational kernels cannot be expressed using \({\textsc {InKS}}_{\textsf {o/loop} }\). Moreover, it cannot respect the constraints imposed by a specific memory mapping and offers only a few optimization. Adding new optimization strategies would require to add new keywords, such as unrolling and loop fusion. \({\textsc {InKS}}_{\textsf {o/XMP} }\) offers memory allocations but no controls on the access to this memory, letting optimization specialists in charge of the good use of this memory. Besides, it is usable only with XMP.

Although we want to address some issues in the \({\textsc {InKS}}_{\textsf {pia} }\) language, \({\textsc {InKS}}_{\textsf {o/loop} }\) and \({\textsc {InKS}}_{\textsf {o/XMP} }\) were preliminary tests before a real \({\textsc {InKS}}_{\textsf {pso} }\) implementation. The goal was to propose a way to express optimizations and retrieving the algorithmic information in \({\textsc {InKS}}_{\textsf {pia} }\) code. These two DSLs enable us to highlight a set of guidelines for the full definition of the \({\textsc {InKS}}_{\textsf {pso} }\) language. First, this definition must be based on the concepts that \({\textsc {InKS}}_{\textsf {pso} }\) must express, i.e., the optimizations related to memory (allocations and layouts) and to computations scheduling. With \({\textsc {InKS}}_{\textsf {o/XMP} }\), we tried some tests on the memory aspects, providing a directive to allocate a logical array and to reorder its dimensions. \({\textsc {InKS}}_{\textsf {o/loop} }\) focuses on computations scheduling by adding strong constraint on the order of the computations. Secondly, in order to express these optimization concepts only, it must be bound to its algorithmic counterpart, express in \({\textsc {InKS}}_{\textsf {pia} }\). In \({\textsc {InKS}}_{\textsf {o/XMP} }\), the domain decomposition directive makes reference directly to logical arrays described in \({\textsc {InKS}}_{\textsf {pia} }\), making possible the computation of the size of each dimension and the halo size depending on the data being accessed. In \({\textsc {InKS}}_{\textsf {o/loop} }\), the Order keyword refers to the structuring variable of a computational kernel defined in \({\textsc {InKS}}_{\textsf {pia} }\). Finally, we must work on the mean to express these optimization concepts with the most generality. For this part, we think that working with existing tools is probably mandatory for most complex strategy, such as XMP for the domain decomposition. Thus, using another existing tool for computation scheduling would make \({\textsf {o/loop} }\) more general. Similarly, a declarative language such as \({\textsc {InKS}}_{\textsf {o/loop} }\) may not be the best strategy: To handle the interaction between memory and computation, an imperative language could be easier.

Fig. 4
figure 4

Weak and strong scaling for 3 parts of the Vlasov–Poisson solver up to 64 nodes (1 process/node) on a \(32^6\) grid divided among processes (strong scaling) or \(16^6\) grid per process (weak scaling). Median of 10 executions

6 Conclusion and future works

In this paper, we have presented the \({\textsc {InKS}}\) programming model to separate the algorithm and the optimization choices and its implementation supporting two DSLs: \({\textsc {InKS}}_{\textsf {o/loop} }\) for loop optimizations and \({\textsc {InKS}}_{\textsf {o/XMP} }\) for domain decomposition. We have evaluated \({\textsc {InKS}}\) on synthetic benchmarks and on the Vlasov–Poisson solving. We have demonstrated its generality and advantages in terms of separation of concerns to improve maintainability and portability while offering performance on a par with existing approaches.

While this paper demonstrates the interest of the \({\textsc {InKS}}\) model, it still requires some work to further develop it. We will improve the optimization DSLs; base \({\textsc {InKS}}_{\textsf {o/loop} }\) on existing tools and ensure interactions with \({\textsc {InKS}}_{\textsf {o/XMP} }\). This will be done within the scope of the development of a complete \({\textsc {InKS}}_{\textsf {pso} }\) DSL enabling its users to manage memory placement and computations scheduling. This planed DSL will truly separate algorithm from optimization choices in static programs, while its compiler could offer static analysis of the program to ensure correctness. We also want to target different architectures to demonstrate the portability gains of the \({\textsc {InKS}}\) model.