1 Introduction

Recent advances in Graphic Processing Units (GPUs) has opened a new challenge in harnessing their computing power as a new general purpose computing paradigm. GPUs have obtained prominence through implementing efficient massive multithreading as the main strategy for latency hiding. GPUs use multiple streaming multiprocessors (SMs) with potentially hundreds of cores, fast context switching, and high memory bandwidth to tolerate ever-increasing latencies to main memory. The strategy is to overlap long-latency loads of stalled threads with useful computation in other threads [33]. The Compute Unified Device Architecture (CUDA) is a C-like interface proposed for programming NVIDIA GPUs. However, porting applications to CUDA remains a challenge. CUDA places on the programmer the burden of packaging GPU code in separate functions, explicit management of data transfers between the host and GPU memories, and manual optimization of GPU memory utilization [20]. Hence, the pre-condition to efficiently utilize the GPU resources is a comprehensive understanding of the underlying architecture, and the exertion of complex kernel optimizations. These difficulties have motivated many researchers to develop high-level compilers that restructure the code into optimized CUDA program using a set of loop transformations or optimizations to meet the GPU architectural constraints. Such high level compilers may greatly simplify programming of GPUs, thereby spreading the use of GPUs for supercomputing and scientific computing applications where the tremendous GPU computing power is most needed.

A survey of a wide set of high-level frameworks [45] known as algorithmic skeletons is available online. In some cases, modular programming is used by representing the computation using a graph of modules. The ultimate aim is to develop an explicit programming paradigm with implicit communication among processes. In other cases, a pattern-oriented programming is developed. Other options use parametric skeleton objects for skeleton programming. The skeletal parallel programming framework is useful for data-parallel applications targeting heterogeneous multi-core platforms. Overall, most of presented frameworks, extensions and libraries target only shared (SMP) and/or distributed (DMS) memory parallel computing systems except the FastFlow framework that extends its application to CUDA programming.

1.1 Compilers and code restructuring tools

There is a growing research for reducing the complexity of generating optimized CUDA kernels [11, 29, 35]. Even though the programming model of CUDA offer a more programmer friendly interface, programming GPUs is still considered error-prone and complex even for expert programmers, in comparison to programming CPUs using parallel programming models, such as OpenMP [13, 34]. Most recently, quite a few directive-based GPU programming models have been proposed from both the research community (hiCUDA [20], OpenMPC [27], etc.) and industry (PGI Accelerator [36], HMPP [40], R-Stream [28], OpenACC, OpenMP for Accelerators [8], etc.). In directive-based programming the user provides hints to the compiler to trigger some selective code optimizations with reasonable background on the GPU architecture and the application. On the surface, these models appear to offer different levels of abstraction and reduction in programming effort for code restructuring and optimization.

CUDA-lite [42] proposed a set of annotations to maximize the efficiency of the transformation such as inserting shared memory variables, loop tiling and memory coalesced loads/stores by replacing global memory access with corresponding shared memory access. CUDA-lite allows users to input a naïve CUDA code that treats the memory as a single entity thereby hiding the complexity of handling hierarchical memory. hiCUDA [20] is a directive based language for programming GPUs. The aim is not to automate program optimizations rather it makes the programming in CUDA easier. It still depends on explicit optimizations by the programmer, such as utilizing shared memory or constant memory.

OpenMPC [27] proposed a set of directives to be considered along with OpenMP directives [8]. It optimizes data movement between CPU and GPU by applying the inter-procedural dataflow analysis. It also performs parallel loop swap and loop collapsing to enhance the inter-thread locality. Furthermore, it also uses auto-tuning to obtain the final optimized CUDA code.

OpenACC is the first standardization effort towards a directive-based, general accelerator programming model portable across device types and compiler vendors. PGI accelerator programming model [36] is a directive based model targeting general hardware accelerators. It currently supports only CUDA GPUs based on OpenACC standards. With light exposure to GPU architecture, the user needs to insert directives into the host program to guide the compiler for particular set of kernel optimizations and code transformations.

HMPP [9] is another directive-based with very high-level abstraction on GPU programming similar to PGI accelerator. HMPP model is based on the concept of codelets that can be remotely executed on hardware accelerators like GPUs. Codelets are offloaded to GPUs based on some manual modification of code structures.

R-Stream [28] is a high-level, architecture-independent programming model that is based on the polyhedral model [10]. It targets various architectures, such as STI Cell, SMP with OpenMP directives, Tilera, and CUDA GPUs. R-Stream performs affine scheduling to extract fine-grained and coarse-grained parallelism. It also performs a set of loop tranformations such as global memory coalescing, loop interchange, strip-mining, loop fusion, shared memory promotion and tiling.

RT-CUDA [25] provides the same level of abstraction as R-Stream but also provides user-defined configurations to control various optimizations and features of the underlying GPU architecture to explore the effects of different kernel optimizations. RT-CUDA converts a C-Like program into an optimized CUDA program by aligning the code to meet the major GPU constraints. A configuration file is used to store hints on selective code transformation. APIs have been added to allow the invocation of external library calls. Finally, a generic program parametrization is used to apply auto-tuning, which will help finding a suitable setting of the resource occupancy.

CUDA-CHiLL [26] is a command based transformation compiler that performs a recipe that contains a set of commands for each required code transformation. Optimization heuristics are applied manually such as the dependences and parallelization, global memory coalescing, shared memory and bank conflicts, and maximize reuse in registers.

The FastFlow framework [2] addresses computations that can be represented by a set of iterative data parallel kernels. Specifically, a loop-of-stencil-reduce is developed to simplify the programming of data parallel programs on heterogeneous multi-core platforms. FastFlow has been used for implementing the Helmholtz PDEs and for streaming Sobel edge detection. For this a 2D stencil is used by combining a 3-by-3 neighbors in updating the solution matrix. The work extends the earlier work on the SKePU programming framework [15]. FastFlow presents an important parallel program optimization for implementing the reduce operations (map, reduce, map-reduce and stencil reduce) in a heterogeneous GPU systems.

Section 2 presents a summary of current compilers and restructuring tools applying various GPU optimizations. It also discusses the GPU optimizations that are required to enhance resource utilization, a pre-requisite for GPU performance.

Most of the proposed computational paradigms aim at determining the involved stencils in the solver equations based on the knowledge of the PDEs involved in the physical model. In this view our proposed computational paradigms and that of FastFlow positively help developing fast prototyping altogether with optimizing the implementation of the most time consuming part that is the stencil involved in the solver system.

1.2 GPUs and structured grid computing (SGC)

The GPU tremendous computing power is extremely useful for accelerating many science applications that are based on discrete numerical simulation. Extracting parallelism depends on the application nature. For example, in signal processing the multi-dimensional digital filter matrices are determined based on dependence graph analysis (DAGs), divide-and-conquer strategies and pipelining [18].

Structured grid computation (SGC) is one important class of science applications [19, 22, 30, 32, 41, 43]. Hence, there is a great interest to develop systematic optimization approaches for accelerating SGCs using GPUs. Generally, SGCs consist of repeatedly solving a sparse system of linear equations using an iterative linear algebra solver (ILAS) algorithm that represent the solver. Most of the simulation time is spend within the solver, which justify the need for an efficient solver implementation. The most time consuming part of a solver is the sparse matrix–vector multiplication (SpMV) operations [22, 30, 44, 48]. Sparse matrix and vector calculus attracted great attention. For example, optimizing the storage of sparce matrices and the implied matrix–vector calculus is one critical issue in the efficient implementation of SGCs on GPUs. In most cases, users have standard storage schemes that are adequate for general sparse matrices with quasi-random rdata layout. In many cases, the above issues are manually addressed due to lack of effective tools especially when there is need to efficiently exploit the regularity and the data pattern in the distribution on non-zero elements to increase problem size and simulation accuracy.

1.3 Paper contributions and organization

Many computational science applications can be discretized and simulated using SGC approaches. The objective of this paper is to explore the techniques that accelerate the implementation of efficient science simulations for the class of SGCs applications. It identifies SGC simulations that can be accelerated by “basic machine dependent optimizations” and “domain specific optimizations & tools”. Figure 1 summarizes the basic idea of this paper.

Fig. 1
figure 1

An overview of required optimizations to support SGC. The optimizations that are briefly addressed by existing tools are marked in green; the optimizations that are mostly not addressed by the existing tools are marked in red (color figure online)

An efficient implementation of SGC on GPUs must account for two fundamental class of optimizations, which are (1) GPU machine dependent architectural optimizations and (2) SGC domain specific optimizations, see Sects. 2.1 and 2.2 for details. Due to their complexity and multi-objective nature, the former optimizations have been only partially addressed in proposed compilers and restructuring tools. The later optimizations represent the functionalities needed for implementing iterative linear algebra solvers using intelligent tools for developing optimized storages, sparse matrix data structures, and implied matrix and vector operators. Section 2 presents more details about the compilers and restructuring tools which will enlighten the reader about how the above optimizations are implemented.

NLs helped the scientific community by providing optimized parallel/GPU code for: (1) a rich variety of operator based vector/matrix functionalities; (2) few standard sparse matrix storage schemes (COO, CSR, HYB, DIA, etc.); (3) few linear algebra solvers (Ax=b) with some preconditioning such as Jacobi, SOR, BiCG, and GMRES; (4) few numerical methods such as matrix Factorization, Gaussian elimination with pivoting, LU, QR, and LSE. However, the following much needed enhancements for SGC are missing: (1) User-guided synthesis of complex data structures; (2) Optimized code using synthesized storage; (3) Data locality across library operators; (4) Automatic auto-tuning.

Section 4 details a set of required enhancements and intelligent tools to help scientists describe their complex data structures and automatically generate optimized code for quick prototyping of SGCs.

For relevant background and basic terminology related to the GPU architecture, heterogeneous programming details and the working of linear algebra solvers, the readers can refer to references [1, 3, 24, 25].

The rest of the paper is organized as follows. Section 2 explores the basic GPU optimizations, how these optimizations were addressed in research and industry compilers, and current practices for programming linear algebra solvers. Section 3 introduces SGC with an example of oil reservoir simulation. It discusses SGC implementation strategy, and review the libraries, tools and optimizations used in these implementations. Section 4 explores how computational scientists define SGC data structures and optimize the solver algorithm, which will help us identify the key optimization functionalities for an integrated SGC library that will ease the process of designing complex SGC simulations. Section 5 presents our incremental contributions towards the integrated SGC library.

2 GPU optimizations for linear algebra solvers

This section presents the basic GPU architectural optimizations (BA), the domain specific compiler optimizations (DS), and how research and industry GPU compilers have implemented transformations to enhance code efficiency. It also identifies the optimizations needed for the class of Iterative Linear Algebra Solvers (ILAS) and explore how scientists program these solvers.

2.1 Basic GPU architectural optimizations

The GPU architecture and its execution model provide detailed information on how the GPU optimizations must be utilized to achieve the best possible application performance. Following is a list of Basic Architectural Optimizations (BAs) and their functional specifications that must be applied by the software tool or the compiler to generate efficient CUDA programs:

  1. 1.

    The Parallel Memory Bandwidth (PMB): This aims at mapping threads within a warp (group of threads run in lock-step) to access data from distinct storages in the device memory. The compilers/tools must explore different correct mappings for the addresses generated by neighboring threads and select a mapping that guarantees coalesced access to global memory. For shared memory accesses, threads within a warp must map their accesses into distinct memory banks to avoid serialization. Data access requests to global memory can be reordered in parallel by multiple channels and banks. However, the memory bandwidth is efficiently utilized when the accesses to the memory channels are balanced, without congested channels.

  2. 2.

    The locality optimization (LO): A GPU has several streaming multiprocessors (SM), each has a small fast shared-memory (ShM). Sharing among SMs is done using a large slow global memory (GM). The locality optimization aims at enhancing the use of the deep explicit GPU memory hierarchy by using four main actions. The first step consists of copying data once into ShM to maximize data reuse while maintaining a data footprint that meets memory constraints. The second step converts the original loop using the technique of blocking or tiling with a fixed maximum size to fit in the ShM capacity. The third step consists of making an efficient use of the available large register file for temporary data. Finally, use read-only special portions of GM that are the constant and texture by preloading the data in them before entering kernels.

  3. 3.

    The Input/Output GPU Memory Allocation (I/O): The use of inter-procedural data-flow analysis to optimize data movement between host CPU and device GPU, an explicit operation for many compilers. This includes allocating memory for GPU input and output, and managing the explicit transfer of data between host CPU and device GPU.

  4. 4.

    Computation Partitioning and Decomposition (CP): It consists of three fundamental actions which are (1) manage block-level and thread-level parallelism, (2) map block/kernel organization and dimension to the data structure of the computation, (3) use of address transformations to map threads to the results and adjust thread granularity to amortize transfer/processing ratio.

2.2 Domain specific compiler optimizations

Although the BA optimizations are essential, they are far from being sufficient to optimize simple domain specific applications. An important target application for restructuring tools is the area of Iterative Linear Algebra Solvers (ILAS).

ILAS can benefit from the existence of highly optimized math libraries for basic dense and sparse linear algebra operations. These libraries are developed by the academia and industry communities to help providing code for multi-core and many-core computers for a variety of applications including ILAS. Libraries may have optimized code for many algebra operators that can be invoked from many high-level languages. Library operator calls offer many substantial performance advantages such as sparing the user from direct exposure to the GPU details, in addition to rapid prototyping and code portability. Further, libraries are constantly enhanced and new features are added.

To efficiently implement ILAS algorithms, a restructuring tool must embody the BA optimizations in addition to the ability to efficiently implement some domain specific (DS) optimizations. For this the following additional DS features should be integrated in addition to the aforementioned BA optimizations:

  1. 1.

    Inter-Block Synchronization (SYN) is needed because of the iterative nature of ILAS algorithms. Here, threads cannot start the next iteration before making sure all threads have completed the current iteration. Since GPUs offer no global synchronization, there is need for a customized inter-block synchronization mechanism, when exact algorithm behavior is needed to ensure correctness. Some of the proposed synchronization techniques are: (1) kernel entry/exit, (2) lock-based, (3) lock free, (4) relaxed synchronization, or (5) adapt synchronization to algorithm depending on expected degree of thread load unbalancing.

  2. 2.

    Invocation of Optimized External Libraries (IEB): Some external libraries have been optimized at lower level programming and may deliver substantial performance advantages over manually optimized regular code. Efficient invocation of external libraries require full understanding of its parameters and related implementation logic to select proper parameter values.

  3. 3.

    Optimization of Architectural Parameters (AP): Due to many GPU occupancy constraints, there is a need to carry out some resource management analysis and find out the most suitable machine occupancy parameters. Empirically searching in a space of possible configurations using code parametrization and auto-tuning techniques allows finding the optimal values of kernel parameters for best performance.

Following subsections explore the research issues for each DS optimizations and discuss some potential solutions.

2.2.1 Global synchronization

The Lock-based synchronization uses atomic operations on global variables defined in GM. When all threads of a block finish their work, the first thread of each block atomically decrements a global variable (mutex) and continues checking as long as it is more than zero. The drawback is the hot-spot in polling of GM by the terminating block. In Lock-free [37, 47], each terminating block b sets its entry in a global input array Ain(b) to post its termination. Next, the thread checks the completion of other blocks using the other block locations of Ain. Note that access to Ain(b) need not be atomic because Ain(b) can be set only by block b. The barrier is passed when a block finds that all entry of Ain are set.

Relaxed synchronization (RS) [25] allows two iterations to overlap in time. A completing thread-block stores its range of results in GM and starts the next iteration by using the partial results from the other threads. A global array is used to collect the results from completing thread-blocks. The thread-block terminates the current iteration when it has processed all partial results in the current operation. This approach is profitable when there is enough load unbalancing among the threads to offset the overhead of processing partial results.

Another interesting scheme is the Re-Ordered Synchronization (ROS), which was proposed to hide the global synchronization overhead by allowing a completing thread to execute some independent work that must be done anyway. ROS consists of re-ordering the operations so that a completing thread block stores its partial results in GM and starts an independent operation to avoid polling GM for the completion of other blocks. Two global arrays Ain and Aout are used to post block completion and to check completion of a global reduction respectively. The thread block that was last in posting its completion carries out the reduction of partial results in a global variable and sets all entries of Aout. ROS advantages are: (1) the synchronization time due to load unbalancing is hidden by some computations, and (2) eliminate the need for atomic access to global flags as well as polling for the other thread completions.

2.2.2 Optimization of architectural parameters

In GPU, all data movements among the cache memory hierarchy are highly dependent on the CUDA kernel structure as there is no cache coherency implemented within the GPU architecture. In addition, it is difficult to determine the optimal parameters that define the GPU machine occupancy. These parameters are the grid block size, the thread block size and tile granularity. Usually these parameters are found using manual empirical testing or using a tool like OpenTuner [5], which is a very time-consuming process due to many possible combinations. Therefore, the need for a compiler auto-tuning approach to evaluate the performance of a newly generated parametric kernel with various possible combinations of the above occupancy parameters. The pruning of the list of possible parameters is used at three levels to reduce the repeated compilation and execution of the kernel [24]. The three levels of pruning consists of skipping those tile sizes which do not equally distribute (1) the number of resultant elements among all threads (array block), (2) among all kernel blocks (Kernel block level), or (3) parameters which require more than the available registers (active block level). For each combination, the number of registers/thread and shared-memory (ShM) per block are determined. Next the number of Active Blocks by Warp (ABW), Active Blocks by Shared Memory (ABShM), and Active Blocks by Registers (ABR) are calculated. Parameter pruning is carried out at the Active Block Level and generates a list of possible optimal parameters. Finally, the kernel is run for each combination of parameters in candidate parameter list and the optimal combination of parameters that give the minimum execution time is retained.

2.2.3 Challenges for numerical libraries

Numerical libraries have made significant progress with respect to code versatility, operator diversity, and ready-made solvers in many cases. However, just implementing an ILAS using numerical library calls from a high-level language may not work because of the following two reasons. (i) The provided sparse matrix storage schemes are appropriate to handle a class of sparse matrices without much regularity. There is a further need to adapt the storage scheme to take advantage of the non-zero pattern regularity, block structure and other specific features. (ii) The optimized operator library calls assume their operands in global memory (GM) to gain generality. This offsets the benefit of data locality when chaining the solver operators. Thus, the operator data locality is lost unless a more general operator semantic is developed to ensure all possible operator chaining be done at the level of the shared memory.

The application of stencil based relationships on structured grids results in some problem-specific features such as the solver matrix layout, the sparsity pattern, and the number of state variables to be updated at each grid point. Numerical libraries have a variety of optimized sparse matrix storage schemes, which aim at minimizing sparse matrix storage and enhancing the computation of basic algebra operators. The use of NL for SGC implementation is useful for generic sparse data structure and the matrix–vector math operators. However, NL do not have tools that takes as input the stencil-based relationships in a given SGC and determine the sparse matrix data layout and its optimized storage scheme. Hence, the scientists are responsible for the tedious manual work or use standard storages that cause significant drop in performance. Another important issue is that sometimes NL does not produce optimized code because of the lack of tools that exploit the data locality across a chain of operator invocation, i.e. locality is lost from one operator invocation to the next. Hence, the performance of library functions is generally far from that of a manually optimized code.

Hence, there is mismatch between the computational power of GPUs and the degree of SGC optimizations when solely using library implementations. To compile efficient library code, there is a need for an analysis of the strategies used in current compilers. Proper analysis allows library users to identify the missing constructs for efficiently implementing iterative solvers, and achieve high-degree of kernel optimization. Another problem is how to let the user describe the features of the domain specific sparse matrix to enable the efficient implementation of the SpMV, sparse matrix storage and the solver [30]. Finally, the evolving GPU architecture requires some reflections on a integrated library framework to provide portable, flexible and viable solutions.

Table 1 Comparison of software compilers and frameworks in terms of optimization specifications

2.2.4 Summary about optimizations in research and industry compilers

The discussion from the previous sections shows how research compilers have addressed the optimizations needed to take advantage of the massive parallelism in GPUs. Table 1 shows how all the aforementioned BA and DS optimizations (see Sects. 2.1 and 2.2) have been addressed in the available software frameworks and compilers. This table is built based on the understanding of the published description of optimizations used in these compilers and frameworks. The main feature of these compilers is that they present a simpler GPU programming model. However, the optimizations needed to generate tailored kernels for scientific simulations are missing. The complexity of finding systematic and automatic GPU optimizations makes these compilers less efficient than manually optimized programs for general purpose computing. The following are the major limitations for the efficient implementation of ILAS algorithms using parallel compilers and libraries:

  1. 1.

    There is a semantic gap between current GPU compilers and the optimizations needed for ILAS, which are far from meeting the expectation of raw numerical algorithms. For example, lack of global synchronization and absence of automatic auto-tuning tools degrade resource utilization.

  2. 2.

    The code is not optimized to take advantage of the operator data locality in a sequence of library calls as it always refer to the data stored in the lower levels of the memory hierarchy.

  3. 3.

    Numerical libraries have standard sparse matrix storage schemes. Most of the sparse matrices found in science simulation have domain specific data structures. Libraries and programming tools fail to capture the regularity in sparse data structures and synthesize customized storages.

  4. 4.

    ILAS computing suffered in the past from the lack of powerful tools to enable scalable implementation on cluster computers. This gap is becoming wider with many-core technology due to complexity of adapting ILAS to massive arithmetic parallelism, the explicit memory system, and the multi-threading strategy.

As a result it is rare to find an application in the area of linear algebra solvers that has been significantly accelerated using the above compilers. The next section explores how SGC researchers have been using parallel computers, software tools, numerical libraries, to develop structured grid simulation. It focuses on the difficulties in the above process and identify some of the needed tools to speedup the development of SG applications on GPUs.

3 From science simulation (SS) to structured grid computing (SGC)

Science simulation (SS) is specially useful to help emulating the design process, which presents a significant cost saving in a variety of research and engineering areas. A physical process is modeled using a set of partial differential equations (PDEs). Discretization of the PDEs lead to representing the problem using a large 3D grid of cells with each grid cell having a set of states. Most of the models are inherently non-linear. In any given simulation, each grid cell has a number of independent state variables. A grid cell interacts with its neighbors through physical exchanges of the state variables. The stencil defines how the state of neighboring cells interact with each other according to the PDE laws. Generally the physical process is non-linear and linearization is done by using the Newton-Raphson method.

To represent the interaction among all the grid cells, a 3D grid having N cells, each cell has k state variables, is unfolded into 1D representation having kN variables by using a simple address mapping function. This is useful to build the solver matrix. Each cell is represented by its k state variables in the 1D representation. A 2D Jacobian matrix (solver) is built by mapping the 1D unfolded grid onto the rows and the columns, e.g. element (i,j) represents the interaction coefficient between grid cells i and j. Hence, the interaction among the grid cells is represented by a Jacobian matrix A of size (Nk)x(Nk). Generally, matrix A is sparse because a cell interact only with its immediate neighbors. This consists of repeatedly solving a system of linear equations of the form Ax=b to approximate a non-linear solution, where A is a square sparse matrix, b is known vector, and x is the unknown vector. Solving this system means finding solution x that satisfies Ax=b. Solution x is used to update the model which in turn updates the value of matrix A and vector b. The process of solving for x and updating the model continues until reaching some converging condition.

There are many linear algebra solvers which solve Ax=b using direct and iterative approaches. The LU factorization is one example of a direct method, and the Jacobi, Conjugate Gradient (CG) and GMRES are examples of iterative approaches. A science simulation spends most of its running time in solving the above system of linear equations. Thus, most of the optimizations focus on the solver which includes the main algebra operators like the Sparse Matrix–Vector Multiply (SpMV), Matrix-Matrix multiply (MM), inner product of two vectors, addition/subtraction of matrices and vectors, etc. Efficient synchronization among all the working threads is another important optimization to produce correct solutions.

The research and industry communities have developed compilers and numerical libraries to help alleviate the complexity of programming directly in CUDA. However, due to the complexity of finding systematic and automatic GPU optimizations and efficient implementation of linear algebra operators make these approaches much less efficiently implemented than manual optimization. This is widening the gap between the performance of science simulation and the large computational capabilities of GPUs. Table 2 shows the summary of SGC research based on code optimization (CO), numerical libraries (NL), and/or data structures (DS) for sparse matrices.

Table 2 Summary of SGC Research work that is based on code optimization (CO), use of optimized operators and tools in Numerical Libraries (NL), or use/develop storage schemes and data structures for sparse matrices (DS)

4 Integrated SGC library (ISL)

GPUs have a great potential for scientific simulations. Utilization of a significant fraction of GPU peak performance is needed to enable enhancing the accuracy in SGC simulations. This task is quite challenging because of the GPU architectural complexity and the lack of efficient tools to customize the SGC data structures and to adapt the code to the algorithm properties. Hence, there have been various research efforts to help users in GPU programming, but a comprehensive solution that efficiently addresses the SGC data structures and algorithms is still in a research phase. So, there is a need of an Integrated SGC Library (ISL) for scientific simulations that will efficiently support scientists in GPU programming and relieve them from the tedious task of programming their solver matrices, customizing optimized storages, redesigning operators to preserve data locality, and efficiently embedding the code using architecture-specific GPU optimization details. ISL can be seen as an intelligent interface for describing PDEs and structured grids at a very high-level using a notation that resembles mathematical formulas. An attempt in this direction is found in the UFL (Unified Form Language) [4] which allows users to describe finite element equations that are translated into kernels by FEniCS Form Compiler (FFC). ISL needs to be implemented as a viable software system focusing on ease of extendability and maintainability, as the working data structures and algorithms will evolve and new annotations and optimizations will be required to adapt to changes in the simulated problems and the GPUs. The following are the detailed features of the desired ISL:

  1. 1.

    Intelligent Data Structure Interface (IDSI): There is need to automate the process of building, refining, and generating the sparse matrix data structures, which is the pre-condition to generate optimized programs for scalable solution of PDEs. The process should be based on a mechanism to capture the grid features such as the regularity in the sparse matrix and its pattern. This can be implemented using a set of high-level annotations to help users synthesize the main solver matrix. Users provide guidance to the principal data structures by taking advantage of the problem structure in implementing customized solutions. Given an SGC and a stencil relationship, a set of linear equations (SLEs) corresponding to the application of the stencil to all the grid points can be easily derived. The obtained SLEs are represented by a sparse matrix, which is the solver matrix. Therefore, it is useful to develop an intelligent tool that allows the user to describe the SG and the relevant simulation constraints such as the stencil operator, boundary conditions, initial conditions and cell components.

    IDSI tool may use graph properties to build the basic grid cell structure with its computing links to other cells and determine the rules at the grid boundary and their trigger conditions. Using inference rules on the synthesized data structure will enable the automatic generation of the solver matrix for arbitrary grid size. Generated solver matrix with the incorporation of all the constraints identify the matrix non-zero blocks, which might be emerging from the stencil definition, initial conditions, or boundary conditions. In addition, the tool will prepare the link between the matrix structure and the physical model parameters to update the solver data on every iteration of the simulation, a task that is essential when implementing the solver algorithm. This flexible approach will prove its usefulness for assessing the simulation scalability in the process of enhancing the simulation accuracy. Thus, the main benefit of IDSI is that it minimizes the user effort from the task of defining the grid and its constraints, and let the system find generic rules that scale the problem to arbitrary size and handle the complexity of the structure and assessment.

  2. 2.

    Automatic assessment and selection of a library storage scheme: There is a need to assess the performance of standard library’s storage schemes given the structure of the synthesized solver matrix, its structured regularity and distribution of non-zero element (nze) blocks. A tool is needed for the automatic assessment of the efficiency of library sparse matrix storage schemes. One approach is to select a subset of available storage schemes by comparing the recognized storage profitability features with those found in the current matrix structure. The automatic assessment must account for the total storage required, number of operations needed to compute the index of non-zero elements, ratio of communication-to-computation involving transfer from GPU global memory accesses to shared-memory, bank conflicts at shared-memory, and available bandwidth. A storage scheme can be retained when its implied performance meets the user specified optimization level.

  3. 3.

    Synthesizing custom storage scheme: The user needs to guide IDSI for building a custom storage scheme if no library storage scheme may provide an acceptable performance level. An interactive process in which the user is provided with high-level annotations to guide the compression of the non-zero elements in the solver matrix. The objective is to build progressively a storage scheme that balances the matrix storage requirement with the overhead of address calculation and memory access in the basic SpMV operation. The experience with manual storage optimization indicates that many possible storage schemes can be interactively designed such as converting diagonal blocked data structure into columns or rows, collapsing rows or columns of non-overlapping data, caching block address keys to avoid multiple conditional statements in code, clustering irregular blocked structure and use a two-level hybrid compression coding. These techniques have been experienced in different fields and may be used as key user annotations in a bottom-up approach to construct the storage scheme from the smallest near-compact pattern (NCP) to whole matrix by taking advantage of the regularity, graph properties, and repeatability found in the solver matrix among the non-zero blocks. The tool may display the snapshot of the NCPs to help the user visualize the pattern regularity at different grid sizes. A grid plan is bounded by many blocks of zeros, which are due to shrinking the stencil at the plan boundary. The result is a regular distribution of NCPs, where the NCP size depends on the plane size. Although the NCP may change its pattern depending on grid dimensions, the block connectivity within the NCP is largely preserved. The tool needs to validate the synthesized storage and assess its expected performance by using an evaluation technique such as the SpMV computation.

  4. 4.

    Optimizing the solver algorithm: The optimized math operators available in the numerical libraries need to be redesigned to preserve inter-operator data locality in shared-memory and register files, which are currently available through generic library calls. Currently library math operators pick up their operands from the global memory. Different argument scenarios must be available for each operator to ensure the use of the data operand wherever created by its predecessor. Similarly, produced data should be cached wherever appropriate to minimize data motion across the data dependent operators. Currently, most of these data motion optimizations are manually performed. A chained list of operators should be automatically translated, following a global data dependence analysis, into a chain of selected operator scenarios that minimizes data motion. Optimization techniques to account for the GPU architectural features such as the small shared-memory size, coalesced global memory, and conflict avoiding in shared memory should be systematically used in the algorithm implementation. This ensures some acceptable level of communication-to-computation ratio for a given GPU. The code needs to be easily modified for solving (i) larger problems, (ii) fixed size problems but with faster execution, or (iii ) fixed size problems but with increased accuracy. The compiler should have sophisticated optimizations to trade memory bound or time bound implementations.

  5. 5.

    Integrated auto-tuning (IAT): Currently SGC kernel auto-tuning is manually done or the application migrated to another environment for a complex user supervised auto-tuning. There is a need for an integrating auto-tuning (IAT) as part of the library development process to hide the complexity of fine tuning code and to avoid exposing the user to GPU intricacies. Auto-tuning need to be redesigned to spare the user from being exposed to process of searching the most optimized combination of architectural parameters, which are complex and highly machine-dependent. IAT should handle many complex and interrelated architectural features such as the automatic generation of parametric solver kernel (PSK) to enable the use of sophisticated auto-tuning techniques. PSK capture the salient GPU occupancy parameters like the kernel structure, thread block size, thread granule size, compilation flags, etc. The parametric code can also be a user guided process, which enables selecting the GPU salient parameters and leave it to the tool to prune unlikely parameter combinations and focus on a small set of values that are assessed using empirical evaluation. Auto-tuning results may change depending on the grid problem size, which favors an integrated approach that benefits from the cumulative knowledge and overall attempts in scaling up the code from one level to the next. IAT increases the portability of the solution as auto-tuning can be done for different architectures without the need for code recompilation.

  6. 6.

    Dynamic Load balancing (DLB): ISL should support DLB to tolerate load unbalancing in iterative algorithms. DLB feature should be scalable and hence should avoid the use of explicit lock-based synchronization. The experience shows that hiding synchronization overheads by running some computation that must be done any way proved to be a profitable alternative in designing linear algebra solvers. DLB should offer different user selected options for enhanced load balancing such as allowing the overlap of different iterations with proper management, work-queues and work-stealing techniques.

  7. 7.

    Backward compatibility and extendibility: ISL should culminate in a viable software system focusing on backward compatibility and ease of maintenance. Specifically, it should be extendable to allow the addition of new data structures, optimizations and solvers that proved to be efficient in optimizing SGCs. For code simplicity, generality and reusability, ISL may be based on object-oriented (OO) design.

  8. 8.

    Profiling and debugging: ISL should provide software tools to let users visualize the key bottlenecks in the code. It should facilitate the debugging when code crashes. For this purpose, it should store the core files needed including all stacks at the time of the crash. Debugging can be simulated on a small number of cores interactively.

5 Proposed tools for integrated SGC library (ISL)

ISL is a complex library and cannot be designed at once, but the individual components are designed gradually. This section presents our incremental contributions towards the development of ISL.

5.1 SpMV and BiCG-stab optimization for a class of Hepta-diagonal sparse matrices on GPU

A Structured Grid Development Tool (SGDT) [1] is proposed to customize the design of the solver algorithm for reservoir simulation (FRS) for arbitrary grid size, stencil relationship, number of components, and boundary conditions. SGDT can be summarized as follows:

  1. 1.

    Generalized Sparse Solver Matrix: Deriving the generalized hepta matrix GH based on the knowledge of the structured grid (J,H,I) all together with a set of cell components, stencil relationships, initial and boundary conditions, and FRS model for updating the data blocks following each solver iteration. To build the solver matrix, the SG is unfolded into 1D form taking into account the grid dimensions and number of cell components. The offsets from a grid cell to its stencil cells are computed and stored for use in SpMV when indexing the dot product. Taking advantage of SG regularity, the solver matrix compactly stores the NZs as well as a common offset vector for all the rows. This enables the automatic generation of a family of sparse solver matrices as a function of grid dimensions, a number of cell components, stencil and initial and boundary conditions.

  2. 2.

    Optimizing Sparse Matrix–Vector Multiply: Each SpMV result is assigned to separate thread. Hence neighboring block of results are implicitly assigned to threads within each warp. SpMV optimization is based on optimizing the sparse matrix storage and minimizing address calculation by shared an explicit offset vector coalescing memory access and indexing operations, which are the pre-requisite for the design of an optimized SpMV CUDA KernelAddress computation is optimized by sharing an explicit offset vector among all threads that are mapped to identical number of SpMV results. Threads use a coalesced access due to row access of matrix NZs and retrieve the multiplicands using the shared offset. SpMV is coded as a parametric module, which enables the use of code auto-tuning. Auto-tuning searches for a combination of GPU architectural parameters (grid, thread blocks, thread granule size, and other compiler flags) for final optimization of the SpMV CUDA code.

  3. 3.

    Optimizing the Solver: Most iterative solvers can be expressed using vector, vector-matrix, and vector scaling operations. The solver dependence graph helps in orchestrating the following parallelization steps: (1) identify and group global synchronization points, (2) implement optimized operators like SpMV over the synthesized sparse matrix, inner product, vector addition and scaling, and convergence condition, and (3) enhance vector data locality. Similarly, auto-tuning is carried out over the parametric solver code as a final optimization step.

For the forward petroleum oil and gas reservoir simulation, the application of a stencil relationship to structured grid leads to a family of generalized Hepta-diagonal solver matrices with some regularity and structural uniqueness [1]. A customized storage scheme that takes advantage of generalized Hepta-diagonal (GHD) sparse pattern and stencil regularity is proposed. The invocation of numerical libraries operators is made using multiple-kernels invocation, which causes loss of data locality due to kernel exit and re-entry. An in-kernel execution model (IKEM) is proposed based on a lock-free inter-block synchronization. Thread blocks are assigned some independent computations to avoid repeatedly polling the global memory. Other optimizations enable combining reductions and collective write operations to the memory. IKEM allow preserving vector data locality and avoiding saving vector data back to memory and re-loading on each kernel exit and re-entry. IKEM is suitable for many iterative solvers like BiCG-stab and QMR. The experiments are run on Tesla K20Xm hosted by an Intel Core i7 CPU. Performance Flops of SpMV using GHD with IKEM is 3\(\times \) that of using CuSPRSE for CSR or BSR storages and 1.25\(\times \) for HYB or DIA storages, respectively. The number of structured grid cells is varied between \(8\times 10^3\) and \(2.6\times 10^5\) and each cell has 2, 4, or 8 state variables. Similarly, the performance Flops of BiCG-stab solver using GHD with IKEM is 2.6\(\times \) that of BiCG-stab for CSR or BSR storages and 1.27\(\times \) for HYB or DIA storages with CuSPRSE and CuBLAS library calls. Results show significant performance improvements in SpMV and BiCG-Stab solver in response to proposed optimizations compared to other proposed implementations found in the literature using standard sparse storages and numerical library calls involving multiple-kernel invocations.

This work contributes towards the three features of the ISL described in Sect. 4—Intelligent Data Structure Interface (feature 1) and synthesizing custom storage scheme (feature 3) and optimizing the solver algorithm (feature 4).

5.2 Invocation of GPU device routines from OpenACC

Many Numerical Libraries (NLs) have been developed to help scientists porting common scientific methods into GPU devices. Automated parallelization is needed to make NLs more accessible from high-level languages (HLLs). Interoperability between automated parallelization technologies is still underrated by researchers in terms of easiness of use and performance. For example, calling CuBLAS GPU interface from a GPU CUDA code region requires the use of handles to enable concurrency, which is similar to calling from a CPU code region. To reduce the overhead, there is need to eliminate the handle use from inside an HLL and introduce an intermediate function that does not require cuBLASHandle argument. Hence, an intermediate implementation has been proposed that puts the necessary calls to create a new cuBLASHandle, call the actual library using that handle instance then destroy it after the call. This way the client code needs only to call the intermediate function without any handling of setup and destroy code for cuBLASHandles. A handle usually has a consistent pattern among each library. In the example of CuBLAS library; handle usage is fairly consistent among functions in a way that makes it possible to generalize a solution and automate it. Our approach is applicable to libraries that depend on pointers to opaque data structure handle arguments without a corresponding support from the compiler. A solution [3] for generating a wrapper library is proposed that avoids the need for such arguments; because these type of arguments stand as obstacles to such use. In the evaluation, the proposed solution showcase with a library paired with an OpenACC compiler that does not support it as an example. Our tests show speedups that reach 2.5\(\times \) in some cases over the plain use of CuBLAS host-based interface, while the speedup reached about 34\(\times \) with respect to the purely OpenACC-exclusive solution in some cases. Moreover, a decrease in code size of about 50% with respect to OpenACC-exclusive approach was noted..

This work facilitates the achievement of ISL’s feature 2 “Automatic assessment and selection of a library storage scheme”. Also it maps indirectly to ISL’s feature 7 “Backward Compatibility and Extendibility”.

5.3 Restructuring tool with auto-tuning

The design of a restructuring tool (RT-CUDA) [25] based on a restructuring algorithm is capable to convert a standard C-program (input) into an optimized CUDA program (output). The proposed restructuring algorithm acquires the best possible kernel optimizations and energy-aware rules. RT-CUDA hides architectural details of the underlying GPU device that helps traditional C programmers to develop parallel programs in a fast and efficient manner. RT-CUDA supports efficient development of sparse linear solvers such as conjugate gradient to be used in reservoir simulation software. It also includes API functions to allocate and initialize sparse matrices with random sparsity as well as reading matrix from matrix market file to be used as input for the solver. The implementation of such a solver can be optimized using the combination of user-defined functions and invoking highly optimized library functions including cuBLAS and cuSparse library functions. RT-CUDA integrates both BA and DS optimizations for code transformations. The user-defined functions generated by RT-CUDA are restructured as parametric CUDA kernels that are then pass through an efficient auto-tuning mechanism to enhance the GPU resource utilization by the functions.

A subset of the restructuring tools were evaluated with various applications including Matrix Addition (Madd), Matrix Multiplication (MM), Matrix–Vector Multiplication (MV), Vector-Vector Multiplication (VV), and also the recursive block matrix multiplication that are Strassen (S-MM) and Winograd (W-MM) Matrix Multiplications [24]. The evaluation results show the significant improvement in terms of the execution time of the parallel codes using the proposed integrated approach. The comparison for different applications and tools has shown with appropriate space size (N) and normalized execution time to show the results in a particular range.

Fig. 2
figure 2

Tools comparison using LAPACK operators

Figure 2 shows the normalized execution times in milliseconds of different tools using a set of operators in LAPACK benchmark suite for basic linear algebra operations including Madd, MM, MV, and VV. The results show that our integrated approach obtained better performance for Madd and VV in comparison to CUBLAS library functions. However, for complex applications such as MM and MV, CUBLAS still has a significant performance advantage over our integrated approach. This is because cublasSgemm and cublasSgemv functions have been developed with complex kernel optimizations at very low-level of coding by hand while at this stage, the paper is focusing on high-level CUDA kernel optimizations. However, with the proposed high-level kernel optimizations, our integrated approach outperforms CUBLAS with 45% improvement in case of Madd and 2% improvement in case of VV. In addition to that, our integrated approach outperforms GPGPU compiler with 30% improvement in case of MM, 99% improvement in case of MV, and 50% improvement in case of VV. Also, MV implementation in GPGPU compiler gives value errors in case of large space size while our integrated approach generates correct values with any space size. Moreover, our integrated approach outperforms OpenACC implementation of PGI compiler with 42% improvement in case of Madd, 99% improvement in case of MM, and approx. similar performance in case of MV and VV for large arrays.

Fig. 3
figure 3

Integrated approach for recursive block matrix multiplication

Using integrated approach, a recursive block matrix multiplication algorithms based on Strassen (S-MM) and its Winograd (W-MM) variant were also implemented. The above algorithms reduce the complexity of the canonical MM algorithm from O(N3) to O(N2.8). The implementation uses a depth-first (DFS) traversal of a recursion tree where all cores work in parallel on computing each of the sub-matrices, which are computed in sequence. The DFS approach reduces the storage at the detriment of large data motion to gather and aggregate the results. Our implementation uses a small set of basic algebra functions, invoking CUBLAS, and use of auto-tuning of the parametric kernel to improve resource occupancy [24]. Evaluation (Fig. 3) shows that our implementation of W-MM and S-MM with one recursion level outperform CUBLAS 5.5 library implementation with up to twice as fast for arrays satisfying \(N \ge 2048\) and \(N \ge 3072\) respectively. Based on the above it is clear that integrated S-MM and W-MM implementations with a few recursion levels can be used to further optimize the performance of basic algebra libraries. This work contributes towards the two ISL features—Optimizing the solver algorithm (feature 4) and Integrated auto-tuning (feature 5).

5.4 Conclusion

Billions of running threads are expected in the coming Exascale computing era. However, there is a mismatch between rapidly growing computational power and the efficiency of program produced by the current compilers and optimization tools. The scalable computing power of GPUs is highly essential for scientific simulations, especially for the class of Structured Grid Computing (SGC). In order to achieve higher simulation accuracy, large scale simulations of the problem with highly efficient code is required. These requirements are missing in the current technologies. This paper identifies GPU architectural optimizations and techniques used in research and industry compilers to produce optimized code. It also identifies missing optimizations for the efficient implementation of SGC algorithms such as the iterative linear algebra solvers. Optimizing SGCs is found to be complex, error prone, and involve a variety of heterogeneous tools and techniques, which can be envisioned only from a research perspective. However, spreading the use of SGC on GPUs requires a deliberate effort for identifying the required automatic techniques for alleviating the complexity and the integration within a well-engineered framework. This paper details these techniques and described an integrated library with the required essential functionalities to ease the process of developing efficient storage, optimized code by using a high-level interactive interface and intelligent domain specific annotations.