Keywords

1 Introduction

Recently, GPU-accelerated computer architectures have become very attractive for achieving high performance execution of scientific applications at low costs [1, 2], especially for linear algebra computations [3, 4]. Unfortunately, the process of adapting existing software to such new architectures can be difficult. Compute Unified Device Architecture (CUDA) programming interface can be used only for NVIDIA cards, while the use of OpenCL (Open Computing Language [5]) leads to a substantial increase of software complexity.

SPARSKIT is a well known package tool for manipulating and working with sparse matrices [6]. It is a very good example of widely used valuable software packages written in Fortran. Unfortunately, it does not utilize modern computer architectures, especially GPU-accelerated multicore machines. The new implementation of the most important SPARSKIT routines for NVIDIA GPUs has been presented in [7].

Sparse matrix-vector product (SpMV) is a central part of many numerical algorithms [6, 8]. There are a lot of papers presenting rather sophisticated techniques for developing SpMV routines that utilize the underlying hardware of GPU-accelerated computers [913]. Unfortunately, these methods are rather complicated and usually machine-dependent. However, the results presented in [14] show that simple SPARSKIT SpMV routines using CSR (Compressed Sparse Row) format [6] can be easily and efficiently adapted to modern multicore CPU-based architectures. Loops in source codes can be easily parallelized using OpenMP directives [15, 16], while the rest of the work can be done by a compiler. Such parallelized SpMV routines achieve the performance comparable with the performance of the SpMV routines available in libraries optimized by hardware vendors (i.e. Intel MKL).

OpenACC is a new standard for accelerated computing [17]. It offers compiler directives for offloading C/C++ and Fortran programs from host to attached accelerator devices. Such simple directives allow to mark regions of source code for automatic acceleration in a vendor-independent manner [18]. However, sometimes it is necessary to apply some high-level transformations of source codes to achieve reasonable performance [1921]. Paper [22] shows attempts to apply OpenACC for accelerating SpMV. However, the authors consider only some modifications of the CSR format and apply other GPU-specific optimizations (just like communication hiding).

In this paper we show that well known SPARSKIT SpMV routines for Ellpack-Itpack (ELL) and Jagged Diagonal (JAD) formats [6] can be easily and successfully adapted to a hybrid GPU-accelerated computer environment using OpenACC. We also advise how to improve the performance of such programs by tuning data structures to utilize hardware properties of GPUs applying some high-level transformation of the source code. The paper is structured as follows. Section 2 describes ELL and JAD – two formats which are suitable for GPU-accelerated computations. We show how to apply some basic source code transformations to obtain accelerated versions of SpMV routines. In Sect. 3 we present pJAD - a new format, which allows to outperform SpMV routine for JAD. Section 4 discusses the results of experiments performed for a set of test matrices. We also compare the performance of our OpenACC-accelerated routines with the performance of SpMV for the HYB (ELL/COO) format [23]. Finally, in Sect. 5 we formulate general guidelines for simple steps that should be done to transform irregular source codes into OpenACC programs.

2 SPARSKIT and SpMV Routines

ELL format for sparse matrices assumes the fixed-length rows [24]. A sparse matrix with n rows and at most ncol nonzero elements per row is stored column-wise in two dense arrays of dimension \(n\times ncol\) (Fig. 1). The first array contains the values of the nonzero elements, while the second one contains the corresponding column indices.

Fig. 1.
figure 1

ELL format for sparse matrices

Fig. 2.
figure 2

JAD format for sparse matrices

JAD format removes the assumption on the fixed-length rows [7]. Rows of a matrix are sorted in non-increasing order of the number of nonzero elements per row (Fig. 2). The matrix is stored in three arrays. The first array a contains nonzero elements of the matrix (i.e. jagged diagonals), while the second one (i.e. ja) contains column indices of all nonzeros. Finally, the array ia contains the beginning position of each jagged diagonal. The number of jagged diagonals is stored in jdiag. Optionally, we can consider just another array rlen containing lengths of all rows [11]. Elements of this array can be easily calculated (even in parallel) using the following formula

$$\begin{aligned} \mathbf {rlen}(i)=|\{ j: 1\le j \le \mathbf {jdiag} \wedge \mathbf {ia} (j+1)-\mathbf {ia}(j) \ge i \}|, \ \ i=1,\ldots ,n. \end{aligned}$$
(1)

Figure 3 shows Fortran subroutines which implement SpMV for ELL and JAD. Note that SPARSKIT subroutines amuxe and amuxj were originally written in Fortran 77, but here we present their equivalents written in Fortran 90.

Fig. 3.
figure 3

SPARSKIT SpMV for ELL (left) and JAD (right)

OpenACC provides the parallel construct that launches gangs that will execute in parallel. Gangs may support multiple workers that execute in vector or SIMD mode [17]. This standard also provides several constructs that can be used to specify the scope of data in accelerated parallel regions. It should be noticed that proper data placement and carefully planned data transfers can be crucial for achieving reasonable performance of accelerated programs [19].

In our OpenACC program, a GPU is responsible for performing SpMV while the host program has to read data and initialize computations. The accelerated subroutines accamuxe and accamuxj are presented in Fig. 4. From the developer’s point of view, the OpenACC parallel construct together with vector_length should be used to vectorize loops. In case of amuxe, the simplest way to accelerate SpMV is to vectorize the loops 6–8 and 10–12. Then, the loop 9–13 would repeat generated kernel ncol times. However, it is better to apply the loop exchange. In accamuxe, the outermost loop 9–16 is vectorized. Similarly we obtain accamuxj. In case of this routine we can observe that the loop 11–19 works on rows, thus we have to provide the length of each row in rlen. Note that to avoid unnecessary transfers, we use the clause present to specify that the data already exist in the device memory.

Fig. 4.
figure 4

Accelerated versions of SpMV for ELL (left) and JAD (right)

3 Optimizing SpMV Using pJAD Format

Our version of SpMV for JAD can be further optimized. We can improve memory access by aligning (padding) columns of the arrays a and ja. Thus, in each column we add several zero elements and each column’s length should be a multiple of a given bsize. Then, each block of threads will have to work on rows of the same length. The number of elements in a and ja will be increased to the size which is bounded by \(n_{nz}+jdiag\cdot (bsize-1)\), where \(n_{nz}\) is the number of nonzero elements (Fig. 5). This modified format can be called pJAD (i.e. padded JAD format). Similar modifications have been introduced in [25]. However, Kreutzer et al. consider bsize equal to the length of half-warp, what is specific for NVIDIA GPUs. They also assume that threads with a block can be responsible for processing various amount of data. Figure 6 shows the source code of accpamuxj. Note that the array brlen contains the length of each block of rows of a given size bsize.

Fig. 5.
figure 5

pJAD format and its data structures

Fig. 6.
figure 6

Accelerated SpMV routine for pJAD format

4 Results of Experiments

Our OpenACC implementation of the SpMV routines has been tested on a computer with two Intel Xeon X5650 (6 cores each with hyper- threading, 2.67 GHz, 48 GB RAM) and two NVIDIA Tesla M2050 (448 cores, 3 GB GDDR5 RAM with ECC off), running under Linux under with NVIDIA CUDA Toolkit version 6.5 and PGI Accelerated Server version 15.4, which supports OpenACC [26]. Table 1 summarizes the results obtained for a set of test matrices chosen from Matrix Market [27] and University of Florida sparse matrix collection [28].

The set contains various matrices with different number of rows and nonzero elements. The largest cage15 has over \(5\cdot 10^6\) rows and almost \(10^8\) nonzero elements. For each matrix we provide the number of rows (columns), number of nonzero entries, average number of nonzero entries per row, maximum number of nonzero elements within a row. We also show the performance (in GFLOPS) of accelerated versions of SpMV for ELL, JAD and pJAD. The last column shows the performance of CUSPARSE SpMV routine using HYB (i.e. hybrid format [23]). In Table 1, the best performance for each matrix is underlined.

The HYB sparse storage format is composed of a regular part stored in ELL and an irregular part stored in COO. CUSPARSE conversion operation from CSR to HYB partitions a given sparse matrix into the regular and irregular parts automatically or according to developer-specified criteria [23]. For our tests, we have chosen the first option.

Table 1. Results of experiments for a set of test matrices

We can observe that for almost all matrices pJAD format achieves better performance than ELL and JAD. ELL outperforms pJAD only for matrices cry1000, af23560, majorbasis, ecology2, atmosmodl, where all rows have almost the same number of nonzero elements (i.e. \(n_{nz}/n \approx \max _{nz}\)) or where the number of nonzero elements is rather big in comparison with the number of rows (i.e. \(n\ll n_{nz}\) for nd24k). It should be noticed that for some matrices ELL exceeds the memory capacity of Tesla M2050 (pre2, torso1, inline_1). The performance of pJAD is a little bit worse than the performance of HYB, because pJAD format requires re-permutation of the result’s entries. For some matrices with \(n_{nz}/n \ll \max _{nz}\), pJAD outperforms HYB (i.e. af23560, bcsstk36, bbmat, cfd1, torso1, ldoor). Note that for cage15, CUSPARSE routine for conversion from CSR to HYB has failed because memory capacity has been exceeded.

5 Conclusions and Future Work

We have shown that well known SPARSKIT SpMV routines for ELL and JAD formats can be easily and successfully adapted to a hybrid GPU-accelerated computer environment using OpenACC. Such routines achieve reasonable performance. Further improvements can be obtained by introducing the new data formats for sparse matrices to utilize specific GPU hardware properties. Numerical experiments have justified that the performance of our optimized SpMV routines is comparable with the performance of the routine provided by the vendor. We have also discussed when the use of considered formats would be profitable. We believe the use of OpenACC and accelerated Fortran routines can be attractive for people who prefer to develop applications using high-level directive programming techniques instead of complicated CUSPARSE API.

The general guidelines for semiautomatic acceleration of irregular codes using OpenACC can be summarized as follows:

  1. 1.

    Define regions where data should exist on accelerators. Try to reduce transfers between host and accelerators.

  2. 2.

    Try to vectorize outermost loops within your code. Vectorized loops should have sufficient computational intensity, namely the ratio of the number of computational operations to the number of memory operations should be greater than one.

  3. 3.

    If necessary, apply loop exchange and inform the compiler that loops are safe to parallelize using the independent clause in OpenACC loop constructs.

  4. 4.

    Try to keep threads within gangs (or blocks in terms of CUDA) working on the same amount of data.

  5. 5.

    The best performance occurs when coalesced memory access takes place [29, 30]. Threads within gangs should operate on contiguous data blocks.

  6. 6.

    Tune your data structures by aligning data in arrays. It can be done by data structure padding.

In the future, we plan to implement some other important routines from SPARSKIT, especially well-known solvers for sparse systems of linear equations. We also plan to implement multi-GPU support using OpenACC and OpenMP [31]. The full package with the software will soon be available for the community.