Keywords

1 Introduction and Background

Solving linear systems of equations and eigenvalue problems is fundamental to scientific computing. The popular LAPACK library [3], and in particular its vendor optimized implementations like Intel’s MKL [8] or AMD’s ACML [2], have been the libraries of choice to provide these solvers for dense matrices on shared memory systems. This paper considers a redesign of the LAPACK algorithms and their implementation to add efficient support for heterogeneous systems of multicore processors with Intel Xeon Phi coprocessors. This is not the first time that DLA libraries have needed a redesign to be efficient on new architectures – notable examples being the move from LINPACK [7] to LAPACK [3] in the 80’s to make algorithms cache friendly, ScaLAPACK [6] in the 90’s to support distributed memory systems, and now the PLASMA and MAGMA libraries [1] targeting efficiency on multicore and heterogeneous architectures, respectively.

The development of new high-performance numerical libraries is complex, accounting for the extreme level of parallelism, heterogeneity, and wide variety of accelerators and coprocessors available in current architectures. Challenges vary from new algorithmic designs to choices of programming models, languages, and frameworks that ease development, future maintenance, and portability. This paper addresses these issues while presenting our approach and algorithmic designs in the development of the MAGMA MIC [9] library.

To provide a uniform portability across a variety of coprocessors/accelerators, we developed an API that abstract the application developer from the low level specifics of the architecture. In particular, we use low level vendor libraries, like SCIF for Intel Xeon Phi (see Sect. 3), to define API for memory management and off-loading computations to coprocessors and/or accelerators.

To deal with the extreme level of parallelism and heterogeneity in current architectures, MAGMA MIC uses a hybridization methodology, described in Sect. 4, where we split the algorithms of interest into computational tasks of various granularities, and properly schedule those tasks’ execution over the heterogeneous hardware. Thus, we use a Directed Acyclic Graph (DAG) approach to parallelism and scheduling that has been developed and successfully used for dense linear algebra libraries such as PLASMA and MAGMA [1], as well as in general task-based approaches to parallelism, such as runtime systems like StarPU [4] and SMPSs [5].

Besides the use of high-performance low-level libraries, addressed in Sect. 3, obtaining high performance depends on a combination of algorithm and hardware-specific optimizations, discussed in Sect. 4.3. The implication of this on software, in order to maintain its performance portability across hardware, is the need to build in it algorithmic variations that are tunable, e.g., at installation time. This is the basis of autotuning, an example of these advanced optimization techniques.

A performance study is presented in Sect. 5. Besides verifying our approaches and confirming the appeal of the Intel Xeon Phi coprocessors for high-performance DLA, the results open up a number of future work opportunities discussed in our conclusions.

2 Compiler Support for Offload

The primary mode of operation for the Xeon Phi coprocessor is the off-load mode. The device receives work from the host processor and reports back upon completion of the assignment without the host being involved in between these two events. This is very similar to the operation of network off-load engines, specifically, the TCP Off-load Engines (TOEs) that feature an optimized implementation of the TCP stack that handles the majority of the network traffic to lessen the burden of the main processor, which handles other operating system and user application tasks.

This way of using the Xeon Phi device has direct support from the compiler in that it is possible to issue requests to the device and ascertain the completion of tasks directly from the user’s C/C++ code. The support for this mode of operation is offered by the Intel compiler through Phi-specific pragma directives: offload, offload_attribute, offload_transfer, and offload_wait.

3 Programming Model: Host-Device with a Server Based on LLAPI

For many scientific applications, the offload model offered by the Intel compiler, described in Sect. 2, is sufficient. This is not the case for a fully equivalent port of MAGMA to the Xeon Phi because of the very rich functionality that MAGMA inherits from both its CUDA and OpenCL ports. We had to use the LLAPI (Low-Level API) based on Symmetric Communication InterFace (SCIF) that offers, as the name suggests, a very low level interface to the host and device hardware. The use of this API is discouraged for most workloads as it tends to be error-prone and offers very little abstraction on top of the hardware interfaces. What motivated us to use it for the port of our library was: (1) the asynchronous events capability that allows low-latency messaging between the host and the device to notify about completion of kernels on Xeon Phi; (2) the possibility of hiding the cost of data transfer between the host and the device which requires the transfer of submatrices to overlap with the computation. The direct access to the DMA (Direct Memory Access) engine allowed us to maximize the bandwidth of data transfers over the PCI Express bus. The only requirement was that the memory regions for transfer be page-aligned and pinned to guarantee their fixed location in the physical memory. Figure 1 shows the interaction between the host and the server running on the Xeon Phi and responding to requests that are remote invocations of numerical kernels on data that have already been transferred to the device.

Fig. 1.
figure 1

(a) MAGMA MIC programming model with a LLAPI server mediating requests between the host CPU and the Xeon Phi device. (b) DLA algorithm as a collection of BLAS-based tasks and their dependencies. The algorithm’s critical path is, in general, scheduled on the CPUs, and large data-parallel tasks on the Xeon Phi.

4 Hybridization Methodology and Optimization Strategies

The hybridization methodology used in MAGMA [10] is an extension of the task-based approach for parallelism and developing DLA on homogeneous multicore systems [1]. In particular,

  • The computation is split into BLAS-based tasks of various granularities, with their data dependencies, as shown in Fig. 1b.

  • Small, non-parallelizable tasks with significant control-flow are scheduled on the CPUs.

  • Large, parallelizable tasks are scheduled on Xeon Phi.

The difference with multicore algorithms is the task splitting, which here is of various granularities to make different tasks suitable for particular architectures, and the scheduling itself. Specific algorithms using this methodology, and covering the main classes of DLA, are described in the subsections below.

4.1 Design and Functionality

The MAGMA interface is similar to LAPACK. For example, compare LAPACK’s LU factorization interface vs. MAGMA’s:

lapackf77_dgetrf(&M,&N, hA,   &lda, ipiv,   &info)

magma_dgetrf_mic(M, N, dA,0, ldda, ipiv,   &info, queue)

Here hA is the typical CPU pointer (double *) to the matrix of interest in the CPU memory and dA is a pointer in the Xeon Phi memory (magmaDouble_ptr). The last argument in every MAGMA call is an Xeon Phi queue, through which the computation will be streamed on the Xeon Phi device (magma_queue_t).

To abstract the user from knowing low level directives, main functions, such as BLAS, CPU-Phi data transfers, and memory allocations and deallocations, are redefined in terms of MAGMA data types and functions. This design allows us to more easily port the MAGMA library to other device such as the GPU accelerator using either CUDA or OpenCL and eventually to merge them while maintaining a single source. Also, the MAGMA wrappers provide a complete set of functions for programming hybrid high-performance numerical libraries. Thus, not only users but application developers as well can opt to use the MAGMA wrappers. MAGMA provides the standard four floating point arithmetic precisions – single real, double real, single complex, and double complex. There are routines for the so called one-sided factorizations (LU, QR, and Cholesky), and recently we are developing the two-sided factorizations (Hessenberg, bi-, and tridiagonal reductions), linear system and least squares solvers, matrix inversions, symmetric and nonsymmetric standard eigenvalue problems, SVD, and orthogonal transformation routines.

4.2 LU, QR, and Cholesky Factorizations

The one-sided factorization routines implemented and currently available through MAGMA are:

magma_zgetrf_mic :

computes an LU factorization of a general M-by-N matrix \(A\) using partial pivoting with row interchanges;

magma_zgeqrf_mic :

computes a QR factorization of a general M-by-N matrix \(A\);

magma_zpotrf_mic :

computes the Cholesky factorization of a complex Hermitian positive definite matrix \(A\).

Routines in all standard four floating point precision arithmetics are available, following LAPACK’s naming convention. Namely, the first letter of the routine name (after the prefix magma_ ) indicates the precision – z, c, d, or s for correspondingly double complex, single complex, double real, or single real. The suffix _mic indicates that the input matrix and the output are on the Xeon Phi memory.

4.3 Hybrid Implementation and Optimization Techniques

In order to explain our hybrid methodology and the optimization that we developed, let us give a detailed analysis for the QR decomposition algorithm. While the description below only addresses the QR factorization, it is straightforward to derive with the same ideas the analysis for both the Cholesky and LU factorizations. For that we start briefly by recalling the description of the QR algorithm.

The QR factorization is a transformation that factorizes an \(m \times n\) matrix \(A\) into its factors \(Q\) and \(R\) where \(Q\) is a unitary matrix of size \(n \times n\) and \(R\) is a triangular matrix of size \(m \times m\). The QR algorithm can be described as a sequence of steps where, at each step, a QR of a panel is performed based on accumulating a number of Householder transformations in what is called a “panel factorization” which are, then, applied all at once by means of high performance Level 3 BLAS operations in what is called the “trailing matrix update”. Despite that this approach can exploit the parallelism of the Level 3 BLAS during the trailing matrix update, it has a number of limitations when implemented on massively multithreaded system such as the Intel Xeon Phi coprocessor due to the nature of its operations. On the one hand, the panel factorization relies on Level 2 BLAS operations that cannot be efficiently parallelized on either Xeon Phi or any accelerator such as GPU-based architectures, and thus it can be considered to be close to sequential operations that limit the scalability of the algorithm. On the other hand, this algorithm is referred as the fork-join approach since the execution flow will show a sequence of sequential operations (panel factorizations) interleaved with parallel ones (trailing matrix updates). In order to take advantage of the high execution rate of the massively multithreaded system, in particular, the Phi coprocessor we redesigned the standard algorithm in a way to perform the Level 3 BLAS operations (Trailing matrix update) on the Xeon Phi while performing the Level 2 BLAS operations (panel factorization) on the CPU. We also proposed an algorithmic change to remove the fork join bottleneck and to minimize the overhead of the panel factorization by hiding its costs behind the parallel trailing matrix update. This approach can be described as the “scalable lookahead techniques”. Our idea is to split of the trailing matrix update into two phases, the update of the lookahead panel (panel of step \(i+1\), i.e., dark blue portion of Fig. 2) and the update of the remaining trailing submatrix (clear blue portion of Fig. 2). Thus, during the submatrix update the CPU can receive asynchronously the panel \(i+1\) and performs its factorization. As a result, our MAGMA implementation of the QR factorization can be described by a sequence of the three phases described below. Consider a matrix \(A\) that can be represented as:

$$\begin{aligned} {A}= \left( \begin{array}{ccc} {A}_{11} &{} {A}_{12} &{} {A}_{13} \\ {A}_{21} &{} {A}_{22} &{} {A}_{23} \\ {A}_{31} &{} {A}_{32} &{} {A}_{33} \\ \end{array} \right) , \end{aligned}$$
(1)
  • Phase 1, the panel factorization: at a step \(i\), this phase consists of a QR transformation of the panel \(A_{i:n,i}\) as in Eq. 2. This operation consists of calling two routines. The DGEQR2 that factorizes the panel and produces \(nb\) Householder reflectors (\(V_{*i}\)) and an upper triangular matrix \(R_{ii}\) of size \(nb\times nb\), which is a portion of the final \(R\) factor, and the DLARFT that generates the triangular matrix \(T_{ii}\) of size \(nb\times nb\) used for the trailing matrix update. This phase is performed on the CPU.

    $$\begin{aligned} \left( \begin{array}{c} {A}_{11} \\ {A}_{21} \\ {A}_{31} \\ \end{array} \right) \Longrightarrow \left( \begin{array}{c} {V}_{11} \\ {V}_{21} \\ {V}_{31} \\ \end{array} \right) , \left( R_{1,1} \right) , \left( T_{1,1} \right) . \end{aligned}$$
    (2)
  • Phase 2, the look ahead panel update: the transformation that was computed in the panel factorization needs to be applied to the rest of the matrix (trailing matrix, i.e., the blue portion of Fig. 2). This phase consists into updating only the next panel (dark blue portion of Fig. 2) in order to let the CPU start its factorization as soon as possible while the update of the remaining portion of the matrix is performed in phase 3. The idea is to hide the cost of the panel factorization. This operation presented in Eq. 3, is performed on the Phi coprocessor and involves the DLARFB routine which has been redesigned as a sequence of DGEMM’s to better take advantage of the Level 3 BLAS operations.

    $$\begin{aligned} \left( \begin{array}{c} {R}_{12} \\ {\tilde{A}_{22}} \\ {\tilde{A}_{32}} \\ \end{array} \right) = \left( I - V_{*i} T_{ii}^{T} V_{*i}^{T} \right) \left( \begin{array}{c} {A}_{12} \\ {A}_{22} \\ {A}_{32} \\ \end{array} \right) . \end{aligned}$$
    (3)
  • Phase 3, the trailing matrix update: Similarly to phase 2, this phase consists into applying the Householder reflectors generated during the panel factorization of step \(i\) according to Eq. 3, but to the remaining portion of the matrix (the trailing submatrix i.e., the clear blue portion of Fig. 2). This operations is also performed on the Phi coprocessor, while in parallel to it, the CPU performs the factorization of the panel \(i+1\) that has been computed in Phase 2.

This hybrid technique of distribution of tasks between CPU-Phi allows us to hide the memory bound operations occurred during the panel factorization (Phase 1) by performing such operation on the CPU in parallel with the trailing submatrix update (Phase 3) on the Phi coprocessor. However, one of the key parameters to performance tuning is the blocking size as the performance and the overlap between the CPU-Phi will be solely guided by it. Figure 2b illustrates the effect of the blocking factor on the performance. It is obvious that, a small \(nb\) will reduce the cost of the panel factorization phase 1, but it decreases the efficiency of the Level 3 BLAS kernel of phase 2 and phase 3 and thus resulting a bad performance. As opposed, a large \(nb\) will dramatically affect the panel factorization phase 1 which becomes slow and thus the CPU/Phi computation cannot be overlapped, providing a deterioration in the performance as shown in Fig. 2b. As a consequence, the challenging problem is the following: on the one hand, the blocking size \(nb\) needs to be large enough to extract high performance from the Level 3 BLAS phase 3 and on the other hand, it has to be small enough to extract efficiency (thanks to the cache speed up) from the Level 2 BLAS phase 1 and overlap CPU/Phi computation. Figure 2b show the performance obtained for different blocking sizes and we can see a trade-off between small and large \(nb\)’s. Either \(nb=480\) or \(nb=960\) can be considered as a good choice because MKL Phi BLAS is optimized for multiples of \(240\). Moreover, to extract the maximum performance and allow the maximum overlap between both of the CPU and the Xeon Phi coprocessor, we developed a new variant that can use a variable \(nb\) during the steps of the algorithm. The flexibility of our implementation allows an efficient task execution overlap between the CPU host and the Phi coprocessor which enables the algorithm to scale almost perfectly in the Phi coprocessor and provides very good performance close to the practical peak obtained on such system. Our tuned variable implementation is represented by the red curve of Fig. 2b where we can easily observe its advantage over the other variants.

Fig. 2.
figure 2

Effect of the blocking factor (Color figure online).

5 Performance Results

This section presents the performance results obtained by our hybrid CPU-Xeon Phi implementation in the context of the development of the state-of-the-art numerical linear algebra libraries.

5.1 Experimental Environment

Our experiments were performed on a system equipped with Intel Xeon-Phi. It is representative of a vast class of servers and workstations commonly used for computationally intensive workloads.

Intel multicore system with dual-socket, 8 core Intel Xeon E5–2670 (Sandy Bridge) processors, each running at 2.6 GHz. Each socket has a 24 MB shared L3 cache, and each core has a private 256 KB L2 and 64 KB L1. The system is equipped with 52 Gbytes of memory. The theoretical peak for this architecture in double precision is 20.8 Gflop/s per core, giving 332 Gflops in total. The system is also equipped with an Intel Xeon Phi cards with 7.7 Gbytes per card running at 1.09 GHz, and giving a double precision theoretical peak of 1046 Gflops.

There are a number of software packages available. On the CPU side we used the MKL (Math Kernel Library) [8] which is a commercial software package from Intel that is a highly optimized numerical library. On the Intel Xeon side, we used the MPSS 2.1.5889-16 as the software stack, icc 13.1.1 20130313 which comes with the composer_xe_2013.3.163 suite as the compiler and the BLAS-3 routine GEMM from MKL 11.00.03.

5.2 Performance Results

Figure 3 reports the performance of the three amigos linear algebra kernels, the Cholesky, QR and LU factorizations with our hybrid implementation and compare it to the performance of the CPU implementation of the MKL libraries. For our implementation, the blocking factor has been chosen to be flexible in order to achieve the best performance, as a reference it is in the range of 480–960 as described in Sect. 4.3. The graphs show the performance measured using all the cores available on the system (i.e., 60 for the Intel Phi and 16 for the CPU) with respect to the problem size. In order to reflect the time completion, for each algorithm the operation count is assumed to be the same as that of the LAPACK algorithm (i.e., \(\frac{1}{3}N^3\), \(\frac{2}{3}N^3\), and \(\frac{4}{3}N^3\) for the Cholesky factorization, the LU factorization and the QR decomposition respectively)

Fig. 3.
figure 3

Comparison of the performance versus the optimized CPU version of the MKL libraries for the three amigos.

Figure 3a,b,c provide roughly the same information: our MAGMA algorithm with hybrid techniques delivers higher execution rates than the CPU optimized counterpart. Such comparison is not fair, our goal is not to compare, but it is rather to show the boost that the hybrid CPU+Phi coprocessor implementation provides, versus a CPU implementation. The figures show that the MAGMA hybrid algorithms are capable of completing any of the three amigos algorithms as twice faster as the CPU optimized version for a matrix of size larger than 10000; and more than three times faster when the matrix size is large enough (larger than 20000). The actual curves of Fig. 3 illustrates the efficiency of our hybrid techniques where we note that the performance obtained by our implementation, achieves a very close level to the practical peak of the Intel Xeon Phi coprocessor computed by running the GEMM routine (which is around 850 Gflops). This gain is mostly obtained by two improvements. First the nature of the operations involved in the Phi side which are mostly BLAS Level 3 operations redesigned and redeveloped as a combination of DGEMM’s. For more details we denote below the routines executed on the Xeon Phi coprocessor:

  • The DSYRK operations for the Cholesky factorization where the DSYRK has been redesigned as a combination of DGEMM’s routines,

  • The DGEMM for the LU factorization,

  • The DLARFB for the QR decomposition where also its has been redesigned as a combination of DGEMM’s.

Second, all of the Level 2 BLAS routines that are memory bound and that represent a limit for the performance (i.e., DPOTF2, DGETF2, and DGEQR2 for Cholesky, LU, and QR factorization respectively) are executed on the CPU side while being overlapped with the Phi coprocessor execution as described in Sect. 4.3.

An important remark has to be made here for the Cholesky factorization: the left-looking algorithm as implemented in LAPACK is considered as well optimized for memory reuse but at the price of less parallelism and thus is not suitable for massively multicore machines. This variant delivers poor performance when compared to the right looking variant that allows more parallelism and thus run at higher speed.

6 Conclusions and Future Work

In this article, we have shown how to extend our hybridization methodology from existing systems to a new hardware platform. The challenge of the porting effort stemmed from the fact that the new coprocessor from Intel, the Xeon Phi, featured programming models and relative execution overheads, that were markedly different from what we have been targeting on GPU-based accelerators. Nevertheless, we believe that the techniques used in this paper adequately adapt our hybrid algorithm to best take advantage of the new heterogeneous hardware. We have derived an implementation schema of the dense linear algebra kernels that also can be applied to either the two-sided factorization used for solving the eigenproblem and the SVD or to the sparse linear algebra algorithms. We plan to further study the implementation of multi-Xeon Phi algorithms in a distributed computing environment. We think that the techniques presented will become more popular and will be integrated into dynamic runtime system technologies. The ultimate goal is that this integration will help to tremendously decrease development time while retaining high-performance.