Efficient Batch LU and QR Decomposition on GPU

Brouwer, William J.; Taunay, Pierre-Yves

doi:10.1007/978-3-319-06548-9_4

William J. Brouwer² &
Pierre-Yves Taunay²

3187 Accesses

Abstract

Accelerating applications with Graphics Processing Units (GPUs) is stimulating ongoing interest in diverse fields within science and engineering. The Compute Unified Data Architecture (CUDA) paradigm introduced by Nvidia in 2007 is arguably the most significant driver in the uptake of GPUs for general purpose computation. However, a large array of entry points to GPU acceleration also exists in the form of libraries, explicit applications, as well as compiler directives. An important computational task for many applications is matrix decomposition; this chapter presents new CUDA implementations of key algorithms, specifically LU and QR matrix decomposition for batches of small, dense matrices.

Access provided by Autonomous University of Puebla. Download chapter PDF

Batch QR Factorization on GPUs: Design, Optimization, and Tuning

Performance, Design, and Autotuning of Batched GEMM for GPUs

Preconditioners for Batched Iterative Linear Solvers on GPUs

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Batch LU Decomposition

While comparatively expensive, direct solvers based around matrix decomposition are used in various applications, for reasons of numerical stability, over iterative solvers. The implementation presented shortly was originally devised for the solution of many decoupled systems simultaneously [4], for what amounts to a domain decomposition approach [6]. The LU decomposition also provides a viable method for the calculation of the matrix determinant; after execution of an in-place implementation, the determinant is available from the product of the diagonal elements. This is particularly useful in condensed matter physics, specifically in studies of the fractional quantum Hall effect based on construction of the Pfaffian wave function, which requires O(N! ) determinant evaluations [9, 10].

1.1 Theory

The decomposition of matrix A into lower L (elements α _ij) and upper U (elements β _ij) matrix,

$$\displaystyle{ \mathbf{LU} = \mathbf{A}, }$$

(4.1)

has the advantage of permitting the solution of linear systems in two steps, comprised of forward and backward substitution procedures, for multiple right hand sides in Ax = y. Crout’s approach to LU decomposition solves the set of equations implicit to Eq. (4.1); these are:

$$\displaystyle{ \beta _{ij} = a_{ij} -\sum _{k=1}^{i-1}\alpha _{ ik}\beta _{kj}, }$$

(4.2)

and

$$\displaystyle{ \alpha _{ij} = \frac{1} {\beta _{jj}}\left (a_{ij} -\sum _{k=1}^{j-1}\alpha _{ ik}\beta _{kj}\right ). }$$

(4.3)

Numerical stability relies on suitable choice of pivot, or dividing element in the solution for α _ij. Pivoting may be partial (a row interchange) or full (both row and column); the former is implemented in this chapter. Following the approach detailed in Numerical Recipes [5], the choice of the best pivot is made only after both Eqs. (4.2) and (4.3) are solved for a given column, and thereafter the row swap and a scaling performed. Recording the row permutations in a separate vector is required for use with the solution of linear equations, in order that the right hand side vector be subsequently rearranged to suit. Equations (4.2) and (4.3) give rise to N ² + N equations, whose overdetermined nature permits the setting of N elements arbitrarily. A popular choice is to set the diagonal elements of α to one, followed in this chapter. Crout’s approach to LU decomposition is summarized in Algorithm 1.

1.2 GPU Implementation

With the foreknowledge that the decomposition will be applied in batch, the mapping of computational thread to matrix is a seemingly reasonable strategy for a GPU implementation. However, on the device this virtually eliminates the possibility of coalesced loads from global memory, and thread cooperation via shared memory, key requirements for good performance. At the other extreme, mapping thread to matrix element would introduce significant overhead in the form of synchronization, owing to dependencies between the loops described in Algorithm 1. In a compromise between the two extremes, O(N) threads were assigned to the operations for each matrix, and individual CUDA thread blocks assigned one or more matrices to process. Referring to Algorithm 1, there are at least two key points at which threads must cooperate. The first is the determination of scaling information, lines 1–5, which may be considered a separate scope to lines 6 forward. This task is readily solved using parallel reduction, a well known primitive. Turning attention to the main steps of the algorithm, lines 7–13 perform updates to matrix elements above the diagonal, specifically column j. By assigning the index of the loop at line 7 to thread index, increasingly more threads in this scope work as the outer loop progresses; a brief summary of this scope as executed in CUDA is detailed in Table 4.1. Within a warp, one may rely on SIMD execution, and thus updated column elements are available to threads of higher indices when needed. As one might expect, matrices of side greater than a single warp require serialization of warp execution, due to the unpredictable way in which instructions are scheduled and dispatched within the Streaming Multiprocessor (SM), as illustrated in Fig. 4.1. Some parallelism is regained by mapping matrix to warp, for this scope alone.

Table 4.1 Global memory read[], shared memory read(), write{}, critical† and arithmetic operations for several iterations and CUDA threads t_id of algorithm lines 7–14

Full size table

No such limitations pervade lines 14–20, where loop index is also mapped to thread index, and column data is read from above the diagonal. Threads in this scope update from diagonal downwards; however, barrier synchronization is necessary before and after this scope. The particular column updated in a single iteration of the outer loop is cached in shared memory before line 7, and written back to global after line 20. Shared memory buffers used for communication are declared using the volatile keyword, to ensure that write operations are not optimized out during compilation. Once the column update is complete, and working threads have written elements q before line 20 to another shared memory buffer, parallel reduction is employed in order to find the index of the pivot. Should the condition at line 21 be satisfied, then a row swap is completed by threads, storing temporary elements in registers. Thereafter, row elements are scaled by diagonal elements; once again loop index k is mapped to thread. Barrier synchronization is employed before the end of the outer loop at line 29. An abbreviated listing of the main CUDA kernel is recorded in Appendix 1, based around the float2 type, for processing complex data.

1.3 LU Results

An implementation of Algorithm 1 was written in C for execution on CPU, for use with row-major storage format matrices and complex (single precision) floating point data. This routine was compiled using a recent revision of the Intel compiler, with flags -O3 -xHost to ensure the highest degree of optimization, taking advantage of AVX hardware and instructions of the Sandy Bridge CPU. OpenMP was used to distribute matrices to separate threads for processing. The main GPU kernel as described and supporting routines including parallel reduction were compiled using nvcc, CUDA revision 5.5, for compute architecture 3.5 and with optimization flag -O3. Table 4.2 summarizes results, comparing execution times. Profiling using nvvp revealed a total global memory bandwidth of approximately 62 GB/s (54.5 GB/s read + 7.5 GB/s write). Both CPU and GPU routines were devoted to calculating the in-place LU decomposition alone. No permutations were stored; however, the sign of the permutation was recorded in memory, as is necessary for any subsequent calculation of matrix determinants. Crout’s algorithm when executed on the K40c device experienced a 1.0–5.0x performance improvement over a single Sandy Bridge CPU socket, running 16 threads. The super-linear scaling of the CPU results was investigated further using tools from the Valgrind suite [8]. As expected, the effect had little correlation with cache performance; miss rates for both instructions and data were negligible for all matrix and batch sizes considered. However, profiling with callgrind did reveal that instructions devoted directly to the LU calculation itself steadily increased as a fraction of the total instructions, with matrix size. This fraction was as little as 60 % for a matrix of side 32, increasing to almost 100 % for matrices of side 256. Similarly, the percentage of instructions derived from other sources, particularly the Intel KMP interface for thread management and CPU affinity decreased to negligible contributions, for matrices of side 256.

Table 4.2 LU algorithm executed on K40c GPU device versus 16 Intel E5-2670 (Sandy Bridge) CPU threads

Full size table

2 QR Decomposition

While also a method that may be applied in the solution of systems of linear equations, the QR decomposition,

$$\displaystyle{ \mathbf{QR} = \mathbf{A}, }$$

(4.4)

generally takes preeminence in a popular approach to eigendecomposition, the QR algorithm. In numerical implementations of the QR decomposition algorithm, the upper diagonal matrix R is constructed by the action of operations on A. R can be produced by one of several means, the most popular being Householder reflections, or Givens rotations [3]. This chapter focuses on the latter, whereby successive rotations G _i are applied, selectively eliminating elements below the diagonal of A, and producing the upper diagonal matrix R. One such step for the first column of a 3×3 complex matrix is illustrated in Eq. (4.5), where * denotes the complex conjugate.

$$\displaystyle{ \left [\begin{array}{ccc} 1& 0 &0\\ 0 & c & s \\ 0& - s{\ast}&c\\ \end{array} \right ]\left [\begin{array}{ccc} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33}\\ \end{array} \right ] = \left [\begin{array}{ccc} a_{11} & a_{12} & a_{13} \\ a_{21}^{\prime} & a_{22}^{\prime} & a_{23}^{\prime} \\ 0 &a_{32}^{\prime} & a_{33}^{\prime}\\ \end{array} \right ] }$$

(4.5)

2.1 Theory

2.1.1 Serial QR Decomposition

The kernel of rotation matrix G _i is a 2×2 matrix that operates on pairs of values a = a _i, j and $b = a_{i+1,j}$ in A, where elements c and s are chosen to eliminate the lower element in the operation:

$$\displaystyle{ \left [\begin{array}{cc} c &s\\ - s{\ast} & c\\ \end{array} \right ]\left [\begin{array}{c} a\\ b\\ \end{array} \right ] = \left [\begin{array}{c} r\\ 0 \end{array} \right ]. }$$

(4.6)

Bindel et al. [1] give expressions for suitable c and s in a variety of contexts; the following are used in the remainder of this chapter for complex values, analogous to those for real values:

$$\displaystyle{ c = \pm \frac{\vert a\vert } {\sqrt{\vert a\vert ^{2 } + \vert b\vert ^{2}}}, }$$

(4.7)

$$\displaystyle{ s = \pm \mbox{ sgn}(a) \frac{b} {\sqrt{\vert a\vert ^{2 } + \vert b\vert ^{2}}}, }$$

(4.8)

where

$$\displaystyle{ \mbox{ sgn}(a) = \left \{\begin{array}{cc} a/\vert a\vert & \mbox{ if }\,a\neq 0\\ 1 &\mbox{ if } \,a =0\\ \end{array} \right.. }$$

(4.9)

The concatenation of all orthogonal operations G _i comprises the transpose of the orthogonal matrix Q ie., using 0-based indexing,

$$\displaystyle{ Q^{T}A = \left [\prod _{ j=0}^{j=N-2}\left \{\prod _{ i=j}^{i=N-2}G_{ i}^{j}\right \}\right ]\;A = R }$$

(4.10)

where the superscript on G refers to the matrix column operated on during a particular iteration.

2.1.2 Parallel QR Decomposition

Sameh and Kuck [7] developed a parallel scheme dedicated to matrices of even side, in which the elimination process pictured in Eq. (4.5) can be carried in parallel across multiple rows and columns. Multiple independent Givens rotations $\tilde{Q}_{m,n}$ can be executed at the same time, where m and n refer to the row and column indices of the eliminated element. The product of these matrices constructs the matrix $\hat{Q}_{i}$, which is applied at the i-th step of the algorithm:

$$\displaystyle{ \hat{Q}_{i} =\prod \tilde{ Q}_{m,n}. }$$

(4.11)

For a given step i, the matrices $\tilde{Q}_{m,n}$ can be multiplied in any order to obtain $\hat{Q}_{i}$, as they are a direct sum of plane rotations [7]. As a result, $\hat{Q}_{i}$ is a block-diagonal matrix, with Givens rotations matrices G _i on the diagonal, as pictured in Eq. (4.12).

$$\displaystyle{ \hat{Q}_{i} = \left [\begin{array}{ccccccccc} 1&& & & && & & \\ &\ddots & & & & & & &\\ & &1 & & & & & & \\ && & c_{k,l} &s_{k,l}&& & & \\ && & - s_{k,l}{\ast}&c_{k,l} && & & \\ & & & & &\ddots & & & \\ && & & && c_{m,n} &s_{m,n}& \\ && & & && - s_{m,n}{\ast}&c_{m,n} & \\ && & & && & &1\\ \end{array} \right ] }$$

(4.12)

The scheme from Sameh and Kuck is completed in 2N − 3 steps, where N is the rank of the matrix. The i-th transform is obtained by eliminating an entry in A at the row m and column n, where m and n are given by

$$\displaystyle{ m = \left \{\begin{array}{lr} \left \{N - i,N - i + 1,\ldots,N - 1 -\delta \left (i\right )\right \} & 1 \leq i \leq N - 1\\ \left \{i - N + 2, i - N + 4,\ldots, N - 1 -\delta \left (i\right ) \right \} &N \leq i \leq 2N - 3 \end{array} \right., }$$

(4.13)

and

$$\displaystyle{ n = \left \{\begin{array}{lr} \left \{1,2,\ldots,\lceil \frac{i} {2}\rceil \right \} & 1 \leq i \leq N - 1 \\ \left \{i - N + 2,i - N + 3,\ldots,\lceil \frac{i} {2}\rceil \right \}&N \leq i \leq 2N - 3 \end{array} \right., }$$

(4.14)

with $\delta \left (i\right )$ defined as

$$\displaystyle{ \delta \left (i\right ) = \left \{\begin{array}{lr} 0& i\,\mathrm{odd}\\ 1 &i\,\mathrm{even} \end{array} \right.. }$$

(4.15)

Though other elimination patterns are possible, this approach has been proven to be one of the most efficient, both from a practical and mathematical point of view, as it is easy to implement and asymptotically optimal [2].

At each step of the process, the total number of rotations performed simultaneously, N _rot, is obtained by counting the total number of columns n and rows m affected:

$$\displaystyle{ N_{\mathrm{rot}} = \left \{\begin{array}{lr} \lceil i/2\rceil & 1 \leq i \leq N - 1\\ \lceil i/2\rceil - i + N - 1 &N \leq i \leq 2N - 3 \end{array} \right.. }$$

(4.16)

An example of the entries successively eliminated by this algorithm is shown in Fig. 4.2, for an 8×8 matrix. Numbers in the matrix correspond to the order in which the associated matrix element is eliminated in the algorithm.

2.2 GPU Implementation

The previous observations made in Sect. 4.1.2 related to global and shared memory accesses are also valid for the QR decomposition; therefore, each CUDA thread block is assigned one or more matrices to process, and N threads operate on a single matrix. The parallel QR algorithm is driven by an outer loop executed on the CPU, as detailed in Algorithm 2. This routine calculates the number of CUDA blocks to run in the x-dimension of the CUDA grid, initializes the orthogonal matrix Q as the identity matrix, and calculates the total number of Givens rotations that can be executed in parallel, based on Eq. (4.16). This number sets the z-dimension of the CUDA grid, to ensure that a total of N _rot Givens rotations are applied in parallel to the same matrix, at each iteration of the outer loop. Finally, each iteration launches the CUDA kernel to be executed on the GPU, shown in Algorithm 3. Each CUDA block in the x-dimension performs operations on multiple matrices A, and accumulates the results in the corresponding matrix Q. All threads first calculate the indices m, n of the entry to eliminate in their corresponding matrix. Threads then load rows m − 1 and m, on lines 10 and 11, subsequently calculating their corresponding Givens rotation, on line 14. Algorithm 4 details this operation: multiple threads load the elements a and b defined in Eq. (4.6) through a shared memory broadcast on lines 1 and 2. The components of the Givens rotation kernel, c and s, are then evaluated on line 3 based on Eqs. (4.7) through (4.9). Turning attention back to Algorithm 3, the threads perform the Givens rotation on their corresponding matrix with the ApplyGivens routine. The details of this function are outlined in Algorithm 5. In the ApplyGivens routine, each thread within a CUDA block operates on a single matrix element of the two rows loaded in upperRow and lowerRow. The calculation presented in Eq. (4.6) is performed on lines 5 and 6. The threads then store the data back in place, in global memory, on lines 7 and 8. Care is taken to introduce an exact zero for columns 1 through n − 1 with the boolean condition myIndex > n on line 8, in order to avoid floating point approximations. The remainder of the Algorithm 3—lines 16 through 18—accumulates the rotations in the matrix Q. Note that the boolean condition on line 8 of Algorithm 5 does not apply to matrix Q, as can be discerned from the last line of QR_Kernel in Appendix 2.

Memory optimizations are included in the QR kernel implementation. A few constants, for example the current iteration number and the total batch size are stored in constant memory to provide fast data access. The bandwidth-cost of copying the data from the CPU to the GPU through a call to cudaMemcpyToSymbol() does not impact the overall performance of the algorithm. Care is taken to avoid non-coalesced global memory accesses by providing contiguous indices for global memory loads and stores.

2.3 QR Results

A serial implementation of the QR decomposition algorithm as described in the first paragraph of Sect. 4.2.1 was written in C for execution on the CPU. The source code was compiled with the latest AVX optimizations available for Intel processors, with flags -O3 -xHost. The core GPU kernel QR_Kernel was compiled with the CUDA 5.5 revision of nvcc for compute architecture 3.5, and with -O3 optimizations. The GPU method was tested on a Kepler K40c, while the CPU implementation was executed on a single Sandy Bridge CPU socket running 16 OpenMP threads. Benchmarking results are presented in Table 4.3. The GPU implementation of the QR algorithm as outlined here demonstrates a 0.6–14.3x performance improvement over a comparable CPU routine. The Nvidia profiler nvvp revealed a global memory bandwidth of 195 GB/s (97.5 GB/s read + 97.5 GB/s write).

Table 4.3 QR parallel decomposition algorithm executed on K40c GPU device versus 16 Intel E5-2670 (Sandy Bridge) CPU threads, in ms

Full size table

Table 4.3 shows that the GPU results scale linearly at a constant matrix size. However, the scaling is not linear with the matrix size, at constant batch size; this effect can be attributed to a decreasing total number of matrices processed per block, as the size of the matrices increase. Therefore, more blocks are scheduled and executed on the GPU, resulting in a larger overhead. The QR GPU kernel as described was revealed to be memory-bound by the Nvidia profiler. Thus, additional optimizations to help the code scale with the matrix size may include increasing the total work performed by individual CUDA threads, in order to keep the total number of matrices processed per block constant. The super-linear behavior observed in the CPU scaling results was deduced to share similar origins as those of the CPU LU implementation.

3 Conclusion

This chapter has detailed new CUDA implementations of LU and QR decomposition, for large batches of matrices of side less than 1,024 elements. The kernels take advantage of several key GPU architectural features and display highly favorable performance and scaling as compared to comparable CPU implementations. However, QR decomposition was relatively more performant than LU decomposition, largely owing to the need for warp serialization and fairly excessive synchronization in the latter. Performance for initial kernels was improved significantly through introduction of several techniques guided by profiling. These techniques included configuring cache and shared memory in software, as well as optimizing thread blocksize and shared memory buffer size. Further optimizations and alternative kernels for these important methods are the subjects of ongoing work.

References

Bindel, D., Demmel, J., Kahan, W., Marques, O.: On computing givens rotations reliably and efficiently. ACM Trans. Math. Softw. 28(2), 206–238 (2002)
Article MATH MathSciNet Google Scholar
Cosnard, M., Robert, Y.: Complexity of parallel QR factorization. J. Assoc. Comput. Machinery 33, 712–723 (1986)
Article MathSciNet Google Scholar
Golub, G.H.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)
MATH Google Scholar
Lucente, E., Monorchio, A., Mittra, R.: An iteration-free MoM approach based on excitation independent characteristic basis functions for solving large multiscale electromagnetic scattering problems. IEEE Trans. Antennas Propag. 56(4), 999–1007 (2008)
Article Google Scholar
Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1993)
Google Scholar
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadelphia (2003)
Book MATH Google Scholar
Sameh, A.H., Kuck, D.J.: On stable parallel linear system solvers. J. Assoc. Comput. Machinery 25, 81–91 (1978)
Article MATH MathSciNet Google Scholar
Seward, J., Nethercote, N., Weidendorfer, J.: Valgrind 3.3: Advanced Debugging and Profiling for GNU/Linux Applications. Network Theory Ltd., Bristol (2008)
Google Scholar
Sreejith, G.J., Jolad, S., Sen, D., Jain, J.K.: Microscopic study of the $\frac{2} {5}$ fractional quantum Hall edge. Phys. Rev. B 84, 245104 (2011)
Article Google Scholar
Sreejith, G.J., Toke, C., Wójs, A., Jain, J.K.: Bipartite composite fermion states. Phys. Rev. Lett. 107, 086806 (2011)
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank Muhammed Kabiru Hassan and Sreejith Ganesh Jaya for bringing applications to their attention that benefit from the routines detailed here. The authors are also very grateful to reviewers from Nvidia for their comments and improvements to the manuscript.

Author information

Authors and Affiliations

The Pennsylvania State University, University Park, PA, USA
William J. Brouwer & Pierre-Yves Taunay

Authors

William J. Brouwer
View author publications
You can also search for this author in PubMed Google Scholar
Pierre-Yves Taunay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to William J. Brouwer .

Editor information

Editors and Affiliations

National Center for Supercomputing Applications, University of Illinois, Urbana, Illinois, USA
Volodymyr Kindratenko

Appendices

Appendix 1

Appendix 2

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Brouwer, W.J., Taunay, PY. (2014). Efficient Batch LU and QR Decomposition on GPU. In: Kindratenko, V. (eds) Numerical Computations with GPUs. Springer, Cham. https://doi.org/10.1007/978-3-319-06548-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-06548-9_4
Published: 09 June 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06547-2
Online ISBN: 978-3-319-06548-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Batch LU and QR Decomposition on GPU

Abstract

Similar content being viewed by others

Batch QR Factorization on GPUs: Design, Optimization, and Tuning

Performance, Design, and Autotuning of Batched GEMM for GPUs

Preconditioners for Batched Iterative Linear Solvers on GPUs

Keywords