Power/Performance Trade-Offs of Small Batched LU Based Solvers on GPUs

Villa, Oreste; Fatica, Massimiliano; Gawande, Nitin; Tumeo, Antonino

doi:10.1007/978-3-642-40047-6_81

Oreste Villa¹⁹,
Massimiliano Fatica²⁰,
Nitin Gawande¹⁹ &
…
Antonino Tumeo¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8097))

Included in the following conference series:

European Conference on Parallel Processing

3687 Accesses
15 Citations

Abstract

In this paper we propose and analyze a set of batched linear solvers for small matrices on Graphic Processing Units (GPUs), evaluating the various alternatives depending on the size of the systems to solve. We discuss three different solutions that operate with different levels of parallelization and GPU features. The first, exploiting the CUBLAS library, manages matrices of size up to 32x32 and employs Warp level (one matrix, one Warp) parallelism and shared memory. The second works at Thread-block level parallelism (one matrix, one Thread-block), still exploiting shared memory but managing matrices up to 76x76. The third is Thread level parallel (one matrix, one thread) and can reach sizes up to 128x128, but it does not exploit shared memory and only relies on the high memory bandwidth of the GPU. The first and second solutions only support partial pivoting, the third one easily supports partial and full pivoting, making it attractive to problems that require greater numerical stability. We analyze the trade-offs in terms of performance and power consumption as function of the size of the linear systems that are simultaneously solved. We execute the three implementations on a Tesla M2090 (Fermi) and on a Tesla K20 (Kepler).

Download to read the full chapter text

Chapter PDF

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Performance, Design, and Autotuning of Batched GEMM for GPUs

Evaluating the NVIDIA Tegra Processor as a Low-Power Alternative for Sparse GPU Computations

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Agullo, E., Augonnet, C., Dongarra, J., Faverge, M., Langou, J., Ltaief, H., Tomov, S.: Lu factorization for accelerator-based systems. In: AICCSA: 9th IEEE/ACS International Conference on Computer Systems and Applications, pp. 217–224 (December 2011)
Google Scholar
Hammond, G., Lichtner, P., Lu, C., Mills, R.: Pflotran: Reactive flow and transport code for use on laptops to leadership-class supercomputers. In: Groundwater Reactive Transport Models. Bentham Sciene Publishers (2012)
Google Scholar
Higham, N.: Gaussian elimination. Computational Statistics 3, 230–238 (2011)
Article Google Scholar
Nidia corporation. Nidia CUBLAS Library, Version 5.0 (2012)
Google Scholar
Nidia corporation. Nvidia CUDA c Programming Guide, Version 5.0 (2012)
Google Scholar
Song, F., Tomov, S., Dongarra, J.: Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems. In: ICS 2012: The 26th ACM International Conference on Supercomputing, pp. 365–376 (2012)
Google Scholar
Tang, G., D’Azevedo, E.F., Zhang, F., Parker, J.C., Watson, D.B., Jardine, P.M.: Application of a hybrid MPI/OPENMP approach for parallel groundwater model calibration using multi-core computers. Computers & Geosciences 36, 1451–1460 (2010)
Article Google Scholar
Tomov, S., Nath, R., Ltaief, H., Dongarra, J.: Dense linear algebra solvers for multicore with gpu accelerators. In: IPDPSW 2010: IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum, pp. 1–8 (2010)
Google Scholar
White, M., Oostrom, M.: STOMP Subsurface Transport Over Multiple Phase: User’s Guide. Technical report, Pacific Northwest National Laboratory, Richland, WA, USA, PNNL-15782 (2006)
Google Scholar
Yeh, G., Tripathi, V., Gwo, J., Cheng, H., Chend, J.-R.C., Salvage, K., Li, M., Fang, Y., Li, Y., Sun, J., Zhang, F., Siegel, M.: HYDROGEOCHEM: A coupled model of variably saturated flow, thermal transport, and reactive biogeochemical transport, on laptops to leadership-class supercomputers. In: Groundwater Reactive Transport Models. Bentham Science Publishers (2012)
Google Scholar
Zhang, K., Wu, Y., Pruess, K.: User’s Guide for TOUGH2-MP - A Massively Parallel Version of the TOUGH2 Code. Technical report, Lawrence Berkeley National Laboratory, Berkeley, CA, USA, LBNL-315E (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Pacific Northwest National Laboratory, Richland, WA, USA
Oreste Villa, Nitin Gawande & Antonino Tumeo
NVIDIA, Santa Clara, CA, USA
Massimiliano Fatica

Authors

Oreste Villa
View author publications
You can also search for this author in PubMed Google Scholar
Massimiliano Fatica
View author publications
You can also search for this author in PubMed Google Scholar
Nitin Gawande
View author publications
You can also search for this author in PubMed Google Scholar
Antonino Tumeo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

German Research School for Simulation Sciences, RWTH Aachen, Schinkelstr. 2a, 52062, Aachen, Germany
Felix Wolf
Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH, Station 22,, 52425, Jülich, Germany
Bernd Mohr
Center for Computing and Communication, RWTH Aachen, Seffenter Weg 23, 52074, Aachen, Germany
Dieter an Mey

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Villa, O., Fatica, M., Gawande, N., Tumeo, A. (2013). Power/Performance Trade-Offs of Small Batched LU Based Solvers on GPUs. In: Wolf, F., Mohr, B., an Mey, D. (eds) Euro-Par 2013 Parallel Processing. Euro-Par 2013. Lecture Notes in Computer Science, vol 8097. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40047-6_81

Download citation

DOI: https://doi.org/10.1007/978-3-642-40047-6_81
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40046-9
Online ISBN: 978-3-642-40047-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Power/Performance Trade-Offs of Small Batched LU Based Solvers on GPUs

Abstract

Chapter PDF

Similar content being viewed by others

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Performance, Design, and Autotuning of Batched GEMM for GPUs

Evaluating the NVIDIA Tegra Processor as a Low-Power Alternative for Sparse GPU Computations

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Power/Performance Trade-Offs of Small Batched LU Based Solvers on GPUs

Abstract

Chapter PDF

Similar content being viewed by others

Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On

Performance, Design, and Autotuning of Batched GEMM for GPUs

Evaluating the NVIDIA Tegra Processor as a Low-Power Alternative for Sparse GPU Computations

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation