#### **REGULAR PAPER**



# **Sgap: towards efficient sparse tensor algebra compilation for GPU**

Genghan Zhang<sup>1</sup> · Yuetong Zhao<sup>1</sup> · Yanting Tao<sup>1</sup> · Zhongming Yu<sup>2</sup> · Guohao Dai<sup>3</sup> · Sitao Huang<sup>4</sup> · Yuan Wen<sup>5</sup> · **Pavlos Petoumenos<sup>6</sup> · Yu Wang<sup>1</sup>**

Received: 3 September 2022 / Accepted: 14 March 2023 / Published online: 8 May 2023 © China Computer Federation (CCF) 2023

## **Abstract**

Sparse compiler is a promising solution for sparse tensor algebra optimization. In compiler implementation, **reduction** in sparse-dense hybrid algebra plays a key role in performance. Though GPU provides various reduction semantics that can better utilize the parallel computing and memory bandwidth capacity, the central question is: *how to elevate the fexible reduction semantics to sparse compilation theory that assumes serial execution*. Specifcally, we have to tackle two main challenges: (1) there are wasted parallelism by adopting static synchronization granularity (2) static reduction strategy limits optimization space exploration. We propose Sgap: *s egment g roup* and *a tomic p arallelism* to solve these problems. Atomic parallelism captures the fexible reduction semantics to systematically analyze the optimization space of sparse-dense hybrid algebra on GPU. It is a new optimization technique beyond current compiler-based and open-source runtime libraries. Segment group elevates the fexible reduction semantics to suitable levels of abstraction in the sparse compilation theory. It adopts changeable group size and user-defned reduction strategy to solve challenge (1) and (2), respectively. Finally, we use GPU sparse matrix-matrix multiplication (SpMM) on the TACO compiler as a use case to demonstrate the efectiveness of segment group in reduction semantics elevation. We achieve up to  $1.2 \times$  speedup over the original TACO's SpMM kernels. We also apply new optimization techniques found by atomic parallelism to an open-source state-of-the-art SpMM library dgSPARSE. We achieve  $1.6 \times \sim 2.3 \times$  speedup on the algorithm tuned with atomic parallelism.

**Keywords** Sparse compiler · Sparse tensor algebra · SpMM · GPU

 $\boxtimes$  Guohao Dai daiguohao@sjtu.edu.cn

 $\boxtimes$  Yu Wang yu-wang@tsinghua.edu.cn

- <sup>1</sup> Department of Electronic Engineering, Tsinghua University, Rhom 4101, Beijing 100084, China
- <sup>2</sup> Department of Computer Science and Enigeering, University of California San Diego, Gilman Drive, La Jolla, CA 92093, USA
- <sup>3</sup> Qingyuan Research Institute, Shanghai Jiao Tong University, Room 318A, Building A No. 930 Jianchuan Road, Shanghai 200240, China
- Department of Electrical Engineering and Computer Science, University of California Irvine, 3215 Engineering Hall, Irvine, CA 92697, USA
- <sup>5</sup> Department of Computer Science, University of Aberdeen King's College, Meston Building, Aberdeen AB24 3UE, UK
- Department of Computer Science, University of Manchester, Kilburn Building, Manchester M13 9PL, UK

# **1 Introduction**

Sparse tensor algebra has been widely used in many felds, including machine learning (Hamilton et al. [2017;](#page-14-0) Kipf and Welling [2016;](#page-15-0) Liu et al. [2015](#page-15-1)), data analysis (Kolda and Bader [2009\)](#page-15-2), scientifc computing (Shantharam et al. [2011](#page-15-3); Bell et al. [2012\)](#page-14-1), graph processing (Yuster and Zwick [2004](#page-15-4)). However, it is challenging to optimize sparse tensor applications because of diversity in computation patterns and irregularity in memory access behavior. Sparse compilers have shown great potential to solve this problem. Sparse compilers can use **one** monolithic theory to express diverse data formats and operations, and provide fexible user interface, enabling users to explore the optimization space given data and hardware. Therefore, more and more researchers are turning to sparse compilers for general solutions (Bik and Wijshoff [1993;](#page-14-2) Venkat et al. [2015](#page-15-5); Strout et al. [2018](#page-15-6); Kjolstad et al. [2017](#page-15-7); Kjolstad [2020](#page-15-8); Popoola et al. [2021;](#page-15-9) Bik et al. [2022](#page-14-3); Ye et al. [2023](#page-15-10)).



<span id="page-1-0"></span>Fig. 1 Sparse compilers suffer from static synchronization granularity and static reduction strategy. **a** Example reduction with legends in latter subfgures. **b** Parallelism waste caused by improper synchroni-

zation granularity. **c** One type of segment reduction and one type of parallel reduction. Segment reduction has two writeback threads and parallel reduction has one

However, it is challenging to design a sparse compiler that can both compile various algebras and generate highly optimized code. In particular, *sparse-dense hybrid algebra* on GPU brings unique challenges to sparse compilers. After analysing sparse-dense hybrid algebra's mathematical expression, we fnd out that **reduction** is its key operation (Nisa et al. [2019](#page-15-11); Huang et al. [2020](#page-15-12); Kurt and Raje [2022](#page-15-13)). There are several possible ways to do reduction on GPUs. Diferent reduction methods are preferred for different workloads. Choosing the correct reduction method can accelerate kernels (Dai et al. [2022;](#page-14-4) Bell and Garland [2009\)](#page-14-5). For example, controlled experiments in Dai et al. [\(2022\)](#page-14-4) show that parallel reduction can outperform conditional reduction and vice versa by  $2x \sim 4x$ . However, current sparse compilers lack the abstraction for such fexible reduction semantics. That is because they assume the code executes serially. GPU reduction is diferent from the serial reduction in that it changes the reduction code's structure (e.g., control-fow and loop basic block). Therefore, it cannot be naively generated by directly adding or replacing some instructions like the *unroll* in CPU. Solving this problem requires elevating reduction semantics to the sparse compilation theory in a systematic way.

However, elevating the fexible reduction semantics to sparse compilation theory faces two main challenges: (1) **Static synchronization granularity wastes parallelism**: GPU synchronizes a group of threads whose group size is power of 2, which we term as synchronization granularity. Threads can pass local register values to another thread in the same group. However, static synchronization granularity may waste parallelism when inputs are dynamic. For example, if not all threads' register values are gathered, threads that do not infuence the reduction result still have to wait to be synchronized. In other words, the synchronization granularity is too large for such input data, as is shown in Fig. [1](#page-1-0)b. However, current sparse compilers only assume synchronization granularity to be 32, which wastes the parallelism. This is the limitation of current sparse compilers. (2) **Static reduction strategy limits optimization space exploration**: GPU has provided very fexible methods to do reduction. Multiple threads in a thread group will write back to the fnal results. We name such thread *writeback thread*. There could be more than one writeback thread in a thread group. The thread indices of writeback threads can also be decided at runtime and are controlled by the reduction strategy. Diferent algorithms favor diferent reduction strategies. For example, as is shown in Fig. [1c](#page-1-0), if we assign a given number of non-zeros to each thread group, it has to use segment reduction. That is because threads need to write back according to the coordinate and thus writeback thread is decided at runtime. However, in another algorithm where all threads in a group are guaranteed to write back to the same place, it can use parallel reduction (Bell and Garland [2009](#page-14-5)). However, current sparse compilers assume that only the frst thread in a thread group is the writeback thread and use parallel reduction.

To tackle these challenges and build a more efficient sparse compiler, we propose *atomic parallelism* and *segment group* in this paper and implement our techniques in a real sparse compiler TACO (Kjolstad et al. [2019](#page-15-14); Chou et al. [2018;](#page-14-6) Kjolstad et al. [2017](#page-15-7); Senanayake et al. [2020](#page-15-15)). Atomic parallelism models the optimization space of sparsedense hybrid algebra from the reduction view. It uses the minimal data and reduction parallelism to distinguish diferent algorithms of a given algebra. Minimal data are used to defne reduction strategy and reduction parallelism for synchronization granularity. We use this model to propose new optimization techniques. Segment group is a new abstraction for sparse compilation theory. It captures the dynamic synchronization granularity and dynamic reduction strategy.



<span id="page-2-0"></span>**Fig. 2** Venn diagram for the relation between atomic parallelism and original sparse compilation theory. The element is the point in the algorithm design space of a sparse-dense hybrid algebra. Original sparse compilation theory can only express parallel reduction with group size 32. However, it can also express some optimization points, for example, loop reorder, beyond atomic parallelism. The union of segment group and original theory creates a new sparse compilation theory

To be specifc, we use fexible group size to solve challenge (1) and design full-stack support for user-defned reduction strategy, which solves challenge (2). As is shown in Fig. [2,](#page-2-0) segment group extends the expression ability of original sparse compilation theory.

Finally, we use sparse matrix-matrix multiplication (SpMM) as an example to demonstrate atomic parallelism and segment group. SpMM is one of the most widely used sparse-dense hybrid algebra. It is the core operator of many emerging applications (Han et al. [2016;](#page-14-7) Wang et al. [2019](#page-15-16); Lin et al. [2021;](#page-15-17) Asgari et al. [2021\)](#page-14-8). It is also the simplest form of sparse-dense hybrid algebra.

Therefore, this work manages to push the frontier a step forward on these two challenges by a combined method involving segment group and atomic parallelism which we called *Sgap* in this paper. Our contributions are as follows:

- 1. We propose a framework *atomic parallelism* to analyse sparse-dense hybrid algebra and propose new SpMM designs beyond previous works (Yang et al. [2018;](#page-15-18) Hong et al. [2019](#page-15-19); Huang et al. [2020;](#page-15-12) Mehrabi et al. [2021](#page-15-20); Dai et al. [2022](#page-14-4)).
- 2. Based on the atomic parallelism, we point out that current sparse compilers miss important optimization opportunities. We propose a new abstraction *segment group* for sparse compilers. Segment group can reduce parallelism waste and improve workload balance.
- 3. We implement segment group in TACO and get up to 1.2× speedup on average over the original TACO's SpMM kernels. Next, we generalize our fndings from TACO to dgSPARSE (Dai et al. [2022](#page-14-4)), an open-source state-of-the-art SpMM library. We achieve  $1.6 \times \sim 2.3 \times$ speedup over dgSPARSE on the algorithm we tune.

The rest of this paper is organized as follows. Background information is provided in Sect. [2](#page-2-1). Section [3](#page-6-0) introduces atomic parallelism and Sect. [4](#page-7-0) is for segment group. Then the implementation of segment group in TACO is detailed in Sect. [5.](#page-9-0) After that, we illustrate the combination of atomic parallelism and segment group in TACO. Our evaluation of new SpMM algorithms in TACO and generalization to dgSPARSE is presented in Sect. [7.](#page-12-0) The paper is concluded in Sect. [8.](#page-14-9)

# <span id="page-2-1"></span>**2 Background**

#### **2.1 Sparse‑dense hybrid algebra**

Sparse-dense hybrid algebra can be defned in two equivalent forms: the tensor formulation (TF) in Eq. [1](#page-2-2) and the database formulation (DF) in Eq. [3](#page-2-3). From TF sparse-dense hybrid algebra because the operands of it are sparse and dense, for example, MTTKRP (Matricized Tensor Times Khatri Rao Product) (Nisa et al. [2019](#page-15-11)), SDDMM (Sampled Dense-Dense Matrix Multiplication) (Yu et al. [2021](#page-15-21)), SpMM (sparse Matrix-Matrix Multiplication) (Huang et al. [2020](#page-15-12)), TTM (Tensor Times Matrix Product) (Kurt and Raje [2022](#page-15-13)). We use Einstein's summation to defne sparse-dense hybrid algebra in AF as Eq. [1.](#page-2-2)

<span id="page-2-2"></span>
$$
\mathbb{Y}_{y_1, y_2, \cdots, y_M} = \mathbb{A}_{a_1, a_2, \cdots, a_N} \prod_{i=1}^D \mathbb{X}^j_{x_1^j, x_2^j, \cdots, x_M^j}
$$
(1)

 $\mathbb {Y}$  is the output tensor,  $\mathbb {X}^j$  are dense input tensors, and  $\mathbb {A}$  is the sparse input tensor. At least one level  $a_N$  in  $\Lambda$  does not store in dense format.  $y_1, y_2, \dots, y_M, a_1, a_2, \dots, a_N, x_1^j, x_2^j, \dots, x_N^j$ *Mj* are in the same index variable set. *M* is the mode of output tensor, and *N* is the mode of sparse input tensor. *D* is the number of dense input tensors, and  $M^j$  is the mode of dense input tensor  $\mathbb{X}^j$ . Specifically, MTTKRP, TTM, SDDMM, and SpMM are expressed as:

$$
\mathbb{Y}_{i,j} = \mathbb{A}_{i,k,l} \mathbb{X}_{k,j}^1 \mathbb{X}_{l,j}^2
$$
\n(2a)

$$
\mathbb{Y}_{i,j,l} = \mathbb{A}_{i,j,k} \mathbb{X}_{k,l}^1 \tag{2b}
$$

$$
\mathbb{Y}_{i,k} = \mathbb{A}_{i,k} \mathbb{X}_{i,j}^1 \mathbb{X}_{j,k}^2
$$
 (2c)

$$
\mathbb{Y}_{i,k} = \mathbb{A}_{i,j} \mathbb{X}_{j,k}^1 \tag{2d}
$$

We use message-passing to defne sparse-dense hybrid algebra in DF as Eq. [3.](#page-2-3)

<span id="page-2-3"></span>
$$
Q(dst) = \bigoplus_{src \in Q_0(dst)} \{ src, \otimes (Q_1(src, dst), Q_2(dst)) \}
$$
(3)

 $Q, Q_0, Q_1, Q_2$  are queries for the relevant database. We follow the idea of logical-physical storage seperation (Codd [1970](#page-14-10)). The value of  $Q(k)$  is defined as  $Q(dst) = D(f(dst))$ . *D* is the relevant database of *Q*, storing (*id*, *value*) in ascending order of *id*, where  $id \in \mathbb{Z}$  and  $value \in \mathbb{R}^n$ . *dst* is any hashable key and f is a function  $K \to \mathbb{Z}$ .  $\oplus$  can be any commutative operation and *⊗* can be any function that takes two objects



<span id="page-3-0"></span>**Fig. 3** Examples of sparse-dense hybrid algebra. The consecutive grey parallelograms or squares represent the reduction modes

as input and output one object that can be operated by *⊕*. The result of *⊕* is written to *f*(*dst*) in *Q*. Sparse-dense hybrid algebra is sparse because  $Q_0(dst)$  for all *dst* are diverse. In other words,  $Q_0(i) \bigcap Q_0(i + 1) \sim \emptyset$ . Such algebra is dense because values in  $D, D_1, D_2$  are scalar, dense vectors, or dense matrices.

The core operation of sparse-dense hybrid algebra is *reduction* and reduction in diferent kernels behaves similarly. This key observation motivates atomic parallelism because we only need to optimize the common reduction operations and use the compiler to optimize diferent sparsedense hybrid algebra kernels automatically. For example, in TF kernels do reduction on *l*, *k* dimensions in MTTKRP, *k* in TTM, *j* in SDDMM and SpMM. The reduction can be along one sparse and one dense dimension, as in MTTKRP, TTM, and SpMM. It can also be along two dense dimensions, as in SDDMM. Figure [3](#page-3-0) illustrates these examples and highlights the reduction dimensions. We also give concrete code examples in Fig. [4](#page-4-0). It shows that some of these kernels share common reduction codes. For example, MTTKRP contains two reductions, each behaving the same as the reduction in SpMM.

Such property can also be illustrated in DF. As shown in Fig. [5](#page-5-0), for the first reduction, the value of  $D_1$  both are scalar; the value of  $D_2$  both are vectors. For the second reduction of MTTKRP, though the value of  $D_1$  is a vector, which is diferent from SpMM's frst reduction, *⊕* behaves the same because *⊗* here is element-wise vector product.

#### **2.2 SpMM optimization**

As explained above, the reduction is the core operation of sparse-dense tensor algebra and some kernels share the same type of reduction. Without loss of generality, we take SpMM as an example to optimize the reduction in this paper. The optimization techniques can be easily generalized to expedite other sparse-dense hybrid algebra kernels. Yang et al. ([2018\)](#page-15-18) selects between two algorithms to achieve

respectively even distribution of nnz among parallel processors and row-splitting among threads. Adaptive Sparse Tiling (ASpT) (Hong et al. [2019\)](#page-15-19) aims at improving data locality and thus reduces the total number of accesses to global memory. Ge-SpMM (Huang et al. [2020\)](#page-15-12) proposes Coalesced Row Caching (CRC) method to enable coalesced memory access to both sparse and dense matrices and Coarse-grained Warp Merging (CWM) method for SpMM merging workloads from diferent warps to reuses loaded sparse matrix. Mehrabi et al. ([2021\)](#page-15-20) proposes several row permutation strategies for CSR format to enhance load balance and data locality. DA-SpMM (Dai et al. [2022](#page-14-4)) is a data-aware kernel selector among 8 algorithms according to 3 dimensions in the space dealing with dynamic input data.

#### **2.3 Sparse compilers**

The complexity of optimizing sparse tensor algebra comes from four directions: data, data format, algebra, and hardware. Researchers often develop a technique for one data format, one algebra, and one hardware. Such a library method heavily relies on experts and engineering work (Guennebaud and Jacob [2010;](#page-14-11) Naumov et al. [2010](#page-15-22); Wang et al. [2014](#page-15-23)). However, sparse compilers can extremely reduce such engineering burden and boost innovation in this area. Unlike the library method, sparse compilers aim to use **one** monolithic theory to express all data formats, all algebras, and provide flexible user interface, which enables users to explore the optimization space given data and hardware. Research on sparse compilers can be divided into two categories: (1) *Passoriented*. Given the imperative code, design compilation passes to optimize the code (Bik and Wijshoff [1993](#page-14-2); Venkat et al. [2015](#page-15-5); Strout et al. [2018\)](#page-15-6). (2) *Language-oriented*. View sparse compiler as a programming language and design lowering and scheduling process [15],(Bik et al. [2022](#page-14-3); Kjolstad [2020\)](#page-15-8). Especially, TACO is a fundamental breakthrough on this problem. To the best of our



(C) TTM

<span id="page-4-0"></span>**Fig. 4** Code examples of reduction in sparse-dense hybrid algebra in TF. The colored lines are reduction codes. MTTKRP has two levels of reduction, colored green and yellow, respectively. The overlapped

region means that the frst-level reduction's output serves as the second-level reduction's input. We follow the naming rules in Kjolstad ([2020\)](#page-15-8) for the storage of *A*

knowledge, it is the first to propose a practical sparse compilation theory. MLIR sparse dialect (Bik et al. [2022\)](#page-14-3) implements TACO's sparse compilation theory as MLIR dialect. SparseTIR [15] follows the design philosophy of

TensorIR (Feng et al. [2022](#page-14-12)), but it still uses some of the TACO's concepts such as position and coordinate space. TACO also motivates innovations on accelerators for sparse tensor algebra (Qin et al. [2022\)](#page-15-24).



<span id="page-5-0"></span>**Fig. 5** Illustration of common reduction in MTTKRP and SpMM. The equivalent expressions of the same kernel in TF and DF are below each sub-figure

<span id="page-5-1"></span>

# **2.4 TACO**

TACO (The Tensor Algebra Compiler) is a fast and versatile compiler-based library for sparse linear and tensor algebra (Kjolstad et al. [2017,](#page-15-7) [2019;](#page-15-14) Kjolstad [2020](#page-15-8); Senanayake et al. [2020\)](#page-15-15). TACO has three types of inputs: a tensor algebra expression (in an Einstein summation notation or reduction notation); level formats of input and output tensor; schedule commands. We will introduce TACO in the front-end, middle-end, and back-end order. The workfow of TACO is illustrated in Fig. [6.](#page-5-1)

#### <span id="page-5-2"></span>**2.4.1 Front‑end**

At the front-end, the tensor algebra expression is concretized to concrete index notation (Kjolstad et al. [2019\)](#page-15-14). The concrete index notation (CIN) is a language that describes the execution of a tensor algebra. Unlike bare tensor algebra expression, CIN describes the loop, index variables relations, workspace, hardware platform, etc. Schedule commands transform the CIN. For example, a *precompute* schedule will add a *where* statement to the CIN. Though TACO provides a clean and powerful scheduling API to transform CIN, the user can still change the CIN directly. TACO

provides a match function that can take lambda expression as input. The function can modify CIN when it meets a specifc type of CIN node or a pattern of CIN nodes. Moreover, users can defne a child class of IndexNotationRewriter that can directly rewrite the CIN. Such technique is used to implement segment group.

#### **2.4.2 Middle‑end**

At the middle-end, CIN will be transformed to imperative IR (or low level IR (LLIR)). LLIR describes the basic blocks, for example, for-loop, while-loop, and if-statement. LLIR is almost the executable code. The output of the middle-end is a chain of LLIR. The sparse iteration theory (Kjolstad [2020](#page-15-8)) guides the CIN to LLIR process. It ensures that diferent tensors only coiterate over elements that can generate non-zero output. Specifcally, TACO designs lower functions for every statement of CIN and lattices in the sparse iteration space. However, current lower functions only assume serial reduction is done on the compressed level of sparse tensors. We will break the serial code assumption to implement segment group. Moreover, we suggest that more fexible or even userdefned lowerers should be designed in the future.

#### **2.4.3 Back‑end**

At the back-end, LLIR will be transformed to code for diferent backends. In this paper, we target the CUDA code generation. TACO CUDA code generator has some assumptions that previous papers did not thoroughly explore. TACO deals with CUDA code generation in a nested loop favor (Sena-nayake et al. [2020](#page-15-15)). Moreover, it only generates one dimension of block and thread. That is, it only has blockIdx.x and threadId.x. When the index variable of a for-loop LLIR is bound on the GPUBlock, it will use blockIdx.x to index this index variable. In the CPU case, it will emit a real for-loop. Such variable is assumed to increment by 1. Index variables bound on GPUWarp and GPUThread are assumed to be the outer and inner variables of threadIdx.x. The tile size depends on the index variable on GPUThread. The mixture of tiling and synchronization semantics of GPUWarp loses some optimization opportunities. We will discuss this later and improve it in our implementation.

# <span id="page-6-0"></span>**3 Atomic parallelism**

## **3.1 Computation unit model**

We observe that the core operation of sparse-dense hybrid algebra is the reduction. Therefore, the core of our model is *how many data are reduced and are synchronized in which way*. We model the atomic computation unit as **thread**. A thread executes a serial program. All threads execute the same program independently with each own's input data and are distinguished by threadId. Threads can do synchronization in groups with *reduction parallelism* of 2, 4, 8, 16, or 32. We model GPU computation as *unlimited* parallel threads and defne the number of threads as *resource parallelism* that GPU can provide. We do not consider the shared memory, grid level, and the mapping of the thread block or the streaming processor. Instead, we view them as reasonable implementation details after the basic parallel pattern is decided. In other words, there can be many kinds of implementation for each algorithm in atomic parallelism. In this sense, atomic parallelism can encourage more GPU optimization innovation.

#### **3.2 Overview of atomic parallelism**

To defne the parallel pattern concretely, we propose *atomic parallelism*. A program with *atomic parallelism* cannot be paralleled anymore. In other words, a thread at least executes the amount of data denoted by atomic parallelism. Formally, atomic parallelism is defned as the Cartesian product of *minimal data*. Minimal data is the minor data of

one category a thread can execute. Atomic parallelism can be used to construct the optimization space of any sparsedense hybrid algebra under the GPU model, but we focus on SpMM in this paper.

Indeed, tiling, manipulating shared memory, and thread mapping (Hidayetoğlu et al. [2020;](#page-14-13) Mehrabi et al. [2021](#page-15-20); Xin et al. [2021](#page-15-25); Huang et al. [2020\)](#page-15-12) are also important for SpMM on GPU. They are crucial for SpMM, especially with many dense columns(usually more than 128 columns), because the computation will be more *workload* intensive and bounded by the memory access for dense columns. However, we focus on SpMM with fewer dense columns(usually less than 8 columns), which are more *balance* intensive and bounded by the maximum warp execution cycles.

SpMM has two orthogonal atomic parallelisms: minimal data can be (1)  $\{\frac{1}{g}, 1, g\}$  non-zeros of the sparse matrix and  $\{\frac{1}{c}, 1, c\}$  columns of the dense matrix; (2)  $\{\frac{1}{g}, 1, g\}$  rows of the sparse matrix and  $\{\frac{1}{c}, 1, c\}$  columns of the dense matrix.  $c \in \mathbb{Z}^+$  and  $g \in \mathbb{Z}^+$  are tunable parameters. Though they can be 1, they have diferent meanings from 1, because they are *tunable*. Therefore, the atomic parallelism space of SpMM is described in  $\langle x, \text{nnz}, y \text{ col} \rangle$  or  $\langle x, \text{row}, y \text{ col} \rangle$ . Resource parallelism only multiplies one element of the atomic parallelism. For example, given resource parallelism *r*, the amount of executed data equals  $\langle r \times x \rangle$  *rnz*,  $\langle \text{vol} \rangle$  or *< x nnz*,*r* × *y col >*. Besides, a fractional amount of data means multiple threads may execute on the same datum. For example,  $\langle \frac{1}{g} row, 1 \text{ col} \rangle$  means that *g* threads execute the same row collaboratively.

## **3.3 SpMM optimization space formalization**

We use atomic parallelism and reduction parallelism { < ... >, *r* } to define an SpMM kernel. < ... > ∈ { $\frac{1}{g}$ , 1, *g* } *nnz*  $\times {\frac{1}{c}, 1, c}$ *col* or  ${\frac{1}{s}, 1, g}$ *row*  $\times {\frac{1}{c}, 1, c}$ *col*. They describe the minimal data. And the *reduction parallelism*  $r \in \{2, 4, 8, 16, 32\}$  assigns how many threads are synchronized each time. Figure [7](#page-7-1) illustrates the SpMM optimization space.

However, not all points in the atomic parallelism space are legal in optimization space. Figure [8](#page-7-2) illustrates the details of space pruning. There are three rules for legal points:

- (1)  $\left\{ \langle \frac{1}{g} nnz, xcol \rangle, r \right\}, \left\{ \langle \frac{1}{g} nnz, \frac{1}{g} col \rangle, r \right\}$  are illegal because one non-zero must by multiplied by at least one element in the dense matrix.
- (2)  $\left\{ \langle \frac{1}{g} \text{ row}, x \text{ col} >, r \right\} \left( \frac{r}{g} < 1 \right) \text{ is illegal because parallel}$ reduction only has one writeback thread.



<span id="page-7-1"></span>**Fig. 7** SpMM optimization space. The grey area is illegal. The dashed line part of the axis represents hardware dependent end of the axis



<span id="page-7-2"></span>**Fig. 8** Projections of SpMM optimization space. Grey areas are illegal and hollow circles are legal points. Sub-fgures (**a**–**c**) correspond to Rule 1, 2, and 3 respectively

(3)  $\left\{ \left\langle \frac{1}{g} \text{ row}, \frac{1}{g} \text{ col} \right\rangle, r \right\}$  is illegal because it conflicts with the rule that resource parallelism only multiplies one element of the atomic parallelism.

The state-of-the-art algorithm space, DA-SpMM (Dai et al. [2022](#page-14-4)) is in the atomic parallelism design space. It proposes a three-dimensional SpMM algorithm design space. We claim that the design space of DA-SpMM is included in the atomic parallelism space. To be specific, EB+PR is {*<* 1 *nnz*, *c col >*, 32}, RB + PR is  ${< \frac{1}{32} row, c col >, 32},$  EB+SR is  ${< 32 nnz, c col >, 1},$ and  $\overline{R}B + SR$  is  $\{< 1 \text{ row}, c \text{ col} > 0, 1\}$ . *c* means coarsen factor, *g* means group size. Though real CUDA code with 1 *row* or 1 *nnz* may have minimal data greater than one

because of limited resource parallelism, we still label the algorithm as 1 *row* or 1 *nnz*. The RM/CM is the implementation detail and is included in atomic parallelism in theory.

## <span id="page-7-0"></span>**4 Segment group**

## **4.1 Current warp‑level abstraction**

Current sparse tensor compilers with CUDA backend take warp as the rank of a thread (*tiling*), a particular parallel unit (*synchronization*) or just a hardware instruction. For example, TACO assumes warp and thread to be the outer and inner loop, and the *warpSize* depends on the

<span id="page-8-0"></span>![](_page_8_Figure_1.jpeg)

<span id="page-8-2"></span>split factor. It should be noted that no synchronization behavior is assumed in this case. TACO also takes the 32-thread warp reduction as atomic addition at the GPU-Warp parallel unit and assumes users will split the last level loop with *warpSize* = 32. In this case, CUDA warp is taken as a for-loop with extent *warpSize* and incremental step 1. Then they will emit CUDA warp primitives such as *\_\_shfl\_down\_sync* to do the reduction. Figure [9](#page-8-0) illustrates TACO's current GPU Warp semantics. On the contrary, TVM(Chen et al. [2018\)](#page-14-14) only binds on thread and block level and does not assign any synchronization on the warp level. Instead, it takes 32 as a hardware feature and uses such intrinsic to fll in schedule parameters in autoscheduler. Besides, it also uses warp as a memory load unit in TIR(Chen et al. [2018](#page-14-14)).

## **4.2 Overview of segment group**

However, at least two existing assumptions should be improved for sparse compilers. First, the *tiling* and *synchronization* semantics of warp should be explicitly separated. As shown in atomic parallelism, the atomic and reduction parallelism can be diferent, and reduction parallelism is not necessarily 32. Second, synchronization semantics should be able to express various reduction strategies and fexible reduction granularity, instead of just parallel reduction for 32 threads. As shown in atomic parallelism,  $\{<1$  nnz,  $\ c \ col >, n\}$ requires synchronization of *n* threads with row number of their own. Therefore, the warp reduction should be able to reduce to diferent outputs instead of only one. Such change not only calls for changing the hand-coded warp level reduction functions but also for elevating the reduction pattern to higher-level compiler passes. Such semantics lifting calls for a new organization of basic blocks, new control flow, and new user-level APIs.

# **4.3 Relationship between segment group and atomic parallelism**

Atomic parallelism models the optimization space of sparsedense hybrid algebra from the reduction view. We use this model to propose new optimization techniques. As shown in Sect. [2](#page-2-1), reduction is the key operation of sparse-dense hybrid algebra, which contains many diferent tensor algebras such as SpMM, SDDMM, MTTKRP, and TTM. Based on this observation, we defne and explain segment group in Sect. [3,](#page-6-0) using SpMM as an example. We show that [3](#page-6-0) opens new optimization space for SpMM. Such beneft can be generalized to other sparse-dense hybrid algebra. However, it requires repetitive engineering efforts to optimize case by case. In response to this issue, we propose segment group, a new abstraction for sparse compilers to ship performance benefts brought by atomic parallelism to users with only several lines of code changed on the user side.

In summary, we propose that sparse compilers for GPU should have abstraction segment group, that is, a *warp* that takes the *tiling* semantics, and a *group* that does diferent types of reduction *synchronization*. We will use TACO<sup>[1](#page-8-1)</sup> to illustrate how to implement segment group, but other sparse compilers can also integrate segment group. Figure [10](#page-8-2) illustrates the workfow.

<span id="page-8-1"></span><sup>&</sup>lt;sup>1</sup> We build on commit d0654a8 [https://github.com/zhang677/taco/](https://github.com/zhang677/taco/tree/d0654a84137169883973c40a951dfdb89883fd9c) [tree/d0654a84137169883973c40a951dfdb89883fd9c](https://github.com/zhang677/taco/tree/d0654a84137169883973c40a951dfdb89883fd9c).

## <span id="page-9-0"></span>**5 Segment group for TACO**

The original *parallelize* transformation is defined as *parallelize(IndexVar i, ParallelUnit pu, OutputRaceStrategy rs)* (Senanayake et al. [2020](#page-15-15)). The transformation does parallel execution on IndexVar *i*, using ParallelUnit *pu*. And OutputRaceStrategy *rs* describes the data races during reductions. For GPU, *pu* can be GPUThread, GPUWarp, and GPUBlock. *rs* can be NoRaces, IgnoreRaces, and Atomics. We propose two new designs to TACO:

- 1. We add a new PrallelUnit, *GPUGroup*, to the *parallelize* transformation, and change the semantics of ParallelUnit *GPUWarp*.
- 2. We break the assumption that other transformations other than parallelize assumes serial code and design a new lower process to enable segment reduction.

#### **5.1 New parallelize transformation**

We assign the *tiling* semantics to GPUWarp and its *Atomic OutputRaceStrategy* will only serve to direct the lowering function instead of synchronization semantics. Because GPUWarp now only serves as the outer loop of tiling on threadIdx, it does not have *Atomic* semantics. Meanwhile, we add *GPUGroup* which has *ReductionStrategy* and *GroupSize* attributes instead of *OutputRaceStrategy*. *ReductionStrategy* describes the group's reduction type, and *GroupSize* assigns the reduction parallelism.

## **5.2 Reduction semantics elevation**

TACO assumes that a sparse algebra compiler should do it best to ensure that only elements that can generate non-zero output will be calculated (Kjolstad [2020\)](#page-15-8). However, we point out that this assumption is not necessarily valid. The previous assumption is the best option for performance because the sparse iteration space theory is built on the assumption that the code runs serially. For CUDA code, however, such assumption is broken, which we term as *zero extension*. Zero extension means that some "out-of-bound" reduction can be allowed in the sparse iteration theory because it can later be executed by some warp primitives faster than for-loop.

#### **5.3 Segment reduction lowering**

```
// Original CUDA code
for ( k=0; k< B2_dimension ; k++)\{pA2_begin=i_blockStarts [block];
   pA2<sub>end</sub>=i_blockStarts [block +1];
  fposA = block * 256;i p o s=tac o binarySearchBefore (
   A2-pos, pA2-begin, pA2-end, fposA);
   i = i - p \circ sfposA=block * 256 + fpos1:
   i f ( fposA>=A2 pos [ A1 dimension ] )
     break ;
   f = A2 \text{ord} [ fpos A ] :
  kB=f*B2 dimension+k:
  while (\text{fposA} == A2\text{-}pos[i\text{-}pos +1]) {
     i-p o s=i-p o s +1;
     i = i - p \circ s;
   }<br>kC=i∗C2_dimension+k;
  float val = 0.0;
   val=A -vals [fposA] * B -vals [kB];
   atomicAdd(\& C-values [kC], val);
}
Listing 1 Original CUDA code
```

```
//Modified CUDA code
for (k=0; k< B2-dimension; k++)\{pA2_begin=i_blockStarts [block];
  pA2 end=i_blockStarts [block +1];
  fposA=block * 256 + fpos1;i p o s=taco binarySearchBefor e (
  A2 pos , pA2 begin , pA2 end , fposA ) ;
   i\!=\!i -pos ;
  float \mathbf{v} \cdot \mathbf{a} = 0.0:
   if (fposA>=A2_pos [A1_dimension])
     v a l = 0;else {
     f=\overset{\cdot}{A}2\text{-crd} [fposA];
     kB = f * B2-dimension+k;
     while (\text{fpos}A=A2\text{-pos}[i\text{-pos}+1])i-p o s=i-p o s +1;
        i=i pos;
     }<br>
val=A_vals [fposA ] * B_vals [kB] ;
   }<br>kC=i*C2_dimension+k;
  segReduceWarp<float, 32 > (C_vals,
  kC, val);
}
Listing 2 Modified CUDA code
```
Listing 1 and Listing 2 show the diference between codes generated by the original TACO and the modifed TACO. They use the same schedule, except that code on the right uses segment reduction of GPUGroup with size 32.

**scalar workspace**. TACO assumes that the statement and the assignment of *scalar workspace* (Kjolstad et al. [2019\)](#page-15-14) are in the same basic block. However, this assumption is so strong that it restricts the expressive power of TACO. For example, in {*<* 1 *nnz*, *c col >*, 32} the scalar workspace should be assigned in a basic block belonging to an *else* but stated in the same context with reduction of scalar workspace, outside the assignment basic block.

**Macro instruction**. It is important to emit code in a modular way. Therefore, we design two new *macro instructions* *atomicAddGroup<T,G>(T\* array, int idx, T value)* and *segReducWarp<T,G>(T\* array, int idx, T value)*. They are template device functions that takes in the output array, the index of the output and the value reduced to the output.<sup>[2](#page-10-0)</sup> They will do some kind of reduction on *G* threads, and *G* equals *GroupSize*. They will be stated in the header fle and used as macro instructions in the fnal CUDA code. In fact, we borrow the *group* concept from the *cooperative group* in CUDA. Since CUDA 11.0, it has supported an easy-to-use API called cooperative group<sup>[3](#page-10-1)</sup> that makes it only one-linecode effort to change reduction granularity to less than 32 threads.

# **6 TACO's support for four SpMM algorithms**

This section will illustrate the atomic parallelism design space and our implementation of segment group. We frst reexamine two SpMM algorithms proposed by TACO (Senanayake et al. [2020](#page-15-15)). They use TACO to generate {*< g nnz*, *c col >*, 1} and {*< x row*, *c col >*, 1}. We then use

another two examples,  $\{<\frac{1}{g}row, c\,col>,r\}$  and  $\{<1\,nnz, c\,col>,r\}$  to illustrate how the CIN is changed. The tensor algebra expression is  $C(i, k) = A(i, j) * B(j, k)$ . *A*'s first level is dense and the second level is compressed. *B* and *C* are both dense matrices. *A*, *B*, and *C* all are row-major. We assume  $N = 4$  and that thread per block (resource parallelism *p*) equals 256. We explicitly fll *p*, *g*, *N*, *c* into the CIN to show their arithmetic relations with CIN parameters. The actual CIN will not have undetermined variables.

## **6.1 TACO SpMM reexamination**

Currently, TACO supports two algorithms in atomic parallelism. They don't need synchronization semantics and only tune on the tiling semantics. The implementation by TACO is shown in Listing 3 and 4. They force the synchronization granularity to be 1 which presents limited capability in reduction.

Concrete Index Notation for {*< g nnz*, *c col >*, 1} is:

```
suchthat (forall (block, forall (warp, forall (thread,
for all (dense-val , where (C(i, k) \text{+}= \text{tnnzC} , for all (\text{nnz} ,
tnnzC+=A(i,j)*B(j,k))), GPUThread, Atomics),
GPUWarp, NoRaces ) , GPUBlock , NoRaces ) ,
fuse(i,j,f) and pos(f,fpos,A(i,j)) and
split ( fpos , block , fpos1 , ( p*g/(N/c) )) and
split (fpos1, warp, nnz, g) and split (k, ko, thread, c)
and bound(ko, dense_val ,N/c, MaxExact)
```
**Listing 3** CIN for  $\{, 1\}$ 

<span id="page-10-0"></span><sup>2</sup> We do not actually integrate these macro instructions into TACO, because it is fairly straightforward and purely engineering. When testing the kernels, we just replace the atomicAdd with the new macro instructions. We open-source the modifed TACO [https://github.com/](https://github.com/zhang677/taco/tree/parallelreduction) [zhang677/taco/tree/parallelreduction.](https://github.com/zhang677/taco/tree/parallelreduction)

<span id="page-10-1"></span><sup>3</sup> [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups) [html#cooperative-groups](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups).

Actually, TACO's *precompute* schedule fails to generate this CIN, so we use the IndexNotationRewriter technique mentioned in Sect. [2.4.1](#page-5-2) to get the CIN above. In the evaluation section of Senanayake et al. ([2020\)](#page-15-15) it assumes  $N = 128$ ,  $g = 16$ ,  $c = 4$ ,  $p = 512$ , which is a point in the {*< g nnz*, *c col >*, 1}.

Concrete Index Notation for {*< g row*, *c col >*, 1} is:

{*< g nnz*, *c col >*, 1}. These two algorithms only use the *tiling* semantics of GPUWarp.

#### **6.2 Two new algorithms**

We introduce two algorithms to overcome the restricted scheme forced by TACO to improve workload balance. The

```
such that (for all (block, for all (warp, for all (row,
for all (thread, for all (col, where (\hat{C}(i), k)+=tjC
for all (j, tjC+=A(i, j)*B(j, k))), GPUThread, NoRaces),
GPUWarp, NoRaces ) , GPUBlock , NoRaces ) , spli t ( i , block , io ,
p*g/(N/c)) and split (io, warp, row, g) and split (k, ko, col, c)and bound (ko, thread , N/c, MaxExact))
```

```
Listing 4 CIN for \{<q\,row, c\,col>, 1\}
```
The generated code can be directly executed. In the evaluation section of Senanayake et al. ([2020\)](#page-15-15) it assumes  $N = 128, g = 1, c = 4, p = 512$ , which is also a point in the

algorithms provide functionality to change group size and reduction strategy through tuning nnz and rows. Listing 5 and 6 show the implementation.

Concrete Index Notation for  $\{<\frac{1}{g}row, c\,col>, r\}$  is:

![](_page_11_Figure_11.jpeg)

<span id="page-11-0"></span>**Fig. 11** Newly generated SpMM kernels performance compared with original TACO's best SpMM kernel for diferent number of dense matrix columns *N*. Density is defned as the number of non-zeros divided by the multiplication of the number of rows and cols for sparse matrix

![](_page_12_Figure_1.jpeg)

**Listing 5** CIN for  $\{<\frac{1}{g}row, c \, col>, r\}$ 

We find that TACO can support  $g = 32$ ,  $r = 32$ , but it is not explored in the autoscheduling paper<sup>[4](#page-12-1)</sup>. GPUGroup is bound on the indexVar that does the reduction. Generated macro-instruction, *atomicAddWarp<Type>*, is changed to *atomicAddGroup<Type, G>* to enable more fne-grained thread synchronization.

Concrete Index Notation for  $\{<1$  *nnz*,  $\int$  *c col* >,  $r\}$  is:

We use NVCC 11.6 and CUDA 11.6 with the same compilation fags as Senanayake et al. ([2020](#page-15-15)) when testing TACO and the same compilation fag as Dai et al. [\(2022\)](#page-14-4) when testing the generalized tuning. We carry 25 tests for each kernel to get the average execution time when evaluating TACO's generated CUDA kernels. We use nsight-compute<sup>[5](#page-12-2)</sup> to get

```
suchthat (forall (block,forall (warp,forall (ki,forall (fpos1,where)
C(i, k)+=tmp, tmp=A((i, j)*B((j, k)), GPUThread, Atomics)), GPUWarp, NoRaces)
GPUBlock, IgnoreRaces), fuse(i, j, f) and pos(f, fpos, A(i, j)) and
split (fpos, block, fpos1, p/(N/c)) and split (k, ko, ki, c) and bound (ko,
warp, N/c, MaxExact) and parallelize (jpos1, GPUGroup, r, Segment))
```
This algorithm has no counterpart in the original TACO. We change the originally emitted *atomicAdd* to *segReduceGroup<Type,G>*, and the grouped segment reduction is done in the macro instruction. The lowerer of scalar workspace is changed to emit the code ready for segmented reduction.

## <span id="page-12-0"></span>**7 Evaluation**

**Experiment settings.** We evaluate the implementation and the generalization on three architectures:

- NVIDIA RTX 3090. Compute Capability 8.6 (68 Ampere SMs at 1.395 GHz, 24 GB GDDR6x, 936 GB/s bandwidth).
- NVIDIA RTX 2080. Compute Capability 7.5 (46 Turing SMs at 1.515 GHz, 8 GB GDDR6, 448 GB/s bandwidth).
- NVIDIA Tesla V100. Compute Capability 7.0 (80 Volta SMs at 1.370 GHz, 16 GB HBM2, 900 GB/s bandwidth).

the execution time of tuned dgSPARSE kernels. We use the same sparse matrices as Dai et al. ([2022](#page-14-4)). We evaluate on three diferent architectures to show that our techniques are not limited to specifc traits on certain generations of GPU, but are valid on common SIMT architectures.

## **7.1 Performance of two new algorithms for TACO**

This experiment aims to prove that segment group can improve the sparse compiler's expression ability and boost the performance of SpMM kernels generated by TACO. The dense input matrices have  $N = 4.6$  $N = 4.6$ 

**Against the static group size 32.** We use  $\{<\frac{1}{g}row, c\,col>,r\}$  to show the improvement brought by flexible group size *r*. Current TACO only supports  $g = 32, r = 32$ , so we keep the same *g* with TACO and change *r*. In Table [1](#page-13-0) we show that  $r = 8$  and  $r = 4$  can bring over 2.0x speedup on average. We also measure the *normalized speedup*. Normalized speedup of *A* over *B* means that

**Listing 6** CIN for  $\{<1$  *nnz*,  $ccol$  >,  $r\}$ 

<span id="page-12-1"></span> $4$  Senanayake et al. [\(2020](#page-15-15))'s authors shared their [code](https://drive.google.com/file/d/1qZbP7tY5N35N54JlmYkBHxY97HbgFSHE/view?usp=sharing) with us. We also use a similar code base to test our kernels in Sect. [7.](#page-12-0)

<span id="page-12-2"></span><sup>5</sup> [https://docs.nvidia.com/nsight-compute/NsightCompute/index.](https://docs.nvidia.com/nsight-compute/NsightCompute/index.html) [html.](https://docs.nvidia.com/nsight-compute/NsightCompute/index.html)

<span id="page-12-3"></span><sup>&</sup>lt;sup>6</sup> We open source the testing code at [https://github.com/zhang677/](https://github.com/zhang677/segTACO) [segTACO.](https://github.com/zhang677/segTACO)

<span id="page-13-0"></span>**Table 1** Flexible group size speedup

<span id="page-13-1"></span>

| Hardware                                               | $r = 8$ | $r = 8$ norm |                             |       | $r = 4$     |                                     | $r = 4$ norm |
|--------------------------------------------------------|---------|--------------|-----------------------------|-------|-------------|-------------------------------------|--------------|
|                                                        |         |              |                             |       |             |                                     |              |
| <b>RTX 2080</b>                                        | 2.451   | 2.478        |                             |       | 2.456       | 2.483                               |              |
| <b>RTX 3090</b>                                        | 2.236   | 2.284        |                             |       | 2.259       | 2.307                               |              |
| Tesla V100                                             | 2.086   | 2.143        | 2.094                       |       | 2.150       |                                     |              |
| <b>Table 2</b> Segment reduction<br>normalized speedup |         |              |                             |       |             | c $r = 4$ $r = 8$ $r = 16$ $r = 32$ |              |
|                                                        |         | 1            | 1.008                       |       | 1.025 1.085 | 1.272                               |              |
|                                                        |         |              | $\mathcal{D}_{\mathcal{L}}$ | 1.019 | 1.045       | 1.102                               | 1.291        |
|                                                        |         |              | 4                           | 1.063 | 1.095       | 1.205                               | 1.381        |
|                                                        |         |              |                             |       |             |                                     |              |

<span id="page-13-2"></span>**Table 3** Normalized performance of new algorithms

![](_page_13_Picture_748.jpeg)

if *A* performs better than *B*, we count the speedup; otherwise, we assume the user can choose the better algorithm, and the speedup is counted as 1.

**Against the original reduction.** We use {*<* 1 *nnz*, *c col >*,*r*} to illustrate the speedup brought by fexible reduction. Because they have diferent data types (nnz vs. row), we control *c* and *r*, and compare the execution of {*<* 1 *nnz*, *c col >*,*r*} with the best *g* configuration of  $\{<\frac{1}{g}row, c\,col>,r\}$  each dataset. We only do this experiment on RTX 3090 and record the normalized speedup here. In Table [2](#page-13-1) we show that segment reduction can bring up to  $1.3 \times$  speedup over atomicWarp reduction. Limited by the

<span id="page-13-4"></span>**Table 4** Speedup over original implementation

| Hardware        | geomean <sup>1a</sup> | max   | N   |
|-----------------|-----------------------|-------|-----|
| <b>RTX 3090</b> | 2.295                 | 4.316 | 128 |
|                 | 2.181                 | 4.432 | 64  |
|                 | 1.997                 | 4.271 | 16  |
|                 | 2.046                 | 7.819 | 4   |
| <b>RTX 2080</b> | 1.938                 | 4.379 | 128 |
|                 | 1.927                 | 4.430 | 64  |
|                 | 1.995                 | 5.019 | 16  |
|                 | 2.307                 | 8.582 | 4   |
| Tesla V100      | 1.874                 | 3.724 | 128 |
|                 | 1.824                 | 3.846 | 64  |
|                 | 1.693                 | 3.388 | 16  |
|                 | 1.852                 | 6.114 | 4   |

<sup>a</sup>We use geometric mean to reduce outlier bias

<span id="page-13-5"></span>**Table 5** Speedup over static implementation

| Hardware        | geomean | N   | Best static          |
|-----------------|---------|-----|----------------------|
| <b>RTX 3090</b> | 1.124   | 128 | < 8, 256, 8, 1/2 >   |
|                 | 1.114   | 64  | $<$ 4, 256, 8, 1/2 > |
|                 | 1.310   | 16  | < 8, 256, 8, 1/2 >   |
|                 | 1.406   | 4   | < 8,256,8,1>         |
| <b>RTX 2080</b> | 1.095   | 128 | $<$ 4, 256, 8, 1/2 > |
|                 | 1.114   | 64  | $<$ 4, 256, 8, 1/2 > |
|                 | 1.276   | 16  | $<$ 4, 256, 8, 1/2 > |
|                 | 1.310   | 4   | $<$ 4, 256, 8, 1/2 > |
| Tesla V100      | 1.137   | 128 | < 8, 256, 8, 1/2 >   |
|                 | 1.177   | 64  | < 8, 256, 8, 1/2 >   |
|                 | 1.367   | 16  | < 8, 256, 8, 1 >     |
|                 | 1.326   | 4   | < 8, 256, 8, 1 >     |

number of threads per warp in GPU, *r* can only be 1, 2, 4, 8, 16, 32. Therefore, users can try these values to tune *r* in practice.

**Against the original TACO SpMM algorithms.** In this experiment, we compare the performance between TACO's original SpMM algorithms {*< g nnz*, *c col >*, 1} and {*< x row*, *c col >*, 1} (Senanayake et al. [2020\)](#page-15-15) and two algorithms proposed by us,  $\{<\frac{1}{g}row, c\,col>, r\}$  and  $\{ \langle \text{Im } z, \text{ } c \text{ } col \rangle, r \}.$  We assign reasonable values to *g*, *c*, *x*, and *r*, and tune these parameters. We record the best performance of each algorithm on each dataset. From Table [3](#page-13-2) we conclude that segment group brings 1.1x∼1.2x normalized speedup. Figure [11](#page-11-0) shows the detailed data.

#### **7.2 Generalization of atomic parallelism**

In this experiment, we implement our atomic parallelism to dgSPARSE library,<sup>[7](#page-13-3)</sup> an open-source state-of-the-art SpMM and SDDMM library. We achieve up to  $2.7 \times$  speedup on a certain SpMM algorithm. We keep the same sparse input matrix format (CSR) with dgSPARSE. After profling, we fnd that row-major algorithms consistently outperform the col-major algorithms. Therefore, we target row-major. We are left with 4 algorithms:  $EB + SR + RM$ ,  $EB + PR +$  $RM$ ,  $RB + SR + RM$ ,  $RB + PR + RM$ . We will introduce the details of tuning  $RB + PR + RM$  and show the speedup.

To tune an actual GPU SpMM kernel, we require more fne-grained parameters than those in atomic parallelism. Parallelism is now two-fold: block-level and thread-level, instead of homogeneous threads. Besides, the memory hierarchy, such as the shared memory should be considered.

<span id="page-13-3"></span><sup>7</sup> <https://github.com/dgSPARSE>.

Moreover, parallelism is limited in the physical world. For example, the largest thread-level parallelism is 1024 because a block has at most 1024 threads. The largest block-level parallelism is also finite(less than  $2^{32} - 1$ ). GridSize can be arbitrary because the extra blocks will be taken care of by GPU scheduler.

Tuning parameters for  $RB + PR + RM$  can be divided into two categories. The frst is how many workers are assigned to process one chunk of data. The second is how many chunks of data are assigned to one worker.  $RB + PR + RM$ has 7 tunable parameters. A block process *tileSz* real columns. *workerSz* threads process one vectorized column and *threadRw* sparse rows. *groupSz* threads are synchronized. *blockSz* denotes the number of threads per threadblock. *workerDimR* denotes the block parallelism of sparse rows. A vectorized column has *coarsenSz* consecutive real columns. If the overall sparse row parallelism is less than the number of rows in the sparse matrix, one thread may process more than one row. The tiling is"Dense major"; dense columns are fully parallelized. Specifcally, *blockDim.x = min(N, tileSz) / coarsenSz \* workerSz*. Full source parallelism of one block is *max(blockSz, blockDim.x \* 2)*. In the dgSPARSE implementation,  $\text{tile}S_z = \text{worker}S_z = \text{group}S_z = 32$ , workerDimR equals the number of rows of the sparse matrix, *threadRw* = 1, *blockSz* = 256, and *coarsenSz*= $(N\%4=0)$ ? *4:(N%2==0)?2:1*.

Based on the insights of this paper, we should separate tiling and synchronization, add finer-grained parallelism, and more flexible workload of each thread. Therefore, we propose to tune four parameters: *< groupSz*, *blockSz*, *tileSz*,*workerDimR >*. Actually, workerDimR can be arbitrary. However, we set it to be power of 2 or reciprocal power of 2 times of the original value in order to explore the local area in the design space. As in atomic parallelism we set groupSz as 2, 4, 8, 16, or 32. tileSz is power of 2 larger than groupSz, and depends on *N*. blockSz is set 128,256, or 512 which are common values for the number of threads per threadblock. We tune the RB + PR + RM kernel for *N* = 4, 16, 64, 128. From Table [4](#page-13-4) we conclude that tuning can bring  $1.6 \times \sim 2.3 \times$  speedup over the original implementation. $\frac{8}{3}$  $\frac{8}{3}$  $\frac{8}{3}$ 

Because DA-SpMM introduces a decision tree model to choose the best confguration for a given sparse matrix, we further explore the maximum speedup that dynamic choices can bring. This experiment examines the necessity of designing a new model to choose the best parameters. From Table [5](#page-13-5) we conclude that the most significant speedup of dynamic choices is 1.1x∼1.4x.

## <span id="page-14-9"></span>**8 Conclusion**

We propose atomic parallelism to analyze sparse-dense hybrid algebra and propose new SpMM designs. Based on atomic parallelism propose a new abstraction segment group to sparse compilers and remedy the missing optimization opportunities. First, we implement the new abstraction in TACO and achieve up to  $1.2 \times$  speedup over TACO's original SpMM kernels. Then, we use atomic parallelism to tune an SpMM algorithm in dgSPARSE and get  $1.6 \times \sim 2.3 \times$ speedup on the tuned algorithm. In the future, atomic parallelism can be exposed as an auto-tuning API for users to explore diferent synchronization granularity and reduction strategy for sparse-dense hybrid algebra.

## **References**

- <span id="page-14-8"></span>Asgari, B., Hadidi, R., Cao, J., Lim, S.-K., Kim, H., et al.: Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 908–920 (2021). IEEE
- <span id="page-14-5"></span>Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–11 (2009)
- <span id="page-14-1"></span>Bell, N., Dalton, S., Olson, L.N.: Exposing fne-grained parallelism in algebraic multigrid methods. SIAM J. Sci. Comput. **34**(4), 123–152 (2012)
- <span id="page-14-3"></span>Bik, A.J., Koanantakool, P., Shpeisman, T., Vasilache, N., Zheng, B., Kjolstad, F.: Compiler support for sparse tensor computations in mlir. [arXiv:2202.04305](http://arxiv.org/abs/2202.04305) (2022)
- <span id="page-14-2"></span>Bik, A.J., Wijshoff, H.A.: Compilation techniques for sparse matrix computations. In: Proceedings of the 7th International Conference on Supercomputing, pp. 416–424 (1993)
- <span id="page-14-14"></span>Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., *et al.*: Tvm: An automated endto-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578–594 (2018)
- <span id="page-14-6"></span>Chou, S., Kjolstad, F., Amarasinghe, S.: Format abstraction for sparse tensor algebra compilers. Proc. ACM Program. Lang. 2(OOP-SLA), 123–112330 (2018). <https://doi.org/10.1145/3276493>
- <span id="page-14-10"></span>Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM **13**(6), 377–387 (1970)
- <span id="page-14-4"></span>Dai, G., Huang, G., Yang, S., Yu, Z., Zhang, H., Ding, Y., Xie, Y., Yang, H., Wang, Y.: Heuristic adaptability to input dynamics for spmm on gpus. [arXiv:2202.08556](http://arxiv.org/abs/2202.08556) (2022)
- <span id="page-14-12"></span>Feng, S., Hou, B., Jin, H., Lin, W., Shao, J., Lai, R., Ye, Z., Zheng, L., Yu, C.H., Yu, Y., et al.: Tensorir: An abstraction for automatic tensorized program optimization. [arXiv:2207.04296](http://arxiv.org/abs/2207.04296) (2022)
- <span id="page-14-11"></span>Guennebaud, G., Jacob, B., et al.: Eigen. **3** http://eigen. tuxfamily.org (2010)
- <span id="page-14-0"></span>Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. **30** (2017)
- <span id="page-14-7"></span>Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: Eie: efficient inference engine on compressed deep neural network. ACM SIGARCH Comput. Archit. News **44**(3), 243–254 (2016)
- <span id="page-14-13"></span>Hidayetoğlu, M., Pearson, C., Mailthody, V.S., Ebrahimi, E., Xiong, J., Nagi, R., Hwu, W.-m.: At-scale sparse deep neural network

<span id="page-14-15"></span><sup>8</sup> We open source our implementation at [https://github.com/dgSPA](https://github.com/dgSPARSE/dgSPARSE-Library/commit/9e3e4c18f40e76b97a805b8a9733258f7e9edeb6) [RSE/dgSPARSE-Library/commit/9e3e4c18f40e76b97a805b8a9](https://github.com/dgSPARSE/dgSPARSE-Library/commit/9e3e4c18f40e76b97a805b8a9733258f7e9edeb6) [733258f7e9edeb6.](https://github.com/dgSPARSE/dgSPARSE-Library/commit/9e3e4c18f40e76b97a805b8a9733258f7e9edeb6)

inference with efficient gpu implementation. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7 (2020). IEEE

- <span id="page-15-19"></span>Hong, C., Sukumaran-Rajam, A., Nisa, I., Singh, K., Sadayappan, P.: Adaptive sparse tiling for sparse matrix multiplication. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, pp. 300–314 (2019)
- <span id="page-15-12"></span>Huang, G., Dai, G., Wang, Y., Yang, H.: Ge-spmm: general-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2020). IEEE
- <span id="page-15-0"></span>Kipf, T.N., Welling, M.: Semi-supervised classifcation with graph convolutional networks. [arXiv:1609.02907](http://arxiv.org/abs/1609.02907) (2016)
- <span id="page-15-14"></span>Kjolstad, F., Ahrens, P., Kamil, S., Amarasinghe, S.: Tensor algebra compilation with workspaces, 180–192 (2019)
- <span id="page-15-7"></span>Kjolstad, F., Kamil, S., Chou, S., Lugato, D., Amarasinghe, S.: The tensor algebra compiler. Proc. ACM Program. Lang. 1(OOPSLA), 77–17729 (2017). <https://doi.org/10.1145/3133901>
- <span id="page-15-8"></span>Kjolstad, F.: Sparse tensor algebra compilation. Ph.d. thesis, Massachusetts Institute of Technology, Cambridge, MA (2020). [http://](http://tensor-compiler.org/files/kjolstad-phd-thesis-taco-compiler.pdf) [tensor-compiler.org/fles/kjolstad-phd-thesis-taco-compiler.pdf](http://tensor-compiler.org/files/kjolstad-phd-thesis-taco-compiler.pdf)
- <span id="page-15-2"></span>Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. **51**(3), 455–500 (2009)
- <span id="page-15-13"></span>Kurt, S.E., Raje, S., Sukumaran-Rajam, A., Sadayappan, P.: Sparsityaware tensor decomposition. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 952–962 (2022). IEEE
- <span id="page-15-17"></span>Lin, C.-Y., Luo, L., Ceze, L.: Accelerating spmm kernel with cachefrst edge sampling for graph neural networks. [arXiv:2104.10716](http://arxiv.org/abs/2104.10716) (2021)
- <span id="page-15-1"></span>Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015)
- <span id="page-15-20"></span>Mehrabi, A., Lee, D., Chatterjee, N., Sorin, D.J., Lee, B.C., O'Connor, M.: Learning sparse matrix row permutations for efficient spmm on gpu architectures. In: 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 48–58 (2021). IEEE
- <span id="page-15-22"></span>Naumov, M., Chien, L., Vandermersch, P., Kapasi, U.: Cusparse library. In: GPU Technology Conference (2010)
- <span id="page-15-11"></span>Nisa, I., Li, J., Sukumaran-Rajam, A., Vuduc, R., Sadayappan, P.: Load-balanced sparse mttkrp on gpus. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 123–133 (2019). IEEE
- <span id="page-15-9"></span>Popoola, T., Shankar, R., Rift, A., Singh, S., Davis, E.C., Strout, M.M., Olschanowsky, C.: An object-oriented interface to the sparse polyhedral library. In: 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1825–1831 (2021). IEEE
- <span id="page-15-24"></span>Qin, E., Garg, R., Bambhaniya, A., Pellauer, M., Parashar, A., Rajamanickam, S., Hao, C., Krishna, T.: Enabling fexibility for sparse tensor acceleration via heterogeneity. [arXiv:2201.08916](http://arxiv.org/abs/2201.08916) (2022)
- <span id="page-15-15"></span>Senanayake, R., Hong, C., Wang, Z., Wilson, A., Chou, S., Kamil, S., Amarasinghe, S., Kjolstad, F.: A sparse iteration space transformation framework for sparse tensor algebra. Proc. ACM Program. Lang. **4**(OOPSLA) (2020).<https://doi.org/10.1145/3428226>
- <span id="page-15-3"></span>Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientifc computing. In: Proceedings of the International Conference on Supercomputing, pp. 152–161 (2011)
- <span id="page-15-6"></span>Strout, M.M., Hall, M., Olschanowsky, C.: The sparse polyhedral framework: composing compiler-generated inspector-executor code. Proc. IEEE **106**(11), 1921–1934 (2018)
- <span id="page-15-5"></span>Venkat, A., Hall, M., Strout, M.: Loop and data transformations for sparse matrix code. ACM SIGPLAN Not. **50**(6), 521–532 (2015)
- <span id="page-15-16"></span>Wang, Z., Wohlwend, J., Lei, T.: Structured pruning of large language models. [arXiv:1910.04732](http://arxiv.org/abs/1910.04732) (2019)
- <span id="page-15-23"></span>Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., Wang, Y.: Intel math kernel library. In: High-Performance Computing on the Intel® Xeon PhiTM, pp. 167–188. Springer, Cham (2014)
- <span id="page-15-25"></span>Xin, J., Ye, X., Zheng, L., Wang, Q., Huang, Y., Yao, P., Yu, L., Liao, X., Jin, H.: Fast sparse deep neural network inference with fexible spmm optimization space exploration. In: 2021 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7 (2021). IEEE
- <span id="page-15-18"></span>Yang, C., Buluç, A., Owens, J.D.: Design principles for sparse matrix multiplication on the gpu. In: European Conference on Parallel Processing, pp. 672–687 (2018). Springer
- <span id="page-15-10"></span>Ye, Z., Lai, R., Shao, J., Chen, T., Ceze, L.: Sparsetir: composable abstractions for sparse compilation in deep learning
- <span id="page-15-21"></span>Yu, Z., Dai, G., Huang, G., Wang, Y., Yang, H.: Exploiting online locality and reduction parallelism for sampled dense matrix multiplication on gpus. In: 2021 IEEE 39th International Conference on Computer Design (ICCD), pp. 567–574 (2021). IEEE
- <span id="page-15-4"></span>Yuster, R., Zwick, U.: Detecting short directed cycles using rectangular matrix multiplication and dynamic programming. In: SODA, vol. 4, pp. 254–260 (2004). Citeseer

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

![](_page_15_Picture_30.jpeg)

**Genghan Zhang** is a fourth-year undergraduate student in Tsinghua University Department of Electronic Engineering. His research interests lie in domainspecifc language and computer architecture.

![](_page_15_Picture_32.jpeg)

**Yuetong Zhao** born in 2002. She is an undergraduate of Tsinghua University. Her major is electronic engineering.

![](_page_16_Picture_2.jpeg)

**Yanting Tao** born in 2002. She is an undergraduate of Tsinghua University. Her major is electronic engineering.

**Zhongming Yu** is a Ph.D. student advised by Prof. Jishen Zhao in the Computer Science and Engineering Department of UC San Diego, starting from 2022. His interests lie in the intersection between efficient systems and machine learning, with a focus on a memory-centric perspective. Before coming to UCSD, he received his B.E. degree from the Department of Electronic Engineering at Tsinghua

![](_page_16_Picture_4.jpeg)

**Sitao Huang** is an assistant professor in the Department of Electrical Engineering and Computer Science at the University of California, Irvine. He received his Ph.D. degree and M.S. degree in Electrical and Computer Engineering from University of Illinois at Urbana-Champaign in 2021 and 2017 respectively. He received his B.S. degree in Electronics Engineering from Tsinghua University in 2014. His research interests include hardware accelerators, compilers for accelerators, and heterogeneous

systems. He is a 2022 DARPA Forward Riser. His research won the Best Paper Award at IDEAL 2021, Best Paper Nomination at ASP-DAC 2021, and the Student Innovation Award at the 2018 IEEE HPEC Graph Challenge.

![](_page_16_Picture_7.jpeg)

**Yuan Wen** is an Assistant Professor at the University of Aberdeen. He received his PhD from the Informatics School of the University of Edinburgh. Prior to his current job, He has been a Research Fellow at the University of Edinburgh and Trinity College Dublin. His research interests include AI/ML workloads optimization, efficient code generation, and softwarehardware co-design. The research targets creating highperformance code and hardware logic for heterogeneous systems

of diferent scales, from data centres to battery-powered wearable devices.

![](_page_16_Picture_10.jpeg)

**Pavlos Petoumenos** is an Assistant Professor in the University of Manchester, UK, and a Research Fellow of the Royal Academy of Engineering. He received his Diploma and his PhD from the University of Patras, Greece, in 2005 and 2011 respectively. Later, he joined the University of Edinburgh as a post-doctoral researcher. His work covers a range of topics in Architecture and Compilers, from CPU cache replacements policies to deep learned heuristics for compiler optimizations.

His papers have received awards in multiple IEEE and ACM conferences (IISWC, CGO, PACT, ISSTA).

![](_page_16_Picture_13.jpeg)

![](_page_16_Picture_14.jpeg)

**Guohao Dai** received the B.S. and Ph.D. (with honor) degrees from Tsinghua University, Beijing, in 2014 and 2019. He is joining Shanghai Jiao Tong University, Shanghai, China, as an Associate Professor. Guohao's research mainly focuses on large-scale sparse graph computing, heterogeneous hardware computing, emerging hardware architecture, etc. He has received Best Paper Award in ASP-DAC 2019, and Best Paper Nomination in DAC 2022 and DATE 2018. He is the winner of the NeurIPS Billion-

University.

Scale Approximate Nearest Neighbor Search Challenge in 2021, and the recipient of the Outstanding Ph.D. Dissertation Award of Tsinghua University in 2019. Currently, he serves as PI/Co-PI for several projects with a personal share of over RMB 6 million.

![](_page_17_Picture_1.jpeg)

**Yu Wang** received the B.S. and Ph.D. (with honor) degrees from Tsinghua University, Beijing, in 2002 and 2007. He is currently a tenured professor with the Department of Electronic Engineering, Tsinghua University. His research interests include brain inspired computing, parallel circuit analysis, application specifc acceleration, power/reliability aware circuit, and system design methodology. He has authored and co-authored more than 350 papers in refereed journals and conferences. He has received Best Paper Award in ASP-DAC 2019, FPGA 2017, NVMSA 2017, ISVLSI 2012, Best Poster Award in HEART 2012, and 11 Best Paper Nominations (DAC22, ICT18, DATE18, DAC17, ASPDAC16, ASPDAC14, ASPDAC12, 2 in ASPDAC10, ISLPED09, CODES09). He is a recipient of the Alexander von Humboldt Fellowship (2019), the DAC Under-40 Innovators Award (2018), and the IBM X10 Faculty Award (2010). He served as TPC chair for ICFPT 2019 and 2011, ISVLSI 2018, fnance chair for ISLPED 2012-2016, track chair for DATE 2017-2019 and GLSVLSI 2018, and served as a program committee member for leading conferences in these areas, including top EDA conferences such as DAC, DATE, ICCAD, ASP-DAC, and top FPGA conferences such as FPGA and FPT. Currently, he serves as associate editor of the IEEE TCAD, the ACM TODAES, and the IEEE TCSVT.