Keywords

1 Introduction

There is a growing interest in scientific computing community for big data analytics. Recent approaches aim to benefit the big data analytics with the methods and techniques that are very common in the mature field of optimization for high performance computing (HPC) [20]. These efforts rely on the observation that graphs often constitute the spine of the data structures used in analyzing big data (as data is almost always sparse), and the adjacency list representation of a graph actually corresponds to a sparse matrix. Hence, analysis operations on big data can be expressed in terms of basic sparse matrix kernels. For example, the popular graph mining library PEGASUS (A Peta-Scale Graph Mining System) [17] uses an optimized sparse matrix vector multiplication kernel, called GIM-V, as the basic operation in several graph mining algorithms such as PageRank, spectral clustering, finding connected components, etc. This work focuses on efficient parallelization of two other important sparse kernels on distributed systems: sparse matrix–matrix multiplication (SpGEMM) of the form \(C=AB\) and sparse matrix–dense matrix multiplication (SpMM) of the form \(Y=AX\).

SpGEMM kernel finds its application in a wide range of domains such as finding all-pair shortest paths (APSP) [11], finite element simulations based on domain decomposition (e.g., finite element tearing and interconnect (FETI) [13]), molecular dynamics (e.g., CP2K [10]), computational fluid dynamics [19], climate simulation [25], and interior point methods [6]. These applications necessitate efficient large-scale parallelization in order to obtain shorter running times for processing today’s rapidly growing “Big Data”. There exist several software packages that provide SpGEMM computation for distributed-memory architectures such as Trilinos [15] and Combinatorial BLAS [7]. Trilinos uses one-dimensional (1D) partitioning of input matrices. Matrices A and C are stationary, whereas matrix B is communicated in K stages for a parallel system with K processors. This algorithm corresponds to replicating B matrix among K processors in K stages. Combinatorial BLAS uses the parallel matrix multiplication algorithm (SUMMA [30]) based on dense matrices. The motivation of Combinatorial BLAS is large-scale graph analytic for “Big Data”. It also contains scalable implementations of kernel operations such as sparse matrix–vector multiplication (SpMV) and subgraph extraction. Recently, a matrix-partitioning method based on two-constraint hypergraph partitioning is proposed in [3] for reducing total message volume during outer-product–parallel SpGEMM. [3] is known to be the first work that proposes to preprocess the sparsity patterns of the matrices in order to reduce parallelization overheads. In [3], it is also proposed that the input and output matrices can be simultaneously partitioned.

SpMM is also an important kernel and many graph analysis techniques such as centrality measures use it as a building block. Apart from its popularity in block methods in linear algebra [14, 22, 23], it is also a very basic kernel in graph analysis as several works pointed out its relation to graph algorithms [2, 8, 17, 24, 28]. In these methods, the dimensions of the dense matrices X and Y are usually very small compared to the dimensions of the sparse matrix A. The importance of this kernel is also acknowledged by vendors Intel MKL [1] and Nvidia cuSPARSE [21], being realized respectively for multi-core/many-core and GPU architectures.

Our contributions in this work are centered around partitioning models to efficiently parallelize these two kernels on distributed systems. In order to do so we aim at reducing communication overheads. The proposed model for the SpGEMM kernel aims to reduce total message volume while the proposed model for the SpMM kernel consists of two phases and it strives for reducing both total and maximum message volume. The experiments on up to 1024 processes show that scalability can be drastically improved using the proposed models.

The rest of the paper is organized as follows: Section 17.2 describes the SpGEMM kernel and the proposed model for parallelization. Section 17.3 describes the SpMM kernel and the two-phase methodology to achieve parallelization. Section 17.4 presents our experiments separately for these two kernels and Sect. 17.5 concludes.

2 Parallelization of the SpGEMM Kernel

We investigate efficient parallelization of the SpGEMM operation on distributed-memory architectures. The communication overhead and imbalance on computational loads of processors become significant bottleneck in large-scale parallelization. Thus, we propose an intelligent matrix-partitioning method that achieve reducing total message volume while maintaining balance on computational loads of processors in outer-product-based parallelization of the SpGEMM operation.

In Sect. 17.2.1, the outer-product-based parallelization of the SpGEMM operation is presented. We present the hypergraph model for this outer-product-based parallelization in Sect. 17.2.2. Section 17.2.3 describes how to decode a hypergraph partition as a matrix partition.

2.1 Outer-Product–Parallel SpGEMM Algorithm

We consider the SpGEMM computation of the form \(C\!=AB\). Here, A and B denote the input matrices, and C denotes the output matrix, where all of these matrices are sparse.

Outer-product–parallel SpGEMM operation uses 1D columnwise and 1D rowwise partitioning of the input A and B matrices, respectively, as follows:

$$\begin{aligned} \hat{A}=AP=[A_1 A_2 \cdots A_K] \ \ \mathrm{and }\ \ \hat{B}=PB=\left[ \begin{array}{c}B_1\\ B_2\\ \vdots \\ B_K\end{array}\right] . \end{aligned}$$
(17.1)

In Eq. (17.1), K denotes the number of parts, which is in turn equal to the number of processors of the parallel system, P denotes the permutation matrix obtained from partitioning. The same permutation matrix is used for reordering columns of matrix A and rows of matrix B so that outer products are performed without any computation.

According to the input matrix partitioning given in Eq. (17.1), the SpGEMM computation is performed in two steps. The first step consists of local outer-product computations performed as follows by each processor \(P_k\):

$$\begin{aligned} C^k=A_kB_k \ \ \mathrm{where} \ \ k=1,2,\ldots ,K. \end{aligned}$$

The second step consists of summing low-rank \(C^k\) matrices, which incur communication. The following operation is performed by all processors as follows:

$$\begin{aligned} C=C^1+C^2+\cdots +C^K \ \ \mathrm{where} \ \ k=1,2,\ldots ,K. \end{aligned}$$

2.2 Hypergraph Model

We propose a hypergraph partitioning (HP) based method to reduce the total message volume that occur in the second step of the outer-product–parallel SpGEMM algorithm (Sect. 17.2.1) while maintaining balance on computation loads of outer products performed by processors in the first step of the parallel algorithm. The objective in this partitioning is to cluster columns of matrix A and rows of matrix B that contribute to the same nonzeros of matrix C into the same parts as much as possible. In other words, the outer-product computations that contribute to the same C-matrix nonzeros are likely to be performed by the same processor without any communication.

We model an SpGEMM instance \(C\!=AB\) as a hypergraph \({\mathscr {H}}(C,A,B)=({\mathscr {V}},{\mathscr {N}})\). \(\mathscr {V}\) contains a vertex \(v_x\) for each outer product of xth column of A with xth row of B. \(v_x\) represents the task of computing this outer product without any communication in the first step of the parallel SpGEMM algorithm given in Sect. 17.2.1. \(\mathscr {N}\) contains a net (hyperedge) \(n_{i,j}\) for each nonzero \(c_{i,j}\) of C. Net \(n_{i,j}\) connects the vertices representing the outer products producing scalar partial results for \(c_{i,j}\). That is,

$$ Pins(n_{i,j})= \{v_x: x \in cols(a_{i,*}) \wedge x \in rows(b_{*,j}) \} \cup \{v_{i,j}\}. $$

Here, \(cols(a_{i,*})\) denotes the column indices of nonzeros in row i (\(a_{i,*}\)), whereas \(rows(b_{*,j})\) denotes the row indices of nonzeros in column j (\(b_{*,j}\)). Hence, \(n_{i,j}\) represents the summation operation of scalar partial results to obtain final result of \(c_{i,j}\) in the second step of the SpGEMM algorithm. Each vertex \(v_x\in {\mathscr {V}}\) is associated with a weight \(w(v_x)\) as follows:

$$ w(v_{x})=|Nets(v_{x})|, $$

where \(Nets(v_x)\) denotes the set of nets that connect vertex \(v_x\). This vertex weight definition encodes the amount of computation performed for the outer products. Each net \(n_{i,j}\in {\mathscr {N}}\) is associated with a unit weight, i.e.,

$$\begin{aligned} w(n_{i,j})=1. \end{aligned}$$

This net weight definition encodes the multi-way relation between the outer products regarding a single nonzero \(c_{i,j}\).

2.3 Decoding Hypergraph Partitioning as Matrix Partitioning

A vertex partition \(\varPi ({\mathscr {V}})=\{{\mathscr {V}}_1, {\mathscr {V}}_2, \ldots , {\mathscr {V}}_K\}\) can be used to obtain a conformal columnwise and rowwise partition of A and B. That is, \(v_x\in {\mathscr {V}}_k\) is decoded as assigning the outer product of xth column of A with xth row of B to processor \(P_k\). If all the pins of \(n_{i,j}\) reside in the same part \({\mathscr {V}}_k\) (i.e., the net is uncut), the summation operation regarding \(c_{i,j}\) is performed locally by \(P_k\). Otherwise, it means that the net is cut and this summation operation is assigned to one of those processors that yield a scalar partial result for \(c_{i,j}\). Hence, \(\varPi \) induces a 1D partition of A and B, and a 2D nonzero-based partition of C.

In the proposed HP-based method, the partitioning constraint used while obtaining \(\varPi \) corresponds to maintaining balance on computational loads of processors. Each cut net incurs communication of scalar partial results. The amount of communication due to a nonzero \(c_{i,j}\) is equal to one less of the number of parts in which \(n_{i,j}\) has pins. This in turn corresponds to one less of the number of processors that send scalar partial results to the processor responsible for \(c_{i,j}\). Thus the partitioning objective of minimizing the cutsize according to “connectivity-1” metric [9] corresponds to minimizing the total message volume.

3 Parallelization of the SpMM Kernel

We consider the SpMM operation of the form \(Y=AX\), where A is an \(n\times n\) sparse matrix and X and Y are \(n \times s\) dense matrices. Whatever the context, SpMM often reveals itself as an expensive operation and hence parallelization of this kernel must be handled with care in order to squeeze the best performance out of it. In this regard, communication metrics centered around volume play a crucial role. Assuming \(Y=AX\) is performed in a repeated/iterative manner, where the elements of X change in each iteration and the elements of A remain the same, the partitions on X and Y matrices should be conformable in order to avoid unnecessary communication during the parallel operations.

In a system with K processors, we consider the problem of obtaining a rowwise partition of A, where processor \(P_k\) stores the submatrix blocks \(A_{k\ell }\), for \(1 \le \ell \le K\), where size of \(A_{k\ell }\) is \(n_k \times n_{\ell }\) These submatrix blocks form the row stripe \(R_k\) and \(P_k\) is held responsible for computing \(Y_k = R_k X\). Since \(P_k\) only stores \(X_k\), it needs to receive the corresponding elements of X from other processors to compute \(Y_k\). This necessitates point-to-point communication between processors. This scheme is called one-dimensional row-parallel algorithm and it consists of the following steps for any processor \(P_k\) with \(1 \le k \le K\):

  1. 1.

    For each off-diagonal block \(A_{\ell k}\), for \(1 \le \ell \le K\), with at least one nonzero in it, \(P_k\) sends the respective elements of \(X_k\) to processor \(P_{\ell }\). If \(a_{ij}\) is a nonzero in this off-diagonal block, then jth row of X need to be communicated.

  2. 2.

    Perform computations on local submatrix \(A_{kk}\) using \(X_k\). Local computations do not necessitate communication. \(Y_k\) is first set with the result of this computation.

  3. 3.

    For each off-diagonal block \(A_{k\ell }\), for \(1 \le \ell \le K\), with at least one nonzero in it, \(P_k\) receives the respective elements of X from \(P_{\ell }\) in order to perform computations on the respective nonlocal submatrix block. \(Y_k\) is updated with the results of nonlocal computations and its final value is computed.

Then, with some possible dense matrix operations that involve Y, new X is computed and used in the upcoming iteration. A local submatrix block is simply the diagonal block owned by the respective processor (\(A_{kk}\)), whereas nonlocal submatrix blocks are the off-diagonal ones (\(A_{k\ell }\), with \(k \ne \ell \)). As hinted above, computations involving local submatrix blocks can be carried out without communication, whereas the computations involving nonlocal ones may necessitate communication if they are nonempty. In the following sections, we describe how to distribute these three matrices to processors via a hypergraph partitioning model that minimizes total communication volume and another model that reduces maximum volume applied on top of the former, hence able to address two important communication cost metrics that contain message volume.

In order to parallelize the SpMM kernel, we utilize the concept of an atomic task, which signifies the smallest computational granularity that cannot further be divided, hence, an atomic task shall be executed by a single processor. In SpMM, the atomic task is defined to be the multiplication of row \(a_{i,*}\) with whole X. The result of this multiplication are the elements of row \(y_{i,*}\). In the hypergraph model, the atomic tasks are represented by vertices.

3.1 Hypergraph Model

In the hypergraph model \({\mathscr {H}}=({\mathscr {V}},{\mathscr {N}})\), the vertices represent the atomic tasks of computing the multiplication of rows of A with X, i.e., \(v_i\in V\) represents \(a_{i,*}X\). Note that \(v_i\) also represents the row \(a_{i,*}\) as well as the computations associated with this row. The computational load of this task is the number of multiply-and-add operations. The weight of \(v_i\) is assigned the computational load associated with the corresponding multiplication, i.e.,

$$ w(v_i) = nnz(a_{i,*})\cdot nnz(X). $$

The dependencies among the computational tasks are captured by the nets. For each row of X, there exists a net \(n_j\) in the hypergraph and it captures the dependency of the computational tasks to the row j of X. In other words, \(n_j\) connects the vertices corresponding to the tasks that need row j of X in their computations. Hence, the vertices connected by \(n_j\) are given by

$$ Pins(n_j)=\{v_i: a_{ij} \ne 0\}. $$

\(n_j\) effectively represents column j of A as well. Note that \(|Pins(n_j)| = nnz(a_{*,j})\), where \(nnz(\cdot )\) denotes the number of nonzeros in a row or column of matrix. Since the number of elements in a row of X is s, \(n_j\) is associated with a cost \(c(n_j)\) proportional to s. In other words,

$$ c(n_j)=s. $$

This quantity signifies the number of elements required by the computational tasks corresponding to the vertices in \(Pins(n_j)\). As a result, there are m vertices, n nets and nnz(A) pins in \({\mathscr {H}}\). This model is a simple extension of the model used for sparse matrix vector multiplication [9].

3.2 Partitioning and Decoding

The formed hypergraph \({\mathscr {H}}=({\mathscr {V}},{\mathscr {N}})\) is then partitioned into K vertex parts to obtain \(\varPi = \{{\mathscr {V}}_1, \ldots , {\mathscr {V}}_K\}\). Without loss of generality, the set of rows corresponding to the vertices in \({\mathscr {V}}_k\) and the respective computations involving these rows are assigned to processor \(P_k\). A net \(n_j\) in the cut necessitates communication of the elements of row j of X and this communication operation involves the processors corresponding to the parts in the connectivity set of this net. Specifically, if \({\mathscr {V}}_k \in \varLambda (n_j)\) is the owner of row j of X, then it needs to send this row to the remaining processors, i.e., to each processor \(P_{\ell }\) such that \({\mathscr {V}}_{\ell } \in \varLambda (n_j) - \{{\mathscr {V}}_k\}\), amounting to a volume of \(s \cdot (|\varLambda (n_j)|-1)\), s being the number of elements in row j. An internal net does not necessitate communication as all the rows corresponding to the vertices connected by this net belong to the same processor. The objective of minimizing cutsize according to the connectivity metric [9] in the partitioning hence encodes the total volume of communication. The constraint of maintaining balance on the vertex part weights corresponds to maintaining balance on the computational loads of the processors.

Aforementioned formulation strives for reducing total communication volume. However, the high volume overhead of SpMM kernel makes another related volume-related metric maximum communication volume also important, which we address next.

3.3 A Volume-Balancing Extension for SpMM

The formulation used to balance the volume loads of processors is based on a hypergraph model. This model was used to address multiple communication cost metrics regarding one- and two-dimensional sparse matrix vector multiplication successfully [26, 27, 29]. Although the main objective of this model is to reduce the latency overhead by aiming to minimize total message count, another important aspect of it is to maintain a balance on communication volume. In our case, i.e., for SpMM, the latter proves to be more crucial. Note that maintaining a balance on volume loads of processors corresponds to providing an upper bound on the same metric. We extend this model for SpMM. For this model, it is assumed a partition information is inherent, such as the one obtained in the previous section—which is also utilized for this model.

Using the partitioning information \(\varPi = \{{\mathscr {V}}_1, \ldots , {\mathscr {V}}_k\}\) obtained on \({\mathscr {H}}= ({\mathscr {V}}, {\mathscr {N}})\), we form another hypergraph \({\mathscr {H}}^C = ({\mathscr {V}}^C, {\mathscr {N}}^C)\) to summarize the communication operations due to this partitioning. Recall that a net \(n_j\) necessitates communication if it is cut in \(\varPi \). This communication operation is represented by a vertex in \({\mathscr {H}}^C\). In other words, there exists a vertex \(v_j^C \in {\mathscr {V}}^C\) for each cut net \(n_j \in {\mathscr {N}}\)

$$ {\mathscr {V}}^C = \{ v_j^C: |\varLambda (n_j)| > 1 \}. $$

There exists a net \(n_k^C \in {\mathscr {N}}^C\) for each processor corresponding to the parts in \(\varPi \). Hence, there are K nets in \({\mathscr {H}}^C\). \(v_j^C\) is connected to each net corresponding to the processor that participate in the communication operation represented by \(v_j^C\). Hence,

$$ Nets(v_j^C) = \{ n_k^C: {\mathscr {V}}_k \in \varLambda (n_j) \}. $$

The vertices connected by \(n_k^C\) correspond to the communication operations that \(P_k\) participates in

$$ Pins(n_k^C) = \{v_j^C: Pins(n_j) \cap {\mathscr {V}}_k \ne \emptyset \} \cup v_f^C. $$

Net \(n_k^C\) connects another vertex \(v_f^C\), which is fixed to \({\mathscr {V}}_k^C\) in partitioning and later used to decode the assignment of communication operations to processors. Here, the important point is the assignment of vertex weights in \({\mathscr {H}}^C\) as they signify the volume incurred in communication operations. The volume incurred in communicating a row j of X is already described in the previous section and here the vertex weight is set accordingly to this quantity

$$ w(v_j^C) = s \cdot (|\varLambda (n_j)|-1). $$

The nets are assigned unit costs

$$ c(n_j^C) = 1. $$

Now consider a partition \(\varPi ^C = \{{\mathscr {V}}_1^C, \ldots , {\mathscr {V}}_K^C\}\) of \({\mathscr {H}}^C\). This vertex partition induces a distribution of communication operations, where the responsibility of the communication operations represented by the vertices in \({\mathscr {V}}_k^C\) are assigned to processor \(P_k\) without loss of generality. A net \(n_j^C\) in the cut necessitates messages between the respective processors. In other words, if the fixed vertex \(v_f^C\) connected by this net is in part \({\mathscr {V}}_k^C\), this implies \(P_k\) will receive a message from each of the processors corresponding to the vertex parts in \(\varLambda (n_j^C) - \{{\mathscr {V}}_k^C\}\). Hence, the number of messages incurred by this net is equal to \(|\varLambda (n_j^C) - \{{\mathscr {V}}_k^C\}|\), which can be at most \(K-1\) as there are K vertex parts. In the partitioning, the objective of minimizing cutsize according to the connectivity metric [9] hence encodes the total message count. As a part weight indicates the amount of volume that the respective processor is responsible for, the constraint of maintaining balance corresponds to maintaining balance on the volume loads of the processors, and thus bounding the maximum volume.

With the partitioning of \({\mathscr {H}}\), we address total volume and with the partitioning of \({\mathscr {H}}^C\), we address total message count and maximum volume. By making use of respective models, we are able to address communication cost metrics that are important for SpMM.

4 Experiments

4.1 SpGEMM

4.1.1 Dataset

We include sparse matrices from Stanford Network Analysis Project (SNAP) [18] and the Laboratory for Web Algorithmics (LAW) [4, 5]. These matrices are downloaded from the University of Florida Sparse Matrix Collection [12] and their properties are given in Table 17.1. These matrices commonly represent the adjacency matrices of road networks or web-related graphs such as relations between products, etc. On such graphs, APSP algorithms can be used to find distances among the entities according to a predetermined metric. In this work, we only consider squaring the original adjacency matrix once, as a representative of the APSP algorithm [11].

Table 17.1 Properties of input and output matrices

4.1.2 Experimental Setup

In order to verify the validity of the proposed parallelization method, an SpGEMM code is implemented and run on a BlueGene/Q system, named Juqueen. For partitioning the hypergraphs, PaToH [9] is used with default parameters except the allowed imbalance ratio, which is set to be equal to 10 %.

As a baseline algorithm, we implemented a binpacking-based method which only considers computational load balancing. This method adapts the best-fit-decreasing heuristic used in K-feasible binpacking problem [16]. The outer-product tasks are assigned to one of K bins in decreasing order of the number of scalar multiplications incurred by the outer products. The best-fit criterion is assigning the task to the minimally loaded bin, whereas the capacity constraint is not used in this method.

4.1.3 Results

The performances of the proposed HP-based method and the baseline binpacking-based method are compared in terms of speedups in Figs. 17.1 and 17.2. As seen in these figures, in all cases, the proposed HP-based method performs substantially better than the baseline method. Thus, this improvement verifies the benefit of reducing total message volume. As seen in the figures, the parallel efficiency of HP-based method remains above 20 % in almost all instances.

Fig. 17.1
figure 1

Speedup curves comparing performances of the proposed and baseline SpGEMM algorithms on road networks

Fig. 17.2
figure 2

Speedup curves comparing performances of the proposed and baseline SpGEMM algorithms on web link matrices

Table 17.2 Properties of matrices tested for SpMM

4.2 SpMM

4.2.1 Dataset

We test 10 matrices for SpMM. The properties of these matrices are given in Table 17.2. These sparse matrices are also from Stanford Network Analysis Project (SNAP) [18] and the Laboratory for Web Algorithmics (LAW) [4, 5], and they are all downloaded from the University of Florida Sparse Matrix Collection [12]. A common operation that can be performed on these matrices for any kind of graph analysis is the execution of the breadth-first search from multiple sources. This corresponds to multiple SpMM iterations.

Table 17.3 Partitioning and runtime results for SpMM for \(K=512\) and \(K=1024\)

4.2.2 Experimental Setup

We tested our model for SpMM on SuperMUC supercomputer, which runs IBM System x iDataPlex servers. A node on this system consists of two Intel Xeon Sandy Bridge-EP processors clocked at 2.7 GHz and has 32 GB of memory. The communication interconnect is based on Infiniband FDR10.

We test two models in our experiments. This first is the plain hypergraph partitioning for SpMM described in Sects. 17.3.1 and 17.3.2. This model aims at only reducing communication volume and is abbreviated with HP. The second model we test is the one that extends the HP as described in Sect. 17.3.3. This model aims at reducing communication volume and balancing communication volume in two separate phases and is abbreviated with HPVB. The dimension of the dense matrices is used as \(s=5\). SpMM kernel is repeated in a loop for certain number of times and a warming up phase is included for healthier timing. We test for \(K \in \{64, 128, 256, 512, 1024\}\) processors.

Fig. 17.3
figure 3

Runtimes for SpMM

4.2.3 Partitioning and Runtime Results

We present the partitioning and runtime results in Table 17.3 for \(K=512\) and \(K=1024\). There are two communication cost metrics we consider and display in the table: total volume, indicated via “TV” column and volume load imbalance as percent, indicated via “VI (%)” column. Communication volume is in terms of communicated words and volume imbalance is on the send volumes of processors. The time of a single SpMM obtained by the two compared methods is given under the “runtime” column and it is in terms of microseconds. The lower runtimes for a specific test case are indicated via bold text.

As seen in the table, HPVB obtains significant improvements over HP in communication volume load imbalance. At \(K=512\) processors, it achieves up to 8\(\times \) improvement and at \(K=1024\) processors it achieves up to 9\(\times \) improvement. It increases total volume compared to HP, causing an increase up to 57 and 56 % at 512 and 1024 processors, respectively. However, the reduction in maximum volume proves to be more vital for performance as HPVB almost always performs superior to HP as seen from the runtimes. Out of 20 instances at \(K=512\) and \(K=1024\) processors, HPVB obtains lower runtimes in 16 of them.

We present the obtained speedups in four test matrices in Fig. 17.3. As seen from the figure HPVB scales better than HP as its performance usually gets better compared to HP with increasing number of processors.

5 Conclusion

We described efficient parallelization of two important sparse matrix kernels for distributed systems that frequently occur in big data applications: sparse matrix–matrix multiplication and sparse matrix–dense matrix multiplication. For these two kernels, we proposed partitioning models in order to reduce the communication overheads and hence improve scalability. Our experiments show that efficient parallel performance for big data analysis on distributed systems requires careful data partitioning models and methods that are capable of exploiting certain communication cost metrics.