Keywords

1 Introduction

Graphs are becoming increasingly important for numerous applications, ranging across the domains of World Wide Web, social networks, bioinformatics, computer security and many others. Such real world graphs typically exhibit power law degrees, and many graphs are directed, such as Web graph and Twitter social graph. Given the real world directed graphs, the growing scale of such graphs has made efficient execution of graph computation very challenging. Some previous work (e.g., PowerGraph [4], Ligra [5], Trinity [6]) fit large graphs in distributed shared memory. However, such systems require a terabyte of memory and expensive cost to maintain distributed computers. Reducing graph size to fit in memory, e.g., by lossless compression, is crucial in cutting the cost of large-scale graph computation and thus motivates our work in this paper.

Literature work typically adopted two following techniques. (1) Compression of adjacency lists when graphs are represented as adjacency lists. For example, recent work Ligra+ [3] adopts the widely used encoding schemes (such as byte coding, run-length byte coding and nibble coding) over adjacency lists for smaller graph size. (2) Compression of adjacency matrix when graphs are represented as adjacency matrix. This technique (e.g., adopted by SlashBurn [1, 2]) frequently exploit graph clustering algorithms to compress graphs.

Unfortunately, the two approaches above still suffers from issues to compress real work directed graphs. For example, the technique of compressing adjacency lists including Ligra+ is frequently with a low compression ratio, when compared with the compression of adjacency matrix. Alternatively, the clustering techniques, frequently used by compression of adjacency matrix, are known to perform not very well on real world directed graphs due to no good cut of such graphs. Finally, many compression techniques including both Ligra+ and SlashBurn could compromise graph computation efficiency, due to nontrivial decompression overhead.

To address the above issue, in this paper, we propose a novel compression algorithm on real world directed graphs. The proposed approach can be intuitively treated as a hybrid of the two approaches above: we first perform an effective clustering algorithm and then represent the resulting adjacency matrix by lists of encoded numbers. In this way, our approach has the chance to greatly reduce the graph size.

We make the following contributions to enable our work. (1) The proposed clustering algorithm leverages real world graph structure information (indegrees, outdegrees and sibling neighbors) to permute graph vertices for high space cost saving. (2) Our scheme to encode the clustered graphs can greatly optimize graph computation (e.g., Breadth-First-Traversal: BFS) efficiency with no or trivial decompression overhead. Thus, the proposed graph compression does not compromise graph computation time. (3) Our extensive experiment on real data sets studies the tradeoff between space cost saving and graph computation time, and verifies the advantages of our work over state of arts SlashBurn and Ligra+. For example, when compared with SlashBurn on a real data set Amazon graph, our work uses \(35.77\%\) fewer nonempty blocks (when both SlashBurn and our work uses the same approach to encode nonempty blocks for fairness), and 4.79\(\times \) faster time of running BFS. When compared with Ligra+, our scheme can achieve \(20.24\%\) less space cost saving and 8.24\(\times \) higher speedup ratio to run the BFS algorithm on an example graph data set.

The rest of the paper is organized as follows. We first give the problem definition and encoding scheme in Sect. 2, and next present the clustering algorithm in Sect. 3. After that, Sect. 4 evaluates our approach, and Sect. 5 reviews literature work. Finally Sect. 6 concludes the paper.

2 Problem Definition and Encoding Scheme

In this section, we first give the problem definition (Sect. 2.1), and present the scheme to encode adjacency matrices (Sect. 2.2).

2.1 Problem Definition

Given a directed graph \(\mathcal {G}\), we model it as an asymmetric binary matrix \(\mathcal {M}\). The binary element \(e_{i,j}\) in the i-th row and j-th column indicates whether or not there exists a directed edge from the i-th vertex to j-th vertex (where i and j denote vertex IDs). When dividing the matrix \(\mathcal {M}\) to blocks of \(b\times b\) elements, we determine whether a block is empty or not. Specifically, if the \(b\times b\) elements in a block are all zeros, we say that the block is empty and otherwise non-empty. Now we can implicitly measure graph space cost by counting those nonempty blocks, denoted by \(B(\mathcal {M})\).

Our basic idea is to permute matrix rows and columns (i.e., re-ordering vertex IDs), such that the 1-elements in the permuted matrix is co-clustered. Thus, we have chance to minimize \(B(\mathcal {M})\) and reduce graph space cost.

Fig. 1.
figure 1

Permutation of graph vertices: (a-b) directed graph and adjacency matrix before permutation, (c-d) directed graph and adjacency matrix after permutation, (e) representation of the permuted graphs by lists of encoded numbers.

Example 1

Figure 1(a) gives a directed graph. Figure 1(b) shows the associated binary matrix with \(B(\mathcal {M})=10\) non-empty blocks (we divide the matrix into \(2\times 2\) elements). After permuting the graph we will have the resulting graph and associated matrix in Fig. 1(c-d). Now we have \(B(\mathcal {M})=7\) non-empty blocks in Fig. 1(d).

Problem 1

Given a directed graph \(\mathcal {G}\) and associated matrix \(\mathcal {M}\), we want to find a permutation of vertex IDs \(\pi : V \rightarrow [n]\) with the objective to minimize the count \(B(\mathcal {M})\) after the permutation.

Note that Problem 1 is NP-hard (see [7]) and no efficient solution can solve it with polynomial time. Consider that real world graphs follow power-law degree distributions with few ‘hub’ nodes having very high degrees and majority of the nodes having low degrees. Traditional clustering approaches, such as spectral clustering [8], co-clustering [9], cross-associations [10], and shingle-ordering [11], do not work well on such real work graphs due to no good cuts.

2.2 Graph Representation by Encoding Non-empty Blocks

When a directed graph is represented as a binary matrix above, we encode those nonempty blocks (consisting of \(b\times b\) elements) into lists of integers. Specially, we first use a directory to maintain the matrix row IDs. Each element in the directory (i.e., an associated matrix row ID) refers to a list of integers to encode such non-empty blocks. For example, in Fig. 1(b), since we divide the matrix into \(2\times 2\) blocks, each element in the directory contains two row IDs when the each of such IDs is with at least one non-empty block. Since the row IDs 6 and 7 are with the all empty blocks, the directory does not contain the row IDs 6 and 7. Thus, among the 9 row IDs (from 0 to 8), we have totally four elements in the directory: \(\{0,1 \}\), \(\{2,3 \}\), \(\{4,5 \}\) and \(\{8\}\).

Next, to encode a non-empty block, we use two numbers. The 1st number is the left-most column ID of such a block. Next by treating the binary elements inside the block as the binary form of an integer, we can use the integer number to represent the whole block. For example, for the left nonempty block in the row IDs 0 and 1, its left-most column ID is 4; the \(2\times 2=4\) binary elements 1110 are encoded to be an integer 14. Thus, we use an integer pair \(\{4,14\}\) to encode the block. Similar situation holds for other non-empty blocks. Figure 1(d) gives the directory and lists of integer pairs to encode the graph in Fig. 1(c).

Note that we might adopt an alternative approach to compress nonempty blocks. For example, SlashBurn adopted gZip to compress nonempty blocks. However, our encoding scheme above offers the obvious benefit: the decoding overhead from the encoded numbers to original binary elements is trivial, when compared with the gZip decompression. Our experiments will verify the benefit when compared with other encoding and compression approaches.

It is not hard to find that the space cost of such encoded lists also depends upon \(B(\mathcal {M})\). Thus, the next section we will present the permutation algorithm.

3 Graph Clustering

We first highlight the solution overview (Sect. 3.1), and next give the algorithm details (Sect. 3.2).

3.1 Solution Overview

In order to understand the proposed clustering algorithm of the permutation of graph vertices, we first give two observations of real world graph structure. First let us consider the hub vertices with high in-degrees in real world directed graphs. Due to the power-law in-degree distributions in such graphs, few hub vertices are with a large amount of in-coming edges (e.g., a lot of spoke fans in Twitter social network), indicating very high in-degrees; yet the majority of vertices have low in-degrees. Given the hub and spoke vertices, we give the following observation.

Fig. 2.
figure 2

Hubs and spokes

Observation 1

For two hub vertices pointed by a large amount of spoke neighbors, it is not rare that such hub vertices share many common spoke neighbors. Intuitively, if the similarity of such spokes is high, we would like to permute the hub vertices together in the matrix columns. Meanwhile, if two spoke neighbors share common hubs (for example, due to the similar interest in social networks). The similarity of such spokes is high and we also place such spokes together in the matrix rows. In Fig. 2, \(hub_1\) and \(hub_2\) share 8 incoming neighbors \(spoke_1...spoke_{8}\), and \(spoke_1...spoke_8\) share the outgoing neighbors \(hub_1\) and \(hub_2\).

We next consider the hubs with high out-degrees. Real world directed graphs also follow power-law out-degree distribution, i.e., few hub vertices are with a very large amount of spoke neighbors (a.k.a very high out-degrees) and the majority of vertices are with low out-degrees. We again give the following observation.

Observation 2

For two hub vertices with high out-degrees, if they share many spoke neighbors, we would like to place the hub vertices together in the matrix rows. Meanwhile, if two spoke vertices share many incoming hub neighbors, we similarly place the spoke vertices together in the matrix columns. In Fig. 2, \(hub_3\) and \(hub_4\) share 8 outgoing neighbors \(spoke_9...spoke_{16}\), and \(spoke_9...spoke_{16}\) share the incoming neighbors \(hub_3\) and \(hub_4\).

For illustration, Fig. 3(a-b) visualize the matrices \(\mathcal {M}\) before and after permutation (we will soon present the permutation algorithm in the next section). In the matrices, each row (resp. column) indicates a source vertex (resp. destination vertex). If there is a directed edge from source i to destination j (where \(1\le i,j\le V\)), we plot a point in the coordination (ij).

In Fig. 3(a), the points are randomly distributed in the matrix before permutation. Instead after permutation, we follow the above observations to purposely co-cluster the points in the following areas, as shown in Fig. 3(b). Specially, (i) for the hubs with high in-degrees (i.e., a large amount of in-coming spoke neighbors), we purposely permute such hubs and spokes together in the matrix to make sure the points w.r.t such hubs and spokes to be co-clustered in the vertical zones; (ii) for the hubs with high out-degrees (i.e., a large amount of out-going spoke neighbors), we again co-cluster the points w.r.t such hubs and spokes in the horizonal zone; and (iii) for those vertices with both high in-degrees and high out-degrees, we co-cluster the associated points in the top-left zone. Since all points are co-clustered in the three zones above, no other points are in the rest of the matrix. Thus, we have chance to minimize the count of non-empty blocks B\(\mathcal {(M)}\).

Fig. 3.
figure 3

Visualization of Wiki-Vote graph (from left to right): (a-b) the matrices before and after permutation, (c) the matrix after permutation by setting a different parameter value.

3.2 Algorithm Detail

In this section, we give the algorithm detail to permute graph vertex IDs, such that the points are co-clustered in the associated zones. Our algorithm will output a list P of vertices, such that the i-th vertex in the list P (with \(0\le i\le |V|-1\)) will be re-assigned with the new vertex ID i. It means that the i-th vertex in P, though with the original ID \(i'\), will be re-ordered from the old ID \(i'\) to a new ID i.

In the algorithm, suppose that the list P has already been added by i items (i.e., old vertex IDs) then the key question is which vertex should be selected from the remaining \((|V|-i+1)\) IDs. To this end, we leverage the two observations in Sect. 2.1: (i) the power-law distribution of vertex degrees and (ii) the similarity of vertices (measured by common neighbors), to design the permutation algorithm. On the overall, the algorithm first selects a set of candidate vertices based on vertex degrees (either in-degrees or out-degrees), and next use common neighbor information to finalize the vertex selection among the candidate vertices.

Candidate Selection: Algorithm 1 leverages degree information to select k candidate vertices from the vertex set S. First, we choose the top-k highest in-degree and out-degree vertices, respectively (lines 1–2). Such vertices are in-degree and out-degree hubs, i.e., \(L_{id}\) and \(L_{od}\). After that, we perform the union and intersection on \(L_{id}\) and \(L_{od}\), and have two corresponding sets \(L_{\sqcup }\) and \(L_{\sqcap }\).

If the cardinality \(|L_{\sqcup }|\) is smaller than k, we have at most k vertices which are still unprocessed. We then simply return such vertices as candidates (line 4). Otherwise, if we have \(|L_{\sqcup }|==k\), such vertices are those with the top-k in-degrees and out-degrees, and line 5 also returns such k vertices (line 5). Finally given \(|L_{\sqcup }|>k\), we have some or none of vertices in \(L_{id}\) which appear in \(L_{od}\). Thus, we first choose \(L_{\sqcap }\) (line 7) and next select the top-\(k'\) highest in-degree or out-degree vertices among the vertices \((L_{\sqcup }-L_{\sqcap })\) (line 8) together as the candidates (line 9).

figure a

Common Neighbors: Beside the power law degree distribution, common neighbor information is also important to our algorithm. The basic idea is to find those vertices sharing with high number of common neighbors.

Algorithm 2 gives the detail to unify the selection of both candidate vertices and common neighbors. First line 3 selects the at most k candidate vertices C (see Algorithm 1); line 4 adds all in-neighbors and out-neighbors of C to a new set N. Next, for each pairwise members \(u\ne v\in N\), we find their common neighbors Nbr(uv), and add the number |Nbr(uv)| of such common neighbors (as a value) and the pairs (uv) (as a key) to a key-value pair list L (line 6). By sorting the pair list L (line 7), we then select the key (uv) with the highest |Nbr(uv)| and add k vertices to P, if such vertices are not inside P (lines 8–11). After that, in case that \(cnt' ({<}k)\) vertices are selected by the steps above, we can still select the remaining \((k-cnt')\) vertices from the previously selected vertices C to make sure that k vertices are added to P (lines 12–14). We repeat the above steps until all vertices have been selected to P (lines 2–15).

figure b

In the algorithm above, the running time depends upon the number of members in N (see line 6). Thus, we can limit the number of members \(m\in N\), for example by requiring that m at least has two (in- and out-) neighbors. In addition, the number k is an important parameter in the algorithm above. The tuning of k involves the tradeoff between the compression quality and running time. Figure 3(b-c) visualizes two matrices when setting \(k=1\) and \(k=0.005n\), respectively. It is not hard to find that Fig. 3(b) is with a higher point density than Fig. 3(c). A larger k could speedup the running time of Algorithm 2, but at cost of higher compression ratio.

Table 1. Statistics of the real data sets

4 Evaluation

4.1 Experimental Setting

Table 1 lists five real world directed graphs used in our experiment. The data sets are collected by Stanford Large Network Dataset Collection (SNAP) [12]. Given the data sets, we evaluate the performance of our scheme to answer the following questions:

  • \(Q_1\): How well does our scheme perform when compared with other graph re-ordering approaches?

  • \(Q_2\): How well is our scheme comparable to other graph compression approaches?

  • \(Q_3\): How fast our scheme can perform graph computation with no or partial graph decompression?

In order to answer the questions above, we use the following performance metrics.

  1. (1)

    Given a graph’s adjacency matrix, we divide it into blocks of \(b\times b\) elements. Each block consists of at most \(b\times b\) elements. Thus, the number of non-empty blocks measures the density of the adjacency matrix. A smaller number of non-empty blocks means a better re-ordering algorithm and thus leads to less space cost.

  2. (2)

    We follow the previous work [1, 2] and encode the graph adjacency matrix into binary bits, and define a cost function by assuming that a compression method achieves the information theoretic lower bound. After that, we compute the bits per edge, which is computed by (i) the totally required bits using information-theoretic coding methods against (ii) the number of edges. This metric measures how many bits are used to store a single edge on average. A smaller amount of bits per edge indicates less space cost used to encode a graph.

  3. (3)

    We measure the used space cost of a compressed graph when maintained on disk. In addition, the compression ratio, measured by the ratio of used space cost before and after a graph compression approach is adopted, is also used to measure the goodness of a graph compression approach.

  4. (4)

    Finally we are interested in the speedup ratio of a graph computation algorithm which is computed by the running time of performing graph computation over original graphs against the time over compressed graphs. The speedup ratio is helpful to understand (i) the benefit of reduced I/O cost (due to smaller graph size by compression) to speedup the graph computation and (ii) the introduced decompression overhead if any. We use the Breadth-first search (BFS)-based graph traversal algorithm and Dijkstra-based shortest paths as the two examples of the graph computation. In case that decompression is required, the running time needs to include both decompression time and graph computation time.

Based on the above metrics, we mainly compare our scheme with other re-ordering algorithms, plus another graph compression approach Ligra+.

  • Natural. Natural reordering of graph vertices means the original adjacency matrix. For some graphs, the natural reordering provides high locality among consecutive vertices.

  • Degree Sort. The reordering is based on the decreasing degrees of graph vertices.

  • SlashBurn [1, 2] is the most similar to our work. Note that SlashBurn is originally designed for undirected graphs. In order to make sure that SlashBurn works for directed graphs, we have to adapt the asymmetric matrices associated with directed graphs by redundantly adding extra rows (or columns) with respect to source vertices (or destination vertices) in the directed graphs. Then the elements associated with such extra rows (or columns) are all 0-elements. Now with the adapted matrices, we next adopt SlashBurn to reorder such matrices.

  • Ligra [5] and Ligra+ [3]. Ligra+ also adopts a graph reordering approach. However, differing from SlashBurn and our work, it improves locality by assigning neighbor IDs close to vertex ID [13]. After that, it sorts edges and encodes the difference. It is useful to understand how our work and SlashBurn is comparable to other graph compression approaches such as Ligra+. Note that Ligra and Ligra+ leverage multicore parallel functionality to greatly speedup graph computation and compression/decompression.

We implement our approach by GCC, and evaluate all experiments on a Linux workstation with 6 cores of Xeon 2.60 GHz CPU, 64 GB RAM and 1TB SATA disk. Our algorithm expects to select the top-k degree vertices with similar sibling relationship during each iteration. Following the poke game, we name our algorithm Ace_Up.

Fig. 4.
figure 4

(a) Degree distribution, and four reordering algorithms: (b) Natural (c) Degree sort (d) SlashBurn (e) Ace Up

4.2 Comparison of Reordering Algorithms

In this section, we compare the proposed algorithm with three other re-ordering approaches including Nature, Degree sort and SlashBurn.

To capture the overall degree distribution, Fig. 4(a) plots the degrees of five used data sets. The x-axis represents vertex degrees (here a vertex degree means the sum of in-degree and out-degree of a vertex), and the y-axis is the frequency of those vertices with the associated degree. In general, all of the directed graphs follow very skewed degree distribution.

Given such directed graphs, we plot the adjacency matrices after the vertex IDs of such graphs are re-ordered by four algorithms as shown in Fig. 4(b-e). In all five data sets, our scheme is always with the least area to plot the points associated with the graphs and thus the least amount of nonempty blocks. Natural reordering instead requires the maximum number of nonempty blocks. Degree sort reordering uses fewer amount of blocks, because the upper-left area of the adjacency matrix is more dense than the Nature approach. Our scheme can achieve better result than SlashBurn. It is mainly because SlashBurn works only on symmetric matrices (i.e., undirected graphs) and we have to add redundant rows (or columns) for directed graphs. Thus, after performing SlashBurn on such matrices, the resulting matrices, as shown in Fig. 4(d), are with equal number of rows and columns. Instead, in Fig. 4(e), the resulting matrices generated by our scheme are with different number of rows and columns (which can be recognized by the x and y axis). Such result indicates that our scheme is very effectively adapted to directed graphs.

Fig. 5.
figure 5

From left to right: (a) nonempty blocks (b) bits per edge (c) space cost (d) compression ratio

Figure 5 directly measures the number of nonempty blocks, bits per edge and used space cost of four schemes. For the two metrics such as nonempty blocks and bits per edge, our scheme Ace_Up performs best. For example, Ace_Up uses \(63.07\%\) and \(35.77\%\) fewer non-empty blocks than Nature and SlashBurn, respectively; in terms of bits per edge, Ace_Up again uses up to \(41.96\%\) and \(16.54\%\) fewer bits than Natural and SlashBurn, respectively. Note that for the used space cost, SlashBurn uses gZip to compress each non-empty blocks and thus leads to the least space cost. Instead, we use integer pairs to encode nonempty blocks (see Sect. 2.2). Our scheme can achieve less space cost than Ligra+ but sightedly higher space cost than SlashBurn. In case that our scheme Ace_Up and SlashBurn both adopt the same compression approach (either gZip or the encoding approach), Ace_Up can use less space cost than SlashBurn. Nevertheless, we will show soon that the gZip-based SlashBurn will incur much higher running time for BFS traversal than our scheme.

Finally, Fig. 5(d) measures the compression ratio of four schemes. In this figure, after we use SlashBurn on directed graphs, we next use gZip and the encoding scheme in Sect. 2.2 to compress non-empty blocks. Thus, we have two version of SlashBurn, namely SlashBurn-gZip and SlashBurn-encode. Now it is clear that Ace_Up leads to a higher compression ratio than SlashBurn-encode, though lower than SlashBurn-gZip. Among these algorithms, Ligra+ shows the smallest compression ratio.

4.3 Graph Computation Speedup Ratio

In this section, we measure the speedup ratio of two graph computation algorithms (BFS traversal and Dijkstra-based shortest paths). The ratio is computed by the running time used by the computation on original graphs and the time on compressed graphs. In terms of Ligra+, we measure the ratio by the running time of Ligra (i.e., the graph computation engine without adopting graph compression technique) and the time of Ligra+. A larger ratio indicates faster computation on compressed graphs and otherwise slower computation.

Fig. 6.
figure 6

(a) Speed up ration of BFS (left); (b) speed up ratio of Dijkstra

In Fig. 6, Ace_Up is with much higher speedup ratios than both SlashBurn and Ligra+. For example, to perform BFS on Amazon data set, Ace_Up is with 4.42\(\times \) and 8.24\(\times \) higher speedup ratio than SlashBurn and Ligra+, respectively. It is because of the non-trivial decompression overhead caused by SlashBurn and Ligra+. In particular, both SlashBurn and Ligra+ are with the speedup ratios smaller than 1.0, indicating that the decompression overhead compromises the reduced I/O overhead (due to smaller space cost of compressed graphs). Instead, our scheme Ace_Up is based on the encoded integers and the decoding time from encoded integer numbers to binary bits is trivial.

Note that though with a speedup ratio smaller than 1.0, the absolute running time of Ligra+ is much smaller than our scheme Ace_Up. It is mainly caused by the parallel computation engine and parallel decompression offered by the excellent system design of Ligra and Ligra+.

Table 2. Space cost and computation time of various b

4.4 Effect of Parameter b

Finally we are interested in how the parameter b affects the performance of our scheme in terms of space cost and graph computation. In Table 2, we vary b by 16, 32 and 64, and divide graph adjacency matrices into \(b \times b\) blocks. It is obvious that the parameter b will affect the number of nonempty blocks and bits per edge. On the overall, a bigger b leads to fewer amount of nonempty blocks and bits per edge. However, in terms of graph computation time (in this table, we give BFS traversal time only. Due to space limit of the table, we do not give the running time of Dijkstra algorithm), a larger b, for example \(b=64\), does not necessarily leads to smaller running time. It is because a larger b means more elements appearing in a block. Thus, we have to spend more time to lookup a specific edge which is encoded by the block (due to more elements).

5 Related Work

Graph Compression: Graphs are typically represented by adjacency lists and adjacency matrices, though they are implemented by various file formats to store graphs. When a graph is maintained by adjacency lists of sorted vertex IDs, some previous work [14, 15] compressed the lists by difference of vertex IDs. The WebGraph [15] compresses the graph by representing adjacent nodes by the small difference to previous value, instead of original large IDs. Since the large IDs require more space (i.e., word length), the maintenance of small difference of continuous IDs can save space cost. [16] improve the framework of WebGraph by proposing a different ordering [16]. Still following the similar idea to maintain the differences between consecutive IDs, Ligra+ optimizes the previous works by much more complex coding schemes such as the run-length encoded byte coding scheme.

Some work represents a graph by a matrix instead of adjacency lists. GBase [17] and SlashBurn [1, 2] are based on graph adjacency matrixes. They divide the matrices into blocks of matrix elements. Consider that the adjacency matrix of a real work graph is usually very sparse. SlashBurn can co-cluster the elements together more densely inside the adjacency matrix. The basic idea of SlashBurn [1] is to remove the vertices of highest degree and next to find a giant connected component (GCC) in rest of the graph to repeat this process. After clustering the matrix and dividing the matrix into element of blocks, SlashBurn adopts gZip to compress such blocks for less space cost. SlashBurn works well on power-law graphs. However, SlashBurn suffers from some limitation. First, it works only on undirected graphs and cannot be directly used to compress directed graphs. Moreover, gZip is known as a heavy-level compression approach, and the decompression overhead of gZip is non-trivial. Since we cannot directly perform graph computation on compressed graphs, we have to spend non-trivial decompression overhead before graph computation works on originally uncompressed graphs.

Some work proposes graph storage format to maintain sparse matrices, such as Compressed Sparse Blocks (CSB) [18]. It is essentially the same as Compressed Sparse Row (CSR) [19] and allows vector multiplication in parallel. It uses simple tuple to representation to store a matrix as row index, column index and the nonzero value. In addition, when maintaining only those graphs without edge weights, the CSR in PrefEdge [20] maintains two components (row index and column index) but with no element values. In addition, some works such as [21] consider graph compression as a matrix factorization problem. Graph computation can be directly performed on such factorized matrices. However, the factorization incurs information loss.

Graph Computation: There are two storage styles when storing a data set, one is to store in main memory and another is in external memory. So the optimization of computation can lay on reduce the querying time on graph or optimize on I/O-efficiency.

When graphs are compressed, we frequently have to recover original graphs by decompression. Non-trivial decompression overhead could compromise the I/O reduction achieved by the reduced space cost when compressed graphs are on disk. Few work in literature has studied the optimization of computation efficiency on compressed graphs. For example, by translating many graph computation by matrix operations, Pegasus [22] can benefit from the less I/O overhead caused by the smaller size of graphs on disk when the graph is compressed (as shown in GBase [17] and SlashBurn [1, 2]). However, after the graph is loaded from disk to main memory, decompression is still needed before graph computation. As shown in Pegasus, sparse matrix-vector (spMV) multiplication that is frequently used by many graph computation. Some previous work [23, 24] has explored the problem of running algorithms on compressed inputs in spMV context. Though with some promising results, such previous work only studied the specific spMV computation and it cannot be comfortably extended to other matrix computation (for example matrix eigenvalue decomposition), which could be used by some graph computation.

Fan et al. [25] gives a framework for query-preserving graph compression. It proposes two compression methods preserving reachability queries and pattern matching queries, based on bounded simulation. Though such queries can be performed directly on compressed graphs, due to a lossy compression, it is unknown whether or not other queries expect the reachability can performed well on the compressed graphs.

6 Conclusion and Future Work

Given real world directed graphs, we study the problem to permute graph vertices in order to minimize the amount of nonempty blocks. The proposed algorithm leverages graph structure information (e.g., in-degrees, out-degrees, and sibling neighbors) to re-order vertex IDs for the permutation. We perform extensive experiments on real world directed graphs and show that our work can outperform the state of art SlashBurn in terms of space cost saving and graph computation time. As future work, we continue to study the performance of graph compression. For example, we are interested in how modern hardware (such as multicore computers and clustered machines) can speedup graph compression and decompression.