1 Introduction

Current single-machine computing environments are a mixture of high-power CPUs and GPUs mixed to large quantities of memory of various speeds. Often these are subsequently networked together into large distributed computational platforms. Cloud computing further complicates the scenario as advanced resources can be purchased for the time needed. These environments present a wide-range of opportunities to schedule what are often called pleasingly parallel computations, namely, those that have a large amount of independent computation that can be scheduled simultaneously.

We wish to investigate how to leverage and utilize such resources in the context of a large graph computation. While the focus on large graph computation is often in terms of solving problems on massive graphs with distributed computation, the downside to such computations is that they often involve nearly linear-time algorithms [16]. These have runtimes such as \(O(n \log n)\) and typically involve a small number of passes over all the edges of the graph, for instance, running a connected components analysis or computing a PageRank vector. Consequently, the algorithm performance is largely dominated by how well the computation maps to the IO and memory system strategies of the platform.

Instead, the computation we investigate is the all-to-all personalized PageRank computation. Given an n-node graph, this involves computing the personalized PageRank vector associated with each node. We state the problem formally in Sect. 2. Consequently, there are n such computations that are all independent and decoupled. In terms of the scale, we are targeting graphs with up to a 100 million edges and with up to 10 million nodes. Real-world instances of such graphs are the LiveJournal social network crawl with around 4 million vertices and 67 million edges and the Orkut social network crawl with 3 million vertices and around 220 million edges.

Because the output from the all-to-all problem would be O(n 2) data, we seek to output only summary statistics of the personalized PageRank vectors including inverse participation ratios for the solutions that serve as a soft-measure of the number of non-zeros, as well as the largest 1000 entries of the vectors. The large values are commonly used as latent measures of node similarity [14, 41, 46]. Hence, a simple strategy for this computation is to load the graph into memory on all computers available, take the fastest single-core algorithm for personalized PageRank, and run it as many places as possible.

This picture becomes more interesting in light of the heterogeneous nature of computers. For instance, we can use vector or SIMD instructions to potentially compute multiple PageRank vectors at once, if the algorithm used is amenable to it. Second, large shared memory machines may have a large number of computing cores (over 200 is possible with commercially available systems that cost less than $250k). However, many of these cores share memory bandwidth resources that can impede some algorithms. This suggests that sharing access to a single graph may not scale. Furthermore, GPUs are constantly changing their underlying compute resources. Fourth, the algorithm performance itself is likely to be sensitive to choices of data distribution within the graph due to memory locality. Hence, even for this simple setting there is a rich set of complications to simplistically expecting a pleasingly parallel algorithm to scale.

Our goal is to investigate these performance differences in the context of the simple personalized PageRank computation. We chose that computation as it is representative of a wide swath of related computations on graphs including scalable methods for all-pairs shortest paths. Moreover, the algorithms to compute it are simple. They are specializations of well-known matrix computation algorithms including the power method and Gauss–Seidel method [19]. We can easily investigate a diverse collection of possible implementations that have different memory characteristics. Our focus was to keep the investigation simple and reflective of what might be expected from an informed, but non-expert, user of the algorithms. This is someone who understand how the algorithms works, where the relevant bottlenecks might be, but does not want to attempt to re-engineer the algorithm for the absolute maximum level of parallelism or performance. This individual is optimistic that the pleasingly parallel nature of the computation will be sufficient to drive performance. Towards that end, we discounted using GPUs at the moment as the toolkit for graph computations on GPUs is still evolving.

We have done all of these experiments in the Julia computing environment to make it easy for others to further investigate our ideas. It is also a high-level programming language that makes it simple to implement a variety of algorithms in a consistent fashion. Regarding the idea that high-level languages may be slow, we initially benchmarked the Julia implementation against a C++ implementation of a similar algorithm [25] and found the runtimes of the methods to be within 10% of each other.

2 The PageRank and AllPageRank Problem

The PageRank problem begins with a graph G, which could be both weighted and directed. However, in the interest of simplicity, we take G to be unweighted, directed, and strongly connected. This greatly simplifies the setting and puts the focus on the relevant pieces of the computation. Let us note that we lose no generality by doing so: a PageRank computation on a graph with multiple strongly connected components can be reduced to a sequence of PageRank computations on the individual strong components, and usually, these additional computations are much smaller because most real-world networks only have a single large strongly connected component (see, among others who make this observation, [35]).

Fix an ordering for G’s vertices from 1 to n and identify each vertex with its index in this order. Let A be the resulting adjacency matrix of G

$$\displaystyle \begin{aligned} A_{ij} = \begin{cases} 1 & \mbox{{$(i,j)$} is a directed {$i \to j$} edge} \\ 0 & \mbox{{$(i,j)$} is not an edge.} \end{cases} \end{aligned} $$

We use the following additional notation:

$$\displaystyle \begin{aligned} \begin{aligned} {\mathbf{d}} & = \mbox{vector of degrees} \quad d_i = \textstyle \sum_{j} A_{ij} \\ {\mathbf{p}} & = \mbox{vector of inverse degrees} \quad p_i = 1/d_i \\ {\boldsymbol{P}} & = \mbox{the stochastic transition matrix} \quad {\boldsymbol{P}} = {\boldsymbol{A}}^T \mbox{Diag}({\mathbf{p}}) \\ {\mathbf{e}}_i & = \mbox{the {$i$}th column of the identity, } {\mathbf{e}}_i \mbox{ has a 1 in the {$i$}th row and {$0$} elsewhere}, \end{aligned} \end{aligned} $$

where we use the Diag(⋅) operator to put the argument along the matrix diagonal. The PageRank problem [14, 28, 39, 40] is to compute the stationary distribution of a random walk that with probability α follows a standard random walk model on G and with probability 1 − α jumps according a teleportation distribution vector v, where v encodes the probability of jumping to each node. Typically α is between 0.5 and 0.99. Throughout this paper, we use what became the standard value of 0.85 [28]. The stationary distribution corresponds to a solution of the following nonsingular linear system:

$$\displaystyle \begin{aligned} ({\boldsymbol{I}} - \alpha {\boldsymbol{P}}) {\mathbf{x}} = (1-\alpha) {\mathbf{v}}. \end{aligned} $$

Personalized, or seeded, PageRank problems set v to be a single node, or in this case, a column of the identity matrix e i and the linear system

$$\displaystyle \begin{aligned} ({\boldsymbol{I}} - \alpha {\boldsymbol{P}}) {\mathbf{x}}_i = (1-\alpha) {\mathbf{e}}_i. \end{aligned} $$
(1)

As an aside, we note that a standard feature of most PageRank constructions [14] is the dangling correction vector c. In this case, we do not have this correction vector because we assume that we are given a strongly connected graph.

Our goal is to compute x i for all i from 1 to n, or more simply, the matrix

$$\displaystyle \begin{aligned} {\boldsymbol{X}} = (1-\alpha) ({\boldsymbol{I}} - \alpha {\boldsymbol{P}})^{-1}. \end{aligned} $$

We wish the entries of X to be of high accuracy, and intend to compute each column of X such that the 1-norm error is provably less than (1 − α)∕n. Because the graph is strongly connected, the matrix X is dense when computed exactly. For a graph with one million vertices this graph is too large to store even on a large shared memory machine. We thus define the AllPageRank problem.

Problem 1 (AllPageRank)

Fix a graph G, let A be a binary adjacency matrix indicating the presence of an edge, let P be a column stochastic matrix giving transition probabilities on the same graph. The AllPageRank problem is to compute the following entries of (1 − α)(IαP)−1:

  • the participation ratio for each column x i, which is a soft measure of the number of non-zeros in the column

  • the non-zero values of X ⊙A and X T ⊙A, and

  • the k largest entries in each column, for k = 100 or k = 1000.

Here ⊙ is the elementwise, or Hadamard, product. Note that the transition matrix P need not come from the transition described above and could come from anywhere, such as the common stochastic transformations in PageRank [14]. Nonetheless, we will always use P = A TDiag(p) in this manuscript. The results of AllPageRank could be used to form a nearest neighbor approximation, to form PageRank affinities [46], or simply as a diffusion approximation of the underlying graph. Additional applications of such an output involve similar methods that solve protein function inference problems [23, 31]. Finally, this can be related to some idea of a “PageRank effective resistance” on an edge.

We stress that there are applications of the output for PageRank, but that our general goal is to use PageRank as a model computation that is representative of the challenges faced by more general numerical computing problems on graphs. This is akin to PageRank’s widespread use to evaluate the performance of distributed graph computation engines [1, 9, 26, 33, 43, 44]. See additional examples of related computations in Sect. 5.

3 PageRank Algorithms

There are a few classic algorithms for PageRank computations: the power method, the Gauss–Seidel method, and the push method in two variations. We briefly explain these algorithms, give a small pseudocode for the computation, as well as an easy-to-compute error bound.

For the following set of algorithms, we will describe how to use them to compute a single vector, although we note that all of them are amenable to computing multiple vectors simultaneously as discussed in Sect. 3.6. We will use the notation x to refer to the solution vector to (I − αP)x = (1 − α)v where v = e i for some fixed i. Each iterate in a high-level description of the method will be written x (k); what exactly constitutes an iteration may vary among the discussions. For instance, for Gauss–Seidel and the Push Methods, it is often helpful to analyze a single update step within an iteration. We have endeavored to keep the discussion consistent and try to point to the pseudocode to clarify any ambiguities. Note that, in the pseudocode and discussions about it, however, we will be more clear about memory and use x, y, and r to denote vectors of memory associated with an iteration rather than their interpretations about the solution.

3.1 The Power Method

What is usually called the power method for PageRank is probably better called the Richardson method for the linear system formulation of PageRank [14] because the two iterations are exactly identical in the scenario that each iterate is a probability distribution. The idea underlying both is to unwrap the linear system (1) into the fixed point iteration

$$\displaystyle \begin{aligned} {\mathbf{x}}^{(0)}_i = {\mathbf{e}}_i \qquad {\mathbf{x}}^{(k+1)}_i = \alpha {\boldsymbol{P}} {\mathbf{x}}^{(k)}_i + (1-\alpha) {\mathbf{e}}_i. \end{aligned} $$

The main work at each iteration is the matrix vector product Px (k). This can be done either by computing a sparse matrix P where the non-zero value is the probability A TDiag(p) or instead, by storing just the graph structure of A T alone without any values for the non-zero entries along with the vector p. To find a point where ||x (k) −x||1 ≤ τ, this method requires at most \(2 \log (\tau )/\log (\alpha )\) iterations [7]. As noted in [14], we can terminate this earlier when ∥x (k+1) −x (k)1 ≤ τ(1 − α) because that guarantees the same error condition. This helpful circumstance arises due to the relationship between ∥x (k+1) −x (k)1 and the residual of the linear system.

In our implementation, this iteration is implemented using two vectors of memory for a compressed sparse column representation of the adjacency matrix A. The pseudocode is in Fig. 1. In this algorithm, we store an iteration in x and use the memory in y to compute the next iterate x (k+1). After the entire update is done, we compare the vectors and swap.

Fig. 1
figure 1

Pseudocode for the power method to compute the vth column of X = (1 − α)(IαP)−1. This algorithm takes two vectors of memory and performs random reads from the memory in x and p, but then linearly ordered writes to the memory y

3.2 Gauss–Seidel

Gauss–Seidel is a simple variant of the power method where we update the solution vector immediately after computing the value update in Fig. 1. This requires only one vector of memory. Writing this update formally is often annoyingly intricate—it involves an idea called a regular splitting [45]—but is an extremely simple change in terms of the code. Thus, we start with the pseudocode in Fig. 2.

Fig. 2
figure 2

Pseudocode for the Gauss–Seidel method to compute the vth column of X = (1 − α)(IαP)−1. This algorithm takes one vector of memory. It maintains x as the Gauss–Seidel iterate elementwise scaled by p. This performs random reads from the memory in x, and then linearly ordered writes to the same memory x. This works like the power-method from Fig. 1 where the updates are immediately applied in the vector x

There are a few subtle differences from the pseudocode of the power method. First, we initialize the vector x from zero. This choice will turn out to make tracking the error in the Gauss–Seidel iteration much easier [8]. Second, the algorithm actually stores x (k) ⊙p in the memory x where ⊙ is an elementwise product. This choice is made so that we can compute the quantities in update on lines 14–17 without looking up the values in p. Note that we could have done the same transformation for the power method, but we found it slightly decreased performance. Here, the value of δ tracks the total sum of x (k) = x ⊙d after the loop 13–25. This corresponds to the sum of an iterate x (k) of the Gauss–Seidel method. As we will see next, for Gauss–Seidel starting from 0, we have that \(\sum _{i=1}^n [{\mathbf {x}}^{(k)}]_i\) gives the 1-norm of the error.

The error analysis of the method is fairly straightforward. The iterations we analyze are the unscaled iterations that would correspond to multiplying x in the code by d (elementwise) at each step. We call these x (k) as discussed in the previous paragraph. In what follows x is the solution vector. However, let us note that what constitutes an iteration is not the loop on line 11, but the loop on line 13. This is because this method is easiest to analyze if we only consider what happens when a single element of x (k) is changed on Line 23. In our analysis, we will show that each iterate x (k) is bounded above by the true solution x. Formally, this can be stated as x (k) ≤x. We will establish this by showing that iterates only increase the value of x (k) and they never get too large. If x (k) ≤x is the case, then the error

$$\displaystyle \begin{aligned} \|{{\mathbf{x}} - {\mathbf{x}}^{(k)}}\|{}_1 = \sum_i [{\mathbf{x}} - {\mathbf{x}}^{(k)}]_i = \sum_i [{\mathbf{x}}]_i - \sum_i [{\mathbf{x}}^{(k)}]_i = 1 - \sum_i [{\mathbf{x}}^{(k)}]_i. \end{aligned} $$

Here, we only used that the sum of the entire PageRank vector ∑i[x]i = 1 for the true solution on a strongly connected graph. Note that in line 22, we update δ which is tracking the sum of the unscaled vector x (k) and after the full loop on 13–24, we have computed δ =∑i[x (k)]i. Consequently, this termination criteria maps to what we use in the algorithm.

Now, it remains to show that we indeed have the solution upper bounding each unscaled iterate. Note that, because x (0) = 0 we immediately have x (0) ≤x. We will also strengthen our setup and note that the residual of the linear system (1)

$$\displaystyle \begin{aligned} {\mathbf{r}}^{(0)} = (1-\alpha) {\mathbf{v}} - ({\boldsymbol{I}} - \alpha {\boldsymbol{P}}){\mathbf{x}}^{(0)} \end{aligned} $$

is also non-negative. The importance of the relationship with the residual is that the residual and error satisfy the following system of equations:

$$\displaystyle \begin{aligned} ({\boldsymbol{I}} - \alpha {\boldsymbol{P}}) ({\mathbf{x}} - {\mathbf{x}}^{(k)}) = {\mathbf{r}}^{(k)}. \end{aligned} $$

The matrix (I − αP) is an M-matrix [27] with a non-negative inverse, so the error vector x −x (k) ≥ 0 when r (k) ≥ 0. Thus, it suffices to show that r (k+1) ≥ 0 given r (k) ≥ 0. In this case, we know that x (k) and x (k+1) are the same in all but one coordinate. Let u correspond to the index i that is changed in iteration k. We compute

$$\displaystyle \begin{aligned} {\mathbf{x}}^{(k+1)} = {\mathbf{x}}^{(k)} + \mu_k {\mathbf{e}}_u, \end{aligned} $$

where μ k is the value of update \( - {\mathbf {e}}_u^T {\mathbf {x}}^{(k)}\). Expanding out the code to get μ k gives

$$\displaystyle \begin{aligned}\mu_k = \begin{cases} \alpha \sum_{j \to u} x^{(k)}_j / d_j - x^{(k)}_u & u \neq v \\ \alpha \sum_{j \to u} x^{(k)}_j / d_j + (1-\alpha) - x^{(k)}_u & u = v \end{cases}. \end{aligned} $$

Note that μ k is exactly the uth element of the residual of the linear system (1)

$$\displaystyle \begin{aligned} {\mathbf{r}}^{(k)} = (1-\alpha) {\mathbf{v}} - ({\boldsymbol{I}} - \alpha {\boldsymbol{P}}){\mathbf{x}}^{(k)} \Rightarrow \mu_k = {\mathbf{e}}_u^T {\mathbf{r}}^{(k)}. \end{aligned} $$
(2)

We have that μ k ≥ 0 because r (k) ≥ 0 by assumption. At this point, we still need to show that r (k+1) ≥ 0, and we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} \begin{aligned} {\mathbf{r}}^{(k+1)} &\displaystyle = (1-\alpha) {\mathbf{v}} - ({\boldsymbol{I}} - \alpha {\boldsymbol{P}}) ({\mathbf{x}}^{(k)} + \mu_k {\mathbf{e}}_u) = {\mathbf{r}}^{(k)} - \mu_k ({\boldsymbol{I}} - \alpha {\boldsymbol{P}}) {\mathbf{e}}_u \\ &\displaystyle = ({\mathbf{r}}^{(k)} - \mu_k {\mathbf{e}}_u) + \underbrace{\mu_k \alpha {\boldsymbol{P}} {\mathbf{e}}_u}_{\text{non}\mathrm{-}\text{negative}} \end{aligned} \end{array} \end{aligned} $$

Now, we also have that

because μ k is the uth component of r (k) (see (2)). Thus we have r (k+1) ≥ 0.

This justifies that the algorithm in Fig. 2, if it terminates, will have the correct error. To see that it will terminate, note that this same analysis shows that we reduce the sum of the residual at each step of the algorithm. We can also get convergence through classical results about the convergence of Gauss–Seidel on M-matrices [45].

Although there is no sub-asymptotic theory about Gauss–Seidel compared with the power method, ample empirical evidence suggests that, for most graphs, Gauss–Seidel runs in about half the iterations of PageRank. The asymptotic theory in Varga [45] shows that Gauss–Seidel is asymptotically faster than the power method. However, this is in terms of the spectral radius alone. This asymptotic theory, however, can be misleading for PageRank as an example with a random graph from [13] shows. To foreshadow our results, Gauss–Seidel will be the method to beat for computing PageRank with a single thread. This mirrors results found in other scenarios as well [15, 36].

3.3 The Cyclic Push Method

One challenge with Gauss–Seidel is that it requires in-neighbor access to the edges of the graph. These are still accessed consecutively, which makes streaming solutions a possibility. There are nevertheless many graph systems that provide the most efficient access to the out-neighbors of a directed graph. It turns out that there is a way to implement the Gauss–Seidel for these systems using something called the push method for PageRank, the big difference, however, is that we maintain two vectors of memory. The first variant of the push method we will describe will exactly map to the Gauss–Seidel computation above. The key difference is that it explicitly maintains a residual vector.

Suppose we kept a solution vector x (k) along with a residual vector r (k). Then the single-entry update in Gauss–Seidel corresponds to

$$\displaystyle \begin{aligned} {\mathbf{x}}^{(k+1)} = {\mathbf{x}}^{(k)} + {\mathbf{e}}_u {\mathbf{e}}_u^T {\mathbf{r}}^{(k)}. \end{aligned} $$

(This expression arises from (2) combined with the μ k variation on the Gauss–Seidel update.) This is easy to compute, but then we have to update r (k) to get the new residual r (k+1). In the push method, this second update dominates the work.

Recall the expression for the residual update that arose in our theory on Gauss–Seidel

$$\displaystyle \begin{aligned} {\mathbf{r}}^{(k+1)} = ({\mathbf{r}}^{(k)} - \mu_k {\mathbf{e}}_u) + {\mu_k \alpha {\boldsymbol{P}} {\mathbf{e}}_u}. \end{aligned} $$

To perform this update, all we need to do is set the uth element of r (k) to 0, and then lookup the values of the uth column of P. Note that the matrix P = A TDiag(p) and so the uth column is just the uth row of A, which encodes the out-neighbors, scaled by p[u]. The resulting algorithm is given by the pseudocode in Fig. 3.

Fig. 3
figure 3

Pseudocode for the cyclic push method to compute the vth column of X = (1 − α)(IαP)−1. This algorithm takes two vectors of memory. It maintains x as the Gauss–Seidel iterate and r as the residual (1 − α)e v − (I − αP). This performs random writes to the memory in r. Note that this iteration is mathematically identical to Fig. 2, but it uses compressed row storage for A instead of compressed column storage

This iteration is mathematically identical to Gauss–Seidel. The iteration in this form was described by McSherry [36] as an alternative way of computing PageRank that was more amenable to optimization because we can use properties of the residual to choose when to revisit or skip updating a node. The term “push” comes from the idea that when you update x[i] you “push” an update out to the neighbors of i in the residual vector.

3.4 The Push Method With a Work Queue

The name “push method” actually comes from [2]. That paper utilized the push method to compute a personalized PageRank vector of an undirected graph in constant time (where the constant depends on α and the accuracy τ) for a weaker notion of error. This weak notion corresponds to finding an iterate with error that satisfies 0 ≤x −x (k) ≤ τd. So the error on a node with a large degree could be large. This enabled a number of clever ideas to show that this can be done in work that does not depend on the size of the graph. One of the key ideas is that this algorithm maintains a queue of vertices to process, and hence, avoids storing or working with vectors that are the size of the graph.

In this case, we adopt similar ideas and add a work queue of vertices that have not yet satisfied their tolerance. In comparison with the cyclic push method, this maintains the same amount of memory, in addition, when the residual associated with a vertex goes above a threshold, we add it to a queue to process in the figure. Namely, if the residual on a node is ω, then we can show that the maximum change to the solution vector due to that element is ω(1 − α). There might be as many as n items in the residual, so if we want a solution that is accurate to 1-norm error τ, then we can check if the residual is smaller than (1 − α)τn. If it is smaller than this, we can show it will not impact the solution.

The pseudocode with the queue is in Fig. 4. The algorithm is identical to Fig. 3, except that we visit vertices in the order that they have been added to the queue. The only small subtlety is that we can check if a vertex is in the queue in order to avoid adding it multiple times based on the current value of the residual. In Line 25, we check if this is the first time that the element increased beyond the threshold ω. The other small detail is that we keep a running sum of the vector x in δ, which is incremented based on the value of μ at each step. In a low-precision implementation, this sum would need to be accumulated at a higher precision as it involves an extremely large running sum. As such, we can use the previous error analysis which justifies that when the total sum of the vector x exceeds τ, then we have converged.

Fig. 4
figure 4

Pseudocode for the push method with a work queue to compute the vth column of X = (1 − α)(IαP)−1. This algorithm takes two vectors of memory. It maintains x as the Gauss–Seidel iterate and r as the residual (1 − α)e v − (I − αP). This performs random writes to the memory in r and picks what amounts to a randomly scattered entry of i to process next

3.5 Related Algorithmic Advances

It was [21] and [36] that realized that the push formulation offered a number of additional opportunities to accelerate PageRank computation by skipping and optimizing potential updates in a Gauss–Seidel-like fashion. These were later improved upon by [6] and [2] with the idea of the workqueue. The connection to Gauss–Seidel only arose later [8, 11]. The algorithms in our paper do not use the full flexibility of these methods as they are often specialized techniques that arise for web-graphs.

We have ignored here a wide class of methods for PageRank that work via Monte Carlo approaches [3,4,5, 32]. These methods all have trouble getting high accuracy entries, although they tend to get the top-k lists correct and should be considered for applications that only desire that type of information. Krylov methods are only competitive for PageRank when α is extremely large [18]. There are have been numerous attempts to parallelize the computation of a single PageRank vector [17]—especially on graph processing systems [1, 9, 26, 33, 43, 44]. In particular, these methods often utilize ideas closely related to the workqueue notion of the push method. Analysis of these results show that they often fail to be useful parallelizations of the underlying problem and have significant overhead compared to simple implementations [37].

3.6 Multivector Transformations

The algorithms described so far here—and most of the discussions of PageRank that we are aware of—deal with computing a single PageRank or personalized PageRank vector. (The biggest exception are a number of techniques to attempt to approximate all PageRank vectors [4, 21].) With the idea in mind that we are considering an educated, but non-expert, user of these algorithms we note the following idea. Modern processors feature vector execution units often called SIMD (single instruction, multiple data) or simply vector instructions. Because the data access pattern for the power method, Gauss–Seidel, and the cyclic push method are entirely independent of both the choice of the right hand side e i and any elements of the vectors, then we can conceptually execute the same iteration on multiple vectors simultaneously. This involves few changes to the code assuming that the language supports some notion of treating a vector of entries like a scalar. Thus, for each of the methods above, we create a variation that processes multiple vectors simultaneously. Our technique to do this in the Julia programming language is to replace a one dimensional array of data with a one dimension array of statically sized vectors. This enables the compiler to unroll and auto vectorize code that involves multiple entries at once in a way that is consistent with our informed user persona. The code is essentially unchanged from the previous cases and we refer interested readers to our online codes to reproduce these ideas. (See Sect. 6.)

4 Results

We now conduct a set of experiments using these four PageRank algorithms in the setting of the AllPageRank problem. That is, we run them to compute multiple columns of the matrix X. The primary performance measure we are considering is the number of columns computed per unit time. We run the algorithms for one of two time intervals: 14.4 min and 5 min. Note that 14.4 min is exactly 1/100th of a day, and so the number of vectors computable in 24 h is exactly 100 times greater. For 5 min, the factor is roughly 300 times larger. Note that the AllPageRank problem involves a great deal of computation, and so it is natural to, perhaps, think of running this for a few days. Months or weeks are less reasonable, though.

We consider two parallelization strategies: threads and processes. In the threaded implementation, we load the graph information into memory once and use the high-level language’s threading library to launch a given number of computation threads. These threads continue to compute single columns, or multiple columns simultaneously, of the solution until the time limit is exhausted. They all access the same shared memory copy. The process scenario is largely the same, except we launch independent processes that all have their own copy of the graph information. Note that we do not consider any parallel setup or IO time; but let us state that this was negligible for our experiments—it might take 1–5 min to set up an execution which we expect to run for hours. Our code for these experiments is all available online (see Sect. 6 for the reference).

Also in keeping with our informed user persona, we did not perform any heroic measures to eliminate all simultaneous usage of the machine. We asked other users not to use the machines during our tests, which, we believe was respected. There were a few processes from other users that would appear to be doing intermittent work. (As an example, we may see someone running the unix “top” command to see if the machine was being used).

4.1 Data and Machines

We report on two datasets, each of which is a strongly connected component of a larger graph. These data come from [29, 38].

  • Orkut has 2, 997, 355 nodes and 220, 259, 736 directed edges.

  • LiveJournal has 3, 828, 682 nodes and 65, 825, 429 directed edges.

The two machines we use are:

  • A 64-core (4 × 16-core) shared memory server with Xeon E7-8867 v3 (2.50 GHz) CPUs and 2 TB of RAM; this is configured in a fully connected topology. Each processor has four memory channels, 45 MB of L3 cache, and 256 KB×16 of L2 cache.

  • A 192-core (8 × 24-core) shared memory server with Xeon Platinum 8168 (2.70 GHz) CPUs and 6 TB of RAM; this is configured in a hypercube topology with three connections per CPU. Each processor has six memory channels, 33 MB L3 Cache, and 24×1 MB of L2 cache.

4.2 Performance on a 64-Core System

We begin our discussion by looking at the results of all the algorithms on the 64-core server as these are the simplest to understand. These are summarized in Table 1, which shows how performance varies on 1, 32, and 64 threads and processes when we compute 1, 8, or 16 vectors simultaneously. In principle, using multiple vectors simultaneously will result in the Julia compiler generating AVX and SIMD instructions on the platform, which can greatly increase the computational power. We see that this increases performance by around a factor of 4 or 5. We see only a small change going from 8 to 16 vectors computed simultaneously, and sometimes this will decrease performance (see the threaded results on the power method and processes results for Gauss–Seidel). The results with processes are generally, but not always, faster than the results with the same number of threads.

Table 1 Vectors computed on the LiveJournal graph within 14.4 min

Note that the power method uses more iterations than either the Gauss–Seidel and cyclic push methods, and so we expect it to be slower from an algorithm perspective (although the memory access patterns are more amenable to parallelization). Gauss–Seidel and the cyclic push methods are mathematically identical and so execute the same number of iterations. The difference in performance is entirely due to the memory access patterns. These results show that it is better to have random reads than random writes as the power method is faster than the cyclic push method. Although the Queue Push method should do the least work of all, it seems that the additional cost of maintaining the queue causes the method to run the slowest.

In summary, these results point to challenges in linearly scaling the work involved in this pleasingly parallel computation. They also highlight the need to compute multiple vectors simultaneously. Note that running Gauss–Seidel with one process or thread produces about half the output of the power method with 32 threads computing only one vector at a time.

4.3 Sparse Matrix Ordering

The next experiment we consider is using a sparse matrix ordering scheme to improve the locality of reference among the operations. This is a standard technique in sparse matrix computations that is commonly taught in graduate curricula. We use the METIS algorithm [24] and generate 50 and 100 partitions. We then re-order the matrix so that each partition is a consecutive block. Since the computations with multiple vectors all had uniformly higher performance, we only report the results for the methods that compute eight vectors simultaneously.

Again, these results show a considerable increase in performance for most methods. The performance of Gauss–Seidel increases by 30%, for instance. Notable exceptions include the power method and Queue Push methods on Orkut. The partitions took less than an hour to compute. Since we envision running these computations for over 10 h, the permuted method would overtake the non-permuted one after about 4 h. Consequently, it seems this technique is still worth doing even for these pleasingly parallel computations. In particular, note that the cyclic push method shows a very large change in performance and largely runs faster than the power method in all cases. Given the random write nature of this work, this is perhaps unsurprising, but it is useful to know that this type of algorithm is especially sensitive to ordering (Table 2).

Table 2 The change in the number of vectors computed on LiveJournal and Orkut as we vary the sparse matrix order shows that a small bit of careful ordering dramatically improves performance

4.4 Performance on a 192-Core System

Next, we investigate how performance changes on a 192-core system for the algorithms that run eight vectors simultaneously. Table 3 shows the results for the threaded and independent process scenarios. This table highlights the problem with scaling threaded computation on this particular system. As the number of threads increases, the performance decreases. We investigate this finding in the next section (Sect. 4.5) as well. In fact, on the Orkut networks, there is no work done when using 192 threads within 14.4 min for most of the trials. We repeated this trial to verify that the result was consistent—it was.

Table 3 Vectors computed by the 192-core server show problems with scaling the threaded computation

Overall, these results show challenges when using threads on a machine with a more complex memory topology, even when using pleasingly parallel computations.

4.5 Performance Scaling

The final experiment we conduct is a performance scaling study for the best algorithm we found: the SIMD Gauss–Seidel algorithm. We use eight-vectors as there was only a minor performance difference (if any) for the 16-vector variant. Here we also use the data that has been reordered with METIS in order to get a sense of scaling when the computation is performing well. We vary the number of threads or processes in each system and report the scaling results in Fig. 5. These show that the threading performance quickly degrades on the machine with 192 cores and the per-process implementation is needed to get good scaling results. Note also that neither setup scales particularly well for a pleasingly parallel computation.

Fig. 5
figure 5

Scaling results for threads and processes implementations. The text annotations give the raw number of vectors computed by that method within 5 min as well as the speedup ratio over the 1 thread or process result. (a) 64-core, threaded. (b) 192-core, threaded. (c) 64-core, processes. (d) 192-core, processes

5 Related Problems

AllPageRank is just a simple instance of a more general need for this type of computation. A closely related methodology underlies the network community profile calculation [12, 22, 30, 34]. This setting involves running a local clustering algorithm for hundreds or thousands of times—independently—on a shared graph. These computations often take hours to run on graphs of similar size.

A related computation is the GHOST technique used for network alignment [42]. This calculation extracts a subset of vertices from a large graph and then computes an eigenvalue histogram on the induced subgraph. These histograms are used as a invariant and characteristic feature for network alignment methods. (As an aside, we note that there are better ways to get a related concept called the network density of states [10].)

In summary, the style of computation used for the AllPageRank problem occurs repeatedly and is worth understanding given that the computations often consume considerable time and informed users.

6 Conclusion

The focus of this manuscript is on a pleasingly parallel computation: the AllPageRank problem we introduce. When we investigated computing the vectors involved in this problem on two shared memory parallel systems, it showed that expecting linear speedup on these problems is unrealistic. Even in this simple case, our results show that two ideas are crucial to get reasonable performance:

  • computing multiple vectors simultaneously

  • using matrix ordering techniques.

Both of these are easy to incorporate into parallel execution libraries that could be designed for this class of tasks, which is distinct from the current focus of distributed graph computation libraries. Our code is available online for others to reproduce our findings on new and emerging systems: https://github.com/YanfeiRen/pagerank

Back to the problem at hand, we are able to compute around 20,000 columns of X for the LiveJournal graph in 14.4 min. This shows that it would take around two days with 192 cores to generate all the information for the AllPageRank problem. For the Orkut graph, it would take around 5 days. We note that both are reasonable and acceptable runtimes to generate an interesting derived dataset. Waiting a week for an experiment is a fairly standard scenario in the physical sciences.

That said, this is still an expensive computation. Making these techniques commonplace on graphs of this scale would likely require another factor of 10 increase in performance so that the results come in 5 h on 192 cores, or say, 15 h on 64 cores. Monte Carlo techniques may be one possibly, along with reduced precision computation. Our experiments all used 64-bit floating point values. The computation may be possible in 32-bit floating point values although it will require some care as values such as 1∕4, 000, 000 are within a factor of 10 of the unit roundoff value for 32-bit floats. Finally, we note that there are methods that should further accelerate Gauss–Seidel, such as successive over-relaxation. While there is a negative finding about SOR on general PageRank systems [20], there are many PageRank systems and near relative PageRank systems that would use symmetric positive definite matrices [34] where SOR, with the optimal choice of ω, might be productive. Preliminary tests show this yields another 2–3 fold improvement for undirected graphs.

We realize that there are additional strategies that an expert could take to improve performance such as developing custom routines to control memory placement and thread locality. We note, however, that these tools are difficult to access from high-level libraries where our hypothetical informed user resides.