Keywords

1 Introduction

In the last decades, clustering coefficient [1] has emerged as an important measure of the homophily and transitivity of a network. Tracking clustering coefficient for networks which are large and evolving rapidly, such as World Wide Web and online social networks, is an important issue for many real-world applications, such as detecting spammer for search engines [2] or online video networks [3] and detecting events in phone networks [4].

However, there are two challenges in tracking clustering coefficient. First, computing clustering coefficient for evolving graphs with millions or billions of edges is time-consuming. Especially, the efficient algorithms for static graphs [5, 6] are not sufficient for evolving graphs. Second, for most real-world networks, such as online social networks, the network topology and the size of network are evolving and can not be known beforehand. Moreover, it is always limited restrictedly to access the underlying network for the sake of performance or safety. Thus, the basic assumption of “independent random sampling” in most random sampling based algorithms [7, 8] is not realistic.

Aiming to track clustering coefficient on large and evolving networks, we propose an incremental algorithm. The proposed algorithm follows a random walk based sampling method to estimate clustering coefficient [9], only relying on the public interface and requiring no prior knowledge. The proposed algorithm reuses previous random walk and gets result updated via reconstructing partial random walk as the graph evolves, instead of recomputing from scratch. Both theoretical analysis and experimental evaluations on real-world graphs demonstrate that the proposed algorithm reduces the running time significantly comparing with a state-of-art random walk sampling based algorithm without sacrificing accuracy.

2 Tracking Clustering Coefficient

2.1 Clustering Coefficient

Let G = (V, E) stand for a simple graph with n vertices and m undirected edges. The ith vertex is donated by vi, 1 ≤ i ≤ n. The edge between vi and vj is donated by eij, or eji. The degree of vertex vi is donated by di. The adjacency matrix of G, donated by A, is a n × n symmetric matrix, Ai,j = Aj,i = 1 if and only if there is an edge between vi and vj, and Ai,j = Aj,i = 0 otherwise. We assume that there are no edges from any arbitrary vertex vi pointing to itself, thus Ai,i = 0, 1 ≤ i ≤ n.

We define a wedge as a triplet (vj, vi, vk) for any i, j and k ϵ[1, n], if and only if Ai,j = Ai,k = 1, and j < k. For any wedge (vj, vi, vk) with Aj,k = 1, we define it a triangle. Let li stand for the number of triangles with vi lying in the middle. It is obvious that li is equal to the number of connected edges between vi’s neighbors.

The local clustering coefficient for any node vi, donated by ci, is the ratio of the number of edges between vi’s neighbors to the maximal possible number of such edges. For node vi where di ≥ 2, we define ci as \( 2l_{i} /d_{i} (d_{i} - 1) \). For node vi with less than 2 neighbors, ci is defined as 0.

In this paper, we focus on two popular versions of clustering coefficient: the average clustering coefficient [10] and the global clustering coefficient [10]. The average clustering coefficient of a graph, donated by a, is the average of the local clustering coefficient over the set of nodes. It is defined as \( \sum\limits_{i = 1}^{n} {c_{i} } /n \). The global clustering coefficient, also called transitivity in some previous works, is a metric measuring the probability that two neighbors of a node are connected to one another in global. In this paper, we donate it by g and define it as \( 2\sum\nolimits_{i = 1}^{n} {l_{i} } /\sum\nolimits_{i = 1}^{n} {d_{i} \left( {d_{i} - 1} \right)} \).

2.2 Problem Definition

Let G(t) stand for a snapshot of an evolving graph at time t, 0 ≤ t. G(0) is the initial graph. For t > 0, G(t) is a graph with an arbitrary edge inserted into (or removed from) G(t − 1). The average and global clustering coefficient of G(t) are donated by a(t) and g(t) respectively. Their estimates are donated by \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{a} \left( t \right) \) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g} \left( t \right) \) respectively. Our goal is to compute \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{a} \left( t \right) \) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g} \left( t \right) \) efficiently, 0 < t.

It is supposed that \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{a} \left( 0 \right) \) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g} \left( 0 \right) \) are computed via the algorithm in [9] (hereinafter referred to as baseline algorithm). It generates a random walk with r steps, donated by R(0) = {x1(0), x2(0),…, xr(0)}. A variable φk(t) is defined as \( A_{{x_{k - 1} \left( t \right),{\text{x}}_{k + 1} \left( t \right)}} , \, 2 \le k \le r - 1,{ 0} \le t \). There are another four variables defined as follows. Φa(0) is defined as \( \left( {r - 2} \right)^{ - 1} \sum\nolimits_{k = 2}^{r - 1} {\phi_{k} \left( 0 \right)\left( {d_{{x_{k} \left( 0 \right)}} - 1} \right)^{ - 1} } \), Ψa(0) is defined as \( r^{ - 1} \sum\nolimits_{k = 1}^{r} {\left( {d_{{x_{k} \left( 0 \right)}} } \right)^{ - 1} } \), Φg(0) is defined as \( \left( {r - 2} \right)^{ - 1} \sum\nolimits_{k = 2}^{r - 1} {\phi_{k} \left( 0 \right)d_{{x_{k} \left( 0 \right)}} } \) and Ψg(0) is defined as \( r^{ - 1} \sum\nolimits_{k = 1}^{r} {\left( {d_{{x_{k} \left( 0 \right)}} - 1} \right)} \). The approximate average and global clustering coefficient for G(0), are defined as \( {{\varPhi_{a} \left( 0 \right)} \mathord{\left/ {\vphantom {{\varPhi_{a} \left( 0 \right)} {\varPsi_{a} \left( 0 \right)}}} \right. \kern-0pt} {\varPsi_{a} \left( 0 \right)}} \) and \( \varPhi_{g} \left( 0 \right)/\varPsi_{g} \left( 0 \right) \) respectively. It is assumed that for any t (0 < t), R(t−1), Φa(t−1), Ψa(t−1), Φg(t−1) and Ψg(t−1) are stored.

2.3 Tracking Clustering Coefficient Incrementally

Without loss of generality, we present how the proposed algorithm works at time t (0 < t). It is supposed that euv is added at time t. The proposed algorithm checks whether vu and vv accessed by the stored path R(t − 1). If neither vu nor vv is accessed, it returns the previous results \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{a} \left( {t - 1} \right) \) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g} \left( {t - 1} \right) \) without any updating. Otherwise, the proposed algorithm finds set of steps in R(t − 1) where either vu or vv accessed, donated by X(t − 1). X(t − 1) = {xk(t − 1) | xk(t − 1) = vu || xk(t − 1) = vv}. For each entry xk(t − 1) in X(t − 1), it takes a chance to reroute the random walk and update the estimates. The probability of rerouting the random walk is \( 1/d_{{x_{k} \left( {t - 1} \right)}} \).

As rerouting the random walk Rt−1 ) from the kth step, there are two phases, as shown in Fig. 1. Firstly, the proposed algorithm undoes the random walk from xkt− 1), removing r − k steps in R(t − 1) from xk(t − 1). Secondly, the proposed algorithm regenerates random walk with (r − k) steps starting form the (k + 1)th step and gets the estimates updated. As vu (or vv) is accessed at the kth step, the proposed algorithm picks vv (or vu) at the regenerated (k + 1)th step, then moves to one of its neighbors uniformly at random and repeats such random movement for (r − k − 1) times.

Fig. 1.
figure 1

Two phases in rerouting the random walk

We donate the set of removed steps by R = {xh(t-1)} and the added steps by R+={xh(t)}, where k < h ≤ r. R(t) = R(t − 1) − R + R+. And the variables are updated as follows.

$$ \varPhi_{a} \left( t \right) = \varPhi_{a} \left( {t - 1} \right) - \left( {r - 2} \right)^{ - 1} \sum\nolimits_{{x{}_{k} \in R^{ - } }} {\phi_{k} \left( {t - 1} \right)\left( {d_{{x_{k} }} - 1} \right)^{ - 1} } + \left( {r - 2} \right)^{ - 1} \sum\nolimits_{{x{}_{k} \in R^{ + } }} {\phi_{k} \left( t \right)\left( {d_{{x_{k} }} - 1} \right)^{ - 1} } $$
(1)
$$ \varPsi_{a} \left( t \right) = \varPsi_{a} \left( {t - 1} \right) - r^{ - 1} \sum\nolimits_{{x_{k} \in R^{ - } }} {\left( {d_{{x_{k} }} } \right)^{ - 1} } + r^{ - 1} \sum\nolimits_{{x_{k} \in R^{ + } }} {\left( {d_{{x_{k} }} } \right)^{ - 1} } $$
(2)
$$ \varPhi_{g} \left( t \right) = \varPhi_{g} \left( {t - 1} \right) - \left( {r - 2} \right)^{ - 1} \sum\nolimits_{{x_{k} \in R^{ - } }}^{{}} {\phi_{k} \left( {t - 1} \right)d_{{x_{k} }} } + \left( {r - 2} \right)^{ - 1} \sum\nolimits_{{x_{k} \in R^{ + } }}^{{}} {\phi_{k} \left( t \right)d_{{x_{k} }} } $$
(3)
$$ \varPsi_{g} \left( t \right) = \varPsi_{g} \left( {t - 1} \right) - r^{ - 1} \sum\nolimits_{{x_{k} \in R^{ - } }}^{{}} {\left( {d_{{x_{k} }} - 1} \right)} + r^{ - 1} \sum\nolimits_{{x_{k} \in R^{ + } }}^{{}} {\left( {d_{{x_{k} }} - 1} \right)} $$
(4)

To make how the proposed algorithm works clear, we provide an example. As depicted in Fig. 2, G(t) = G(t − 1) + {e25} and R(t − 1) = {v3, v6, v2, v1}. The proposed algorithm reroutes R(t − 1) at the third step (where v2 accessed) with 1/3 probability. Supposing we reroutes R(t − 1), we remove R={v1} and pick R+={v5} to take the place of the removed steps. Finally we get the random walk updated R(t) = {v3, v6, v2, v5} and the estimates are updated according to the forementioned equations respectively.

Fig. 2.
figure 2

An example of a random walk on a graph with an edge added

There’s a little difference in the case of edge removal. It is supposed that euv is removed at time t. The proposed algorithm updates the stored random walk if and only if euv is passed by the previous random walk. Approach of updating the random walk and estimates is analogous with the case of edge addition.

3 Correctness and Complexity

3.1 Correctness

Here we prove the proposed algorithm is correct via comparing the proposed algorithm with the baseline algorithm on a same graph. We demonstrate that the set of random walks generated by the proposed algorithm on graph G(t) is equal to the set of random walks generated by the baseline algorithm.

Proof.

First, we prove that for any random walk R(t) generated by the baseline algorithm on G(t), there is an equal random walk R(t)’ generated by the proposed algorithm on the same graph. There are two cases. In the first case, we suppose that R(t) doesn’t access either vu or vv. The proposed algorithm inherits the random walk R(t − 1) generated by baseline algorithm on G(t − 1). And we could always find a random walk path R(t)’ = R(t − 1) which is equal to R(t). In the second case, we consider situations that R(t) accesses either vu or vv. Let’s suppose that R(t) accesses vu at its kth step. Similar as the first case, we could always find a path R(t − 1) who is equal to R(t) at the first k steps. At the (k + 1)th step, it takes 1/du probability to access each neighbor of vu (including vv) in R(t). At the (k + 1)th step of R(t)’, it takes 1/du probability to access vv (rerouting R(t − 1) from the (k + 1)th step) and (du − 1)/du probability to access one of vu’s neighbors except vv (inheriting from R(t − 1) without rerouting). Thus it is easy to find a random walk R(t)’ which is equal to R(t) from the (k + 1)th step to the end of the random walk path. In summary, for any random walk generated by the baseline algorithm there’s always an equal random walk generated by the proposed algorithm on the same graph.

In a similar way, it is easy to prove that for any random walk generated by the proposed algorithm, there’s an equal random walk generated by the baseline algorithm on the same graph. Thus, the sets of random walk path generated by the proposed algorithm and the baseline algorithm respectively are equal. So the proposed algorithm is correct, due to it is demonstrated that the baseline algorithm is correct in [9, 11].

3.2 Computing Complexity

Here we analysis the complexity of updating estimate of clustering coefficient as graph evolves all the time. Assuming euv is the edge added at time t and Mt is the number of random walks rerouted. Then we have \( E\left[ {M_{t} } \right] \le {{\alpha n} \mathord{\left/ {\vphantom {{\alpha n} m}} \right. \kern-0pt} m} \), where αn = r, the length of a random walk path.

Proof.

It is intuitive that most edges in the graph are not accessed by the random walk. A random walk needs to be updated only if it accesses vu (or vv) at time t − 1 and chooses vv (or vu) as the next step. For an arbitrary vertex vu, the expectation of the number of times vu is accessed by a random walk is πur. The probability that node vu is accessed at each step of a random walk, donated by πu, is equal to du/D, which is proved in [9].The probability of choosing vv as the next step of vu is 1/du. Thus, consider the expectation of Mt, which can be expressed as Eq. (5). As a random walk updated, the amount of work is O(r) at most. In summary, we could get that the upper bound of expected amount of work that the proposed algorithm needs to do is O((αn)2/m) as an arbitrary edge arrives randomly.

$$ E\left[ {M_{t} } \right] = r\sum\limits_{i,j} {\left( {{{\pi_{i} } \mathord{\left/ {\vphantom {{\pi_{i} } {d_{i} }}} \right. \kern-0pt} {d_{i} }} + {{\pi_{j} } \mathord{\left/ {\vphantom {{\pi_{j} } {d_{j} }}} \right. \kern-0pt} {d_{j} }}} \right)\Pr \left[ {i = u,j = v} \right]} \approx 2r\sum\limits_{i} {\left( {{{d_{i} } \mathord{\left/ {\vphantom {{d_{i} } {d_{i} D}}} \right. \kern-0pt} {d_{i} D}}} \right)\Pr \left[ {i = u} \right]} = 2{r \mathord{\left/ {\vphantom {r D}} \right. \kern-0pt} D} = {{\alpha n} \mathord{\left/ {\vphantom {{\alpha n} m}} \right. \kern-0pt} m} $$
(5)

4 Evaluations

4.1 Experiment Setup

In experiments, we mainly compare the accuracy and performance of the proposed algorithm with the baseline algorithm [9], a state-of-art algorithm estimating the clustering coefficient via random walk. We implement both of the algorithms by Java. Our experiments all run on a machine with Intel(R) Core (TM) i7-2600 CPU and 8 GB RAM. The graphs used in the experiments are some public datasets downloaded from SNAP [12]. We deal with all of them as undirected graphs by ignoring the direction of directed edges. Table 1 lists the main characteristics of the graphs.

Table 1. Main parameters of data sets

For generating the evolving graphs, we randomly remove 100000 edges from each graph and use the rest part as the initial graph in our experiments. The initial estimates are computed by the baseline algorithm. Then we add the removed edges one by one and estimate the average and global clustering coefficient respectively via the proposed algorithm and its competitor. For all experiments, we set r = 0.02n by default, which is enough for getting accurate approximations [9, 11]. Moreover, we run the experiments for 100 times on each graph and use the average results in our evaluations.

4.2 Accuracy

We use RMSE (Root Mean Square Error) to measure the accuracy, which is defined as Eq. (6), where c stands for the exact clustering coefficient and \( \hat{c} \) stands for its estimate. We computed the exact clustering coefficient by a node-iteration based method [13]. Table 2 provides a comparison of the average RMSE.

$$ RMSE = \sqrt {E\left[ {\left( {{{\hat{c}} \mathord{\left/ {\vphantom {{\hat{c}} c}} \right. \kern-0pt} c} - 1} \right)^{2} } \right]} $$
(6)
Table 2. Comparison of RMSE

The results demonstrate that the proposed algorithm performs as well as the baseline algorithm in accuracy. The proposed algorithm achieves smaller RMSE in two-thirds of the experiments. RMSE of the proposed algorithm is about 12% smaller than the baseline algorithm in the best case and it is about 6% bigger than the baseline algorithm in the worst one. In summary, the proposed algorithm achieves close (or even smaller) RMSE comparing with the baseline algorithm for all experiments, which means the proposed algorithm performs as well as the baseline algorithm.

4.3 Performance

We measure the running time and the number of random walk rerouting of the proposed algorithm to evaluate the performance. We also define the speedup as the ratio of the running time of the baseline algorithm to the running time of the proposed algorithm. Table 3 provides a detailed comparison of the running time, the number of random walk rerouting and the speedup of the proposed algorithm on each graph.

Table 3. Comparison of the running time and number of times the random walk rerouted

It is obvious that the proposed algorithm reduces the running time significantly for dealing with large and evolving graphs. The speedup of the proposed algorithm comparing with the baseline algorithm for dealing with graph with 100000 evolving edges is in the range of 115 to 549. We also find that estimating the average and global clustering coefficient on a graph via the proposed algorithm cost similar running time.

5 Conclusion

In this paper, we propose an incremental algorithm for tracking approximate clustering coefficient on dynamic graph via random walk method. As an arbitrary edge is added and/or removed, our algorithm replaces partial random walk path around the evolving part and updates the estimate based on previous result. The accuracy and performance improvement of the proposed algorithm are verified through analysis in theory and experiments on real-world graphs. It is demonstrated that the proposed algorithm improves the performance effectively comparing with the state-of-art algorithm based on random walk.