Abstract
Clustering coefficient is an important measure in complex graph analysis. Tracking clustering coefficient on dynamic graphs, such as Web, social networks and mobile networks, can help in spam detection, community mining and many other applications. However, it is expensive to compute clustering coefficient for real-world graphs, especially for large and evolving graphs. Aiming to track the clustering coefficient on dynamic graph efficiently, we propose an incremental algorithm. It estimates the average and global clustering coefficient via random walk and stores the random walk path. As the graph evolves, the proposed algorithm reconstructs the stored random walk path and updates the estimates incrementally. Theoretical analysis indicates that the proposed algorithm is practical and efficient. Extensive experiments on real-world graphs also demonstrate that the proposed algorithm performs as well as a state-of-art random walk based algorithm in accuracy and reduces the running time of tracking the clustering coefficient on evolving graphs significantly.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In the last decades, clustering coefficient [1] has emerged as an important measure of the homophily and transitivity of a network. Tracking clustering coefficient for networks which are large and evolving rapidly, such as World Wide Web and online social networks, is an important issue for many real-world applications, such as detecting spammer for search engines [2] or online video networks [3] and detecting events in phone networks [4].
However, there are two challenges in tracking clustering coefficient. First, computing clustering coefficient for evolving graphs with millions or billions of edges is time-consuming. Especially, the efficient algorithms for static graphs [5, 6] are not sufficient for evolving graphs. Second, for most real-world networks, such as online social networks, the network topology and the size of network are evolving and can not be known beforehand. Moreover, it is always limited restrictedly to access the underlying network for the sake of performance or safety. Thus, the basic assumption of “independent random sampling” in most random sampling based algorithms [7, 8] is not realistic.
Aiming to track clustering coefficient on large and evolving networks, we propose an incremental algorithm. The proposed algorithm follows a random walk based sampling method to estimate clustering coefficient [9], only relying on the public interface and requiring no prior knowledge. The proposed algorithm reuses previous random walk and gets result updated via reconstructing partial random walk as the graph evolves, instead of recomputing from scratch. Both theoretical analysis and experimental evaluations on real-world graphs demonstrate that the proposed algorithm reduces the running time significantly comparing with a state-of-art random walk sampling based algorithm without sacrificing accuracy.
2 Tracking Clustering Coefficient
2.1 Clustering Coefficient
Let G = (V, E) stand for a simple graph with n vertices and m undirected edges. The ith vertex is donated by vi, 1 ≤ i ≤ n. The edge between vi and vj is donated by eij, or eji. The degree of vertex vi is donated by di. The adjacency matrix of G, donated by A, is a n × n symmetric matrix, Ai,j = Aj,i = 1 if and only if there is an edge between vi and vj, and Ai,j = Aj,i = 0 otherwise. We assume that there are no edges from any arbitrary vertex vi pointing to itself, thus Ai,i = 0, 1 ≤ i ≤ n.
We define a wedge as a triplet (vj, vi, vk) for any i, j and k ϵ[1, n], if and only if Ai,j = Ai,k = 1, and j < k. For any wedge (vj, vi, vk) with Aj,k = 1, we define it a triangle. Let li stand for the number of triangles with vi lying in the middle. It is obvious that li is equal to the number of connected edges between vi’s neighbors.
The local clustering coefficient for any node vi, donated by ci, is the ratio of the number of edges between vi’s neighbors to the maximal possible number of such edges. For node vi where di ≥ 2, we define ci as \( 2l_{i} /d_{i} (d_{i} - 1) \). For node vi with less than 2 neighbors, ci is defined as 0.
In this paper, we focus on two popular versions of clustering coefficient: the average clustering coefficient [10] and the global clustering coefficient [10]. The average clustering coefficient of a graph, donated by a, is the average of the local clustering coefficient over the set of nodes. It is defined as \( \sum\limits_{i = 1}^{n} {c_{i} } /n \). The global clustering coefficient, also called transitivity in some previous works, is a metric measuring the probability that two neighbors of a node are connected to one another in global. In this paper, we donate it by g and define it as \( 2\sum\nolimits_{i = 1}^{n} {l_{i} } /\sum\nolimits_{i = 1}^{n} {d_{i} \left( {d_{i} - 1} \right)} \).
2.2 Problem Definition
Let G(t) stand for a snapshot of an evolving graph at time t, 0 ≤ t. G(0) is the initial graph. For t > 0, G(t) is a graph with an arbitrary edge inserted into (or removed from) G(t − 1). The average and global clustering coefficient of G(t) are donated by a(t) and g(t) respectively. Their estimates are donated by \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{a} \left( t \right) \) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g} \left( t \right) \) respectively. Our goal is to compute \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{a} \left( t \right) \) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g} \left( t \right) \) efficiently, 0 < t.
It is supposed that \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{a} \left( 0 \right) \) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g} \left( 0 \right) \) are computed via the algorithm in [9] (hereinafter referred to as baseline algorithm). It generates a random walk with r steps, donated by R(0) = {x1(0), x2(0),…, xr(0)}. A variable φk(t) is defined as \( A_{{x_{k - 1} \left( t \right),{\text{x}}_{k + 1} \left( t \right)}} , \, 2 \le k \le r - 1,{ 0} \le t \). There are another four variables defined as follows. Φa(0) is defined as \( \left( {r - 2} \right)^{ - 1} \sum\nolimits_{k = 2}^{r - 1} {\phi_{k} \left( 0 \right)\left( {d_{{x_{k} \left( 0 \right)}} - 1} \right)^{ - 1} } \), Ψa(0) is defined as \( r^{ - 1} \sum\nolimits_{k = 1}^{r} {\left( {d_{{x_{k} \left( 0 \right)}} } \right)^{ - 1} } \), Φg(0) is defined as \( \left( {r - 2} \right)^{ - 1} \sum\nolimits_{k = 2}^{r - 1} {\phi_{k} \left( 0 \right)d_{{x_{k} \left( 0 \right)}} } \) and Ψg(0) is defined as \( r^{ - 1} \sum\nolimits_{k = 1}^{r} {\left( {d_{{x_{k} \left( 0 \right)}} - 1} \right)} \). The approximate average and global clustering coefficient for G(0), are defined as \( {{\varPhi_{a} \left( 0 \right)} \mathord{\left/ {\vphantom {{\varPhi_{a} \left( 0 \right)} {\varPsi_{a} \left( 0 \right)}}} \right. \kern-0pt} {\varPsi_{a} \left( 0 \right)}} \) and \( \varPhi_{g} \left( 0 \right)/\varPsi_{g} \left( 0 \right) \) respectively. It is assumed that for any t (0 < t), R(t−1), Φa(t−1), Ψa(t−1), Φg(t−1) and Ψg(t−1) are stored.
2.3 Tracking Clustering Coefficient Incrementally
Without loss of generality, we present how the proposed algorithm works at time t (0 < t). It is supposed that euv is added at time t. The proposed algorithm checks whether vu and vv accessed by the stored path R(t − 1). If neither vu nor vv is accessed, it returns the previous results \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{a} \left( {t - 1} \right) \) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{g} \left( {t - 1} \right) \) without any updating. Otherwise, the proposed algorithm finds set of steps in R(t − 1) where either vu or vv accessed, donated by X(t − 1). X(t − 1) = {xk(t − 1) | xk(t − 1) = vu || xk(t − 1) = vv}. For each entry xk(t − 1) in X(t − 1), it takes a chance to reroute the random walk and update the estimates. The probability of rerouting the random walk is \( 1/d_{{x_{k} \left( {t - 1} \right)}} \).
As rerouting the random walk R( t−1 ) from the kth step, there are two phases, as shown in Fig. 1. Firstly, the proposed algorithm undoes the random walk from xk( t− 1), removing r − k steps in R(t − 1) from xk(t − 1). Secondly, the proposed algorithm regenerates random walk with (r − k) steps starting form the (k + 1)th step and gets the estimates updated. As vu (or vv) is accessed at the kth step, the proposed algorithm picks vv (or vu) at the regenerated (k + 1)th step, then moves to one of its neighbors uniformly at random and repeats such random movement for (r − k − 1) times.
We donate the set of removed steps by R− = {xh(t-1)} and the added steps by R+={xh(t)}, where k < h ≤ r. R(t) = R(t − 1) − R− + R+. And the variables are updated as follows.
To make how the proposed algorithm works clear, we provide an example. As depicted in Fig. 2, G(t) = G(t − 1) + {e25} and R(t − 1) = {v3, v6, v2, v1}. The proposed algorithm reroutes R(t − 1) at the third step (where v2 accessed) with 1/3 probability. Supposing we reroutes R(t − 1), we remove R−={v1} and pick R+={v5} to take the place of the removed steps. Finally we get the random walk updated R(t) = {v3, v6, v2, v5} and the estimates are updated according to the forementioned equations respectively.
There’s a little difference in the case of edge removal. It is supposed that euv is removed at time t. The proposed algorithm updates the stored random walk if and only if euv is passed by the previous random walk. Approach of updating the random walk and estimates is analogous with the case of edge addition.
3 Correctness and Complexity
3.1 Correctness
Here we prove the proposed algorithm is correct via comparing the proposed algorithm with the baseline algorithm on a same graph. We demonstrate that the set of random walks generated by the proposed algorithm on graph G(t) is equal to the set of random walks generated by the baseline algorithm.
Proof.
First, we prove that for any random walk R(t) generated by the baseline algorithm on G(t), there is an equal random walk R(t)’ generated by the proposed algorithm on the same graph. There are two cases. In the first case, we suppose that R(t) doesn’t access either vu or vv. The proposed algorithm inherits the random walk R(t − 1) generated by baseline algorithm on G(t − 1). And we could always find a random walk path R(t)’ = R(t − 1) which is equal to R(t). In the second case, we consider situations that R(t) accesses either vu or vv. Let’s suppose that R(t) accesses vu at its kth step. Similar as the first case, we could always find a path R(t − 1) who is equal to R(t) at the first k steps. At the (k + 1)th step, it takes 1/du probability to access each neighbor of vu (including vv) in R(t). At the (k + 1)th step of R(t)’, it takes 1/du probability to access vv (rerouting R(t − 1) from the (k + 1)th step) and (du − 1)/du probability to access one of vu’s neighbors except vv (inheriting from R(t − 1) without rerouting). Thus it is easy to find a random walk R(t)’ which is equal to R(t) from the (k + 1)th step to the end of the random walk path. In summary, for any random walk generated by the baseline algorithm there’s always an equal random walk generated by the proposed algorithm on the same graph.
In a similar way, it is easy to prove that for any random walk generated by the proposed algorithm, there’s an equal random walk generated by the baseline algorithm on the same graph. Thus, the sets of random walk path generated by the proposed algorithm and the baseline algorithm respectively are equal. So the proposed algorithm is correct, due to it is demonstrated that the baseline algorithm is correct in [9, 11].
3.2 Computing Complexity
Here we analysis the complexity of updating estimate of clustering coefficient as graph evolves all the time. Assuming euv is the edge added at time t and Mt is the number of random walks rerouted. Then we have \( E\left[ {M_{t} } \right] \le {{\alpha n} \mathord{\left/ {\vphantom {{\alpha n} m}} \right. \kern-0pt} m} \), where αn = r, the length of a random walk path.
Proof.
It is intuitive that most edges in the graph are not accessed by the random walk. A random walk needs to be updated only if it accesses vu (or vv) at time t − 1 and chooses vv (or vu) as the next step. For an arbitrary vertex vu, the expectation of the number of times vu is accessed by a random walk is πur. The probability that node vu is accessed at each step of a random walk, donated by πu, is equal to du/D, which is proved in [9].The probability of choosing vv as the next step of vu is 1/du. Thus, consider the expectation of Mt, which can be expressed as Eq. (5). As a random walk updated, the amount of work is O(r) at most. In summary, we could get that the upper bound of expected amount of work that the proposed algorithm needs to do is O((αn)2/m) as an arbitrary edge arrives randomly.
4 Evaluations
4.1 Experiment Setup
In experiments, we mainly compare the accuracy and performance of the proposed algorithm with the baseline algorithm [9], a state-of-art algorithm estimating the clustering coefficient via random walk. We implement both of the algorithms by Java. Our experiments all run on a machine with Intel(R) Core (TM) i7-2600 CPU and 8 GB RAM. The graphs used in the experiments are some public datasets downloaded from SNAP [12]. We deal with all of them as undirected graphs by ignoring the direction of directed edges. Table 1 lists the main characteristics of the graphs.
For generating the evolving graphs, we randomly remove 100000 edges from each graph and use the rest part as the initial graph in our experiments. The initial estimates are computed by the baseline algorithm. Then we add the removed edges one by one and estimate the average and global clustering coefficient respectively via the proposed algorithm and its competitor. For all experiments, we set r = 0.02n by default, which is enough for getting accurate approximations [9, 11]. Moreover, we run the experiments for 100 times on each graph and use the average results in our evaluations.
4.2 Accuracy
We use RMSE (Root Mean Square Error) to measure the accuracy, which is defined as Eq. (6), where c stands for the exact clustering coefficient and \( \hat{c} \) stands for its estimate. We computed the exact clustering coefficient by a node-iteration based method [13]. Table 2 provides a comparison of the average RMSE.
The results demonstrate that the proposed algorithm performs as well as the baseline algorithm in accuracy. The proposed algorithm achieves smaller RMSE in two-thirds of the experiments. RMSE of the proposed algorithm is about 12% smaller than the baseline algorithm in the best case and it is about 6% bigger than the baseline algorithm in the worst one. In summary, the proposed algorithm achieves close (or even smaller) RMSE comparing with the baseline algorithm for all experiments, which means the proposed algorithm performs as well as the baseline algorithm.
4.3 Performance
We measure the running time and the number of random walk rerouting of the proposed algorithm to evaluate the performance. We also define the speedup as the ratio of the running time of the baseline algorithm to the running time of the proposed algorithm. Table 3 provides a detailed comparison of the running time, the number of random walk rerouting and the speedup of the proposed algorithm on each graph.
It is obvious that the proposed algorithm reduces the running time significantly for dealing with large and evolving graphs. The speedup of the proposed algorithm comparing with the baseline algorithm for dealing with graph with 100000 evolving edges is in the range of 115 to 549. We also find that estimating the average and global clustering coefficient on a graph via the proposed algorithm cost similar running time.
5 Conclusion
In this paper, we propose an incremental algorithm for tracking approximate clustering coefficient on dynamic graph via random walk method. As an arbitrary edge is added and/or removed, our algorithm replaces partial random walk path around the evolving part and updates the estimate based on previous result. The accuracy and performance improvement of the proposed algorithm are verified through analysis in theory and experiments on real-world graphs. It is demonstrated that the proposed algorithm improves the performance effectively comparing with the state-of-art algorithm based on random walk.
References
Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998)
Shen, G., Gao, B., Liu, T.Y., Feng, G., Song, S., Li, H.: Detecting link spam using temporal information. In: 6th IEEE International Conference on Data Mining, pp. 1049–1053. IEEE Press, New York (2006)
Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., Gonçalves, M.: Detecting spammers and content promoters in online video social networks. In: 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 620–627. ACM, New York (2009)
Akoglu, L., Dalvi, B.: Structure, tie persistence and event detection in large phone and SMS networks. In: 8th Workshop on Mining and Learning with Graphs, pp. 10–17. ACM, New York (2010)
Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient algorithms for large-scale local triangle counting. ACM Trans. Knowl. Discov. Data (TKDD) 4(3), 13 (2010)
Park, H.M., Chung, C.W.: An efficient mapreduce algorithm for counting triangles in a very large graph. In 22nd ACM International Conference on Information & Knowledge Management, pp. 539–548. ACM, New York (2013)
Tsourakakis, C.E., Kang, U., Miller, G.L., Faloutsos, C.: DOULION: counting triangles in massive graphs with a coin. In: 15th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp. 837–846. ACM, New York (2009)
Seshadhri, C., Pinar, A., Kolda, T.G.: Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat. Anal. Data Min. ASA Data Sci. J. 7(4), 294–307 (2014)
Hardiman, S.J., Katzir, L.: Estimating clustering coefficients and size of social networks via random walk. In: 22nd International Conference on World Wide Web, pp. 539–550. ACM, New York (2013)
Costa, L.D.F., Rodrigues, F.A., Travieso, G., Villas Boas, P.R.: Characterization of complex networks: a survey of measurements. Adv. Phys. 56(1), 167–242 (2007)
Katzir, L., Hardiman, S.J.: Estimating clustering coefficients and size of social networks via random walk. ACM Trans. Web (TWEB) 9(4), 19 (2015)
Stanford large network dataset collection. http://snap.stanford.edu/data/index.html
Schank, T.: Algorithmic aspects of triangle-based network analysis. Ph.D. thesis, Universität Karlsruhe (TH) (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Liao, Q., Sun, L., Yuan, Y., Yang, Y. (2017). Tracking Clustering Coefficient on Dynamic Graph via Incremental Random Walk. In: Bouguettaya, A., et al. Web Information Systems Engineering – WISE 2017. WISE 2017. Lecture Notes in Computer Science(), vol 10569. Springer, Cham. https://doi.org/10.1007/978-3-319-68783-4_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-68783-4_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68782-7
Online ISBN: 978-3-319-68783-4
eBook Packages: Computer ScienceComputer Science (R0)