Keywords

1 Introduction

Extracting dense subgraphs has been actively studied in recent years. It is a basic and important work in graph analysis. There are many different definitions of dense subgroup, e.g., average degree [11, 21, 30], k-truss [22, 40], clique [3, 4, 15], quasi-clique [1, 27], k-core [12, 24, 32]. All of those different definitions refer to the same thing that vertices in the group are highly connected. Dense subgroup of different definitions have different ways to extract, such as linear programming [23, 26], core decomposition [8, 42], etc. It has a lot of application in practice, e.g., biological module discovery [20], story identification [2] and community detection [6, 10].

Most of the previous work focus on the dense subgraph on single-layer graph which exists for just one time. However, they can not apply to multilayer graph. There were also many work finding dense subgraph on multilayer network. Some of them consider layers are different type of information [13, 46]. Some of them consider that different layers represent different time [25, 29] and we call them temporal networks. Each edge in temporal networks is associated with time. Densely connected vertices in a temporal network may correspond to a community. For example, in a collaboration network, dense subgroup may represent a research team or some researchers in the same domain publishing papers together continuously.

In this paper, we study the problem of long-lasting group in the temporal network. We not only consider the cohesiveness but also the time the group lasts. We adopt the definition of k-core, which is a group in which each member has at least k other friends also in the group. We aim to find the largest group in the temporal network with given lasting-time constraint and connectivity constraint.

To achieve above goal, we present a model, called (L, K)-lasting core, based on the well-known concept of k-core. Our model can preserve a k-core lasting for L time. For its application scenarios, we can leverage this model to find a team of researchers publishing paper every year in the same field, or a social group whose member interact with each other very frequently. We formally define the problem and propose effective algorithm to deal with it. We conduct extensive experiments to evaluate our model and study the impact of algorithm variations. In summary, the contributions of this paper are summarized as follows:

  • We propose the notion of (LK)-lasting core to integrate the social cohesiveness with the temporal dimension to identify important groups in temporal social networks.

  • We develop effective algorithms and techniques to extract (LK)-lasting cores.

  • We conduct extensive experiments to evaluate the performance of our algorithms.

The rest of this paper is organized as follows. After reviewing related work on dense subgraph and temporal network in Sect. 2, we provide the notation and formulate our problem in Sect. 3. The details of the proposed algorithms are described in Sect. 4. We provide the experimental results in Sect. 5. We conclude this paper in Sect. 7.

2 Related Work

2.1 Dense Subgraph

Our work is related to the problem of extracting dense subgraphs, which has been actively studied for years. There are many measurements of dense subgraphs, including average degree [11], k-core [12], clique [4]. For example, Epasto et al. [11] proposed the method of maintaining a densest subgraph and quickly updating while an edge insertion or deletion. Sariyuce et al. [32] also studied on dynamic graph but what they maintained is the k-core decomposition. Bomze et al. [4] proposed several methods to deal with clique problem. In this paper, we propose the notion of (L, K)-lasting core, which extends the idea of k-core.

2.2 Multilayer Network

We study the problem on temporal networks, which is a special type of multilayer graph and its layers represent continuous time. There are many related work of multilayer graphs [13, 25, 29, 46, 47]. Zhang et al. [46] studied the problem on two layer graph which is a special case of multilayer graph. One layer is friendship of entities and the other is similarity between entities. Rozenshtein et al. [29] searched several groups of vertices with different time interval to maximize the sum of group density. Galimberti et al. [13] propose a method to find a subgroup that maximizes the minimum density of selected layers. Zhu et al. [47] aimed to find several multilayer cores that cover the largest number of vertices. Li et al. [25] deals with temporal network. They aimed to find dense subgraph with three constraints, \(\theta \), \(\tau \), and k called (\(\theta \), \(\tau \))-persistent k-core. Note that \(\tau \) is larger than \(\theta \). The union graph of each \(\theta \) length time interval in a \(\tau \) length interval contains a k-core. We give an example of this work. In 1, we give \(\theta \) = 2, \(\tau \) = 3 and k = 3. For \(\tau \) = 3, we first choose \(G_1\), \(G_2\) and \(G_3\) and then intersect two consistent snapshots in them for \(\theta \) = 2, e.g., \(G_1\), \(G_2\) and \(G_2\), \(G_3\). Then we find k-core of each intersection graph. The k-core of \(G_1\), \(G_2\) is abcd and the one of \(G_2\), \(G_3\) is abcdef. And the (2, 3)-persistent 3-core of this interval is abcd. The work we mentioned can’t apply to our problem which we want to find a group lasting for a consistent time. We formulate the problem in the next section.

Fig. 1.
figure 1

An example of (\(\theta \), \(\tau \))-persistent k-core.

2.3 Community Search

Community search in social networks is an active research field [5, 9, 18, 19, 31, 41, 43]. These works study various community search problems, including the enumeration of k-vertex connected components [41], extracting dense subgraphs, i.e., small-diameter k-plexes [9], identifying the maximum clique in sparse social networks [5], and proposing the UCF-Index to extract \((k, \eta )\)-core in linear time for uncertain graphs [43]. In addition to finding dense communities in social networks, recent works also discuss finding sparse anti-communities in social networks [16, 34, 36, 37], which has a wide spectrum of application scenarios.

2.4 Dense Subgraphs in Heterogeneous Social Networks

Extracting dense subgraphs in heterogeneous social networks have attracted research attentions [7, 14, 17, 25, 33, 35, 39, 45, 47]. These works propose new ideas and enable many new applications, such as enumerating the spatial cliques in the two-dimensional space for dense subgraph extraction [45]. Moreover, a set of socio-spatial group queries aim at identifying the socially-dense groups and the corresponding meeting points [14, 38, 44]. In addition to social and spatial relations, SDSQ is proposed for live multi-streaming scenarios in social networks that considers both the social tightness and preference of the users, as well as the diversity of multi-streaming channels [33].

3 Problem Definition

We are given a temporal network \(G = (V, E)\), where V is a set of vertices, and E = \(\{(u, v, t)\}\), such that \(u, v \in V\), timestamp \(t \in \mathbb {N}\), indicating that edge (uv) exists at time t. Given \(t \in \mathbb {N}\), \(E_{t} = \{(u, v, t)\}\), which contains edges in timestamp t, we call each graph associated with certain time is a snapshot: Let \(G_t\,=\,(V, E_t)\), \(\forall t \in \mathbb {N}\). Given a subset \(C \subseteq V\), edges induced by C at timestamp t is denoted \(E_t(C) = \{(u, v, t)\}\) for u, \(v \in C\). Then the degree of vertex \(u \in C\) at time t is denoted \(d_t(u, C) = | \{ u \in C | (u, v, t) \in E_t(C)\}|\).

Fig. 2.
figure 2

An example of (L, K)-lasting core.

Definition 1

(L-lasting time). L-lasting time means a time sequence which has L continuous snapshot, e.g., \( {[0, 1,...,}\)\( {-\,1]}\).

Definition 2

((L, K)-lasting core). The (L, K)-lasting core of a temporal network \(G = (V, E)\) is a non-empty set of vertices \(C_{(L, K)}\subseteq V\), such that \(\forall u \in \) \(C_{(L, K)}\) and \(d_t( u\), \(C(L, K)) \ge K\), \(\forall t \in [t_0, t_1,...,t_{L-1}]\), \(K \in \mathbb {N}^+\).

In other words, Definition 2 is saying that a (LK)-lasting core is a K-core with L-lasting time. A maximum (LK)-lasting core means it is a (LK)-lasting core which has the most vertices. Then we formulate the first problem in this work which is to search a maximum (L, K)-lasting core.

Problem 1

(Maximum (L, K)-lasting core). Given a temporal network G = (VE), two parameters L and K, find the maximum (LK)-lasting core of G.

We give an example of (LK)-lasting core. In Fig. 2, we have a temporal network of 4 snapshots. Given L = 2 and K = 3, we can observe that a, b, c, d circled by red line is the maximum (2, 3)-lasting core in this temporal network. In the next section, we will propose basic algorithm for Problem 1 and how to speed up using advanced techniques.

4 Algorithm Design

In this section, we introduce our proposed approaches and techniques to extract (L, K)-lasting core in details. First, we give a naive idea and a basic algorithm that is also the baseline in experiments. And then, we provide some simple but powerful techniques to speed up online search time. Finally, we explore the order of snapshots permutation to minimize the intersection cost.

4.1 Naive Algorithm

The naive algorithm of Problem 1 is detailed in Algorithm 1. First, we use the concept of sliding window and let L be the length of the sliding window, i.e., it contains L snapshot. The basic algorithm consists of two steps. The first step is to slide the window by changing starting time t. Let C be \(G_t\), the snapshot of time t. Then we intersect C with the rest of \(L\,-\,1\) snapshot. The second step is to find \(C_k\), the K-core of the intersection result C. If the size of \(C_k\) is larger than the current maximum size of (L, K)-lasting core, \(C_k\) become the current maximum (L, K)-lasting core.

figure a

Here, we omit the details of the K-core algorithm. It is to recursively remove the vertex with degree smaller than K and check its neighbors’ degrees until no vertex has degree smaller than K in the subgroup C, i.e., \(d\mathbf {_{ t}}\)(uC) \(\ge K\). Finally, in line 9, we output the maximum (L, K)-lasting core.

4.2 Temporal Core Finding-Basic (TCFB)

In the previous subsection, the naive approach performs many redundant intersections. To tackle this issue, our idea is to reuse some parts of intersection results. Once we reuse them, we are able reduce the number intersections and improve the efficiency. For example, the current time sequence being processed is [0, 1, 2, 3, 4] and the next time sequence is [1, 2, 3, 4, 5]. We can observe that the timestamp [1, 2, 3, 4] is repeated. If we first do intersection of [1, 2, 3, 4] which means \(G_1, G_2, G_3\) and \(G_4\) and have the result C, then the next action for current time sequence is to intersect C and \(G_0\) and for next time sequence is to intersect C and \(G_5\). Thus we can reduce one time of intersection of \(G_1, G_2, G_3\) and \(G_4\). The detailed description is outlined in Algorithm 2.

4.3 Min-Degree Pruning (MDP)

In this subsection, we focus on reducing vertices. Given a time sequence, each vertex has a degree on different snapshot. After the process of intersection and k-core, we get the result subgraph C, the degree of each vertex in C will less than or equal to the vertex’s smallest degree of all snapshots of the time sequence. In other words, given a time sequence T, \(d(u,C) \le min_t^{t \in T}d(u,G_t)\). Based on this, we can remove the vertex u which \(min_t^{t \in T}d(u,G_t) < K\) before the process. Equipped with this technique, we may traverse less vertices for each time sequence to reduce the cost. Here we just eliminate the vertex which does not satisfy the K constraint before the process, then we can further execute Algorithm 2.

figure b

4.4 Reordering for Intersection Minimization (RIM)

Now if we have a time sequence, we intersect snapshots chronologically in Algorithm 2. We observed that if we disrupt the order of original time sequence, the result of intersection graph remain the same, but the number of intersection times will be different. Here we can formulate a Subproblem. If we know the order having the least number of intersection times, then we have minimum cost. The subproblem is formulated as follows.

Problem 2

(Minimum times of intersection). Given a time sequence of L snapshot on G, we would like to find a permutation of given time sequence that minimize the sum of edge numbers of the intersection graph \([I_0, I_1,...,I_{L-2}]\), such that \(I_0=G_0\), \(I_1=\) intersection\((I_0, G_1)\), \(I_2=\) intersection\((I_1, G_2)\),..., \(I_{L-2}=\) intersection\((I_{L-2}, G_{L-1})\).

Clearly, if we can solve Subproblem 1, then we can solve Problem 1 more efficiently. A straightforward method to find optimal solution of Subproblem 1 is to enumerate all the permutation of given time sequence and find the best one. However, it is time-consuming to search in L factorial number of permutation when L grows. We propose a method, which choose snapshot greedily based on the previous snapshot. The standard of choosing next snapshot is by edge difference of two snapshots. Edge difference means the number of edges of previous snapshot that do not exist in the next snapshot. The detailed description is outlined in Algorithm 3. We compute edge difference offline. In line 4, edgediff(\(G_i\),\(G_j\)) means the number of edges of \(G_i\) that do not exist in \(G_j\). It should be noted that edgediff(\(G_i\),\(G_j\)) may be not same as edgediff(\(G_j\),\(G_i\)). After the computation, we get a map Diff giving us information of edge difference of each snapshot pair. We then apply it on Algorithm 2 which is Algorithm 3. In Figs. 3, 3(a) is an example of snapshots. We can compute edge difference: Diff[(1,2)] = 3, Diff[(1,3)] = 2, Diff[(1,4)] = 1, Diff[(2,3)] = 4, Diff[(2,4)] = 1. If we are using greedy order, and we first choose snapshot 1, and next we choose snapshot 2 because Diff[(1,2)] = 3 is the largest. Then we choose snapshot 3. Now we see Fig. 3(b) to compute the cost: snapshot 1 has 4 edges; Intersection of 1, 2 has 1 edge; Intersection of 1, 2, 3 has 1 edge and Intersection of 1, 2, 3, 4 has 0 edge. Cost will be 4 + 1 + 1. Finally, we have an advanced algorithm called Temporal Core Finding (TCF) which applies MDP and RIM on TCFB.

figure c
figure d
Fig. 3.
figure 3

Example of greedy order and optimal order.

4.5 Time Complexity and Optimality

Then we analyze the time complexity of naive algorithm that we mentioned in Sect. 4.1. We use a sliding window to traverse given time. Assume the length of given time is N, then we need to process \(N-L+1\) sliding window. For each sliding window, we have L snapshots and intersect graphs by iterating edges of graph of starting time, so the time complexity of this part is O(EL). The next part is to find k-core of intersection graph. We recursively remove edges and vertices, thus the time complexity is O(V+E). Then we can derive the time complexity of naive algorithm which is O(\(NLE^2\)).

Next, we analyze the time complexity of TCF. The number of sliding window is still \(N-L+1\). We can see the part of reusing intersection of overlapped snapshots can decrease intersection times by about 2 times, which do not affect time complexity. The part of MDP remove vertices which can not be in the solution, but it may not remove any vertices in the worst case. The part of RIM change the intersection order of snapshots, but it has no effect on time complexity. Therefore, the time complexity of intersection part is still O(EL). The part of finding k-core didn’t change. Then we can derive the time complexity of TCF which is still O(\(NLE^2\)). Though the time complexity of TCF doesn’t change, the better performance can be see in later Sect. 5.

Here we discuss the optimality of basic algorithm and TCF. The basic algorithm slide a window to search in each interval, do intersection, and find k-core. Obviously, the basic algorithm can find optimal solution. Then, in TCF, the part of reusing intersection of overlapped snapshots just reduce intersection cost, it process the same thing. MDP part eliminates the vertex not in the solution which doesn’t affect the result. RIM part change the order of snapshots, but the intersection graph of them is same as non-ordering one’s. Thus we can observe that both naive algorithm and TCF can find optimal solution.

5 Experiments

In this section, we present experimental results and performance comparisons of our methods on different datasets.

Datasets. We use 3 real-world datasets recording interactions with temporal information. Each dataset is with a window size which defining how much continuous time each snapshot contains. If multiple interactions between two vertices appear in the same window, they just counted as one interaction. The characteristics of the datasets are reported in Table 1.

Last.fm records the co-listening history of its streaming platform. DBLP is the co-authorship network of the authors of scientific papers from DBLP computer science bibliography. Youtube [28] is a video-sharing web site that includes a social network and we pick two window size for this dataset. Synthetic datasets are all generated by Barabási–Albert preferential attachment model. Synthetic2000 is generated as 2000 vertices with 100 snapshots and the average degree of each snapshot ranges from 50 to 700. Synthetic5000 is generated as 5000 vertices and the average degree ranges from 80 to 900. Synthetic10W is generated as 100000 vertices and the average degree ranges from 60 to 300.

Method. We compare our approach (TCFB, TCF and two techniques) with brute force finding order of least intersection times (BF), greedy density method (Jethava) in [23] and maximal-span-cores algorithm (Galimberti) in [12]. Techniques are MDP and RIM. TCFB is in Sect. 4.2 and TCF is our best method applying MDP and RIM.

Implementation. All methods are implemented in C++. The experiments run on a machine equipped with Intel CPU at 3.7 GHz and 64 GM RAM.

Table 1. Datasets.

5.1 Comparison of BF and TCF

First, we try to find optimal solution of Subproblem 1, finding the permutation of minimum number of times of intersection. We test all permutation of snapshots for each given time sequence. The running time is proportional to L factorial. Then we conduct experiments with small L on DBLP dataset to compare TCF and BF. In Fig. 4(a), L remains the same, and when K becomes larger, BF’s running time decreases because the part of finding k-core removes most of the vertices that do not satisfy K constraint and TCF’s changes not much because most of the vertices are removed before processing and intersection time dominates the running time. In Fig. 4(b), we can see BF’s running time grows with L as expected. Figures 4(c) and (d) show that TCF has the smallest number of times of intersection in all conditions.

Fig. 4.
figure 4

Comparison of baseline and exhaustive - DBLP.

6 Methods with Different Techniques

We compare our methods applying different techniques in this subsection. In Fig. 5(a), the running time of intersection part decreases when K grows, but the part of finding k-core dominates the running time, thus the total running time seems to remain the same. In Fig. 5(b), all the total running time of four methods decreases when L grows because more vertices are removed before processing and more edges do not satisfy the L constraint. Therefore, both running time of intersection part and finding k-core part decrease. Moreover, Figs. 5(c) indicates that TCFB+RIM’s intersection times is smaller than TCFB’s because reordering snapshots has effect on reducing the cost. They all remain the same because L is fixed. Both TCFB+MDP’s and TCF’s intersection times decrease when K grows because more vertices that not satisfy constraint K are removed.

Fig. 5.
figure 5

Methods with different techniques - Youtube #1.

Fig. 6.
figure 6

Methods with different techniques - synthetic datasets #1.

Fig. 7.
figure 7

Methods with different techniques - synthetic datasets #2.

6.1 Synthetic Datasets

Then we conduct experiments on small and big synthetic datasets. For small synthetic datasets, in Figs. 6 and 7, we can observe that TCFB+MDP’s and TCF’s time and intersection times decrease when K is larger than 60 in Figs. 6(a) and (c) and when K is larger than 80 in Figs. 7(a) and (c). That is because synthetic2K’s vertex smallest degree is 60 and synthetic5K’s is 80 and MDP only can remove vertices when K is larger than each dataset’s smallest degree. Other methods do not apply MDP so their running time and intersection times remain almost the same. In Figs. 6(b) and 7(b), TCFB and TCFB+MDP are lines almost the same, and TCFB+RIM and TCF are lines almost the same. We can see methods with RIM get large improvement and MDP have no effect on this condition because of fixed K 30 is too small to remove any vertices. Figures 6(d) and 7(d) have same condition that methods with RIM are lines almost the same and methods without RIM are other lines almost the same. They first ascend and then decline, it is because RIM is less effective when L is small. On the other hand, when L is large, RIM reduces a lot of cost. For big synthetic dataset, in Figs. 8(a) and (c), MDP only decreases intersection times when K is large enough. In Figs. 8(b) and (d), RIM is less effective because this dataset is too sparse to make great change on intersection times, so all lines seems almost the same.

Fig. 8.
figure 8

Methods with different techniques - big synthetic datasets.

6.2 Comparison of Other Work

In this subsection, we implement greedy density method (Jethava) in [23] and maximal-span-cores algorithm (Galimberti) in [12] and compare with our methods on Last.fm dataset. In Fig. 9(a), we can see Jethava takes more time than Galimberti and TCF. For Galimberti and TCF, we can see more clear in Fig. 9(b) that Galimberti takes more time than TCF. Then we turn to judge the solution each method found. In Fig. 9(c), we can see Jethava’s solution is always the biggest size and Galimberti’s solution is same as TCF which is the optimal solution. Jethava find the group with largest density according to L, which do not consider K. And its solution is too big to find meaningful or interesting pattern. Although Galimberti finds optimal solution, TCF is more effective which has less running time. Therefore, our method TCF has better efficiency and effectiveness.

Fig. 9.
figure 9

Comparison of other work.

Finally, we make a conclusion. MDP can reduce graph size when K becomes larger thus reduce number of times of intersection and then running time. When L becomes larger, RIM can intensively strengthen the performance. In other words, our methods can give good speedup in running time.

7 Conclusions

In this paper, we introduced a model called (LK)-lasting core to detect the lasting group in temporal networks. We proposed efficient algorithms and applied advance techniques to solve this problem. Experiments in different datasets show us the efficiency and scalability. We can find interesting group by using our algorithms. In future work, we can extend our algorithms to temporal hypergraphs, which edges in the graph are arbitrary sets of nodes.