1 Introduction

Dense subgraphs reveal important information in graphs, which give birth to many applications. On the one hand, dense subgraphs serve as cornerstones of some graph clustering algorithms [1] and graph partitioning algorithms [2, 3]. On the other hand, they are independently used in real-world scenarios to, e.g., extract research communities in co-authorship networks [4] and detect fraud groups in social or commercial networks [5]. For the latter kind of applications, existing dense subgraph finding methods are almost global-oriented, without prior knowledge, finding dense subgraphs globally and providing them as communities or groups. But sometimes, these methods are expected to be local-oriented or targeted-oriented, i.e., finding dense subgraphs that are related to particular authors, user accounts or product IDs, called targeted applications.

Methods to search dense subgraphs can be classified by the density measure, i.e., the way of defining density. Consider an unipartite graph Gu = (V,E), a bipartite graph Gb = (Vu,Vv,E), and a subset of nodes \(S\subseteq V\) or S = SuSv with \(S_{u}\subseteq V_{u}\) and \(S_{v}\subseteq V_{v}\). Let E(S) be the set of edges in the subgraph induced by S. The average degree measure of S is defined as |E(S)|/|S| for Gu and |E(S)|/(|Su| + |Sv|) for Gb. The density measure of S presented in Kannan and Vinay’s work [6] is defined as \(|E(S)|/\sqrt {|S_{u}||S_{v}|}\) only for Gb. The edge density measure of S is defined as \(|E(S)|/\binom {|S|}{2}\) for Gu and |E(S)|/(|Su||Sv|) for Gb. For global-oriented methods, Charikar’s greedy algorithm [7] produces S for either Gu or Gb with its average degree being at least 1/2 of that of the optimal subset of nodes. Kannan and Vinay proposed a spectral algorithm for Gb, using their customized density measure to identify S with its density within a factor of \(\mathcal {O}(\log n)\) [6]. Lots of works in detecting α-quasi-cliques, a kind of dense subgraph, are based on the edge density measure and its variants [4, 8,9,10].

Take, for example, fraud detection in social networks. The social networks are formed as bipartite user-user networks. Many global-oriented methods are used to detect extremely dense subgraphs in these networks, due to one common type of fraudulent practices is that fraudsters hire workers to add edges towards their customers. The workers and the customers are users in these networks. However, manually reported violations still exist. It is necessary to develop local dense subgraph finding methods for assisting people with the discovery of fraud groups related to individually reported user accounts. There are a few of these methods [1, 10] based on density measures of Kannan and Vinay’s and edge density, but without average degree. Nevertheless, average degree is a very useful density measure for a fraud detection application, because fraudsters make profits by increasing the average edges between their hired workers and their customers, and for other similar real-world applications is the same.

In this paper, we formulate a new MDS-N problem, aiming at finding a maximum density subgraph (MDS) measured by the average degree, near or containing a given node (N) in an undirected (unipartite or bipartite) graph. We propose the Slither algorithm and the Slither PageRank (PR) algorithm, both consisting of two steps. The first step is a same graph transformation, which transforms the undirected graph into a connected and weighted graph, and reduces the MDS-N problem to the minimum conductance problem. The second step is based on random walks, specifically the lazy random walk for Slither and the personalized PageRank for Slither PR. The theorem behind this step is the Lovász-Simonovits Theorem. Lower bounds on the densities of the subgraphs found by the two algorithms are given. Additionally, we adopt a simple hierarchically repetition frame over the two algorithms for advancing them. The time complexity of these algorithms are analyzed. All these algorithms are easy to implement.

Our algorithms are mainly compared with other existing local dense subgraph finding algorithms on both unipartite graphs and bipartite graphs, in terms of the densities (i.e., the average degrees) of the found subgraphs, the stability of the choice of the given nodes, and the scalability. Experiments show our algorithms are the fastest among those which can get subgraphs with high densities. Slither tends to explore while Slither PR tends to exploit, which can be inferred in the time consumed and other experimental results. Experiments in terms of the hierarchy show that a small hierarchy (no more than two) enables Slither and Slither PR to achieve satisfactory results. Finally, we apply the MDS-N problem in Twitter, a large social network. Our proposed algorithms successfully detect four subgraphs with high fraud rates according to four given fraudulent user accounts.

2 Related work

2.1 Dense subgraph problems

The densest subgraph problem is generally defined as finding a subset of nodes in an undirected graph whose average degree is maximized. This problem is proved to be a polynomial-time problem and can be optimally solved by max-flow methods [11, 12]. Charikar [7] proposed a linear time greedy algorithm guaranteeing a 1/2-approximation result of this problem. Another frequently used density measure is edge density, which is calculated by diving the number of edges in a subgraph by the maximum possible number of edges in it. Directly maximizing this density measure is not meaningful, because a single edge, with two nodes at its ends, obtains the maximum density. Therefore, the α-quasi-clique is introduced, and requires the number of edges to be no less than the maximum possible number of edges times a threshold parameter α ∈ (0,1). There are ways to find single or all α-quasi-clique(s) at once [8, 9]. Other dense subgraph finding problems include maximum clique problem [13, 14], maximal clique problem [15, 16], K-core [17], K-plex [18], Kd-clique [19], K-club [20], etc.

2.2 Local dense subgraph finding

A local dense subgraph finding algorithm is to find an approximation of the densest subgraph near or containing a specified node, with a running time depending largely on the number of edges in the subgraph rather than that in the entire input graph. Andersen [1] proposed an algorithm that finds local dense subgraphs in a bipartite graph, based on the density measure presented in Kannan and Vinay’s work [6]. This algorithm produces a subgraph with its density proportional to that of the optimal subgraph and with a running time proportional to the square of the node size of the optimal subgraph times the maximum degree of the input graph. Zhang et al. [10] proposed a set of alternative projected gradient based algorithms HiDDen, which can find local dense subgraphs in an undirected graph, measured by edge density. There are some random-walk-based heuristic algorithms [21, 22], relying on the observation that shortened random walks starting from a subgraph with low conductance are inclined to stay within this subgraph, due to the narrow passage between it and the remaining graph. Besides these heuristic algorithms, some local clustering algorithms [2, 3, 23], including PageRank-like algorithms, are further based on the Lovász-Simonovits Theorem [24] and give some local bounds on the conductance of their found subgraphs. However, our work differs from all these algorithms in finding local dense subgraphs measured by average degree.

2.3 Fraud detection methods

Existing fraud detection methods are globally oriented, and do not support the application of detecting a fraud-related subgraph according to a particular node. We simply summarize several representative methods as follows. 1) Feature-based methods [25,26,27] usually find fraud activities according to the explicit analysis of various features that are topology-based or not. 2) Propagation-based methods [21, 22, 28] leverage a set of labeled fraudulent nodes and/or labeled normal nodes, and propagate beliefs of label information throughout the graph for predicting labels of other nodes. Our work and this kind of method are both utilizing belief propagation and starting from particular nodes. But, our work solves a local solution only based on one particular node, and provides theoretical bounds on the density of it. 3) Dense subgraph mining methods [5, 29,30,31] are based on the key insight that fraudsters will not hire enough workers to make them behave like normal users, but will create lots of links (i.e., edges) towards their customers to make money. That our algorithms can be used in fraud detection is also based on this insight.

3 Problem formulation

For a targeted application, it is rational to assume that any given node is in a specific dense subgraph, wanted by the application.

Consider an undirected and unweighted graph G = (V,E) with a set of nodes V and a set of edges E. Set n = |V |. Let E(X) be the set of edges in the subgraph induced by XV.

Definition 1

We define the density of X

$$ den(X) = \frac{|E(X)|}{|X|} $$
(1)

Our goal is to find a XV, which induces a maximum density subgraph near or containing a given node r (MDS-N) in G. Particularly, for targeted applications with the assumption of rX, the MDS-N problem is solved by finding the X such that

$$ den(X^{*}) = \max_{r\in X\subset V} den(X) $$
(2)

4 Approach

This section introduces the algorithmic details of our approach to solve the MDS-N problem. The core of our approach is in Section 4.1, which is to transform any undirected (unipartite/bipartite) and unweighted graph into a weighted and connected one, for finding the local densest subgraph. This transformation finishes a reduction from the MDS problem to the minimum conductance problem. The equivalence relation of the two problems can be found in the proof of Proposition 1 for two cases. For one case svS, due to the equivalence relation hidden in the later defined (8) and (9) is difficult to formulate, we consolidate the hidden equation and the other one for the other case into a single inequity (the later defined (5)), which is in a unified simple form and further simplifies the proof of Theorem 2. Therefore, after we use local graph search algorithms, e.g., random-walk-based algorithms, to solve the transformed minimum conductance problem, the MDS-N problem can then be solved. In Section 4.2, two random-walk-based algorithms, Slither, based on the lazy random walk, and Slither PageRank (PR), based on the personalized PageRank, are proposed. The two subsections complete an algorithmic flow of our approach. Additionally, we provide a simple hierarchical framework in Section 4.3 to optimize the results of the flow.

4.1 Graph transformation

To transform G into a weighted and connected graph Gc = (Vc,Ec,W), we add a source node sv and a sink node tv in the graph G. Then, for each node iV, we add two edges to connect i with sv and tv respectively. Let d(i) be the degree of node i, and let dmax be the maximum degree of G. The weights on the edges connecting i with sv and with tv are given as mc and mcd(i) respectively, where mc is a constant such that mc > dmax. We develop this transformation inspired by the Goldberg’s max-flow algorithm [11]. An example of this transformation with G being a bipartite graph is depicted in Fig. 1. Formally, Gc is written as

  1. i.

    Vc = V ∪{sv,tv}

  2. ii.

    Ec = E ∪{(sv,i)|iV }∪{(tv,i)|iV }

  3. iii.

    wij = 1, (i,j) ∈ E

  4. iv.

    wsv,i = mc, iV

  5. v.

    wtv,i = mcd(i), iV

  6. vi.

    wij = 0, (i,j)∉Ec

Fig. 1
figure 1

Instance an unweighted bipartite graph G being transformed into a weighted connected graph Gc. The added edges are dashed lines in Gc

Let SVc. After transforming G into Gc, we build a bridge between subsets of nodes {X} in G and subsets of nodes {S} in Gc. For Gc, there are four possible categories of S derived by all possible cuts shown in Fig. 2: 1) svS and tvS; 2) svS and tvS; 3) svS and tvS; 4) svS and tvS. As the constructed node tv is definitely not in X, according to the theory of local Cheeger inequality [32], we adopt a specified subset of nodes Vc −{tv} in Gc, and it is unnecessary to consider two categories of S that include tv.

Fig. 2
figure 2

Four categories of subsets of nodes derived by two kinds of cuts

Definition 2

Let (S,VcS) be an arbitrary cut of graph Gc with SVc and S. We define the conductance of (S,VcS) as

$$ {\varPhi}(S) = \frac{\mu(\delta(S))}{\min(\mu(S),\mu(V_{c}-S) )} $$
(3)

where δ(S) = {(i,j) ∈ Ec|iS,jVcS} is a set of edges called the edge boundary of S; \(\mu (S) = {\sum }_{i\in S}{\mu (i)}\) is the volume of S with \(\mu (i) = {\sum }_{j}{w_{ij}}\); and \(\mu (\delta (S)) = {\sum }_{(i,j)\in \delta (S)}{w_{ij}}\) is the volume of δ(S).

Due to the subgraph derived by the targeted S is relatively small compared to the entire graph Gc, it is rational to assume μ(S) ≤ μ(VcS), and (3) is simplified as

$$ {\varPhi}(S) = \frac{\mu(\delta(S))}{\mu(S)} $$
(4)

Proposition 1 is then proposed to formally give the quantitative relationship between {X} in G and {SVc −{tv}} in Gc as follows.

Proposition 1

Let S be an arbitrary subset of nodes in Gc = (Vc,Ec,W), satisfying that tvS and μ(S) ≤ μ(VcS). Set X = S −{sv}, which is a subset of nodes in G = (V,E). A mc > dmax is given. Then

$$ {\varPhi}(S) \geq \frac{1}{1+2\frac{|X|}{n}} - \frac{1}{m_{c}}den(X) $$
(5)

Proof

For the case svS, we can calculate

$$ \begin{array}{@{}rcl@{}} \mu(\delta(S)) &=& \sum\limits_{i\in S,j\in V_{c}-S}w_{ij} \\ &=& \sum\limits_{j\in V-X} w_{sv,j} + \sum\limits_{i\in X,j\in V-X} w_{ij} + \sum\limits_{j\in X} w_{tv,j} \\ &=& m_{c}|V-X| + \sum\limits_{i\in X,j\in V-X} w_{ij} + (m_{c}|X| - \sum\limits_{i\in X}d(i)) \\ &=& m_{c}n + \sum\limits_{i\in X,j\in V-X} w_{ij} - \sum\limits_{i\in X} d(i) \end{array} $$
(6)

and

$$ \begin{array}{@{}rcl@{}} \mu(S) &=& \sum\limits_{i\in S,j\in V_{c}-S} w_{ij} + (\sum\limits_{i\in X}d(i) - \sum\limits_{i\in X,j\in V-X} w_{ij} )+ 2 \sum\limits_{j\in X} w_{sv,j}\\ &=& (m_{c}n + \sum\limits_{i\in X,j\in V-X} w_{ij}- \sum\limits_{i\in X} d(i) ) + (\sum\limits_{i\in X} d(i)\\&& - \sum\limits_{i\in X,j\in V-X} w_{ij} )+ 2m_{c}|X|= m_{c}n + 2m_{c}|X| \end{array} $$
(7)

Substituting (6) and (7) into (4), we have

$$ \begin{array}{@{}rcl@{}} {\varPhi}(S) &=& \frac{m_{c}n + {\sum}_{i\in X,j\in V-X} w_{ij} - {\sum}_{i\in X} d(i)}{m_{c}n + 2m_{c}|X|} \\ &=& \frac{1}{1+2\frac{|X|}{n}} -\! \frac{1}{m_{c}}\frac{{\sum}_{i\in X} d(i) - {\sum}_{i\in X,j\in V-X} w_{ij}}{n + 2|X|} \end{array} $$
(8)

Note that

$$ den(X) = \frac{|E(X)|}{|X|} = \frac{{\sum}_{i\in X} d(i) - {\sum}_{i\in X,j\in V-X} w_{ij}}{2|X|} $$
(9)

and constant n > 0. Thus (8) implies

$$ {\varPhi}(S) \geq \frac{1}{1+2\frac{|X|}{n}} - \frac{1}{m_{c}} den(X) $$
(10)

For the case svS, similarly we get

$$ \begin{array}{@{}rcl@{}} {\varPhi}(S) &=& \frac{ 2m_{c}|X| + {\sum}_{i\in X,j\in V-X} w_{ij} - {\sum}_{i\in X} d(i) }{2m_{c}|X|} \\ &=& 1 - \frac{1}{m_{c}} den(X) \\ &&\geq \frac{1}{1+2\frac{|X|}{n}} - \frac{1}{m_{c}} den(X) \end{array} $$
(11)

4.2 The slither and slither PR algorithms

After we have finished the transformation between the two problems in Section 4.1, next is to solve the minimum conductance problem on Gc.

figure c

Before going into the details of the definitions of the variables and the parameters of our forthcoming Slither and Slither PR algorithms (in Alg. 1), we state that our algorithms differ from the algorithms formed by Lovász-Simonovits Theorem only in the addition of a transformation (in Line 1 of Alg. 1) and the \(\arg \max \limits _{j} den\) (in Line 8 of Alg. 1) where the latter are to minimize a conductance. Besides, the two kinds of algorithms share the same rules of random walks and the same calculation of a important term \({S_{j}^{t}}\), called the sweep-cut.

Here we first present the rules of the lazy random walk. A step of the lazy random walk stays where it is with probability 0.5 and goes from a node iVc to its neighboring node jVc with probability 0.5wij/μ(i). Set the diagonal matrix D = diag(μ(1),μ(2),…,μ(nc)) with nc = |Vc|, and the nc-dimensional column vector pt as the probability distribution over Vc at time t. The lazy random walk, beginning at time 0, can be expressed as

$$ p_{t} = Mp_{t-1} = M^{t}p_{0} $$
(12)

where the matrix M is calculated as

$$ M = (WD^{-1} + I ) / 2 $$
(13)

with I being the nc × nc identity matrix. We set the initial probability distribution p0 as

$$ p_{0} = (0,\cdots,1,\cdots,0)^{\mathrm{T}} st. p_{0}(i)=\begin{cases}1,i=r\\0,otherwise\end{cases} $$
(14)

with a given node rVc −{sv,tv}.

Gc is a connected graph. On a connected graph, it is known that a random walk finally converges to a stationary distribution \(p_{\infty }\), and the convergence is not affected by whatever the value of p0. Set \(2m = {\sum }_{j\in V_{c}}{\mu (j)}\). For a node i, the stationary probability

$$ p_{\infty}(i) = \mu(i) / (2m) $$
(15)

is proportional to μ(i).

At each t, sort pt(i)/μ(i) in descending order for {iVc}. Let π(k) be the identifier of the node at position k in the sorted sequence and we obtain

$$ \frac{p_{t}(\pi(k))}{\mu(\pi(k))} \geq \frac{p_{t}(\pi(k+1))}{\mu(\pi(k+1))}\ (k=1,2,\cdots,n_{c}) $$
(16)

Then we set the sweep-cut

$$ {S_{j}^{t}} = \{\pi(1),\pi(2),\ldots,\pi(j)\}\ (1<j\leq n_{c}) $$
(17)

Theorem 1

For some non-negative integer T, let

$$ {\varPhi}^{*} = \min_{0\leq t\leq T} \min_{1<j\leq n_{c}} {\varPhi}({S_{j}^{t}}) $$
(18)

Then for each SVc,

$$ \sum\limits_{i\in S}{p_{t}(i)} - \mu(S)/2m \leq \beta t + \sqrt{\mu(S)} (1-\frac{1}{8}{\varPhi}^{*2})^{t} $$
(19)

with μ(S) ≤ μ(VcS) = 2mμ(S) as we assumed before.

Theorem 1 can be seen as a generalization of the classical Lovász-Simonovits Theorem [24], which additionally introducing a parameter β ∈ [0,1). The classical one is built based on merely lazy random walks corresponding to β = 0, and the range of this parameter is extended with the popularity of the personalized PageRank [3, 23, 33] by later researches [3]. For the general β ∈ [0,1), (12) is written as

$$ p_{t} = \beta p_{0} + (1-\beta) Mp_{t-1} $$
(20)

The characteristics of the two different kinds of random walks decide that a bigger β makes the corresponding random walk be more inclined to exploit than explore during graph searching. Take these words in mind while tuning β, which is applicable to both the algorithms formed by Lovász-Simonovits Theorem and our algorithms.

Theorem 2 is obtained by combining Theorem 1 with Proposition 1.

Theorem 2

For some non-negative integer T, let

$$ den^{*} = \max_{0\leq t\leq T} \max_{1<j\leq n_{c}} den({S_{j}^{t}}-\{sv,tv\}) $$
(21)

Then for each SVc,

$$ \begin{array}{@{}rcl@{}} den^{*} &\geq& \beta t + \frac{m_{c}}{1+2\frac{|X|}{n}} - 2\sqrt{2}m_{c} [ 1\\ &&- \mu(S)^{-\frac{1}{2t}} (\sum\limits_{i\in S}{p_{t}(i)} - \mu(S)/2m)^{\frac{1}{t}} ]^{\frac{1}{2}} \end{array} $$
(22)

with X = S −{sv,tv} and the assumption μ(S) ≤ 2mμ(S).

To more clearly identify the lower bounds of den, compared to the elegant implicit expression of Φ in (19), we place den on the left-hand side of (22) and leave the lower bounds of it on the other side. The definitions of the variables and the parameters in the lower bounds have been given before. Like Theorem 1, Theorem 2 also has a strong algorithmic implication, i.e., it can be directly formed into algorithms. The algorithms, Slither corresponding to β = 0 and Slither PR corresponding to β ∈ (0,1), are formally presented in Alg. 1 for solving the MDS-N problem.

According to (22), in order to get big lower bounds of den, it suggests setting \(S={S_{j}^{t}}\) for each time t, i.e., ordering pt(i)/μ(i) for all nodes iVc. Then, to choose from these bounds, for each t, the corresponding bound increases with the increase of \({\sum }_{i\in {S_{j}^{t}}}{p_{t}(i)}\) or the decrease of \(\mu ({S_{j}^{t}})\) and \(|X|=|{S_{j}^{t}}-\{sv,tv\}|\). Thus, to obtain the maximum den(X), all the values of \(({\sum }_{i\in {S_{j}^{t}}}{p_{t}(i)},\mu ({S_{j}^{t}}),|X|)\) should be evaluated for each t.

4.3 The Hierarchical Algorithms

figure d

Inspired by the hierarchical algorithmic structure adopted in [10] for advancing graph mining, we improve the Slither (PR) algorithm by putting it into a hierarchical frame with K levels, called the hierarchical Slither (PR) algorithm, summarized in Alg. 2. It is a heuristic technique, which only repeats the procedure of Slither (PR), but at each round k (k = 1,⋯ ,K) in Alg. 2, a new MDS-N problem with the same given node r is solved, and a new G is derived by the subset of nodes X obtained from the last round, till rX. Slither and Slither PR are the special cases of their hierarchical versions with setting K = 1. With the increase of K, the hierarchical algorithms will not get worse results than the non-hierarchical ones, meaning that the former have the same lower bounds as the latter.

4.4 Time complexity

Line 7 in Alg. 1 runs in \(\mathcal {O}(|E_{c}|)\) for both lazy random walk and PageRank, and Line 8 runs in \(\mathcal {O}(|V_{c}|\log (|V_{c}|)\) for sorting, with |Ec| = |E| + 2n and |Vc| = |V | + 2. Therefore, Slither (PR) runs in a total of \(\mathcal {O}(T(|E_{c}|+|V_{c}|\log (|V_{c}|)))\), and hierarchical Slither (PR) runs in a total of \(\mathcal {O}(KT(|E_{c}|+|V_{c}|\log (|V_{c}|)))\). Note that, hierarchical Slither (PR) runs more optimistically than what its time complexity looks like, because even the first round of the k-for-loop reduces the size of G largely.

Importantly, if we move Line 8-12 in Alg. 1 (and the same procedure in Alg. 2) out of the t-for-loop, Slither (PR) obtains an \(\mathcal {O}(T|E_{c}|+|V_{c}|\log (|V_{c}|))\) and hierarchical Slither (PR) obtains an \(\mathcal {O}(K(T|E_{c}|+|V_{c}|\log (|V_{c}|)))\). This move will replace all t in Theorem 2 with T, leaving only one lower bound of den with respect to T. This lower bound is no less than the minimum one of the original bound set. Nevertheless, this move saves T − 1 times of sorting in calculating \({S_{j}^{t}}\), which significantly reduces algorithms’ runtime. We recommend to adopt this move in tasks with their bottlenecks being time, i.e., tasks taking large graphs as inputs. In one of this kind of tasks, T is often set to a small value for saving time too. This small T makes the lower bound with respect to T close to the minimum one, and makes the adoption of the move in our algorithms succeed in both effectiveness and efficiency for these tasks.

5 Experiments

In this section, we use HS and HSPR to represent the hierarchical Slither algorithm and the hierarchical Slither PR algorithm respectively, for short. mc = dmax + 1 is set for both HS and HSPR, and β = 0.74 is set for HSPR. Although any β ∈ (0,1) can be set for HSPR, we choose the comparatively big β to clearly show the different search style of HSPR than HS. As there is no existing work that handles the MDS-N problem, we compare our proposed algorithms mainly with two local dense subgraph finding algorithms: FindDense [1] and HiDDen [10], which are using different density measures. All experiments are carried out on a 2.2GHZ Intel Xeon E5-2407 server with 18 GB RAM.

5.1 Experiments on real-world graphs

In this set of experiments, we configure our algorithms with the setting of K = 10 and T = 100. K = 10 is also set for HiDDen. We evaluate algorithms on six publicly-available real-world graphs, of which, three are unipartite graphs including Co-Author [10], Crocodile [34], and Brightkite [35], and another three are bipartite graphs including Epinions [36], Facebook [37], and Amazon [38]. Co-Author is built on a snapshot of AMiner citation dataset [39] collected until the year 2011, covering five research areas: data mining, machine learning, database, information retrieval and bioinformatics. Crocodile is a Wikipedia page-page network on the topic crocodiles. Brightkite is a location-based online social network. Epinions is an Epinions signed who-trust-whom online social network. Facebook consists of social circles from Facebook. Amazon is built on the category Computers of the Amazon product dataset, crawled from May 1996 to July 2014. The main characteristics of the six real-world graphs are shown in Table 1. Note that a bipartite graph has two sets of nodes.

Table 1 Some real-world graphs used in our experiments

Table 2 shows the densities of subgraphs found by algorithms. For each real-world graph, Charikar’s greedy algorithm was used to search the global densest subgraph. Then we randomly sampled ten nodes from it as the given nodes of MDS-N for local subgraph finding algorithms, and averaged the results out. The densities of subgraphs found by the greedy algorithm, labeled as Greedy, are presented as baselines. HS or HSPR achieves the maximum densities among all algorithms, even better than the baselines. Between HS and HSPR, HS wins when densest subgraphs are more easy to be found by exploration search, while HSPR wins when exploitation search is better. To further explain the performance of HS and HSPR, we also show the results of hierarchical PageRank (HPR) in Table 2. HPR is a hierarchical version of PageRank and has the same parameter setting as HSPR, which locally finds the subgraphs that are related to cuts with minimum conductance. Most of the time, HPR performs the worst. Except for Co-Author, the only acceptable result of HPR is from Crocodile, and for Crocodile, HSPR performs better than HS, showing the superiority of exploitation search of both HPR and HSPR in Crocodile. For each of the other four graphs, HSPR performs not always the best, but achieves more than 90% density of the subgraph found by HS, compared to the worst result of HPR, owing to the function of graph transformation used in HSPR.

Table 2 The densities of subgraphs found by algorithms, averaged among given nodes sampled from the global densest subgraphs detected by Charikar’s Greedy algorithm

Table 3 presents the results of the top five hierarchies of three hierarchical algorithms: HiDDen, HS, and HSPR. It can be seen that the bigger β of HSPR enables it to converge faster than HS (with smaller k to get its optimum). Both HS and HSPR reach or approach their optimum in two hierarchies, indicating that we could set far smaller hierarchies (k ≪ 10) for them.

Table 3 The densities of the found subgraphs vs. the hierarchies

5.2 Experiments on Synthetic Graphs

In this set of experiments, we configure our algorithms with the setting of K = 10 and T = 100. K = 10 is also set for HiDDen. From the real-world bipartite graph Epinions, we sampled a random (2000,2000) scale of nodes as the base of two synthetic bipartite graphs, depicted in Fig. 3a with the label “Separate” and in Fig. 3b with the label “Overlapping” separately. Each synthetic graph was constructed by injecting two dense subgraphs. In “Separate”, injected a dense subgraph with edge density 0.04 ranged nodes ([0,150],[0,150]) and a dense subgraph with edge density 0.01 ranged nodes ([150,300],[150,300]). In “Overlapping”, injected a dense subgraph with edge density 0.04 ranged nodes ([0,150],[0,150]) and a dense subgraph with edge density 0.0025 ranged nodes ([0,450],[0,450]). For a synthetic graph, we tested each of the 300 nodes in ([0,150],[0,150]) as a given node of MDS-N, and the results were averaged and shown in Table 4. Two metrics, accuracy (\(AC = \frac {TP+TN}{TP+FN+FP+TN}\)) and F-measure (\(F = \frac {2}{1/precision+1/recall}\)), are used.

Fig. 3
figure 3

The scatter plots of two synthetic bipartite graphs, each injected with two dense subgraphs: (a) Separate: a dense subgraph with edge density 0.04 ranged nodes ([0,150],[0,150]) and a dense subgraph with edge density 0.01 ranged nodes ([150,300],[150,300]); (b) Overlapping: a dense subgraph with edge density 0.04 ranged nodes ([0,150],[0,150]) and a dense subgraph with edge density 0.0025 ranged nodes ([0,450],[0,450])

Table 4 The accuracy (AC) and F-measure (F) of algorithms to find the known densest subgraphs, derived by nodes ([0,150],[0,150]), of the two synthetic bipartite graphs in Fig. 3, averaged among the given nodes in these two subgraphs

In Table 4, algorithms FindDense, HiDDen, HS and HSPR all find subgraphs derived by nodes ([0,150],[0,150]) with high accuracy (> 0.95). But, for F-measure, only HS and HSPR get values > 0.84 on both two synthetic graphs. HS performs better than HSPR here, due to their algorithmic characteristics. Compared with HSPR, especially which with big β, HS is more inclined to explore than exploit when searching local densest subgraphs, which seems to make HS be beneficial in experiments of this subsection.

Additionally, Fig. 4 depicts a box plot to show the concentration of 300 values of F-measure for each algorithm evaluated on each synthetic graph. It can be seen that with more exploitation, HSPR gets the narrowest range of values of F-measure for each synthetic graph, demonstrating its great stability.

Fig. 4
figure 4

The stability of algorithms to find the known densest subgraphs, derived by nodes ([0,150],[0,150]), of the two synthetic bipartite graphs in Fig 3, with respect to F-measure

5.3 Scalability

In this set of experiments, we configure our algorithms with the setting of K = 10 and T = 100. K = 10 is also set for HiDDen. We used the category “Apps for Android” of the Amazon product dataset to construct a user-product bipartite graph, from which we sampled a series of subgraphs with an increasing number of edges for evaluating the scalability of algorithms. As depicted in Fig. 5, HiDDen runs the slowest, and the gap between its running time and that of the other three algorithms grows rapidly with the number of edges. HSPR shows its faster convergence than HS when the number of edges becomes large. The running time of FindDense increases the slowest with the number of edges.

Fig. 5
figure 5

The scalability of algorithms

5.4 Large graph application: targeted fraud detection on twitter

In social/commercial networks, fraudsters often hire workers, being users of these networks, to add links to particular users/products, which leads to unusually dense subgraphs in these networks. These links are added through user behaviors like following, buying, and reviewing. Compared with finding subgraphs sparsely connected to the remaining network, detecting fraud based on dense subgraph finding is naturally camouflage-resistant [5]. This paper provides a way to detect a particular fraud-related subgraph according to a given node, named as targeted fraud detection. Take the large follower-followee social network Twitter as an example. We use the Twitter dataset, which contains 41.7 million users and 1.47 billion social relations crawled in July 2009 [40]. We adopt the criteria in [30] with some modifications for labeling fraudulent user accounts, which is summarized below.

  • The account is suspended or deleted.

  • The text information of the account is associated with malware or adware, or the account is a follower of such an account.

  • The account has a suspicious username, or is followed by users with suspicious usernames, e.g., usernames having identical prefixes/suffixes.

  • The account has very few tweets or very few different tweets (< 5), but relatively more followees (> 20).

Algorithms, FindDense, HiDDen, HS and HSPR, were verified on the Twitter dataset with four user accounts as the given nodes: “@tweepme”, “@twitbacks”, “@tweepi” and “id = 14868835”. The first three accounts, publishing obvious follower buying advertisements like “helps you get more followers on twitter quick and easy”, are still active in gaining followers. id = 14868835 is a suspended account found in a forum. Due to the memory limitation of our 18 GB RAM, we cannot load the entire Twitter dataset that contains 24.3 GB of unstructured follower-followee data. For a given node, we cut off a piece from the Twitter dataset as an extracted dataset, with consecutive follower account IDs that contains the given node’s ID, and all their followees. An extracted dataset has 100 million edges. For instance, the ID of @tweepme is 23711158, and the follower IDs of its corresponding extracted dataset range from 22563769 to 24907792. The node scales of four extracts corresponding to four accounts are given in Table 5.

Table 5 The node scales of the extracts corresponding to four given nodes in Twitter network

For large datasets, we set k = 1 for three hierarchical algorithms (i.e., HiDDen, HS and HSPR), and for HS and HSPR, set T = 10 and moved Line 10-14 in Alg. 2 out of the t-for-loop. For each algorithm, the densities of the detected fraud-related subgraphs corresponding to four accounts are listed in Table 6. Averaged over the four accounts, the running time of FindDense is 2475s, of HS is 1545s, and of HSPR is 1488s. The results of HiDDen are not shown in Table 6, because HiDDen cannot get any result in 5h. Note that the setting of k = 1, etc. makes HS and HSPR run faster even than FindDense, and the results of them are still much better than FindDense, which means HS and HSPR can get acceptable results, that are better than the results of other algorithms used to solve the MDS-N problem, with very small hierarchies (e.g., k = 1) and well before convergence (e.g., with T = 10). The algorithmic characteristic of more exploitation often causes HSPR to have an advantage in time, but in this set of experiments, HS with more exploration achieves the best results.

Table 6 The densities of the detected fraud-related subgraphs corresponding to four given nodes in Twitter network

Then, we focus on subgraphs detected by HS. The node scales of these subgraphs are: (598,2726) for @tweepme, (812,2200) for @twitbacks, (732,1756) for @tweepi, and (273,2304) for id = 14868835. From nodes in each detected subgraph, we randomly selected 50 workers and 50 fraudulent followees, and labeled them according to the previously listed criteria and their profiles and tweets publicly available in https://twitter.com/. The ratios of determined workers and determined fraudulent followees are presented in Fig 6. For comparison, we add another case “Random” by randomly labeling 100 users, consisting of 50 followers and 50 followees, and display its ratios of determined fraud in Fig. 6.

Fig. 6
figure 6

The ratios of determined workers and determined fraudulent followees in subgraphs detected by HS and in 100 randomly selected nodes

Except for “Random”, the ratios of other cases in Fig. 6(a) are over 30% (@tweepme achieves the highest 56%). Three ratios in Fig. 6b are over 30% and id = 14868835 is 24%. “Random” obtains the lowest ratios: 8% for determined workers and 0 for determined fraudulent followees. That shows HS can target fraud effectively according to a given node in a real-world network.

6 Conclusion and future work

This paper introduces the MDS-N problem, a local dense subgraph finding problem based on the average degree measure. We present a graph transformation, which transforms an undirected and unweighted graph into a connected and weighted graph, and reduces the MDS-N problem to the minimum conductance problem. After the transformation, the proposed lazy-random-walk-based Slither algorithm and PageRank-based Slither PR algorithm “walk” on the connected and weighted graph to find the densest subgraph according to a particular node. A simple hierarchically repetition frame is used to further advance the two algorithms. Experiments conducted on both unipartite graphs and bipartite graphs show our algorithms are the fastest among algorithms that can find subgraphs with high densities. Slither tends to explore while Slither PR tends to exploit during their “walks”. A small hierarchy (no more than two) enables Slither and Slither PR to achieve satisfactory results. We verify the MDS-N problem and the proposed algorithms on a large social network Twitter, the experimental results of which show our algorithms can successfully detect local fraud-related subgraphs based on particular fraudulent user accounts.

In the future, we plan to: (I) combine our two hierarchical algorithms into an algorithm with a dynamically adjusted β for being better applied in different graphs, and (II) utilize graph partition techniques before belief propagation (i.e., walks) to decrease time complexity.