1 Introduction

Clusters or communities, defined as groups of vertices that share common properties or play similar roles in a network, are one of the most important patterns in social networks. Fortunato (2010) summarized some recent development of the community detection methods. Modularity-based method (Newman and Girvan 2004) is arguably the most popular methods in finding communities of networks. Greedy techniques (Newman 2004) and annealing methods (Guimera et al. 2004) were developed based on this criterion. However, the modularity method lacks statistical significance for deciding if the detected communities are real ones and are criticized for hardly finding communities smaller than a given scale (Fortunato and Barthélemy 2007). Bayesian models (Handcock et al. 2007; Heard et al. 2010) and Latent Dirichlet Allocation (LDA) models (Blei et al. 2003; Liu et al. 2009; Balasubramanyan and Cohen 2011) offer statistical inference of clusters via given model priors, but the selection of priors and computing times are judged like other applications of Bayesian models.

Recent studies of social networks also paid attentions on networks with attributes. For example, a tendency called “homophily” (McPherson et al. 2001) suggested that people usually interact with others who are similar to themselves with some attributes (Kossinets and Watts 2006). Some studies also discussed how homophily affects network integration (Louch 2000). Zhou et al. (2009) developed a distance-based transition probability, based on similarities of both structure and attribute, to construct a clustering algorithm. Yang et al. (2013) modeled the links of network and node attributes to provide a probability regime to detect community memberships.

However, few of them considered attributes with power-law (PL) distributions. The PL distribution is one of the most commonly found distributions that many data sets follow in networks (Clauset et al. 2009), such as the degrees of proteins in the protein interaction network and the degrees of metabolites in the metabolic network. In addition, the properties and problems of PL distributions are also well addressed in Goldstein et al. (2004) and Clauset et al. (2009). Thus, we intend to construct a community detection method which can accommodate to PL distributions.

To provide a statistical significance of cluster detection without considering the priors, Wang et al. (2008) provided a scanning method for testing clusters based on the idea of cluster detection in the spatial data analysis (Kulldorff 1997). Under the assumption of Poisson random model, Wang et al. (2008) constructed a scan statistic for detecting structure clusters in social networks. Instead of considering the structure clusters, We intend to generalize scan statistics to consider the attribute of networks with PL distributions. Since the scan statistic is originally applied in temporal (Naus 1966) and spatial domains (Kulldorff 1997), the test statistics are basically constructed by the likelihood of attributes. For this reason, we can extend the scan statistic to accommodate to PL distributions with a similar way.

This study aims to extend the use of the scanning method provided in Wang et al. (2008) to the network whose attributes are PL distributed. In Sects. 2 and 3, we introduce the PL distributions including both “discrete” and “continuous” cases, and the scanning method for community detection in social networks. To verify the proposed method, simulation studies are provided in Sect. 4 and a real data set of authorship is analyzed in Sect. 5. We discuss this method and provide some future works.

2 Power-law distributions

Specifying network clusters with PL-distributed attributes is the main focus in this paper. We only mention the topics related to the scanning methods, including density functions and the maximum likelihood estimates (MLEs). If one is interested in the other properties of PL distribution, such as goodness of fit and additional examples of PL distributions, please refer to Goldstein et al. (2004) and Clauset et al. (2009). Note that, we use different notations for density functions and parameters to distinguish between the “discrete” and “continuous” PL distributions.

2.1 Discrete power-law distribution

The probability density function of discrete PL distribution is expressed as

$$\begin{aligned} p(x)=Mx^{-\alpha }, \end{aligned}$$

where M is a normalized constant and is usually expressed as

$$\begin{aligned} M=1/\zeta (\alpha , x_{\min })=1/\sum _{w=0}^\infty \left( w+x_{\min }\right) ^{-\alpha } \end{aligned}$$

when a lower bound \(x_{\min }>0\) is considered. In a special case of \(x_{\min }=1\), \(\zeta (\alpha ,1)\) is equivalent to the Riemann zeta function and is abbreviated as \(\zeta (\alpha )\). Observed from its density function, the probability is confined by the value of \(\alpha\) and has a heavy tail.

A likelihood ratio of PL distribution is required when applying a PL distribution. Suppose \(\{x_1,x_2,\ldots ,x_n\}\) is a random vector following PL distribution and only the parameter \(\alpha\) is considered. The log-likelihood is

$$\begin{aligned} \ln L(\alpha )=-n\ln \zeta (\alpha ,x_{\min })-\alpha \sum _{i=1}^n \ln (x_i). \end{aligned}$$
(1)

By differentiating Eq. (1) at \(\alpha\),

$$\begin{aligned} \frac{\zeta ^{'}(\alpha ,x_{\min })}{\zeta (\alpha ,x_{\min })}=-\frac{1}{n}\sum _{i=1}^n \ln (x_i), \end{aligned}$$
(2)

where \(\zeta ^{'}(\alpha )\) is the derivative of the zeta function. Since Eq. (2) contains an infinity summation, there is no closed form of the MLE for \(\alpha\), and the MLE of this distribution cannot be directly found. Numerical algorithm thus becomes a possible way to obtain its solution. When \(x_{\min }=1\), we can quickly solve it by computing the Riemann zeta function whose derivation is provided in most mathematical softwares like Matlab and Maple. On the other hand, an approximated estimate is considered.

$$\begin{aligned} \hat{\alpha } \simeq 1+n \left[ \sum _{i=1}^n \ln \left( \frac{x_i}{x_{\min }-1/2}\right) \right] ^{-1}. \end{aligned}$$
(3)

when \(x_{\min } \ge 6\) (Clauset et al. 2009). In this paper, the data are restricted to \(x_{\min }=1\), so Eq. (2) is applied to construct the likelihood ratio test of the scanning method. If \(x_{\min } \ge 6\), we would suggest to directly use the approximate estimation Eq. (3). If \(x_{\min }\) is not included in above cases (\(1< x_{\min } <6\)), iterative computation is required to solve the equation (Gillespie 2013). We did not consider the parameter \(x_{\min }\) in this study, because it is easier to assume \(x_{\min }\) is a known constant and this assumption is usually true in our experience. If the \(x_{\min }\) is not certain and has to be estimated, it will be more complicated than the discussion in this study. Clauset et al. (2009) also gave some details of the estimations for parameters \(\alpha\) and \(x_{\min }\).

2.2 Continuous power-law distribution

Similar to what we have done in the case of discrete PL case, the continuous case is introduced as follows. Since the formulation of the continuous PL distribution can be obtained by integration, the estimation procedures are easier. We also assume that the \(x_{\min }\) is a fixed and known parameter. The continuous PL density function is expressed as

$$\begin{aligned} f (x)=K x^{-\beta }, \end{aligned}$$

where K is a normalized constant and is expressed as

$$\begin{aligned} K=\left( \beta -1\right) x_{\min }^{\beta -1}. \end{aligned}$$

Suppose a random sample is collected as \(\{x_1, \ldots , x_n\}\). By directly differentiating the parameter \(\alpha\), the MLE of \(\alpha\) is obtained as

$$\begin{aligned} \hat{\beta }=1+n\left[ \sum _{i=1}^n \ln \frac{x_i}{x_{\min }}\right] ^{-1}. \end{aligned}$$
(4)

3 The scanning method for PL distributions

The scan statistic is one of the most popular cluster detection methods applied in spatial domain (Kulldorff 1997). However, there are few studies and methods applying this approach to detect clusters in social networks. Wang et al. (2008) first used a scan statistic to detect clusters in a social network. We briefly introduce the basic idea of the scanning method and extend it to networks with PL-distributed attributes.

3.1 Scanning window and test statistic for structure pattern in networks

A scanning window is used to separate the studied region/network into two parts, where a likelihood ratio test is used to evaluate the difference between the selected observations and their complementary observations. Usually, a scanning window is circularly expanded and constructed by a center and a radius. In the network structure, the center is one node in a given network, and the radius is the shortest path from the center node to the other nodes. We demonstrate the construction and move of a scanning window via a grid network in Fig. 1.

Fig. 1
figure 1

Example of scanning window in a grid network

Based on this 5 by 5 grid network (Fig. 1a), we clearly observe the distances (the length of shortest paths) among nodes. Thus, it is easy to transform and demonstrate the network into a radius expansion. In this study, we use circular windows to be elective subgraphs. That is, a scanning window is generated based on a center with a corresponding radius, and the set of nodes within the window is the elective subgraph. Take the node 13 as the center for example. In Fig. 1b, the scanning windows with gray boundaries are generated from the center 13, and each window is expanded based on the length of the shortest paths. Suppose we generate a scanning window by the center node 13 and a radius 1. The elective subgraph is demonstrated in Fig. 1c, and we can compare the difference between the selected part and its unselected counterpart. One may check that all the vertices are contained in the windows according to Fig. 1a. In fact, the number of testing regions for a scan statistic is decided based on the number of vertices and the number of radiuses.

Wang et al. (2008) provided explicit descriptions on how to test the structure based on the scan statistics. We recall some notations that appear in the rest of our paper. We only focus on undirected graphs in this study. Let \(G=(V_G,E_G)\) be an undirected graph with vertex set \(V_G=\{v_1,\ldots ,v_{|V_G|}\}\) and edge set \(E_G\), and the degrees of vertices are \({\mathbf {k}}=\{k_1,\ldots ,k_{|V_G|}\}\). In addition, we define the sum of total degrees as \(k_G=\sum _{i=1}^{|V_G|}k_i\) and the total number of edges as \(|E_G|=k_G/2\). Then, by considering the Poisson random graph model (Erdős and Rényi 1959) with degree vector \(\mathbf {k}\), the number of expected edges connecting the pair nodes \((v_i,v_j)\) is expressed as \(e_{ij}=(k_i k_j )/(2 |E_G|)\) for \(i \ne j\), and \(e_{ii}=k_i^2 /(4 |E_G|)\).

Suppose a subgraph \(Z=(V_Z, E_Z)\) is selected. Similar notations is used to describe the quantities of Z: \(k_Z=\sum _{i \in V_Z} k_i\) and \(|E_Z|=k_Z/2\). To test if a network is composed by two different subgraphs, the number of edges in G is equal to \(\text{ Poi }(\lambda =\gamma \mu (Z \bigcap G) + \eta \mu (Z^C \bigcap G) )\), where \(\gamma\) and \(\eta\) represent the strengths for subgraph Z and its complementary subset \(Z^C\), and \(\mu (Z \bigcap G)\) and \(\mu (Z^C \bigcap G)\) represent the expected numbers of edges under the null hypothesis. Under the Poisson random graph assumption, \(\mu (G)\), \(\mu (Z)\), and \(\mu (Z^C)\) are, respectively, defined as

$$\begin{array}{lll} \mu (G)&=\;\;\frac{k_G^2}{4|E_G|}, \\ \mu (Z)&=\;\;\frac{k_Z^2}{4|E_G|}, \quad \text{ and } \\ \mu \left( Z^C\right)&=\;\;\mu (G)-\mu (Z). \end{array}$$

Thus, the likelihood ratio statistic of a selected subgraph Z is

$$\begin{aligned} LR(Z) =\frac{L_Z}{L_0}=\left\{ \begin{array}{ll} \left( {\frac{\left| E_Z\right| }{\mu (Z)}}\right) ^{|E_Z|}\left( {\frac{|E_G|-|E_Z|}{\mu (G)-\mu (Z)}}\right) ^{|E_G|-|E_Z|} &{}\quad \text{ if } \quad \hat{\gamma }>\hat{\eta }\\ 1 &{}\quad \text{ otherwise, } \end{array} \right. \end{aligned}$$
(5)

where \(\hat{\gamma }=\frac{|E_Z|}{\mu (Z)}\) and \(\hat{\eta }=\frac{|E_G|-|E_Z|}{\mu (G)-\mu (Z)}\). By scanning the whole region, the test statistic is the one with the maximum logarithmic likelihood ratio

$$\begin{aligned} \lambda _S(Z)=\max _Z \ln LR(Z). \end{aligned}$$

The subgraph Z with maximum \(LR(\cdot )\) is identified as a cluster if the null hypothesis is rejected.

3.2 The hypothesis and likelihood ratio test for discrete PL distribution

A testing statistic for attribute is constructed in a similar manner. Since the data are divided into two parts via a scanning window, the likelihood ratio evaluates the likelihoods between the values within the selected window and the unselected counterpart. In general, the test suggests \(H_0: F(Z) = F(Z^c)\) vs. \(H_a: F(Z) < F(Z^c)\), where F is the distribution function. Suppose a subgraph is selected as Z and the parameter of interest is \(\theta\). The likelihood ratio statistic is expressed as

$$\begin{aligned} \lambda \left( Z, Z^c\right) =\frac{\sup _{\Theta } L\left( \theta |x\right) }{\sup _{\Theta _0}L\left( \theta |x\right) }, \end{aligned}$$

where \(\Theta _0\) is the parameter space under the null hypothesis and \(\Theta\) is the entire parameter.

In this study, the distribution of interest is PL, and we consider the cluster with higher value of PL distribution, or equivalently, the cluster with a smaller parameter \(\alpha\) in the PL distribution. Thus, it is equivalent to consider the test \(H_0: \alpha _z \ge \alpha _\mathrm{c}\) vs. \(H_a: \alpha _z < \alpha _\mathrm{c}\), where \(\alpha _z\) and \(\alpha _\mathrm{c}\) are the parameters of PL distribution for the selected observations and their complementary observations, respectively.

Set the minimum value of the PL distribution as 1. Based on the null hypothesis that \(H_0: \alpha _{Z}=\alpha _{Z^c}\), the joint likelihood of the distribution is

$$\begin{aligned} L_0\left( \alpha |\mathbf {x}\right)&= \prod _{i=1}^n p(x_i)\\&= \prod _{i=1}^n \zeta \left( \alpha , x_{\min }\right) x_i^{-\alpha }. \end{aligned}$$

By taking logarithm on \(L_0\) and differentiating it with respect to \(\alpha\), the MLE of \(\alpha\) is the solution of

$$\begin{aligned} \frac{\zeta ^{'}(\hat{\alpha }_0)}{\zeta (\hat{\alpha }_0)}=-\frac{1}{n}\sum _{i=1}^n \ln (x_i). \end{aligned}$$

Thus, the denominator of the testing statistic \(\lambda\) is \(\sup _{\Theta _0}L\left( \theta |x\right) = L_0\left( \hat{\alpha }_0|\mathbf {x}\right)\).

Two cases on the numerator of the test statistic are considered; \(\alpha _z \ge \alpha _\mathrm{c}\) and \(\alpha _z<\alpha _\mathrm{c}\). When \(\alpha _z \ge \alpha _\mathrm{c}\), the numerator of the test statistic reduces to the null hypothesis and obtains the same estimate of the denominator. When \(\alpha _z<\alpha _\mathrm{c}\), the joint likelihood is viewed as two parts; one belongs to the selected subgraph Z and the other one is the complementary set of Z, \(Z^c\). Then, the joint likelihood of the distribution for this case is

$$\begin{aligned} L_\Theta (\alpha |{\mathbf {x}})= \prod _{i \in Z} p(x_i) \prod _{j \in Z^c} p(x_j). \end{aligned}$$

Since \(x_i\) and \(x_j\) belong to different \(\alpha\)s and are independent, we separately discuss the estimates of \(\alpha _z\) and \(\alpha _\mathrm{c}\). The MLEs of them are the solutions of

$$\begin{aligned} \frac{\zeta ^{'}(\hat{\alpha }_z)}{\zeta (\hat{\alpha }_z)}&=-\frac{1}{n_z}\sum _{i \in Z} \ln (x_i)\quad \text{ and } \nonumber \\ \frac{\zeta ^{'}(\hat{\alpha }_\mathrm{c})}{\zeta (\hat{\alpha }_\mathrm{c})}&=-\frac{1}{n_\mathrm{c}}\sum _{j \in Z^c} \ln (x_j), \end{aligned}$$
(6)

where \(n_z\) and \(n_\mathrm{c}\) are the numbers of nodes in Z and \(Z^c\), respectively. Denote the estimates as \(\hat{\alpha }_z\) and \(\hat{\alpha }_\mathrm{c}\). The numerator of the testing statistic is \(\sup _{\Theta }L(\theta |x) =L(\hat{\alpha }_z, \hat{\alpha }_\mathrm{c}|\mathbf {x})\) when \(\hat{\alpha }_z<\hat{\alpha }_\mathrm{c}\), and \(\sup _{\Theta }L(\theta |x)=L_0(\hat{\alpha }_0|\mathbf {x})\) otherwise.

According to above description, the test statistic is \(\lambda (Z)=\frac{L(\hat{\alpha }_z,\hat{\alpha }_\mathrm{c}|\mathbf {x})}{L(\hat{\alpha }_0|\mathbf {x})}\) when \(\hat{\alpha }_z<\hat{\alpha }_\mathrm{c}\), and \(\lambda (Z)=1\) otherwise. By considering the real form of the PL distribution, the likelihoods for the null hypothesis and the alternative hypothesis are

$$\begin{aligned} L\left( \hat{\alpha }_0\right)&=\prod _{i=1}^n \left[ \frac{x_i ^{-\hat{\alpha }_0}}{\zeta \left( \hat{\alpha }_0, x_{\min }\right) }\right] , \quad \text{ and } \\ L\left( \hat{\alpha }_z,\hat{\alpha }_\mathrm{c}\right)&=\prod _{i\in Z} \left[ \frac{x_i ^{-\hat{\alpha }_z}}{\zeta \left( \hat{\alpha }_z, x_{\min }\right) }\right] \prod _{j \in Z^c} \left[ \frac{x_j ^{-\hat{\alpha }_\mathrm{c}}}{\zeta \left( \hat{\alpha }_\mathrm{c}, x_{\min }\right) }\right] . \end{aligned}$$

By scanning the whole region via some predetermined radii, the test statistic for detecting clusters is

$$\begin{aligned} \lambda _A(Z)=\max _{Z \in \Omega } \lambda (Z). \end{aligned}$$

The logarithm form of the test statistic is expressed as

$$\begin{aligned} \Lambda _A(Z)&=\ln \lambda _A (Z)\nonumber \\&= -\hat{\alpha }_z\sum _{i \in Z} \ln (x_i) -n_z \ln \left( \zeta \left( \hat{\alpha }_z, x_{\min }\right) \right) -\hat{\alpha }_\mathrm{c}\sum _{j \in Z^c} \ln (x_j)\nonumber \\&\quad \,\, -n_\mathrm{c} \ln \left( \zeta \left( \hat{\alpha }_\mathrm{c}, x_{\min }\right) \right) +\hat{\alpha }_0\sum _{i \in \Omega } \ln (x_i) +n \ln \left( \zeta \left( \hat{\alpha }_0, x_{\min }\right) \right) , \end{aligned}$$
(7)

where n, \(n_z\), and \(n_\mathrm{c}\) are the numbers of nodes in the whole graph, the selected subgraph Z, and the complementary subgraph \(Z^c\), respectively.

3.3 The hypothesis and likelihood ratio test for continuous PL distribution

Similar to the discrete PL distribution, we construct the testing statistic for the continuous PL distribution. Since we have the MLE of the continuous case from Eq. (4), we can directly apply the same construction of the test statistic for the continuous PL distribution. For the null hypothesis, \(H_0: \beta _{Z} \ge \beta _{Z^c}\), the joint likelihood of the distribution is

$$\begin{aligned} L_0\left( \beta |\mathbf {x}\right)&= \prod _{i=1}^n p(x_i)\\&= \prod _{i=1}^n \left( \beta -1\right) x_{\min }^{\beta -1} x_i^{-\beta }. \end{aligned}$$

By taking logarithm on \(L_0\) and differentiating it with respect to \(\beta\), the MLE of \(\beta\) is

$$\begin{aligned} \hat{\beta }_0=1+n\left[ \sum _{i=1}^n \ln \frac{x_i}{x_{\min }}\right] ^{-1}. \end{aligned}$$

Thus, the denominator of the testing statistic \(\lambda\) is \(\sup _{\Theta _0}L(\theta |x) = L_0(\hat{\beta }_0|\mathbf {x})\).

Furthermore, for the case \(\beta _z<\beta _\mathrm{c}\), the MLEs of them are

$$\begin{aligned} \hat{\beta }_Z&=1+n\left[ \sum _{i \in Z} \ln \frac{x_i}{x_{\min }}\right] ^{-1} \quad \text{ and } \nonumber \\ \hat{\beta }_Z^c&=1+n\left[ \sum _{j \in Z^c} \ln \frac{x_j}{x_{\min }}\right] ^{-1}. \end{aligned}$$
(8)

where \(n_z\) and \(n_\mathrm{c}\) are the numbers of nodes in Z and \(Z^c\), respectively.

According to above description, the test statistic is \(\lambda (Z)=\frac{L(\hat{\beta }_z,\hat{\beta }_\mathrm{c}|\mathbf {x})}{L(\hat{\beta }_0|\mathbf {x})}\) when \(\hat{\beta }_z<\hat{\beta }_\mathrm{c}\), and \(\lambda (Z)=1\) otherwise. By considering the real form of the PL distribution, the likelihoods for the null hypothesis and the alternative hypothesis are

$$\begin{aligned} L(\hat{\beta }_0)=\prod _{i=1}^n \left[ \left( \hat{\beta }_0-1\right) x_{\min }^{\hat{\beta }_0-1} x_i^{-\hat{\beta }_0}\right] , \end{aligned}$$

and

$$\begin{aligned} L(\hat{\beta }_z,\hat{\beta }_\mathrm{c})=\prod _{i\in Z} \left[ \left( \hat{\beta }_z-1\right) x_{\min }^{\hat{\beta }_z-1} x_i^{-\hat{\beta }_z}\right] \prod _{j \in Z^c} \left[ \left( \hat{\beta }_\mathrm{c}-1\right) x_{\min }^{\hat{\beta }_\mathrm{c}-1} x_i^{-\hat{\beta }_\mathrm{c}}\right] . \end{aligned}$$

Thus, the test statistic for detecting clusters with the continuous PL distribution is expressed as

$$\begin{aligned} \Lambda _A(Z) =\ln \lambda (Z)&= n_z \ln \left( \hat{\beta _z} -1\right) - {\hat{\beta }_z}\sum _{i \in Z} \ln \frac{x_i}{x_{\min }}\nonumber \\&\quad + n_\mathrm{c} \ln \left( \hat{\beta _\mathrm{c}} -1\right) - {\hat{\beta }_\mathrm{c}}\sum _{i \in Z^c} \ln \frac{x_i}{x_{\min }}\nonumber \\&\quad - n \ln \left( \hat{\beta _0} -1\right) + {\hat{\beta }_0}\sum _{i \in \Omega } \ln \frac{x_i}{x_{\min }}, \end{aligned}$$
(9)

where n, \(n_z\), and \(n_\mathrm{c}\) are the numbers of nodes in the whole graph, the selected subgraph Z, and the complementary subgraph \(Z^c\), respectively.

3.4 Testing procedure

Due to a large set of selected subgraphs, scanning method often suffers the multiple testing problem. The Monte Carlo testing is a suggested solution to this problem Kulldorff (1997). For the case of attribute pattern, we directly apply the method provided in Kulldorff (1997) in which a randomized permutation of observation is suggested. For example, when a PL distribution is applied, the Monte Carlo procedure randomly permutes the observations and assigns the values for each node.

When a new graph is generated, we evaluate the new data by the same test statistics provided in Sects. 3.2 and 3.3. Suppose a simulation with a large number of iteration, such as 99 or 999, is executed. A Monte Carlo p value with R runs is computed as

$$\begin{aligned} p=\frac{\# \left\{ \Lambda _r \ge \Lambda _\mathrm{obs}\right\} +1}{ R+1 }, \end{aligned}$$
(10)

i.e., the probability of finding more extreme values than the observed value. If the p value is smaller than a prespecified criterion (e.g., 0.05), it is statistically significant to declare that there is a cluster. That is, the observed value is less likely to happen under the null hypothesis. The details about the Monte Carlo simulations in other distributions are referred to Kulldorff (1997).

4 Simulation study

4.1 Simulation settings

In this section, we generate a series of random data to testify the type I error and the testing power for detecting a single cluster with power-law distributions based on our proposed method. For the network structure, we restrict the study region with 100 nodes and the edges among them are set to follow a Bernoulli distribution with connection probability \(p_0=1/20\), i.e., expected degree of each node is 5. To verify the power of our proposed method, we set a cluster in the network with a variate size S (10, 15, and 20) with higher connection probability of edges (\(p_\mathrm{c}=1/4\), 1/2, 3/4, and 1). For the consideration of attribute, we set a power-law distribution for usual nodes with the parameters \(\alpha =2.5\) and \(x_{\min }=1\), and set that for cluster nodes with the lower \(\alpha\) (from 1.5 to 2.0 in steps by 0.1).

We illustrated a simulated network in Fig. 2. In this example, we set the cluster size as 20, connection probability as 1/2 for the cluster nodes and 1/20 for the usual nodes, and the parameter \(\alpha\)s of discrete PL attribute is 1.8 and 2.5 for the cluster nodes and usual nodes, respectively. The node labels are the values of discrete PL attributes. From Fig. 2, the cluster nodes clearly have higher connections than other nodes. However, it is not trivial to observe the difference by eye for the cluster of attribute, even if we set a large difference between usual nodes and cluster nodes. Thus, it is necessary to have an automatic tool to specify the clusters.

Fig. 2
figure 2

A simulated cluster network with size 20, connection probability 1/2, and discrete power-law distribution \(\alpha = 1.8\)

4.2 Type I error

For testing the feasibility of our proposed method, we conduct a simulation for type I error. In this subsection, different connection probabilities (\(p_0 =\) 1/20, 1/15, 1/10, and 1/5) are considered to check the performance of our proposed method. In each simulation case, we executed 1,000 runs to assess sampling fluctuations. If at least a subgraph among the constructed network is statistically significant, the type I error occurs. Table 1 shows that the type I errors are very consistent and well performed around 0.05 for both the discrete and continuous power-law distributions.

Table 1 Type I error

We will use \(p_0=1/20\) in the following simulations since \(p_0=1/20\) is the usual case in real data.

4.3 Testing the attribute clusters

In our past studies, we realize that the cluster size and connection strength greatly affect the detection power, so one should pay attention to these changes. The simulation for this purpose is set up as follows. \(p_\mathrm{c}\) are selected from 1/4, 1/2, 3/4, and 1; cluster sizes are selected from 10, 15, and 20; the parameters \(\alpha\) and \(\beta\) of discrete and continuous power-law attributes for cluster nodes are selected from 1.5 to 2.0 in steps by 0.1 and that of usual nodes is fixed as 2.5. We also try to set the \(\alpha\) and \(\beta\) for usual nodes as 3. However, the behavior of each node is too similar to one another (most of them are 1), and the cluster may not be easily noticeable.

Fig. 3
figure 3

The detection powers of discrete PL-attributed clusters with different structure and parameter \(\alpha\)

Fig. 4
figure 4

The detection powers of continuous PL-attributed clusters with different structure and parameter \(\beta\)

Figures 3 and 4 illustrate the detection powers for continuous and discrete PL attributes, respectively. The results show few differences between continuous and discrete cases, and cluster size does not appear to affect the testing power. In contrast, connection probability is the most important factor for detecting clusters. If the clusters are not highly connected, it is hard to see the similarity of attributes even we set a large value for cluster nodes. On the other hand, parameter value has impact when it has a lower value (smaller than 1.7). In Fig. 2, most data are 1 and few nodes have extreme values. It is hard to verify the significant difference. To see the influence of the parameter values, we list the estimation results for some selected cases in Table 2.

Table 2 Estimation results of parameter \(\alpha\) for cluster nodes

Table 2 shows the average values and standard deviations of estimations in 100 simulations for each combination case. The estimated averages are acceptable, but they are underestimated for large true values and are overestimated for small true values. In addition, the standard deviations (the bracket values) show interesting changes. When connection probability gets higher, the standard deviation gets lower. The estimations under the null hypothesis, values of \(\hat{\alpha }_0\) and \(\hat{\beta }_0\), are equally important but not included in Table 2. The average values of the parameters under null hypothesis range between 2.2 and 2.4 and the standard deviations range between 0.12 and 0.16. These results suggest that a good power can be achieved when the cluster parameter is not low.

5 Empirical study

We apply our proposed method to the authorship data in this section. The authors collaboration network was produced from the BibTeX bibliography (Beebe 2002) obtained from the Computational Geometry Database and was well organized in Pajek datasets (http://vlado.fmf.uni-lj.si/pub/networks/data/collab/geom.htm). We only consider the vertices with at least one paper in this study. Thus, the network of interest consists of 6158 vertices and 11,898 edges. The maximum number of papers for an author is 697 in this data. There are strong connections between coauthors. Most researches only consider the network structure, but few of them mentioned that the number of papers itself may form another patterns. We apply our method to the data with PL distribution as a demonstration.

Fig. 5
figure 5

Exploratory data analysis of the coauthor data

First, we inspect the data properties and check if the data follows a PL distribution. The log–log plot and Q–Q plot with estimated parameter \(\alpha\) of PL distribution are shown in Fig. 5. Clearly from Fig. 5, the coauthor data possess the features of the PL distribution such as the heavy tail and logarithm linear form, together with some biased values on both ends.

Following the testing procedures in Sect. 3, the results of structure and attribute clusters are listed in Table 3. We show the results of the most significant cluster of these two types of clusters. In addition, the Newman’s modularity method (Newman and Girvan 2004) is also applied to detect the structure clusters. The modularity measure is defined as

$$\begin{aligned} Q=\frac{1}{2|E_G|}\sum _{ij}\left( A_{ij}-\frac{k_i k_j}{2|E_G|}\right) s_i s_j=\frac{1}{4|E_G|}s^T B s, \end{aligned}$$

where the notations \(|E_G|\), \(k_i\), and \(k_j\) are equivalent to those described in Sect. 3.1, A is the adjacent matrix based on the given network, and s is a \(\pm 1\) vector in which 1 represents an element belongs to the target group and \(-1\) represents an element does not belong to the target group if only two groups are considered.

Table 3 Information of detected clusters

Table 3 lists all the results of our proposed method and the Newman’s modularity method. The bracket of the method column indicates the pattern types of the detected cluster. When considering the structure cluster (S), the scan method detect a larger cluster but with lower connection strength within the cluster \(S_Z.\) The modularity values Q of the Scan(S) and Newman(S) are interesting. It is common that Newman’s method usually finds the best modularity when considering the existence of two clusters. However, the scan method is able to a higher modularity (although there is more than one cluster detected), because the scan method is more flexible to find the clusters in this case. Compare the attribute cluster with the structure clusters, the attribute cluster (A) detected by the proposed method is apparently larger than the structure clusters detected by the other two method. That means the expansion of the attribute seems quicker than the network connection.

6 Discussion

In this paper, we extend the scan statistic to consider PL distributions in both discrete and continuous cases for testing the attribute clusters in networks. We first review the properties and the estimations of PL distributions, and then we generate the likelihood ratio test statistic of PL distributions for the scanning method. We further construct simulations to verify the feasibility of our detecting approach. Finally, we use the method to analyze the authorship data and compare the results with the modularity method from (Newman and Girvan 2004).

In practice, the scanning method can be applied to other distributions such as binomial, Poisson, normal, and multinomial distributions which were constructed in spatial data analysis. The power-law distribution, which is rarely mentioned in many statistical applications, is however one of the most important distributions in social networks. We believe the potential of extending this distribution to other statistical applications. Besides, a truncated power-law distribution draws much attention in recent years (Burroughs and Tebbens 2001). Some studies believe that it is more realistic for real data, but the estimation is more difficult.

There are few problems of the scanning method. One of the most obvious problem is the computing burden. Because the scan statistics are decided by the scanning windows, the number of scanning windows is tremendous in a large network. Furthermore, we applied the Monte Carlo method to obtain a statistical significance. This also increases the computing burden. The other problem is the connection probability and clustered size can influence the detection power. We are not sure if this problem exists in a large but spare network like the authorship data in this study, since it is not feasible in terms of the computing loading for our proposed method to test all generated windows and verify clusters by Monte Carlo procedure in such a large network. To resolve the computing burden listed above, We are looking forward to parallel computing method which can evaluate the Monte Carlo simulations at the same time to facilitate the computing efficiency. In addition, we also try to create a new algorithm to reduce the number of scanning windows rather than searching the whole region.

We only generate circularly scanning windows in this study, but there are many different ways to construct these windows, such as elliptical shape windows (Kulldorff et al. 2006) and flexible scanning windows (Tango and Takahashi 2005) proposed in the spatial statistics. It is a vital future work that we will try to utilize these diverse methods to see if the detection accuracy can be improved.