Keywords

1 Introduction

The World-Wide-Web (WWW) encodes associative links among a large amount of pages. Its structure has grown without any central control, thus make it approximable to the result of a random process, where pages link to each other following local probabilistic rules.

Such probabilistic rules are defined through statistical properties of Web graph features. In particular, several investigations show that the WWW is scale-free [1, 5, 10] i.e., both the distributions of incoming and outgoing links are well-approximated by a discrete power law [21]. This can be traced to the fact that the vast majority of documents in the Web have relatively few outgoing and incoming links, but few pages still have enormous number of links that skew the mean of the empirical distribution far above the median.

Nonetheless, when analyzing specific portions of the Web, i.e. websites, the scale-free property seems to be less evident especially for specific categories (e.g. university homepages) [22, 24]. Note that, differently from what is commonly done in literature [22], we consider websites as closed sub-systems of the Web whose temporal evolution is independent of the system they evolved into.

In this work, we are interested in developing a method able to assess if data from empirical observations follow a power-law. Indeed, testing power laws on empirical data is usually hard due to the large fluctuations that are present in the tail of the distribution.

One of the most commonly used statistical test is the Kolmogorov-Smirnov [11]. This method focuses on the center of the distribution, making it not suitable for testing heavy-tailed distributions. In [11] the authors make strong use of this test by performing a bootstrap procedure that is optimal in small sample size regimes. Indeed, as the sample size grows, the power of the statistical test increases, thus leading to higher rate of rejections of the null hypothesis. Moreover, even in presence of small sample sizes, adding a low amount of noise may cause the test to reject.

As in real-world, noisy or large samples are the common scenario, here, we propose an alternative testing pipeline that leverages on the Anderson Darling test [3] and Monte Carlo sub-sampling. Our pipeline is able to cope with the power of the test problem by reducing the sample size while maintaining the original degree distribution behavior.

We show synthetic experiments in which the state-of-the-art method fails under small variations or large sample sizes of input data. In all these cases, our method is proved to be more stable under variations and it can be shown that provides results with a better confidence. Lastly, we present case studies on 3 websites representative of different generative processes. These case studies present interesting results showing that indeed, closed sub-portion of the Web do not necessarily follow a power-law distribution. And, they seem to point in the direction that the more the generative process is centralized the less the degree distribution can be associated to a power law decay.

Outline. The remainder of the paper is organized as follows: Sect. 2 presents the state-of-the art algorithm for testing empirical power-law distribution; in Sect. 3 we present the limitations of such method with the related synthetic examples; in Sect. 4 we present our adaptation based on Monte Carlo sub-sampling to overcome the issue of power in empirical data; in Sect. 5 we present a large variety of experiments showing how our method is more stable and the case studies; lastly, we conclude with Sect. 6 with some discussion on the obtained results and future research directions.

2 Discrete Power-Law Distribution: Definition, Fit and Statistical Test

The discrete power-law distribution is defined as

$$\begin{aligned} {\begin{matrix}&\mathbb {P}(d_{v}=x)\approx \frac{1}{\xi (x_{min},\alpha )} x^{-\alpha }, \end{matrix}} \end{aligned}$$
(1)

where \(d_v\) is the random variable representing the degree of a node v, \(x_{min}\) is a fixed lower bound on the values x, \(\alpha \) is a scaling parameter, and \(\xi (x_{min},\alpha )=\sum \limits _{x=x_{min}}^{\infty }x^{-\alpha }\) is the Hurwitz-zeta function [15].

The parameter \(x_{min}\) is particularly important, as often the degree distribution of a network follows a power law only for degrees x greater than a lower bound. A network is said to be scale-free if the tail of its in-degree and out-degree distributions obeys to a discrete power law decay. In practice, this entails that we have a non-null probability to observe nodes with a degree much greater than average (hubs).

2.1 Maximum Likelihood Estimation

The parameters \(x_{min}\) and \(\alpha \) of an empirical power-law distribution need to be estimated from data. Given as input a vector \(\mathbf {x}\in \mathbb {N}^n\) representing the degrees of n nodes of a graph, we need to perform two different procedures to estimate these two parameters, as described by the pseudo-code in Algorithm 1.

Estimate of \(x_{min}\). First, we pick \(\hat{x}\) as the value that minimizes the difference between the empirical degree distribution and the fitted power-law model where \(x_{min}=\hat{x}\) [11, 12].

In order to minimize such difference, we need to select a suitable distance. One of the most common is the Kolmogorov-Smirnov (KS) statistic, which is defined as the supremum norm of the difference between two distribution functions (CDFs) of the empirical data and the best-fit model [18]. Although the KS statistic is widely used, it presents some drawbacks in the detection of heavy-tailed distributions since, being based on the CDF, it mainly penalizes fluctuations in the center of the empirical distribution. A more reliable distance for the comparison of heavy-tailed distributions is the Anderson-Darling (AD) statistic as it puts more importance to the extreme values of the CDFs [3]. For this reason, we will recur to this statistic in the rest of the paper. The AD distance is defined as

$$\begin{aligned} A^2(\mathbf {x}, F_{x_{min}=x}) = -n-\sum \limits _{i=1}^n\frac{2i-1}{n}\bigg [\ln F_{x_{min}=x}(x_{i})+\ln (1-F_{x_{min}=x}(x_{n+1-i}))\bigg ], \end{aligned}$$

where n is the sample size and \(F_{x_{min}=x}\) is the power-law CDF.

Note that, if we select a \(\hat{x} > x_{min}\), we are reducing the size of our training data, and our model will suffer from the statistical fluctuations in the tail of the empirical distribution. On the other hand, if \(\hat{x} < x_{min}\), the maximum likelihood estimate of the scaling parameter \(\hat{\alpha }\) may be severely biased.

figure a

Estimate of \(\alpha \). Given the lower bound \(x_{min}\), we estimate the scaling parameter \(\alpha \) by means of maximum likelihood, which provides consistent estimates in the limit of large sample sizes [13].

In the discrete case, a good approximation of the true scaling parameter can be reached mostly in the \(x_{min}\ge 6\) regime [11]. And it can be computed as:

$$\begin{aligned} \hat{\alpha }\approx 1+n\bigg [\sum \limits _{i=1}^n\ln {\frac{x_{i}}{x_{min}-\frac{1}{2}}}\bigg ]^{-1}. \end{aligned}$$

2.2 Goodness-of-Fit Test

figure b

Once \(\hat{\alpha }\) and \(\hat{x}\) have been estimated, we need to assess if observed data are plausibly sampled from the related power-law distribution. To such extent, we perform a goodness-of-fit (GoF) test procedure [19].

A goodness-of-fit test measures how well a statistical model fits into a set of observations. Given the statistical model under testing, a GoF makes use of a statistic that evaluates the discrepancy between the observed values and the expected value of the model. By definition, a statistic is a function which does not depend on the parameters of the model. The output of the GoF procedure is a p-value corresponding to the probability that the statistic is greater than its realization on the observed data.

Note that, since we estimate the model parameters from data we do not know the distribution of the statistic. Thus, we perform a semi-parametric bootstrap approach to estimate such distribution empirically [11, 25].

In particular, we fixed as statistic the Anderson-Darling distance and we perform a procedure described in Algorithm 2. Given n samples, we indicate with \(n_{tail}\) the amount of samples that are greater than \(\hat{x}\). Bootstrap is then performed by simulating \(n_{tail}\) examples from a power law with parameters \(\hat{\alpha }\) and \(\hat{x}\), and for the remaining sample size \(n - n_{tail}\) we sample degrees from the empirical data that are smaller than \(\hat{x}\). We repeat this procedure M times. The value of M depends on the desired significance of the p-value. Typically, if we want a p-value that approximates its true value with an error smaller than \(\epsilon \), then \(M=\frac{1}{4\epsilon ^{2}}\).

Given the M simulated data sets, we fit to each of them its own power-law model and compute the AD distance. This provides the empirical distribution of the AD statistic that we use to compute the associated p-value, defined as the fraction of synthetic distances larger than the observed one.

If p is large (relatively to a fixed significance level, e.g. 0.1), we cannot reject the null hypothesis. Then, possibly, the difference between the empirical and theoretical distributions may be attributed to statistical fluctuations. Differently, if p is smaller than the significance level, we say that the empirical data are not power law.

Fig. 1.
figure 1

On the left, the empirical probability density functions of true power-law data (black line) and noisy power-law data (pink). On the right, the Anderson-Darling test on both samples. Little variations from an exact power-law sample lead to reject the null hypothesis. (Color figure online)

Fig. 2.
figure 2

On the left, the empirical probability density function of true power law data. On the right, the Anderson-Darling test. Large sample size (\(5\times 10^{5}\)) leads to reject the null hypothesis.

3 Problems of Goodness-of-Fit on Empirical Data

Testing whether empirical data are power-law distributed is a hard task. This is due to the following reasons: a) the probability of rejecting the null hypothesis grows with sample size; and, as a consequence b) the procedure is too sensitive to even minimal amount of noise. Little attention has been put on these issues, but we argue that they are crucial as they heavily affect the final response of the statistical test.

In particular, both problems can be addressed by considering the power of the test, which, fixed a significance level, is defined as the probability of correctly rejecting the null hypothesis. Such probability increases accordingly to the sample size, hence, when the number of nodes n is large, we tend to reject the null hypothesis even in cases of true power-law distributed data (as the power of the test is very close to 1). Indeed, by performing bootstrap, we simulate nearly exact power-law samples, which induce the Anderson-Darling test to be very sensitive to even minimal fluctuations in the observed distribution.

In Fig.  1 and 2, we show two synthetic experiments where such test fails, in particular:

  1. (a)

    we generated \(n=10^{5}\) samples from a discrete power-law distribution with parameters \(x_{min}=7\) and \(\alpha =2.7\). We perturbed the data by adding one occurrence to the last 13\(^{\circ }\) in the extreme tail (see Fig. 1 left panel for the true and perturbed data);

  2. (b)

    we generated \(n=5\times 10^{5}\) samples from a discrete power-law distribution with parameters \(x_{min}=2\) and \(\alpha =2.7\).

We applied the procedure in Sect. 2 on both datasets, with \(M = 200\) and significance level set to 0.1. Results are shown on the right side of Fig. 1 and 2. In Fig. 1, the empirical probability density functions of the two samples are indistinguishable from each other except in the extreme tail, where little divergences can be traced. Thus, it becomes evident that for large sample sizes the test is very sensitive even to little fluctuations in the observed sample. Also, with example (b) we show that even perfect power-law samples induce the test to fail when the sample size is too large (Fig. 2).

Both examples show that the high power of the Anderson-Darling test in large sample size regimes constitutes a drawback of the previously introduced method [11]. Since it is never the case that an observed degree distribution is exactly drawn from a discrete power law, we propose a variation of the method in Sect. 2 that aims at testing the goodness of fit of heavy tail distributions.

Fig. 3.
figure 3

Schematic representation of the proposed pipeline.

4 Monte Carlo Approach

Our proposal is based on the idea of performing iterative Monte Carlo (MC) sub-samplings of different length on the original degree sequence. We argue that with this sub-sampling scheme we can reduce the sample size without modifying the trend of the original degree distribution and possibly obtain a more reliable test.

The global scheme of the procedure is provided in Fig. 3. In particular, we define a set of lengths, \(\{l_{1},\dots ,l_{max}\}\), for each length we perform r corresponding MC samplings. For each sample, we fit a power-law distribution and assess its plausibility exploiting Algorithm 1 and Algorithm 2 and, thus, obtaining a sequence of p-values of the Anderson-Darling test of length r. We consider, as final output of the procedure, the mean of all p-values sequences for all different lengths and the related standard deviation.

To the best of our knowledge, it is not usual to exploit MC sub-sampling to test for power-law decay in the degree distribution. In fact, performing MC does not allow to exactly estimate the parameters of the power-law distribution, indeed, to each sub-sample may correspond a different set of parameters. Nonetheless, we do not use MC as a fitting method but rather to say if a network is plausible to asymptotically satisfying the scale-free property. We argue that using MC as a way to obtain suitable sub-samples of smaller sample size would provide better understanding of the degree sequence behavior while overcoming the drawbacks induced by large sample sizes.

4.1 Instantiation of Parameters

In order to apply the Monte Carlo approach we need to fix different values, specifically \(l_1\), \(l_{max}\), r and the significance level.

The problem of selecting adequate lengths for the MC sub-samples is not trivial. On the one hand, a too small sub-sample would lead to very different degree sequences due to the large fluctuations present in the original network, while, on the other hand, lengths close to the original degree sequence would lead to higher rates of rejection of the power-law hypothesis. Then, we arbitrarily decided to set \(l_1\) at n/2 which is half the length of the observed data. As for \(l_{max}\), we fix it to n as in case of true power-law samples we want to being able to obtain a high p-value, while in case of noisy data, considering one length equal to the original size does not particularly affect the resulting mean p-value.

The value of r affects the robustness of the final result, the more repetitions the better approximation of the true p-value. Nonetheless, its value depends on constraints deriving from computational power. Thus, we leave the definition of such value to the user.

We fixed the significance level at 0.1 for the rejection of the null hypothesis. This is a conservative choice implying that the power law hypothesis is ruled out if there is a probability of 1 in 10 or less that data sampled from the true model agree with the model as the empirical data.

Lastly, we fixed the maximal possible \(x_{min}\) to be least 25 observations less than the maximal observed degree. This is due to limit the chances of fitting a power-law distribution on too few observations.

5 Experimental Results

In order to evaluate the performance of the proposed pipeline, we perform four experiments and compare the results with the state-of-the-art method. In the rest of the narration we will refer to the state-of-the-art method as Bootstrap and to our method as Monte Carlo + Bootstrap.

All the simulations are performed in Python. We used the package powerlaw [2] for fitting power-law distributions to empirical data and compute the AD distances. We provide all the notebooks used for the experiments of this paper in a GitHub repositoryFootnote 1. For all experiments, we fixed 30 lengths of Monte Carlo re-sampling in the interval \([\frac{n}{2},n]\) and for each of this length we get \(r=10\) re-samplings.

5.1 Validation of the Proposed Method on Different Graph Models

In the first experiment we aim at verifying if Monte Carlo + Bootstrap is comparable to just Bootstrap when considering two cases at varying sample sizes:

  1. 1.

    Erdős-Renyi models of size \(\{75\times 10^{3},15\times 10^{4}\}\), we expect both methods to refuse the null hypothesis as the degree distribution of this model is known to follow a binomial distribution [14]. Thus, we use this as base test to assess the probability of correctly rejecting the power-law hypothesis.

  2. 2.

    Barabasi-Albert models of size \(\{75\times 10^{3},15\times 10^{4},3\times 10^{5}\}\), we expect both methods to have high p-values as the degree distribution follows a power law [6]. We use this experiment to provide proof of the soundness of the method in presence of true power-law data.

Each experiment listed above is repeated 10 times to estimate the mean and standard deviation of p-values. Results are reported in Table 1 where we observe that our approach (Monte Carlo + Bootstrap) always reject the null hypothesis in the Erdős-Renyi case as the Bootstrap method, while in the Barabasi-Albert case we always provide p-values with a smaller variance.

Table 1. Results to assess the goodness of the proposed testing pipeline in cases of scale-free graphs (Barabasi-Albert) or not (Erdős-Renyi), in terms of mean p-value and standard deviation on 10 repetitions of the test for different sample sizes.

5.2 Robustness to Noise

We now want to assess that our method is indeed more robust under increasing noise in the input empirical distribution. We simulated from a discrete power law with parameters \(\alpha =2.3\) and \(x_{min}=1\), a sample of size \(n=10^{5}\). For different levels of noise in the set \(\bar{n} \in \{10,40,70,100\}\), we perturbed the power law observation by adding \(\bar{n}\) values uniformly sampled from the original observation.

Fig. 4.
figure 4

Results in terms of p-values for the two testing pipelines as the input data present an increasing level of noisy observations.

Figure 4 shows that the proposed methods is in mean always better than the simple bootstrap approach while also providing a smaller variance. Also, it never reject the null-hypothesis in cases in which the noise is small while sometimes it rejects it in presence of high amount of noise (100 added observations). Differently from the Bootstrap approach that, depending on the simulated sample, sometimes rejects it even in presence of zero noise.

5.3 Benchmark on University of Notre Dame Website

We exploit a widely studied example of empirical data that is assumed to follow a power-law distribution [1, 4, 20], i.e. the web graph of the University of Notre Dame website. This graph, in 1999, has been studied in order to obtain information regarding the topology of the Web. In [1], the authors found that the in-degree and out-degree distributions of the graph underlying the hyperlink structure of the domain nd.edu were well approximated by power-law distributions with scaling parameters 2.7 and 2.1 respectively. We downloaded the hyperlink graph from http://snap.stanford.edu/ [16]; the crawl consists of 325729 documents and 1497134 links. We tested the Monte Carlo + Bootstrap approach against the Bootstrap approach as the empirical data are noisy and we want to provide further validation of our testing procedure on the in-degree distribution of the network.

We performed \(r=5\) MC re-samplings for different sizes equally spaced in the interval [162864, 325729]. Monte Carlo + Bootstrap results in a mean p-value of 0.15, meaning that there is no strong evidence against the power-law hypothesis for the in-degree distribution. Differently, when applying the Bootstrap method we observed a p-value equal to 0.00, which would lead us to reject the null hypothesis.

As in literature many have argued the power-law nature of this graph, this allows us to conclude that our testing procedure is more robust and thus can be applied on real-world data with higher reliability.

5.4 Websites Analysis

We now want to exploit our procedure in real scenarios to seek for evidence of differences in the degree distributions deriving from different generative processes. We considered three different websites that we deemed representative of different strategies of content creation: e-commerce, academic and free encyclopedia. The first category is typically characterized by a strong central control in the design and evolution of the information architecture and content generation. Conversely, the last category is completely user-guided and its evolution is, thus, likely to be random. We argue that the academic category, as well as other website of complex institutions, should be a trade-off between the two, as usually many contributors have access to writing and adding content with a mild central control (Fig. 5).

Fig. 5.
figure 5

Log-log plots of the empirical distributions of the considered case studies.

Table 2. Analyzed websites with the related information about number of nodes, number of edges and category.

We consider the following websites:

  1. 1.

    Goop, the website of a wellness and lifestyle company; we crawled the entire website using the open source framework ScrapyFootnote 2, during the crawl we restricted to the domains goop.com and shop.goop.com;

  2. 2.

    Stanford, the website of Stanford University. We downloaded a crawl performed in 2002 available at http://snap.stanford.edu/;

  3. 3.

    Wikipedia (ES), the website of the free spanish encyclopedia. We downloaded a crawl of 2013 at http://law.di.unimi.it/index.php [8, 9].

Table 2 describes the characteristics of the three considered websites, in terms of category, number of nodes and number of edges. Table 2 also reports the mean p-values obtained with Monte Carlo + Bootstrap on the in-degree distributions. Results seems to validate our hypothesis about an inverse correlation between the centrality of the content generative process and the scale-free property.

6 Discussion

In this paper, we proposed a method for hypothesis testing of power-law distributions in empirical data that overcomes issues related to the power of the test. In particular, our method mediates the effect of possibly noisy data through Monte-Carlo sub-samplings of the empirical distribution. We verified that the proposed method retains the ability of assessing if observations are indeed plausibly sampled from a power law, under different sample sizes and level of noise. Indeed, the method is more reliable than the state-of-the art on synthetic data. To further assess the reliability of our approach we also provide a real-world example, specifically the University of Notre Dame website, which is a well studied dataset and it is considered to be scale-free. Our method does indeed provide a p-value higher than the significance level, differently from the state-of-the-art method that rejects the null hypothesis.

This allowed us to use our method to test different websites corresponding to different content generative processes. From a first insights, we observed that different content generation strategies may induce a different connectivity structure of the hyperlink graph.

For future research we intend to increase the number of real networks studied and consider current websites related to different generative processes to provide a more comprehensive understanding of specific sub-categories of the Web.

Future research directions may also involve the use of random walks instead of Monte Carlo as a sub-sampling technique on graphs [7, 17] and the comparison with other estimators of power laws in empirical data [23].

To conclude, our pipeline is an attempt to perform statistical testing while considering its limits both theoretical and due to noisiness of data. We argue that this is fundamental to reliably test assumptions on real-world examples.