Keywords

1 Introduction

Cooperation is among the most fundamental behaviors of living creatures [1]. Animals cooperate in various activities: from hunting and forming territories to grooming and child raising [2]. Humans are the experts of cooperation. From cooperation between states [3], companies [4] to cooperation between individuals [5], it is the bedrock of our society.

As a form of human cooperation, scientific collaboration is the backbone of the scientific world. In this process, scientists share their ideas, their time, and their skills with each other in order to push the boundary of knowledge. Since the start of the twentieth century, the number of scientific articles with more than one author has grown to more than three times the number of single-author articles [6]. There is accumulating evidence that articles resulted from collaborations are cited more frequently than non-collaborated ones [6, 7]. Since the number of citations is the main metric of scientific impact [8], collaborations thus lead to high impact research. Therefore, understanding the evolution of a scientific co-authorship network, in particular, understanding how new collaborations are fostered, is significantly important for policy makers, funding agencies, university managers as well as each scientist.

To understand the evolution of a scientific co-authorship network, it is beneficial to consider this evolution in a larger context of the evolution of a complex network. Two defining characteristics of an evolving complex network are the heavy-tail of the degree distribution and the high value of the clustering coefficient [9]; both are often represented at the same time in scientific co-authorship networks [10, 11].

On the one hand, to explain the heavy tail property, complex network studies have proposed the preferential attachment (PA) mechanism where the probability that a node with degree k receives a new link is proportional to the PA function \(A_k\) [12, 13]. When \(A_k\) is an increasing function on average, nodes with higher numbers of links will receive more new links, and thus hubs are formed and the heavy-tail degree distribution emerges.

On the other hand, one of the most simple mechanisms to explain the high value of the clustering coefficient is transitivity where the probability that a pair of nodes with b common neighbors receives a new link is proportional to the transitivity function \(B_{b}\) [14]. When \(B_{b}\) is an increasing function on average, more triangles are formed between sets of three nodes, and this leads to an increase in the clustering coefficient.

Existing approaches either estimate one mechanism in isolation [13,14,15] or estimate jointly the two mechanisms assuming some parametric forms for \(A_k\) and \(B_b\) [16, 17]. On the one hand, estimating either mechanism in isolation often leads to poor fit, since many real-world networks exhibit simultaneously heavy-tail degree distribution and high-clustering. On the other hand, it is difficult to justify a particular choice of functional forms used in parametric estimation methods. A non-parametric estimation method would allow the functional forms to be learnt from the observed data.

Estimating \(A_k\) and \(B_b\) is the first step towards answering the question of what matters more in the evolution of a complex network: transitivity or PA. While there is some research studying a similar question regarding PA and fitness [18], the question regarding PA and transitivity has curiously remained unexplored, despite its potential to provide deeper understanding on how new cooperation is fostered.

Our contribution is threefold. In our first contribution, we propose a method for the non-parametric joint estimation of the PA function \(A_k\) and the transitivity function \(B_b\). In our model, the probability that a new edge emerges between node i and node j at time step t is

$$\begin{aligned} P_{ij}(t) \propto A_{k_i(t)}A_{k_j(t)}B_{b_{ij}(t)}, \end{aligned}$$
(1)

where \(k_i(t)\) and \(k_j(t)\) are the degrees of nodes i and j at time-step t, respectively, and \(b_{ij}(t)\) is the number of common neighbors between i and j at that time-step. We estimate \(A_k\) and \(B_b\) by maximum likelihood estimation. The log-likelihood of the observed data is maximized by a Minorize-Maximization algorithm [19]. This is an iterative algorithm that increases the log-likelihood value at each iteration until convergence. Although our primary targets for estimation are the non-parametric functions \(A_k\) and \(B_b\), one can apply the same maximum likelihood estimation method for the power-law parametric model where we assume \(A_k = (k + 1)^\alpha \) and \(B_b = (b + 1)^\beta \). The method is discussed more in details in Sect. 2.

We use the co-authorship network between scientists in the complex network field [20] to illustrate the goodness-of-fit of three estimation methods: (a) jointly estimating the two mechanisms, i.e., the proposed method in Eq. (1), (b) estimating the PA function \(A_k\) in isolation, i.e., \(P_{ij}(t) \propto A_{k_i(t)}A_{k_j(t)}\), and (c) estimating the transitivity function \(B_{b}\) in isolation, i.e., \(P_{ij}(t) \propto B_{b_{ij}(t)}\). In each case, after estimation, starting from the same initial snapshot as in the real-world data, we simulated the new edges based on \(P_{i,j}(t)\) while we kept the numbers of new nodes and new edges at each time-step exactly the same as in the real-world network. Figure 1 shows the clustering coefficient C in the final snapshot of the simulated networks of each case over 100 replications. Only the proposed method of joint estimation could satisfactorily reproduce the true clustering coefficient \(C_{*}\). This implies that the true growth mechanism in this network, whatever it might be, is closer to PA + transitivity than either PA or transitivity alone. Using, for example, transitivity alone to explain the growth of the network would lead to over-estimation of the transitivity function \(B_b\), since the edge-increasing effect of the neglected PA mechanism has to be explained by \(B_b\) alone. This over-estimation, in turn, would lead to an overly high C as can be seen in Fig. 1.

Fig. 1.
figure 1

Simulated clustering coefficients in three cases: the proposed method (jointly estimating the two functions), estimating the PA function in isolation, and estimating the transitivity function in isolation in the co-authorship network between scientists in the complex network field. The number of replications is 100. The plotted box in each case shows the population mean and its \(\pm 2\sigma \) confidence interval. The red dashed line represents the value of the true clustering coefficient.

In our second contribution, we propose a method to quantify the contributions of PA and transitivity mechanisms in the growth process of a network, and thus provide a way to answer the question of what matters more: PA or transitivity. The method is discussed in Sect. 3.

In our third contribution, we analyzed three scientific co-authorship networks between scholars in the complex network field [20], physicists in high-energy physics [21], and authors in the Strategic Management Journal (SMJ) [22]. The results are shown in Sect. 4. As opposite to conventional parametric approaches, the non-parametric approach provides us a chance to investigate finer details of PA and transitivity mechanisms. While \(A_k\) in CMP and SMJ is increasing in general, its trend is complex in HEP: it is first increasing, then decreasing, and finally increasing again. On the other hand, while \(B_b\) in HEP shows a clearly increasing trend, it is flat or even decreasing in CMP and SMJ when we exclude \(b = 0\).

Furthermore, in all networks, having at least one common collaborator with another scientist increases at least five times the chance that one will collaborate with that scientist in the future, which is consistent with previous studies [14]. This implies that, if one wants to boost their number of collaborators, one should collaborate with well-connected scientists.

By quantifying the contribution of each mechanism in the growth process, we also found that while transitivity dominates PA in the high-energy physics network, preferential attachment triumphs transitivity in the remaining networks. Concluding remarks are given in Sect. 5.

2 Maximum Likelihood Estimation

In this section, we describe the maximum likelihood estimation method for the model described in Eq. (1). Let \(k_{max}\) be the maximum value of the degree of a node and \(b_{max}\) be the maximum value of the number of common neighbors between a pair of nodes. The variables we want to estimate are \(\varvec{A} = [A_0, A_1,\ldots , A_{k_{max} - 1}]\) and \(\varvec{B} = [B_0, B_1,\ldots , B_{b_{max}-1}]\). The log-likelihood function is:

$$\begin{aligned} l(\varvec{A}, \varvec{B})&= \sum _{t=1}^{T}\sum _{k_1=0 }^{k_{max}-1}\sum _{k_2=k_1}^{k_{max}-1}\sum _{b=0}^{b_{max}-1} m_{k_1,k_2,b}(t) \log {A_{k_1}A_{k_2}B_{b}} \nonumber \\&-\sum _{t=1}^{T}m(t) \log \left( \sum _{k_1 = 0}^{k_{max}-1}\sum _{k_2=k_1}^{k_{max}-1}\sum _{b=0}^{b_{max}-1} n_{k_1,k_2,b}(t) {A_{k_1}A_{k_2}B_{b}} \right) , \end{aligned}$$
(2)

where m(t) is the number of new edges at time-step t, \(m_{k_1,k_2,b}(t)\) is the number of new edges at time-step t between node pairs (ij) that satisfy \((k_{i}(t), k_{j}(t), b_{i,j}(t)) = (k_1, k_2, b)\), and \(n_{k_1,k_2,b}(t)\) is the number of such pairs. By substituting \(A_k = (k + 1)^\alpha \) and \(B_b = (b + 1)^\beta \) in Eq. (2), we can obtain the log-likelihood function of the parametric model. In practice, logarithmic binning is applied to k and b for fast computation [13]. The log-likelihood function in Eq. (2) can be maximized by an MM algorithm [19]. Starting from some initial vectors \(\varvec{A}^{0} = [A_0^{0}, A_1^{0},\ldots , A_{k_{max} - 1}^{0}]\) and \(\varvec{B}^{0} = [B_0^{0}, B_1^0,\ldots , B_{b_{max}-1}^{0}]\), the algorithm updates \(\varvec{A}^{i}\) and \(\varvec{B}^{i}\) in parallel so that \(l(\varvec{A}^{i}\), \(\varvec{B}^{i})\) is monotonically increasing in i.

3 Calculating the Contributions of PA and Transitivity

It is not straightforward to meaningfully quantify the contributions of PA and transitivity. According to Eq. (2), if we define \(\hat{\varvec{A}} = [\hat{A}_0, \hat{A}_1,\ldots , \hat{A}_{k_{max} - 1}]\) and \(\hat{\varvec{B}} = [\hat{B}_0, \hat{B}_1,\ldots , \hat{B}_{b_{max}-1}]\) as the estimated PA and transitivity functions, then \((\hat{\varvec{A}}, \hat{\varvec{B}})\) and \((c\hat{\varvec{A}}, d\hat{\varvec{B}})\) give the same log-likelihood value for any positive constant factors c and d. This means that the magnitudes of \(\hat{\varvec{A}}\) and \(\hat{\varvec{B}}\) cannot be meaningfully compared with each other and summary statistics that bases solely on magnitudes such as the mean are also out of the question.

The contributions \(v_{PA}(t)\) of PA and \(v_{trans}(t)\) of transitivity at time-step t, however, can be meaningfully defined if we look at scale-invariant statistics, such as the variance of logarithmic values. From Eq. (1), we have:

$$\begin{aligned} \log P_{i,j}(t) = \log \hat{A}_{k_i(t)}\hat{A}_{k_j(t)} + \log \hat{B}_{b_{ij}(t)} + S(t), \end{aligned}$$

for some constant S(t) that is independent of i and j. We have the following three observations. Firstly, what matters is the variances in \(\log P_{i,j}(t)\) across node pairs (ij): the more uniform \(\log P_{i,j}(t)\)’s are, the closer the behavior at time-step t of the temporal network to that of an Erdös-Rényi random network where each pair (ij) has the same probability of developing new edges regardless of \(k_i(t)\), \(k_j(t)\) and \(b_{i,j}(t)\). Secondly, PA and transitivity mechanisms influence \(\log P_{i,j}(t)\) through \(\log \hat{A}_{k_i(t)}\hat{A}_{k_j(t)}\) and \(\log \hat{B}_{b_{ij}(t)}\), respectively. Finally, since \(\text {Var}(\log (cX)) = \text {Var}(\log (X) + \log c) = \text {Var}(\log (X)) \) for any positive constant factor c, the variances of \(\log {\hat{A}_{k_i(t)}\hat{A}_{k_j(t)}}\) and \(\log \hat{B}_{b_{i,j}(t)}\) are invariant to the scale of \((\hat{\varvec{A}}, \hat{\varvec{B}})\). From these observations, we define \(v_{PA}(t)\) and \(v_{trans}(t)\) to be the variances of \(\log {\hat{A}_{k_i(t)}\hat{A}_{k_j(t)}}\) and \(\log \hat{B}_{b_{i,j}(t)}\), respectively, where each variance is calculated from all node pairs that exist at time-step t. Note that \((\hat{\varvec{A}}, \hat{\varvec{B}})\) is estimated using all the time-steps as described in Sect. 2. A low \(v_{PA}(t)\) or \(v_{trans}(t)\) indicates that the contribution of the corresponding mechanism is weak at that time-step. For example, when \(\hat{B}_{b} \approx 1\) for all b, i.e., there is almost no transitivity, then \(\log \hat{B}_{b_{i,j}(t)}\) will take approximately the same value for all node pairs, and thus \(v_{trans}(t)\) will be very close to zero. Although \(\hat{A}_k\) and \(\hat{B}_b\) do not change with t, \(v_{PA}(t)\) and \(v_{trans}(t)\) are inherently temporal for two reasons: (1) the set of all node pairs that exist at time-step t might be changing due to the births of new nodes, and (2) \(\log \hat{A}_{k_i(t)}\hat{A}_{k_j(t)}\) and \(\log \hat{B}_{b_{i,j}(t)}\) might be changing due to the changes of \(k_i(t)\), \(k_j(t)\) or \(b_{i,j}(t)\).

4 Results

We analyzed three different scientific co-authorship networks: between authors in the complex network field (CMP) [20], authors of articles in the arXiv Hep-Th (high-energy theory) section (HEP) [21]), and authors of articles published in the SMJ (SMJ) [22]. While new collaborations and repeated collaborations are pooled together in HEP and SMJ, in CMP only new collaborations are considered. Table 1 shows the summary statistics for the three datasets.

Table 1. Summary statistics for three scientific co-authorship networks. |V| and |E| are the total numbers of nodes and edges in the final snapshot, respectively. T is the number of time-steps. \(\varDelta |V|\) and \(\varDelta |E|\) are the increments of nodes and edges after the initial snapshot, respectively. C is the clustering coefficient of the final snapshot.

We note that the number of edges per node in HEP is nearly 100 times greater than those in CMP and SMJ. As we shall see in Figs. 2a and b, this would cause the confidence interval at each estimated \(A_k\) and \(B_b\) in HEP to be very small comparing with those in CMP and SMJ.

Figures 2a shows the estimated PA functions in three networks. The estimated PA functions of CMP and SMJ are clearly increasing in average, which implies the existence of PA in these datasets. The estimated \(A_k\) of HEP is, however, only increasing up to \(k_1 \approx 30\), from there it decreases up to \(k_2 \approx 300\) and then once again increases. This might be caused by the fact that there are papers with hundreds of authors in high-energy physics and such papers are likely to be governed by some mechanism that is different than PA.

Fig. 2.
figure 2

Results in CMP, HEP, and SMJ networks. a: Estimated PA functions \(\hat{A}_k\). The vertical bar at each point indicates the \(\pm 2\sigma \) confidence interval. b: Estimated transitivity functions \(\hat{B}_b\). The vertical bar at each point indicates the \(\pm 2\sigma \) confidence interval. c: Estimated PA exponent \(\alpha \) and transitivity exponent \(\beta \). \(\alpha \) and \(\beta \) are estimated by fitting the parametric forms \(A_k = (k + 1)^\alpha \) and \(B_b = (b + 1)^\beta \). d: The evolutions of \(v_{PA}(t)\) and \(v_{trans}(t)\). \(v_{PA}(t)\) and \(v_{trans}\) are the contributions of PA and transitivity, respectively, at time-step t. The arrow in each dataset indicates the final time-step.

Figure 2b shows the estimated transitivity functions in three networks. In CMP, where only new collaborations are considered, the increase in \(B_1\) from \(B_0\) is nearly 100-fold. This is consistent with what was observed in Newman’s analysis [14], which revealed a 50-fold increase in a co-authorship network from the Los Alamos E-print Archive. The implication of this observation is that one common co-author between a pair of scientists dramatically increase the chance that the two will collaborate in the future, thus one might want to co-author with well-connected scientists if one wants to boost their number of co-authors. When new collaborations and repeated collaborations are pooled together, \(B_1/B_0\) is also very high: this quantity is about 10 and about 80 in HEP and SMJ, respectively, which also indicates the importance of one common collaborator in facilitating collaborations between a pair of authors.

When b is greater than 1, the confidence intervals in CMP and SMJ are too big to draw any conclusions beyond the observation that \(B_b\) in these two datasets fluctuate and show a non-increasing trend. The \(B_b\) in HEP, however, shows a clearly increasing trend when \(b > 1\). This trend, together with Fig. 2d, shows the dominance of transitivity in HEP, which is consistent with the well-known fact that the field of high-energy physics enjoys many long-lasting collaborations [23, 24].

Although the non-parametric \(A_k\) and \(B_b\) in these datasets seem to not follow any simple parametric forms, it is still beneficial to consider the power-law forms \(A_k = (k + 1)^\alpha \) and \(B_{b} = (b + 1)^\beta \) and then estimate the PA exponent \(\alpha \) and transitivity exponent \(\beta \). These exponents reveal the general trend of the two mechanisms and might be important in predicting the future of the network. Figure 2c shows the estimated \(\alpha \) and \(\beta \) in three networks. The estimated values of \(\alpha \) are founded to be from 0 to 1, which are consistent with previous works [13, 14]. This range of \(\alpha \) falls in the linear and sub-linear PA regions, which have been shown to lead to stable degree distributions when there is PA alone [12]. Extrapolating this result to our case, it is likely that some stable degree distribution would arise in the long run, too. This is important since it means that the low-degree authors would still have a chance to develop new collaborations, as opposite to the “winner-take-all” situation when the degree distribution is not stable. The estimated values of \(\beta \) are all greater than 1. There is, however, no study on the effects of such values of \(\beta \) on the asymptotic behavior of the network.

Finally, Fig. 2d shows \(v_{PA}(t)\) and \(v_{trans}(t)\) of three datasets at each time-step. In HEP, \(v_{trans}(t)\) is greater than \(v_{PA}(t)\) for all time-steps t, which suggests that transitivity completely dominates PA in this network. In CMP and SMJ networks, the situation is reversed: \(v_{PA}(t)\) is greater than \(v_{trans}(t)\) for all t, which implies that PA triumphs transitivity in CMP and SMJ.

5 Concluding Remarks

We proposed a non-parametric joint estimation method for the PA and transitivity mechanisms. Applying our method to three scientific co-authorship networks, we found the main driving force behind the evolution of each network is different: it is transitivity in Hep-Th network, and is PA in the remaining two. Furthermore, we found that \(B_1\) increases at least five fold from \(B_0\), which indicates the importance of one common collaborators in facilitating new collaborations between two authors. Among many possible directions for future research, two particularly stand out. The first one is to incorporate the fitness mechanism [25] to express node heterogeneity which is ignored in the PA and transitivity mechanisms. The second promising direction is to explore the evolutions of the PA and transitivity mechanisms themselves by investigating how the functions \(A_k\) and \(B_b\) change with time.