Abstract
We propose a method for the non-parametric joint estimation of preferential attachment and transitivity in complex networks, as opposite to conventional methods that either estimate one mechanism in isolation or jointly estimate both assuming some functional forms. We apply our method to three scientific co-authorship networks between scholars in the complex network field, physicists in high-energy physics, and authors in the Strategic Management Journal. The non-parametric method revealed complex trends of preferential attachment and transitivity that would be unavailable under conventional parametric approaches. In all networks, having one common collaborator with another scientist increases at least five times the chance that one will collaborate with that scientist. Finally, by quantifying the contribution of each mechanism, we found that while transitivity dominates preferential attachment in the high-energy physics network, preferential attachment is the main driving force behind the evolutions of the remaining two networks.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
- Preferential attachment
- Clustering coefficient
- Rich-get-richer
- Transitivity
- Scientific co-authorship networks
- Collaboration networks
1 Introduction
Cooperation is among the most fundamental behaviors of living creatures [1]. Animals cooperate in various activities: from hunting and forming territories to grooming and child raising [2]. Humans are the experts of cooperation. From cooperation between states [3], companies [4] to cooperation between individuals [5], it is the bedrock of our society.
As a form of human cooperation, scientific collaboration is the backbone of the scientific world. In this process, scientists share their ideas, their time, and their skills with each other in order to push the boundary of knowledge. Since the start of the twentieth century, the number of scientific articles with more than one author has grown to more than three times the number of single-author articles [6]. There is accumulating evidence that articles resulted from collaborations are cited more frequently than non-collaborated ones [6, 7]. Since the number of citations is the main metric of scientific impact [8], collaborations thus lead to high impact research. Therefore, understanding the evolution of a scientific co-authorship network, in particular, understanding how new collaborations are fostered, is significantly important for policy makers, funding agencies, university managers as well as each scientist.
To understand the evolution of a scientific co-authorship network, it is beneficial to consider this evolution in a larger context of the evolution of a complex network. Two defining characteristics of an evolving complex network are the heavy-tail of the degree distribution and the high value of the clustering coefficient [9]; both are often represented at the same time in scientific co-authorship networks [10, 11].
On the one hand, to explain the heavy tail property, complex network studies have proposed the preferential attachment (PA) mechanism where the probability that a node with degree k receives a new link is proportional to the PA function \(A_k\) [12, 13]. When \(A_k\) is an increasing function on average, nodes with higher numbers of links will receive more new links, and thus hubs are formed and the heavy-tail degree distribution emerges.
On the other hand, one of the most simple mechanisms to explain the high value of the clustering coefficient is transitivity where the probability that a pair of nodes with b common neighbors receives a new link is proportional to the transitivity function \(B_{b}\) [14]. When \(B_{b}\) is an increasing function on average, more triangles are formed between sets of three nodes, and this leads to an increase in the clustering coefficient.
Existing approaches either estimate one mechanism in isolation [13,14,15] or estimate jointly the two mechanisms assuming some parametric forms for \(A_k\) and \(B_b\) [16, 17]. On the one hand, estimating either mechanism in isolation often leads to poor fit, since many real-world networks exhibit simultaneously heavy-tail degree distribution and high-clustering. On the other hand, it is difficult to justify a particular choice of functional forms used in parametric estimation methods. A non-parametric estimation method would allow the functional forms to be learnt from the observed data.
Estimating \(A_k\) and \(B_b\) is the first step towards answering the question of what matters more in the evolution of a complex network: transitivity or PA. While there is some research studying a similar question regarding PA and fitness [18], the question regarding PA and transitivity has curiously remained unexplored, despite its potential to provide deeper understanding on how new cooperation is fostered.
Our contribution is threefold. In our first contribution, we propose a method for the non-parametric joint estimation of the PA function \(A_k\) and the transitivity function \(B_b\). In our model, the probability that a new edge emerges between node i and node j at time step t is
where \(k_i(t)\) and \(k_j(t)\) are the degrees of nodes i and j at time-step t, respectively, and \(b_{ij}(t)\) is the number of common neighbors between i and j at that time-step. We estimate \(A_k\) and \(B_b\) by maximum likelihood estimation. The log-likelihood of the observed data is maximized by a Minorize-Maximization algorithm [19]. This is an iterative algorithm that increases the log-likelihood value at each iteration until convergence. Although our primary targets for estimation are the non-parametric functions \(A_k\) and \(B_b\), one can apply the same maximum likelihood estimation method for the power-law parametric model where we assume \(A_k = (k + 1)^\alpha \) and \(B_b = (b + 1)^\beta \). The method is discussed more in details in Sect. 2.
We use the co-authorship network between scientists in the complex network field [20] to illustrate the goodness-of-fit of three estimation methods: (a) jointly estimating the two mechanisms, i.e., the proposed method in Eq. (1), (b) estimating the PA function \(A_k\) in isolation, i.e., \(P_{ij}(t) \propto A_{k_i(t)}A_{k_j(t)}\), and (c) estimating the transitivity function \(B_{b}\) in isolation, i.e., \(P_{ij}(t) \propto B_{b_{ij}(t)}\). In each case, after estimation, starting from the same initial snapshot as in the real-world data, we simulated the new edges based on \(P_{i,j}(t)\) while we kept the numbers of new nodes and new edges at each time-step exactly the same as in the real-world network. Figure 1 shows the clustering coefficient C in the final snapshot of the simulated networks of each case over 100 replications. Only the proposed method of joint estimation could satisfactorily reproduce the true clustering coefficient \(C_{*}\). This implies that the true growth mechanism in this network, whatever it might be, is closer to PA + transitivity than either PA or transitivity alone. Using, for example, transitivity alone to explain the growth of the network would lead to over-estimation of the transitivity function \(B_b\), since the edge-increasing effect of the neglected PA mechanism has to be explained by \(B_b\) alone. This over-estimation, in turn, would lead to an overly high C as can be seen in Fig. 1.
In our second contribution, we propose a method to quantify the contributions of PA and transitivity mechanisms in the growth process of a network, and thus provide a way to answer the question of what matters more: PA or transitivity. The method is discussed in Sect. 3.
In our third contribution, we analyzed three scientific co-authorship networks between scholars in the complex network field [20], physicists in high-energy physics [21], and authors in the Strategic Management Journal (SMJ) [22]. The results are shown in Sect. 4. As opposite to conventional parametric approaches, the non-parametric approach provides us a chance to investigate finer details of PA and transitivity mechanisms. While \(A_k\) in CMP and SMJ is increasing in general, its trend is complex in HEP: it is first increasing, then decreasing, and finally increasing again. On the other hand, while \(B_b\) in HEP shows a clearly increasing trend, it is flat or even decreasing in CMP and SMJ when we exclude \(b = 0\).
Furthermore, in all networks, having at least one common collaborator with another scientist increases at least five times the chance that one will collaborate with that scientist in the future, which is consistent with previous studies [14]. This implies that, if one wants to boost their number of collaborators, one should collaborate with well-connected scientists.
By quantifying the contribution of each mechanism in the growth process, we also found that while transitivity dominates PA in the high-energy physics network, preferential attachment triumphs transitivity in the remaining networks. Concluding remarks are given in Sect. 5.
2 Maximum Likelihood Estimation
In this section, we describe the maximum likelihood estimation method for the model described in Eq. (1). Let \(k_{max}\) be the maximum value of the degree of a node and \(b_{max}\) be the maximum value of the number of common neighbors between a pair of nodes. The variables we want to estimate are \(\varvec{A} = [A_0, A_1,\ldots , A_{k_{max} - 1}]\) and \(\varvec{B} = [B_0, B_1,\ldots , B_{b_{max}-1}]\). The log-likelihood function is:
where m(t) is the number of new edges at time-step t, \(m_{k_1,k_2,b}(t)\) is the number of new edges at time-step t between node pairs (i, j) that satisfy \((k_{i}(t), k_{j}(t), b_{i,j}(t)) = (k_1, k_2, b)\), and \(n_{k_1,k_2,b}(t)\) is the number of such pairs. By substituting \(A_k = (k + 1)^\alpha \) and \(B_b = (b + 1)^\beta \) in Eq. (2), we can obtain the log-likelihood function of the parametric model. In practice, logarithmic binning is applied to k and b for fast computation [13]. The log-likelihood function in Eq. (2) can be maximized by an MM algorithm [19]. Starting from some initial vectors \(\varvec{A}^{0} = [A_0^{0}, A_1^{0},\ldots , A_{k_{max} - 1}^{0}]\) and \(\varvec{B}^{0} = [B_0^{0}, B_1^0,\ldots , B_{b_{max}-1}^{0}]\), the algorithm updates \(\varvec{A}^{i}\) and \(\varvec{B}^{i}\) in parallel so that \(l(\varvec{A}^{i}\), \(\varvec{B}^{i})\) is monotonically increasing in i.
3 Calculating the Contributions of PA and Transitivity
It is not straightforward to meaningfully quantify the contributions of PA and transitivity. According to Eq. (2), if we define \(\hat{\varvec{A}} = [\hat{A}_0, \hat{A}_1,\ldots , \hat{A}_{k_{max} - 1}]\) and \(\hat{\varvec{B}} = [\hat{B}_0, \hat{B}_1,\ldots , \hat{B}_{b_{max}-1}]\) as the estimated PA and transitivity functions, then \((\hat{\varvec{A}}, \hat{\varvec{B}})\) and \((c\hat{\varvec{A}}, d\hat{\varvec{B}})\) give the same log-likelihood value for any positive constant factors c and d. This means that the magnitudes of \(\hat{\varvec{A}}\) and \(\hat{\varvec{B}}\) cannot be meaningfully compared with each other and summary statistics that bases solely on magnitudes such as the mean are also out of the question.
The contributions \(v_{PA}(t)\) of PA and \(v_{trans}(t)\) of transitivity at time-step t, however, can be meaningfully defined if we look at scale-invariant statistics, such as the variance of logarithmic values. From Eq. (1), we have:
for some constant S(t) that is independent of i and j. We have the following three observations. Firstly, what matters is the variances in \(\log P_{i,j}(t)\) across node pairs (i, j): the more uniform \(\log P_{i,j}(t)\)’s are, the closer the behavior at time-step t of the temporal network to that of an Erdös-Rényi random network where each pair (i, j) has the same probability of developing new edges regardless of \(k_i(t)\), \(k_j(t)\) and \(b_{i,j}(t)\). Secondly, PA and transitivity mechanisms influence \(\log P_{i,j}(t)\) through \(\log \hat{A}_{k_i(t)}\hat{A}_{k_j(t)}\) and \(\log \hat{B}_{b_{ij}(t)}\), respectively. Finally, since \(\text {Var}(\log (cX)) = \text {Var}(\log (X) + \log c) = \text {Var}(\log (X)) \) for any positive constant factor c, the variances of \(\log {\hat{A}_{k_i(t)}\hat{A}_{k_j(t)}}\) and \(\log \hat{B}_{b_{i,j}(t)}\) are invariant to the scale of \((\hat{\varvec{A}}, \hat{\varvec{B}})\). From these observations, we define \(v_{PA}(t)\) and \(v_{trans}(t)\) to be the variances of \(\log {\hat{A}_{k_i(t)}\hat{A}_{k_j(t)}}\) and \(\log \hat{B}_{b_{i,j}(t)}\), respectively, where each variance is calculated from all node pairs that exist at time-step t. Note that \((\hat{\varvec{A}}, \hat{\varvec{B}})\) is estimated using all the time-steps as described in Sect. 2. A low \(v_{PA}(t)\) or \(v_{trans}(t)\) indicates that the contribution of the corresponding mechanism is weak at that time-step. For example, when \(\hat{B}_{b} \approx 1\) for all b, i.e., there is almost no transitivity, then \(\log \hat{B}_{b_{i,j}(t)}\) will take approximately the same value for all node pairs, and thus \(v_{trans}(t)\) will be very close to zero. Although \(\hat{A}_k\) and \(\hat{B}_b\) do not change with t, \(v_{PA}(t)\) and \(v_{trans}(t)\) are inherently temporal for two reasons: (1) the set of all node pairs that exist at time-step t might be changing due to the births of new nodes, and (2) \(\log \hat{A}_{k_i(t)}\hat{A}_{k_j(t)}\) and \(\log \hat{B}_{b_{i,j}(t)}\) might be changing due to the changes of \(k_i(t)\), \(k_j(t)\) or \(b_{i,j}(t)\).
4 Results
We analyzed three different scientific co-authorship networks: between authors in the complex network field (CMP) [20], authors of articles in the arXiv Hep-Th (high-energy theory) section (HEP) [21]), and authors of articles published in the SMJ (SMJ) [22]. While new collaborations and repeated collaborations are pooled together in HEP and SMJ, in CMP only new collaborations are considered. Table 1 shows the summary statistics for the three datasets.
We note that the number of edges per node in HEP is nearly 100 times greater than those in CMP and SMJ. As we shall see in Figs. 2a and b, this would cause the confidence interval at each estimated \(A_k\) and \(B_b\) in HEP to be very small comparing with those in CMP and SMJ.
Figures 2a shows the estimated PA functions in three networks. The estimated PA functions of CMP and SMJ are clearly increasing in average, which implies the existence of PA in these datasets. The estimated \(A_k\) of HEP is, however, only increasing up to \(k_1 \approx 30\), from there it decreases up to \(k_2 \approx 300\) and then once again increases. This might be caused by the fact that there are papers with hundreds of authors in high-energy physics and such papers are likely to be governed by some mechanism that is different than PA.
Figure 2b shows the estimated transitivity functions in three networks. In CMP, where only new collaborations are considered, the increase in \(B_1\) from \(B_0\) is nearly 100-fold. This is consistent with what was observed in Newman’s analysis [14], which revealed a 50-fold increase in a co-authorship network from the Los Alamos E-print Archive. The implication of this observation is that one common co-author between a pair of scientists dramatically increase the chance that the two will collaborate in the future, thus one might want to co-author with well-connected scientists if one wants to boost their number of co-authors. When new collaborations and repeated collaborations are pooled together, \(B_1/B_0\) is also very high: this quantity is about 10 and about 80 in HEP and SMJ, respectively, which also indicates the importance of one common collaborator in facilitating collaborations between a pair of authors.
When b is greater than 1, the confidence intervals in CMP and SMJ are too big to draw any conclusions beyond the observation that \(B_b\) in these two datasets fluctuate and show a non-increasing trend. The \(B_b\) in HEP, however, shows a clearly increasing trend when \(b > 1\). This trend, together with Fig. 2d, shows the dominance of transitivity in HEP, which is consistent with the well-known fact that the field of high-energy physics enjoys many long-lasting collaborations [23, 24].
Although the non-parametric \(A_k\) and \(B_b\) in these datasets seem to not follow any simple parametric forms, it is still beneficial to consider the power-law forms \(A_k = (k + 1)^\alpha \) and \(B_{b} = (b + 1)^\beta \) and then estimate the PA exponent \(\alpha \) and transitivity exponent \(\beta \). These exponents reveal the general trend of the two mechanisms and might be important in predicting the future of the network. Figure 2c shows the estimated \(\alpha \) and \(\beta \) in three networks. The estimated values of \(\alpha \) are founded to be from 0 to 1, which are consistent with previous works [13, 14]. This range of \(\alpha \) falls in the linear and sub-linear PA regions, which have been shown to lead to stable degree distributions when there is PA alone [12]. Extrapolating this result to our case, it is likely that some stable degree distribution would arise in the long run, too. This is important since it means that the low-degree authors would still have a chance to develop new collaborations, as opposite to the “winner-take-all” situation when the degree distribution is not stable. The estimated values of \(\beta \) are all greater than 1. There is, however, no study on the effects of such values of \(\beta \) on the asymptotic behavior of the network.
Finally, Fig. 2d shows \(v_{PA}(t)\) and \(v_{trans}(t)\) of three datasets at each time-step. In HEP, \(v_{trans}(t)\) is greater than \(v_{PA}(t)\) for all time-steps t, which suggests that transitivity completely dominates PA in this network. In CMP and SMJ networks, the situation is reversed: \(v_{PA}(t)\) is greater than \(v_{trans}(t)\) for all t, which implies that PA triumphs transitivity in CMP and SMJ.
5 Concluding Remarks
We proposed a non-parametric joint estimation method for the PA and transitivity mechanisms. Applying our method to three scientific co-authorship networks, we found the main driving force behind the evolution of each network is different: it is transitivity in Hep-Th network, and is PA in the remaining two. Furthermore, we found that \(B_1\) increases at least five fold from \(B_0\), which indicates the importance of one common collaborators in facilitating new collaborations between two authors. Among many possible directions for future research, two particularly stand out. The first one is to incorporate the fitness mechanism [25] to express node heterogeneity which is ignored in the PA and transitivity mechanisms. The second promising direction is to explore the evolutions of the PA and transitivity mechanisms themselves by investigating how the functions \(A_k\) and \(B_b\) change with time.
References
Nowak, M.A.: Five rules for the evolution of cooperation. Science 314(5805), 1560–1563 (2006). http://science.sciencemag.org/content/314/5805/1560
Dugatkin, L.A.: Cooperation Among Animals: An Evolutionary Perspective. Oxford University Press, Oxford (1997)
Watson, A.: Diplomacy. Routledge, London (1984)
Hamel, G., Doz, Y.L., Prahalad, C.K.: Collaborate with your competitors-and win. Harv. Bus. Rev. 67(1), 133–139 (1989). https://hbr.org/1989/01/collaborate-with-your-competitors-and-win
Johnson, D.W., Johnson, R.T., Smith, K.A.: Active Learning: Cooperation in the College Classroom. Interaction Book Company, Edina (1991)
Larivière, V., Gingras, Y., Sugimoto, C.R., Tsou, A.: Team size matters: collaboration and scientific impact since 1900. J. Assoc. Inf. Sci. Technol. 66(7), 1323–1332. https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.23266
Bornmann, L.: Is collaboration among scientists related to the citation impact of papers because their quality increases with collaboration? An analysis based on data from F1000Prime and normalized citation scores. J. Assoc. Inf. Sci. Technol. 68(4), 1036–1047 (2017). https://doi.org/10.1002/asi.23728
Tahai, A., Meyer, M.J.: A revealed preference study of management journals’ direct influences. Strateg. Manag. J. 20(3), 279–296 (1999). https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI%291097-0266%28199903%2920%3A3%3C279%3A%3AAID-SMJ33%3E3.0.CO%3B2-2
Holme, P., Kim, B.J.: Growing scale-free networks with tunable clustering. Phys. Rev. E 65, 026107 (2002)
Newman, M.E.J.: Coauthorship networks and patterns of scientific collaboration. In: Proceedings of the National Academy of Sciences, vol. 101(suppl 1), pp. 5200–5205 (2004). http://www.pnas.org/content/101/suppl_1/5200
Newman, M.E.J.: Scientific collaboration networks. I. network construction and fundamental results. Phys. Rev. E 64, 016131 (2001). https://link.aps.org/doi/10.1103/PhysRevE.64.016131
Krapivsky, P., Rodgers, G., Redner, S.: Degree distributions of growing networks. Phys. Rev. Lett. 86(23), 5401–5404 (2001)
Pham, T., Sheridan, P., Shimodaira, H.: PAFit: a statistical method for measuring preferential attachment in temporal complex networks. PLoS ONE 10(9), e0137796 (2015)
Newman, M.: Clustering and preferential attachment in growing networks. Phys. Rev. E 64(2), 025102 (2001)
Jeong, H., Néda, Z., Barabási, A.: Measuring preferential attachment in evolving networks. Europhys. Lett. 61(61), 567–572 (2003)
Ripley, R., Boitmanis, K., Snijders, T.A.: RSiena: Siena - Simulation Investigation for Empirical Network Analysis, R package version 1.1-232 (2013). https://CRAN.R-project.org/package=RSiena
Krivitsky, P.N., Handcock, M.S.: tergm: Fit, Simulate and Diagnose Models for Network Evolution Based on Exponential-Family Random Graph Models. The Statnet Project. R package version 3.4.0 (2016). http://www.statnet.org, http://CRAN.R-project.org/package=tergm
Kong, J., Sarshar, N., Roychowdhury, V.: Experience versus talent shapes the structure of the web. Proc. Nat. Acad. Sci. U.S.A. 37, 105 (2008)
Hunter, D., Lange, K.: Quantile regression via an MM algorithm. J. Comput. Graph. Stat. 9, 60–77 (2000)
Pham, T., Sheridan, P., Shimodaira, H.: PAFit: an R Package for the Non-parametric Estimation of Preferential Attachment and Node Fitness in Temporal Complex Networks. ArXiv e-prints, April 2017
KONECT: arxiv hep-th network dataset. http://konect.uni-koblenz.de/networks/ca-cit-HepTh. Accessed 03 May 2018
Ronda-Pupo, G.A., Pham, T.: The evolutions of the rich get richer and the fit get richer phenomena in scholarly networks: the case of the strategic management journal. Scientometrics, May 2018. https://doi.org/10.1007/s11192-018-2761-3
Zimmermann, F.: High-energy physics strategies and futurelarge-scale projects. Nucl. Instr. Meth. Phys. Res. Sect. B: Beam Interact. Mater. Atoms 355, 4–10 (2015). http://www.sciencedirect.com/science/article/pii/S0168583X1500350X. Proceedings of the 6th International Conference Channeling 2014: Charged & Neutral Particles Channeling Phenomena, 5–10 October 2014, Capri, Italy
Birnholtz, J.P.: What does it mean to be an author? The intersection of credit, contribution, and collaboration in science. J. Am. Soc. Inf. Sci. Technol. 57(13), 1758–1770. https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.20380
Pham, T., Sheridan, P., Shimodaira, H.: Joint estimation of preferential attachment and node fitness in growing complex networks. Sci. Rep. 6 (2016). https://doi.org/10.1038/srep32558
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Inoue, M., Pham, T., Shimodaira, H. (2018). Transitivity vs Preferential Attachment: Determining the Driving Force Behind the Evolution of Scientific Co-Authorship Networks. In: Morales, A., Gershenson, C., Braha, D., Minai, A., Bar-Yam, Y. (eds) Unifying Themes in Complex Systems IX. ICCS 2018. Springer Proceedings in Complexity. Springer, Cham. https://doi.org/10.1007/978-3-319-96661-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-96661-8_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96660-1
Online ISBN: 978-3-319-96661-8
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)