Introduction

Starting from late 1990s, we have witnessed a new movement in network analysis with the focus shifting from the analysis of small graphs to large-scale statistical properties of graphs (Newman 2003). This new approach has been driven by the availability of computers and networks which allow us to gather and analyze large scale data and also by the breakdown of boundaries between disciplines, allowing us to uncover the generic properties of complex networks (Albert and Barabási 2002).

Coauthorship network, an important form of social network, has been intensively studied in this movement. Currently, most articles studied “hard science” coauthorship networks; less focus has been paid to “soft science” coauthorship networks, such as the coauthorship networks of library and information science (LIS). Meanwhile, few researches place their focus on the coauthorship network of LIS in China. China is currently one of the fast developing countries in the world. Recently China has issued several research policies to boom up its national and international research power. Sufficient funding has been provided nationally to encourage large-scale collaborative projects (Jin 1999; Wang et al. 2005). Library and Information Science in China is also experiencing dramatic changes during recent years along with the “open” policy of China and the exponential growth of Chinese economy (Wu and Yuan 1994). The modern research equipments and new Chinese research policies enable more and more Chinese researchers to collaborate with a broader scope. It becomes an interesting research topic to identify the key features of the current coauthorship network of LIS in China.

This paper aims to deploy the network analysis methods to identify the key features of the coauthorship network of LIS in China. We not only analyze this network with macro-level methods which capture the global features of the networks, but also obtain three sets of centrality values from micro-level analysis which illustrates the local features of the individual elements in the networks, and apply these values to impact analysis and discuss the correlation between centrality measures and citation counts.

In the first part of this article, we conduct a literature review on coauthorship network analysis, tracing its origin and current development. We then introduce the data set and method of this research. In the next part, we construct the coauthorship network of LIS in China, and apply both macro-level and micro-level metrics to analyze this network. We find out some structural properties of this network and also discuss the relationship between citation counts and centrality measures. We conclude this article by discussing the limitation of this research and proposing some future researches.

Related work

In 1998, Watts and Strogatz (1998) showed that many real-world networks have a small average shortest path length, but also a clustering coefficient significantly higher than expected by random chance. A year later, Barabási and Albert (1999) found that some nodes had many more connections than others and that the network as a whole had a power-law distribution. In 2002 and 2003, two review articles were published in top physics (Albert and Barabási 2002) and mathematics (Newman 2003) journals, symbolizing the modeling of social network has gained its first-phase fruition.

As for the coauthorship networks that we are investigating in this paper, an early example of this kind is the Erdös Number Project, in which the smallest number of coauthorship links between any individual mathematician and the Hungarian mathematician Erdös are calculated (Castro and Grossman 1999). Newman studied and compared the coauthorship graph of arXiv, Medline, SPIRES, and NCSTRL (Newman 2001a, b) and found a number of network differences between experimental and theoretical disciplines. By mapping the graph containing all relevant publications of members in an international collaboration network COLLNET, Yin et al. (2006) found that this scientific community displays many aspects of a small-world network and is vulnerable to disruption. Using the Science Citation Index (SCI) data for 1990 and 2000, Wagner and Leydesdorff (2003) found that in the 10 years between 1990 and 2000, the global network has expanded to include more nations and it has become more interconnected. Cronin and Shaw (2007) found physical location played an important role in collaboration in their study of Rob Kling’s intellectual impact and influences. A recent paper by Hou et al. (2008) has revealed the coauthorship pattern of Scientometrics using the data from SCI. According to Liu et al. (2005), coauthorship analysis has also been applied to various ACM conferences: Information Retrieval (SIGIR), Management of Data (SIGMOD) and Hypertext, as well as mathematics and neuroscience, and information systems.

Methodology

Data collection

We select 18 LIS journals from Chinese Social Science Citation Index (CSSCI: http://cssci.nju.edu.cn/CSSCIlyqk2008.htm) with a time span of 6 year (2002–2007). Parallel to Social Science Citation Index of Institute for Scientific Information, CSSCI’s source journals contain the high quality journals published in China. Thus, these 18 journals are capable of revealing the collaboration patterns of China’s LIS research.

These source journals are (translated into English based on CSSCI translations): (1) Archives Science Bulletin, (2) Archives Science Study, (3) Document, Information & Knowledge, (4) Information and Documentation Services, (5) Information Science, (6) Information Studies: Theory & Application, (7) Journal of Academic Libraries, (8) Journal of Information, (9) Journal of Library Science in China, (10) Journal of the China Society for Scientific and Technical Information, (11) Library, (12) Library and Information, (13) Library and Information Service, (14) Library Journal, (15) Library Theory and Practice, (16) Library Tribune, (17) Library Work and Study, (18) New Technology of Library and Information Service.

Method

Given that we have established a social network, we can describe its properties on two levels, by macro-level metrics (global graph metrics) and micro-level metrics (individual actor properties). Macro-level metrics seek to describe the global characteristic of a social network as a whole (Liu et al. 2005) with the aim to capture the overall structural features of a network. Commonly used measures are diameter, mean distance, components, clusters, etc. Micro-level metrics relate to the analysis of the individual properties of network actors, for example, actor position, actor status, and distance to others, which informs us about “the differential constraints and opportunities facing individual actors which shape their social behavior” (Yin et al. 2006, p. 1600). It zooms in to capture the features of the individual nodes/actors in a network with the consideration of the topology of the network. Micro-level metric usually refers to centrality, which indicates how central the actor is to the network. Central actors are well connected to other actors (Liu et al. 2005), and metrics of centrality will measure an actor’s degree (degree centrality), average distance (closeness centrality), or the degree to which geodesic paths between any pair of actors passes through the actor (betweenness centrality).

Macro-level metrics

Macro-level metrics are useful to identify the global structural features of the network. There are many ways of characterizing the structure of a network. In this study, we will focus five key elements of networks: component, distance, cluster, degree distribution, and error and attack tolerance of the network.

Component

In social network analysis, connected graphs are called components. A component of a graph is a subset with the characteristic that there is a path between any node and any other one of this subset (Nooy et al. 2005). A coauthorship network usually consists of many disconnected components, and usually what we focus on is the largest component. Nascimento (Nascimento et al. 2003) reported that the largest component in SIGMOD’s coauthorship graph has about 60% of all authors. In the four coauthorship networks studied by Newman (2001b), Medline has the largest component, with 92.6% of all the authors, while NCSTRL has the smallest largest component, containing 57.2% of all authors. After some comparison studies on coauthorship networks, Kretschmer (2004) suggests that the largest components usually have a ratio of more than 40% of all the authors. A same finding has also been presented by Grossman (2002), who finds that this ratio ranges from 40% to more than 80%. The ratio of the largest components is useful in identifying the collective collaboration pattern of a field, and making comparisons across different disciplines.

Mean distance

A geodesic is the shortest path between two vertices. The distance from vertex u to vertex v is the length of the geodesic from u to v. As defined formally by Watts and Strogatz (1998), and informally by Milgram (1967), many social networks display structures where most individuals are at very few degrees of distance from one another. According to Yin et al. (2006), short mean distance is a sign allowing authors to share information more rapidly, and thus is a desirable property of coauthorship networks. In this study, we will calculate the mean distance of LIS in China, and compare this figure with those of other disciplines (see Table 1).

Table 1 Typical macro-level metrics results
Cluster

Real networks are also clustered, which indicates that nodes in these networks have a higher chance to be linked than those in random networks. Coauthorship networks are likely to be characterized by local clusters of individuals who are tied to most of the others (Newman 2001a). Clustering coefficient is a standard way to identify how clustered these networks are:

$$ {\text{C = }}{\frac{{ 3\times {\text{number}}\,{\text{of}}\,{\text{triangles}}\,{\text{on}}\,{\text{the}}\,{\text{graph}}}}{{{\text{number}}\,{\text{of}}\,{\text{connected}}\,{\text{triples}}\,{\text{of}}\,{\text{vertices}}}}} $$

Here a “triangle” is a trio of authors, each of whom is connected to both of the others, and a “connected triple” is a single author connected to two others. It is the probability of collaborating if both have collaborated with a third author. If a network has a clustering coefficient of 0.5, it indicated that there is 50% of a chance that two authors both collaborating with a third author would also collaborate with each other. In scientific collaboration networks, it is the tendency for authors to form local clusters with people with whom they share common interests. To the extent that the relations among authors display a tendency towards transitivity, collaboration networks are likely to be characterized by local clusters of individuals who are tied to most of the others (Yin et al. 2006). This definition corresponds to the concept of the “fraction of transitive triples” used in sociology (Wasserman and Faust 1994).

Degree distribution

The degree of a vertex is the number of lines incident with it. Vertices with higher degrees play as the hub in the network. These hubs have many more lines than other vertices, and function to connect among nodes that would have otherwise been unable to reach one another. They are crucial to the robustness of the network as well as the transmission of information. p(k) is defined as the fraction of nodes in the network that have degree k. Equivalently, p(k) is the probability that a random chosen node in a network has degree k (Yin et al. 2006). Albert and Barabási (2002) have discovered that power-law distribution is related to their degree distribution in many real-world networks: \( p(k) \sim k^{ - \gamma } \). Networks with power-law distributions are often referred to as “scale-free” networks. In this study, we will calculate the exponent −γ to detect the degree distribution of this network (see Table 3).

Error and attack tolerance

Error tolerance indicates that local failures rarely lead to the loss of the global information-carrying ability of the network. It’s the ability to transmit information when a fraction of random vertices are deleted. Attack tolerance, on the other hand, means the ability of retain network’s transmission when removing of a few nodes that play a vital role in maintaining the network’s connectivity. Albert et al. (2000) finds that scale-free networks display an unexpected degree of robustness, the ability of their nodes to communicate being unaffected even by unrealistically high failure rates. This is attributed to the fact that there are some redundancy of ties between local clusters and central hubs. However, error tolerance comes at a high price in that these networks are extremely vulnerable to attacks. Such error tolerance and attack vulnerability are generic properties of social networks (Albert and Barabási 2002). The ability to sustain network ties in collaboration network is important to the cumulative development of deep specialist knowledge (Yin et al. 2006). Two experiments will be conducted to identify the robustness of the LIS coauthorship network in China.

Table 1 shows some typical macro-level metrics results in coauthorship network analysis. The ratio of the largest component of each coauthorship network ranges from 40% to above 90%, which is consistent with Grossman’s (2002) findings. The mean distances of these coauthorship range from 3 to 10. Most of the networks’ clustering coefficients are at the 0.1 level, whereas Biology coauthorship network only has a value of 0.07. Newman (2001c) concluded that this is the result of top-down organization of biology laboratories, which tends to produce tree-like collaboration networks. Such tree-like networks possess low clustering coefficients.

Micro-level metrics

Macro-level metrics are useful for identifying the overall collaboration pattern of scientific collaboration network. However, for individual actors, they may not be helpful. Micro-level metrics are designed for solving such problems. We apply three centrality measures (degree, closeness, betweenness centrality) to investigate different contribution of actors in the coauthorship network, which are important to understanding power, stratification, ranking, and inequality in social structures (Wasserman and Faust 1994).

Degree centrality

Degree centrality is equal to the number of ties (connections) that a vertex has with other vertices. The equation of it is as following where d(n i ) is the degree of n i :

$$ C_{D} (n_{i} ) = d(n_{i} ) $$

Generally, authors with higher degree or more connections are more central to the structure and tend to have a greater capacity to influence others. Yet, for some authors with high degree, it is because they co-authored with many authors in a single paper, rather than co-authored in many papers.

Closeness centrality

A more sophisticated centrality measure is closeness (Freeman 1979) which emphasizes the distance of a vertex to all others in the network by focusing on the geodesic distance from each vertex to all others.

$$ {\text{C}}_{\text{c}} (n_{i} ) = \sum\limits_{i = 1}^{N} {{\frac{1}{{d(n_{i} ,n_{j} )}}}} $$

In above equation, Cc(n i ) is the closeness centrality, and d(n i , n j ) is the distance between two vertices in the network. Authors with high closeness centrality possess either of the following characters: first, authors collaborate with authors from diverse subgroups, and thus shortened their distance to each author from each of the subgroups; meanwhile, authors collaborate with authors in the first case would also result in a high closeness centrality. Authors in the former case are the central authors we intend to identify, whereas those in the latter case are the noises we try to avoid.

Betweenness centrality

Betweenness centrality is based on the number of shortest paths passing through a vertex. Vertices with a high betweenness play the role of connecting different groups. In the following formula, g jik are all geodesics linking node j and node k which pass through node i; g jk is the geodesic distance between the vertices of j and k.

$$ C_{B} (n_{i} ) = \sum\limits_{j,k \ne i} {{\frac{{g_{jik} }}{{g_{jk} }}}} $$

In social networks, vertices with high betweenness are “pivot points of knowledge flow in the network” (Yin et al. 2006, p. 1603). For coauthorship networks, vertices with high betweenness connect authors who share similar research interest. Therefore, authors with high betweenness usually engage in research of different fields and thus show interdisciplinarity.

Table 2 summarizes the macro and micro level metrics for coauthorship network analysis discussed in above paragraphs.

Table 2 A summary of macro and micro level metrics

Results and analysis

An overview

A simple and widely used coauthorship network model is based on an undirected, binary graph in which each edge represents a coauthorship relationship. If any two authors coauthored an article, an edge without weight is created. In this study, we take such an approach in constructing the coauthorship network of LIS studies in China. The whole coauthorship network contains 11,067 vertices and 14,471 edges (authors without collaboration are excluded). What needs to be pointed out is that there is a chance that a vertex stands for multiple authors due to the fact that different authors may have the same name. Although the probability of it is not high (around 0.01%), yet it has been magnified in degree centrality as we will discuss later.

Macro-level structure analysis

In this study, we will focus four key elements of networks: component, distance, degree distribution, and error and attack tolerance of the network. Similar to observations from previous research in coauthorship networks, the LIS coauthorship network is not a single connected graph. The largest component of the network has 5,408 authors, and the second largest component has 42 authors. The dot graph of the relationship between Lg (Author per Component) and Lg (No. of Component) is shown as Fig. 1.

Fig. 1
figure 1

Distribution of components

In this study, the largest component has a ratio of 48.87% of all the authors, which is in consistent with the previous findings. Two factors affect this ratio: (1) the number of the papers under study, the inadequacy of time span or source journals would result in a lower ratio. Our primary research only included the data from 2002 to 2006, and as a result the largest component only has a ratio of 31.01%; (2) the nature the discipline under study, more authors would be involved if it is an experimental research, thus disciplines like biology science and chemistry science would have a bigger size of largest component.

We calculate the distance of the largest component which is 8.8414. The distribution of the mean distance of each author is shown in Fig. 2.

Fig. 2
figure 2

Distance distribution of largest component

The clustering coefficient for LIS coauthorship network is 0.425, which means that two authors are much more likely to have collaborated if they have a third common collaborator than are two authors chosen at random from the community. We also construct a random network, sharing same number of vertices and edges with LIS coauthorship network. The mean distance is 13.9, which is 36.69% longer than LIS coauthorship network; and cluster coefficient is 0.012, which is only a thirtieth of LIS coauthorship. This finding is consistent with former research done by Albert and Barabási (2002) who find that the cluster coefficient of real-world network is higher than that of random networks. Based on this, we can conclude that the LIS coauthorship in China is a small-world network.

Table 3 shows the degree distribution of all components in the LIS coauthorship network.

Table 3 Degree distribution of LIS coauthorship

The table shows that about half (43.04%) of the authors only collaborate with another author. 80% authors in the network only collaborate with less than three authors. Eight authors collaborate with more than 30 authors. They are hubs of the network. The relationship between the two variables fit the curve: p(k) = 2.0020 k −2.6992, with R 2 = 0.9611. This shows that the LIS coauthorship network possess certain character of scale-free network. Furthermore, this result is also consistent with Price’s network of citations, which is probably the earliest published example of a scale-free network (Price 1965). He quoted a value of α = 2.5–3 for the exponent of his network. Other relevant research on scale-free network also confirmed his assumption (Newman 2003).

In order to test the error and attack tolerance of the LIS coauthorship network, we conduct two sets of experiment as follows. Firstly, we delete a certain number of vertices from the largest component of LIS coauthorship network randomly, with the decrement shown on the right column of Table 4. The decrement is based on the last largest component remained. In each of the steps, we record the size of largest component remained, the number of component, mean distance of the largest component remained, and its diameter. Then we reuse the largest component, and delete the hubs of the network. The decrements shown in Table 5 represent the deletion of vertices with more than 30 ties, more than 20 ties, more than 15 ties, more than 10 ties, and more than 6 ties respectively, and record the same set of data as in the random deletion.

Table 4 Results of random deletion
Table 5 Results of the deletion of hubs

Table 4 shows that the size of largest component remaining decreases gradually, and the number of component, mean distance of the largest component remained, and its diameter increases gradually or fluctuantly as more vertices are deleted. The network breaks down when about 78% of its vertices are deleted. The same results can be applied to the deletion of hubs in the network shown in Table 5. However, comparing to the small amount of vertices deleted in the second time, the decrement of the size of largest component remained and the increment of the number of components, mean distance of the largest component remained, and its diameter are sharper in the deletion of hubs. The figures indicate that the loss of hubs in the LIS coauthorship network results in considerable fragmentation; this suggests that the community is fragile, and dependent on central figures to maintain connection across the entire network. Removal of additional, less central vertices, however, the speed through fragmentation is significantly slow.

Micro-level structure analysis

We apply three centrality measures (degree, closeness, betweenness centrality) to investigate different contribution of vertices in the coauthorship network. We also retrieve accumulative citation counts of each author’s publications in the database of CSSCI from 2002 to 2006 (Due to some problem inherent to CSSCI, citations to a paper can only be counted for the first author).

The effect of identical names has been magnified in degree centrality analysis. Such as Li Yong and Zhou Ningli in Table 6, they are normal names in China, which suggests that they may stands for more than one author in the network. Meanwhile, some abnormal data occurs, the citations of Hu Tiejun, Zhao Pengmo, Hu Dehua, and Xie Yangqun are too small comparing to their degree centrality, this may due to the fact that they are not first authors for most papers they wrote, accordingly, CSSCI cannot calculate the citations to these papers for them.

Table 6 Degree centrality (top 41 authors)

In Table 7, we display the top 20 authors with the highest closeness centrality.

Table 7 Closeness centrality (top 20 authors)

Some abnormal data also show up in this table: there are discrepancies between closeness centrality and citation counts for Jiang Enbo and Feng Ying. This is the result of the algorithm of closeness centrality. Closeness centrality measures the distance of an author to the rest authors in the network. Thus, if an author coauthored with an author with high closeness centrality, this author would also have a high closeness centrality. For this analysis, Feng Ying only coauthored once with Chen Ling who has a high closeness centrality; Jiang Enbo coauthored once with Zhang Xiaolin who also has a high closeness centrality. Thus, they have high closeness centrality but low citation counts. Based on this, we can conclude that central authors usually have high closeness centrality (Yin et al. 2006), but high closeness centrality does not necessarily indicate that an author is central to the network.

In Table 8, we display the top 20 authors with the highest betweenness centrality in the largest component.

Table 8 Betweenness centrality (top 20 authors)

The largest component contains 5,408 vertices, while 2,134 vertices have betweenness centrality other than zero. The result indicates that the removal of these vertices will increase the distance of the network, while the removal of the other 3,274 will not.

Applying centrality to impact analysis

Before the advent of mature network models and theories, scientific evaluation is based on comparatively simple and static data, such as citation counts, publication counts, impact factor, and so on. Now, with the help of high performance computer as well as advanced network theories, we are able to construct networks with thousands even millions of vertices. Accordingly, many characteristics and properties of networks can be applied to scientific evaluation. Specifically, centrality analysis displayed appropriate characteristics as indicators for scientific evaluation.

In the interest of verifying the correlation of centrality with citation counts, we conduct correlation analysis for the two values of authors in Tables 6, 7, 8. The results are shown in Table 9.

Table 9 Partial correlation between each pair of centrality measures

Table 9 shows that the three centrality measures correlate with citation counts, with betweenness centrality has the most significant correlation. The high correlation of citation counts with centrality suggests that centrality measures in certain degree also assess author’s scientific productivity and quality. They can be indicators, or at least supplementary indicators for impact evaluation, providing alternative perspectives for current methods.

Table 10 shows the rankings of citations and three centrality measures (R D stands for the rank results of degree centrality, R C for closeness centrality, R B for betweenness centrality, and R CC for citation counts within top 20 authors).

Table 10 Overall ranking of top 20 authors in the network

Generally, citation counts and citation counts ranking match the ranking acquired by centrality measures. Yet, there are some discrepancies, especially for each individual author. Although the motivation for citation varies, citation counts measure the quality and impact of articles (Garfield and Sher 1963; Frost 1979; Lawani and Bayer 1983; Baird and Oppenheim 1994). While centrality measures both article impact and author’s field impact: degree centrality measures author’s collaboration scope, closeness centrality measures author’s position and virtual distance with others in the field, and betweenness centrality measures author’s importance to other authors’ communication. Hence, centrality has its value in impact evaluation, since it integrates both article impact and author’s field which is usually difficult to measure. Their relationship can be illustrated in Fig. 3.

Fig. 3
figure 3

Relation between citation and centrality

The quality of an article is subjective, yet we can measure it indirectly through article impact which can be quantified by citation counts. Similar to article quality, author’s field reputation is also difficult to assess, but we can assess it through social capital (Burt 1980; Burt 2002; Cronin and Shaw 2002). Accordingly, centrality measures integrate both article impact—citation counts and social capital—author’s field impact, as displayed in Fig. 3.

Another factor contributed to the discrepancy lies on the algorithm of centrality measures. The current algorithm has some inherent drawbacks. Authors from papers coauthored by multiple authors have high degree centrality. This may be magnified when coauthored by many authors. For instance, if a paper is coauthored by 10 authors, each of these authors would have a degree centrality of 9. This is equivalent to 45 papers if they were coauthored by just two authors. It is obvious that they have quite different academic impact. Closeness centrality is a measure of network property rather than a direct measure of academic impact. Authors involved in interdisciplinary research would have a high betweenness centrality even through their role in this specific discipline may not be that significant. In future studies, it will be necessary to improve the algorithm of centrality measure to improve the reliability of centrality measures. Some scholars have already embarked on this. Newman (2005) proposed a new betweenness measure that includes contributions from essentially all paths between nodes, not just the shortest, and meanwhile giving more weight to short paths. Brandes (2008) introduced variants of betweenness measures, as endpoint betweenness, proxies betweenness, and bounded distance betweenness.

Conclusion

Using the data from 18 core source journals in the field of LIS covering 6 years, we construct the coauthorship network for LIS in China. This network provides us with rich information about the collaboration patterns of LIS studies in China. We find that this network is a small-world network, which indicates that there are many shortcuts in the network, where authors can reach others or be reached with much less ties. It also indicates that information flows more quickly and more directly in this network, inspiring more collaboration in the foreseeable future. Through degree distribution analysis, we find that this network also possesses the characteristics of scale-free networks in which quite a few authors collaborate widely while most authors collaborate with limited number of authors.

We also analyze the network at the micro-level through degree, closeness and betweenness centrality. We conduct correlation analysis between citation counts and centrality values, and discover that they are highly correlated. The centrality is originated from sociology (Freeman 1977, 1979). Yet centrality analysis is relatively new to coauthorship network analysis. Currently several papers have applied centrality measures to coauthorship network analysis (Mutschke 2003; Liu et al. 2005; Yin et al. 2006; Liu et al. 2007), they all found that centrality measures are useful for impact evaluation. From our study, we find that citation counts are correlated with centrality measures, yet some discrepancies occur for individual author. On one hand, this is due to the fact that citation and centrality evaluate different content; on the other hand, some drawbacks of centrality algorithms also add noises to the results.

The data and method used in this study only reflect the collaboration patterns of LIS studies in China for a given time period. It is a static rather than a dynamic one. In future studies, we intended to apply PageRank to impact analysis. PageRank creates a new synergy to information retrieval for a better ranking of Web pages. It is query independent and also content free. PageRank has been deployed in bibliometrics to evaluate research impact (Liu et al. 2005; Yin et al. 2006; Chen et al. 2007). The PageRank algorithm can be viewed as a state space system of the form:

$$ x(n + 1) = dTx(n) + b $$

The state x(n) is a Markov vector of length v, and T is a Markov matrix on \( \mathbb{R}^{\nu } \), while d the so called damping is a scalar in [0, 1]. Furthermore, b is a vector of length v consisting of all positive numbers which sum to 1 − d. PageRank provides an integrated algorithm to combine simple counting (the b part) and the topology of the network (the dTx(n) part) in a simple and efficient way. The damping factor allows us to tune the algorithm based to the specific needs of the applications and whether the topology of the network should be considered or not and to which level. In the future, we are interested to use PageRank and different weighted PageRank algorithms to analyze co-authorship networks.