Introduction

A dissertation shows proof of both an original contribution to knowledge and substantial subject knowledge in a discipline, and it also provides evidence of significant scholarly achievement (Kushkowski et al. 2003). Doctoral dissertation is not only a measure of research output, but is also a measure of the production of qualified manpower; a resource that is essential for contemporary knowledge societies (Andersen and Hammarfelt 2011). For these reasons, researches about doctoral dissertations in different disciplines and countries (Anwar 2004; Breimer 1996; Herubel 2007; Macauley et al. 2010; Villarroya et al. 2008; Yaman and Atay 2007) have been received a lot of attention. Doctoral dissertation of Library and Information Science (LIS) is also a valuable research subject, for example, Anwar (2004) examined the publication output of LIS graduates that was derived from their doctoral dissertations.

Since the ‘‘opening-up’’ of China to the West in 1978, Library and Information Science in China has developed rapidly along with the growth of Chinese economy (Wu and Yuan 1994). Peking University and Wuhan University were approved in 1990 authorized for the Library Science doctorate and Information Science doctorate respectively, and LIS doctor recruitment in China was started in 1991. Now, 20 years later, understanding the development of the discipline is particularly necessary. Doctoral dissertations offer unique insight into the field by revealing the foci of research and instruction within the institutions that produce LIS scholars (Finlay et al. 2012), and serve a critical function in the exploration of disciplinary development (Sugimoto et al. 2011). Thus, in this study, we map the intellectual structure of the research fields of LIS doctoral dissertations in China by using co-word analysis. This will contribute to better understanding the field and providing a basis for its future development. The remainder of this paper is organized as follows. First, we conducted a literature review. Next we introduced the methodology of this paper, included data collection, data process, and method of data analysis. Then descriptive statistic, cluster analysis, strategic diagram and social network analysis (SNA) was conducted to analyze the dataset, and the results were interpreted. Finally, we draw our conclusions in the last section.

Literature review

This section reviews the Co-word analysis and its latest development (especially Latent Dirichlet Allocation (LDA), a newly developed approach of topic modeling), and bibliometric research on dissertations (especially the LIS doctoral dissertations).

Co-word analysis and its latest development

Similar to the method of co-citation analysis (Small 1973; Small and Grifth 1974), co-word analysis rests on the assumption that a paper’s keywords constitute an adequate description of its content. Two keywords co-occurring within the same paper are an indication of a link between the topics to which they refer (Cambrosio et al. 1993). The presence of many co-occurrences of pair of words within scientific papers reveals that they may correspond to a research theme. The main feature of co-word analysis is that it visualizes the intellectual structure of one specific discipline into maps of the conceptual space of this field, and that a time-series of such maps produces a trace of the changes in this conceptual space (Ding et al. 2001). Many researchers have used co-word analysis as an important method to explore the concept network in different fields, for instance, biology (An and Wu 2011; Cambrosio et al. 1993; Rip and Courtial 1984), education (Ritzhaupt et al. 2010), Library and Information Science (Ding et al. 2000; Milojevic et al. 2011; Uzun 2002; Zhao and Zhang 2011), and so on. With the support from strategic diagram and SNA, co-word analysis, different from other co-occurrence methods, is able to visualize the intellectual structure of a specific discipline through measuring the association strength of keywords from publications in a research area (Liu et al. 2011).

Topic model is a kind of statistical model for discovering the scientific topics in a literature corpus, and can be considered as the latest development of co-word analysis. One of the newly developed approaches of topic model is LDA, which was proposed by Blei, Ng, and Jordan (Blei et al. 2003) as a generative probabilistic model for identifying the topics in a set of documents. LDA has been used to study the topical structure of a filed and it performs well. Griffiths and Steyvers (2004) presented a statistical inference algorithm for LDA to analyze abstracts from PNAS. The topics recovered by their algorithm picked out meaningful aspects of the structure of science and revealed some of the relationships between scientific papers in different disciplines. Zheng et al. (2006) extracted a set of major semantic concepts from a protein-related corpus of text words from MEDLINE titles and abstracts by applying the LDA model. They found that the identified concepts are semantically coherent, and most of them are biologically relevant. Expansions of LDA have also been used to understand (Sugimoto et al. 2011) correlations between topics (Blei and Lafferty 2007), authors (Rosen-Zvi et al. 2004), academic networks (Tang et al. 2008) and changes in topic overtime (Pruteanu-Malinici et al. 2010; Rzeszutek et al. 2010). Moreover, the topic modeling approaches are also used in field of community detection (Ding 2011), and some models (e.g., CTM, DCTM, etc.) are proposed to better understand the dynamic features of social networks and make improved personalized recommendations (Li et al. 2012).

Bibliometric research on LIS dissertations

As a bibliometric approach, citation analysis of LIS dissertations has been conducted to study the sources, rankings of disciplines and authors (Buttlar 1999; Gao et al. 2009; Sugimoto 2011). A few studies have investigated the topics of LIS dissertations. Schlater and Thomison investigated the methods used in Library Science dissertations (Schlater and Thomison 1974, 1982). Franklin and Jaeger (2007) examined the LIS doctoral dissertations written by African American women between 1993 and 2003, and the research fields were divided into five categories of research topics: information issues, library/librarianship issues, literature, and technology. Sugimoto et al. (2011) identified changes in dominant topics in LIS over time, by analyzing the 3,121 doctoral dissertations completed between 1930 and 2009 at North American Library and Information Science programs. In this study, core research areas (library history, citation analysis, and information-seeking behavior) was identified; meanwhile, one of the notable changes in the topics was the diminishing use of the word library (and related terms). Finlay et al. (2012) examined the topicality of LIS dissertations written between 1930 and 2009 at schools with American Library Association (ALA)-accredited university programs in North America. The results of this article indicated that the percentage of dissertations found to contain no instance of any of the selected library keywords had steadily risen since 1980; similarly, the percentage of dissertations found to contain instances of keywords in both the title and abstract had steadily declined.

Similarly, some researchers have analyzed the doctoral dissertations of Library Science or Information Science in China from the quantitative views. Gao et al. (2009) conducted a citation analysis of 14 doctoral dissertations in LIS at Wuhan University, the results revealed that the cited literatures came primarily from Chinese sources. Based on 110 doctoral dissertations of Information Science from China Doctoral Dissertations Full-text Database (CDFD), Wanfang Data, and National Science and Technology Library (NSTL), Wang et al. (2009) revealed the distribution of time and themes, as well as the focuses according to the frequency of keywords. They found that knowledge management and information services are the top two highest frequency keywords. Jin (2010) gathered 256 doctoral dissertations of LIS in China from 1994 to 2010. In this study, she investigated method systems of these dissertations, and results showed that researchers paid close attention to research methods, but ignored the methodology. Similarly, Yang (2011) surveyed research methods of 70 LIS doctoral dissertations of National Science Library of Chinese Academy of Sciences from 2000 to 2009, and the findings indicated that conventional methods were mainly used, such as investigation and experimental methods. Ye (2011) conducted a statistical analysis of doctoral dissertations of Library Science published in CDFD, Wanfang Data, and National Library of China (NLC) from 1994 to 2010, and analyzed the keywords with word frequency statistics method. The conclusion of this article indicated that publishing, digital libraries, library and information management whose frequency exceeded 10 and these keywords with high frequency can reflect the hot researches in Library Science.

However, until now, little attention has focused on the internal and external relationship of research fields in doctoral dissertations of LIS in China. Furthermore, the dataset of previous studies in China was only from public doctoral dissertations databases, such as CDFD, Wanfang data, and NLC. Thus, the data of these previous researches is still scarce. This paper aims to map the intellectual structure of the research fields of LIS doctoral dissertations in China. Compared with previous studies, in order to get more data, we not only gather doctoral dissertations from public degree databases, but also obtain doctoral dissertations from the degree databases provided by the universities/institutes which have been authorized to grant LIS doctoral degrees.

Methodology

Data collection and process

Keywords are the most important research elements in co-word analysis and should be exacted from publications when the research area is selected. We gathered data (1994–2011) from 16 databases, more specifically, six public degree databases and ten degree databases provided by the universities/institutes which have been authorized to grant LIS doctoral degrees.

In China, doctoral dissertations should be submitted to library/archive of the universities/institutes when the authors obtained their doctoral degrees. Until now, there are nine universities/institutes have been authorized to grant Library Science, or Information Science, or LIS doctoral degrees in China. They are Wuhan University (WHU), Nanjing University (NJU), Peking University (PKU), The National Science Library of Chinese Academy of Sciences (NSLC), Nankai University (NanKai), Jilin University (JLU), Renmin University of China (RUC), Central China Normal University (CCNU), and Sun Yat-sen University (SYSU). Table 1 shows degree databases of the nine universities/institute.

Table 1 Doctoral dissertation database of LIS of the nine Universities/Institute in China

Except for being submitted to library/archive of the degree-conferring universities/institutes, some doctoral dissertations (full text or bibliography) will be also submitted to the public degree databases. There are six common public degree databases, including CDFD, Wangfang Data, NCL, NSTL, China Academic Library and Information System (CALIS), and, Institute of Scientific and Technical Information of China (ISTIC). Table 2 shows the six public degree databases in China.

Table 2 Public database of doctoral dissertations in China

We gained data of doctoral dissertations by four steps. Firstly, we gathered bibliographies of LIS doctoral dissertations from the 16 degree databases. Secondly, we merged these data, and removed duplicated ones. Notice that, some bibliographies, especially the item “Keywords” is not correct in some public databases. Therefore, thirdly, we checked all keywords of the dataset one by one through reading full text or the first 16/24 pages in the databases above. Unfortunately, there were still some doctoral dissertations without full text (or full text of first 16/24 pages) in the databases or Internet. Thus, fourthly, we obtained these doctoral dissertations through document delivery services to fill the gap.

In this study, a program in RUBY was developed to processes the raw data. And then, BibExcel (Persson et al. 2009) was employed to calculate the frequencies that two keywords appeared together in the same doctoral dissertation. Subsequently, a symmetrical co-occurrence matrix based on the word co-occurrence was built. The value of the cell of two keywords was decided by the frequencies these two keywords both appear in the same dissertation. The higher co-occurrence frequency of the two keywords means the closer relationship between them (Ding et al. 2001). The symmetrical co-occurrence matrix was then transformed into a correlation matrix by using equivalence index (Cahlik 2000; Callon et al. 1991; Coulter et al. 1998). The equivalence index \( \left( {{\text{E}}_{\text{ij}} } \right) \) describes the strength of the association between words i and j in each word pair ij (Neff and Corley 2009; Callon et al. 1991):

$$ {\text{E}}_{ij} = C_{\text{ij}}^{2} /C_{ii} C_{jj} . $$

Similarly, the value of the cell indicates the distance of two keywords; the higher value means the closer relationship between them.

We converted co-occurrence matrix to binary matrix by the program developed in RUBY. The average co-occurrence times between the high-frequency keywords (in the co-occurrence matrix) were 0.41. So, we set one as the threshold. If the value of the cell in co-occurrence was less than one, the value of the cell in binary matrix would be zero; otherwise, the value of the cell in binary matrix would be one.

Method of data analysis

Similar to other studies using co-word analysis, we chose hierarchical cluster analysis, strategic diagram and SNA. As we know, hierarchical cluster analysis and strategic diagram are commonly used in co-occurrence analysis.

Hierarchical cluster algorithm helps us find the clusters and the result of clustering can be graphically displayed as tree which shows the merging process and the intermediate clusters (Yang et al. 2012). Hierarchical clustering coupled with co-word analysis has been used widely in many studies (An and Wu 2011; Ding et al. 2001; Milojevic et al. 2011).

The strategic diagram is mainly used to describe the internal relations within a certain field and the interactions between fields (Law et al. 1988). This diagram (Bredillet 2009) is created by putting the strength of global context on the X axis (called centrality) and putting the strength of local context on the Y axis (called density). Two kinds of indexes (density and centrality) are used to measure the strength of local context and global context, respectively. Centrality is used to measure the strength of a subject area’s interaction with other subject areas. The value of the centrality of a given cluster can be the sum of all external link values (Courtial et al. 1993; Turner 1988) or the square root of the sum of the squares of all external link values (Coulter et al. 1998). In this study, we take the square root of the sum of the squares of all external link values (Coulter et al. 1998) as centrality. Density is used to measure the strength of the links that tie together the words making up the cluster; that is the internal strength of a cluster (He 1999). The density value can be the average value (mean) of internal links (Coulter et al. 1998; Turner 1988), the median value of internal links (Courtial et al. 1993), or the sum of the squares of the value of internal links (Bauin et al. 1991). In this paper, we take the average value (mean) of internal links as density (Coulter et al. 1998).

Social network analysis assesses the unique structure of interrelationships among individuals (Lurie et al. 2009), and has been extensively used in social science, management science, scientometrics, etc. SNA can also map the network by using methods of information visualization. We map the co-word network to show the relationships among research topics. Meanwhile, k-core analysis is commonly used in SNA. A k-core is a maximal group of nodes, all of which are connected to at least k other nodes in the group (Eschenfelder 1980; Maimon and Rokach 2005). By varying the value of k (that is, how many members of the group do you have to be connected to), different pictures can emerge. As the value of k becomes larger, group sizes will decrease, and the relationship among the members will be tighter (Yang et al. 2012). In bibliometrics, some studies have been investigated hot research topics though co-word analysis coupled with k-core analysis (Yang et al. 2012; Zhao and Zhang 2011).

In this study, the hierarchical cluster analysis and strategic diagram was conducted by using SPSS20. Simultaneously, the mapping and network were also obtained by analyzing original co-occurrence matrix and a binary matrix with Ucinet6.0 (Borgatti et al. 2002).

Result and discussion

Descriptive statistic of doctoral dissertations and keywords

We obtained 640 LIS doctoral dissertations in this study. Table 3 shows the distribution of institutions to which these dissertations belong. As shown in Table 3, Wuhan University has the largest number of LIS doctoral dissertations, indicating it is the most important institution of LIS doctoral education in China.

Table 3 distributions of LIS doctoral dissertations in universities/institute in China

There were two dissertations which we could not get their keywords. We totally obtained 3,015 keywords (4.7 keywords per dissertation) from the 638 dissertations, and took the 3,015 keywords as the data sample of co-word analysis. Due to the lacking of unified indexing on keyword, we standardized these keywords by merging the synonyms (e.g., “Bibliometric analysis” is replaced by “Bibliometric”). Finally, 56 keywords with frequency more than six were selected as shown in Table 4. The frequencies of these 56 keywords are 612 times (about 20.3 % of the total), covering the main research topics of LIS doctoral dissertations in China. Notice that the keywords “library”, “China”, “countermeasure”, “Information Science” and “information study” have very broad meanings. In other words, this kind of keywords are meaningless for this study, and we excluded them in the below analysis.

Table 4 The top 56 keywords

The words with high frequency of occurrence and co-occurrence can reflect research focuses to some extent. The top ten keywords with high frequency of occurrence are knowledge management (39), digital library (34), network (31), ontology (22), information service (20), evaluation (20), electronic government (19), information resource (16), competitive intelligence (16), and library (15). The top ten keywords with high frequency of co-occurrence are network (32), knowledge management (32), digital library (30), information resource (26), ontology (25), knowledge organization (20), information resource management (19), electronic government (18), information retrieval (16), and evaluation (19). Notice that knowledge management, digital library, network, ontology, evaluation, electronic government, information resource have the higher frequency of occurrence and co-occurrence, and indicating that these research topics are major focuses and the bridges connecting other research topics (Liu et al. 2011) in the research of LIS doctoral dissertations in China.

Cluster analysis

We conducted the cluster analysis using hierarchical cluster analysis, with Ward’s method (Ding et al. 2001; Gordon 1996; Neff and Corley 2009; Lee and Jeong 2008; Liu et al. 2011) and the distance measure is “Squared Euclidean distance” as recommended by Bacher (2002). The 51 keywords of LIS doctoral dissertations in China were divided into 15 clusters. It indicated that the research fields of LIS doctoral dissertations in China are varied. The dendrogram of the cluster analysis is shown in Fig. 1. Cluster names are given for each cluster as shown in Table 5.

Fig. 1
figure 1

Clusters of 51 keywords

Table 5 Fifteen clusters of research topics of LIS doctoral dissertations in China

As can be seen, cluster 10 has the largest number of keywords, indicating that the cluster 10 is the most centralized research fields. The keywords, that is, the research topics in cluster 10 are paid close attention to in LIS doctoral dissertations in China.

Drawing strategic diagram

Centrality and density, the two indicators of strategic diagram, could reflect the strength of relation within and between clusters. The strategic diagram can display the structure of research fields, and it can also reveal the focuses and trends of research fields by dividing the clusters into four quadrants. We calculated the values of centralities and densities of all clusters (as shown in Table 6), and drew the strategic diagram (Fig. 2).

Table 6 Centrality and density of fifteen clusters
Fig. 2
figure 2

Strategic diagram of Clusters

As shown in Fig. 2, clusters in quadrant I (upper right hand quadrant) include cluster 1 (information resource), cluster 2 (Ontology), cluster 3 (Electronic government), cluster 6 (Knowledge management) and cluster 14 (digital library). Both of the centrality and density of these clusters are high, indicating these clusters not only contain close internal connections but also are widely connected with other clusters. These fields are the research focuses in LIS doctoral dissertations of China and tends to be mature.

Clusters in quadrant II (upper left hand quadrant) only include cluster 4 (Information retrieval). This cluster has close internal connections, indicating that the research of this cluster has formed a relative stable scale. Contrary to the internal connections, connections between this cluster and the other clusters are not so close. That is to say, this field is located on the edge of the whole research network.

Clusters in quadrant III (lower left hand quadrant) include cluster 5 (social network), cluster 7 (evaluation of humanities and social sciences), cluster 8 (performance evaluation), cluster 9 (academic journal), cluster 10 (competitive intelligence), cluster 11 (library management), cluster 12 (bibliometrics) and cluster 15 (open access). These clusters have low centrality and density, thus, they have loose internal and external connections. The fields are still immature and located on the edge of research network.

Clusters in quadrant IV (lower right hand quadrant) only contain cluster 13 (information management). Although this field have loose internal connections, it has been attracted many researchers’ attentions. Consequently, there is vast space for further development in this field. In other words, information management will become research trends in the future, and need to be further studied.

Social network analysis

We conduct two types of co-word networks through NetDraw. In each network, nodes represent keywords, and line between two nodes indicates that the two keywords have appeared in a same dissertation.

The first network was generated by using original co-occurrence matrix (Fig. 3). It could intuitively show the relationship of research topics of LIS doctoral dissertations in China. The relative size of nodes is proportional to the frequency of keywords. Line thickness reflects the closeness of connections between two keywords, the thicker the line between two keywords, the closer the connection is. As shown in Fig. 3, the “KM (Knowledge management)” node has the biggest size, which represents it has the highest frequency of keyword. The thicker lines between two keywords, such as “IRM (Information Resource Management)” and “EG (Electronic Government)”, “Onto (Ontology)” and “Sema-web (Semantic Web)”, etc. represent closer relationships.

Fig. 3
figure 3

Social network maps of original co-occurrence matrix

We conducted the second network by using the binary matrix which was converted from the original co-occurrence matrix. As show in Fig. 4, five cores are identified by k-cores analysis. In order to display the cores clearly, different shapes are configured: thirty up triangle nodes (k = 5) represent core themes of the network. Ten square nodes (k = 4) represent the secondary core themes. Six circle nodes (k = 3) are the themes which are located between core and periphery. Three down triangle nodes (k = 2) and two plus nodes (k = 1) are the periphery themes.

Fig. 4
figure 4

k-cores analysis of binary matrix

It should be noted that, Figs. 3, 4 are two different networks of the research topics of LIS dissertations in China. Figure 3 focuses on the relationships between research topics. Compared with Fig. 3, Fig. 4 focuses on finding core-verge research topics.

Conclusions

In this paper, we investigated the intellectual structure of LIS doctoral dissertations in China by using co-word analysis, including hierarchical cluster analysis, strategic diagram and SNA. We obtain some clear and reasonable results about researches of LIS doctoral dissertations in China.

The distribution of LIS doctoral dissertations in universities/institutes implies that Wuhan University is the most important institution of doctoral education in LIS in China. School of Information Management of Wuhan University is the earliest LIS education institute in China. After 92 years of development, it has become a comprehensive and largest-scale research-oriented LIS education and research institute in China.

According to keyword frequency, strategic diagram and k-cores, we identify the focuses of researches in LIS doctoral dissertations in China, including information resource and allocation, ontology, semantic web, semantic search, electronic government, information resource management, knowledge management, knowledge innovation, knowledge sharing, knowledge organization, network, information service, information need and digital library. In these research focuses, there are only a small percentage of topics about library/library-related. This may be caused by many reasons, such as the researchers are no longer studying topics that are relevant to the practical field (Finlay et al. 2012). A lack of connections between research, education, and practice, is not only harmful to the development of LIS disciplines, but also to the future of the practice.

The hierarchical cluster analysis for top keywords with high frequency suggests that the research fields of LIS doctoral dissertations in China are varied. The evolution trends in strategic diagram reveal that many research fields in LIS doctoral dissertations in China are still immature; accordingly, the well-developed and core research fields are fewer, such as information resource, electronic government, ontology, digital library and knowledge management. In summary, it is notable that research fields of LIS doctoral dissertations in China are varied and many of them are still immature. This may be caused by two reasons. Firstly, the development of LIS disciplines (especially the discipline of Information Science) is still immature. Taking Information Science for example, there is no undergraduate education in Information Science in China. That is, there is only doctoral/master’s education in Information Science in China. The researchers (e.g., Ph.D. candidate) are from different disciplines, such as mathematics, computer science, information systems, management science, economics, engineering, law, and so on. Thus, the research fields are varied due to the different backgrounds of disciplines. Secondly, the development of information technology and the growth of economics in China have motivated LIS schools to adjust their objectives, curriculum and knowledge structure in order to meet the various kinds of social needs (Dong 1997). Accordingly, the research fields of LIS doctoral dissertations become more specific and specialized. It should be noted that the doctoral education of LIS in China is only in a preliminary stage, thus, many research fields are still immature and need to be further studied.

As an exploratory study, this study also has some limitations. A little number of dissertations may contain some contents of confidentiality, such as patenting and/or commercial development possibilities. As a result, these dissertations through submission to degree databases may be delayed disclosure for years. We collected LIS doctoral dissertations from university/institute dissertation databases and public dissertation databases, therefore, the research may missed a small amount of dissertations.

Notwithstanding its limitation, this study has mapped an intellectual structure of the research fields of LIS doctoral dissertations in China. In future, we will do a longitudinal study in order to gain a full understanding of historical and contemporary developments in research. In addition, future studies may carry out a comparative study between China and oversea countries to integrate into the global trend.

Doctoral dissertation is an underdeveloped unit of analysis in contemporary bibliometric research. With the same view as the other researchers (Andersen and Hammarfelt 2011), we believe that the analysis of doctoral dissertations can provide valuable insights into the growth and structure of scientific fields and disciplines.