Introduction

Co-citation analysis (CA) is a significant branch of citation analysis in bibliometrics. It can be divided into at least three types according to the object of study: author co-citation analysis (ACA), document co-citation analysis (DCA), and journal co-citation analysis (JCA). H. D. White and B. C. Griffith brought ACA into Library and Information Science (LIS) in 1980s (White and Griffith 1981) in order to depict the intelligent domain of certain field(s). The main purpose of ACA is to map scientific domains from the perspective of co-cited authors by pointing out the co-citation relationships in which the object of study (i.e. the unit of analysis) is author rather than document or journal (Jeong et al. 2014). The basic assumptions of ACA can be summarized as: all cited articles play equal roles in co-citation analysis; the more two authors are co-cited, the stronger their relevance is. Moreover, four normal steps of ACA are listed as followings (McCain 1990; Eom 2008a): (1) selection of author set and retrieval of co-cited author counts; (2) forming the raw co-citation matrix; (3) transformation from the raw co-citation matrix to the correlation matrix; (4) multivariate analyses (e.g., cluster analysis, multi-dimensional scaling (MDS), factor analysis, etc.). The concepts and methods of ACA were applied frequently in other majors to exhibit scientific domains and academic researchers (Eom 1999; Tsay 2011). Recently, ACA has been further combined with content-based analysis (Jeong et al. 2014) and artificial intelligence technologies (An et al. 2011).

However, it is assumed that each citation in an article has equal contribution according to White and Griffith (1981). It could not reveal significance and relevance because the purpose of these citations could be different in citers’ perspective. For example, the article, named “PageRank for ranking authors in co-citation networks” (Ding et al. 2009), has two references, coauthorship-related one (Liu et al. 2007) and PageRank-related one (Bianchini et al. 2005) with the corresponding authors, Liu and Binanchini. In fact, the authors have different interest fields, Library and Information Science (LIS) and Computer Science (CS), though their studies are co-cited. The author, Dr. Binanchini, could appear in citation networks (graph) obtained in multivariate analyses while LIS is considered. This might cause an oversight to explore the potential authors in LIS if lots of such situations occurred. In other words, its performance has been accepted and tolerated despite the fact that ACA uses author co-citation relationships as its unique information to construct a knowledge domain. And the major purpose of this paper is to reduce the oversight by involving more general information in citations based on ACA. The information can be general descriptive metadata of a citation, such as published time, the publication itself, and keywords of a citation. Specifically, in time perspective, for example, small difference between two citations’ published time implies that the authors tend to focus on similar issues in the same period of time. The representation of authors’ relationship might be distinctive in knowledge graph because of various concepts, methods, or even diversified demands in different periods of time. Similar journals where two authors’ papers are published or similar keywords of citations they use, on the other hand, implies that they tend to research on similar issues.

As a result, the proposed method, called Modified Author Co-Citation Analysis (MACA), exploits four general descriptive metadata in citations, authors of a citation, the time when a citation is published, the carrier (i.e. journals, conferences, monographs, and even electronic sources, etc.) where a citation is published, and the keywords of a citation, to construct a citation network. Similar to ACA, the information of authors in citations (i.e. author co-citation count) is used to establish the co-citation relationships among authors. The import of published time information in citations to every co-cited author is produced to form the co-citation matrix from time perspective, called time-based parameter. The carrier information of citations is abstracted first and their professional fields belonged are developed according to the focused issues. The relationship of co-cited authors, called carrier-based parameter, is calculated depending on the similarity of professional fields of their articles. Similarly, the professional fields to which keywords of citations belong are obtained initially based on the meaning of keywords. Fields calculate the co-cited authors’ relation in keyword perspective, called keyword-based parameter by fields.

Related works are described in “Related works” section. The calculations and explanations of the proposed MACA are detailed in “Modified author co-citation analysis (MACA)” section. The dataset and pre-processing of our studies are expressed and the performance and analysis of the proposed MACA are demonstrated in “Experimental results and discussion” section. Finally, the conclusions are provided in “Conclusion” section.

Related works

ACA has been a hotspot in informetrics and scientometrics, which aims to instruct scientific research by looking for co-citation relationships between authors in academic articles set and mapping knowledge domains (McCain 1990). Much empirical research indicated that ACA is very effective and applicable in evaluating discipline development situations and identifying micro-structures of certain field and its sub-fields since it can reveal dynamic changes and future developments.

The major steps of ACA are shown in Fig. 1. An academic dataset is selected by using certain methods (e.g. selection of specific journal(s), snowballing, etc.) and the author’s name should be disambiguated in the first two steps. Author name disambiguation mainly bases on the authors’ affiliation, collaboration records, and research areas. Then the co-cited authors within a dataset are abstracted to construct a raw co-citation symmetric matrix based on their co-citation count regardless of whether the first-author or all-author information is counted. The raw co-citation matrix is transformed into a correlative co-citation matrix for normalization in the next step. Many correlation measurements (e.g. Pearson’s r, Jaccard, cosine, Euclidean distance, etc.) should be judged and selected in this step. The final series of data analysis methods (e.g., factor analysis, cluster analysis, network analysis, and multi-dimensional scaling) are used to produce a more accurate interpretation of the results. For example, when trying to cluster given authors, a hierarchical agglomerative or iterative partitioning method is adopted to analyze the correlating authors. Then professionals provide some explanations based on the results before peer reviewing.

Fig. 1
figure 1

The framework of ACA

Over 30 years, four major concerns of traditional ACA can be summarized as followings: (1) Data collection methods (White and McCain 1998; Cothill et al. 1989) and database selection (Zhao and Strotmann 2008); (2) Raw matrix formation and definition or modification of ACA; (3) Correlation matrix transformation and similarity measurement in ACA (Ahlgren et al. 2003; White 2003a; Bensman 2004; Leydesdorff and Vaughan 2006; Egghe 2009; Mêgnigbêto 2013); (4) Further analysis methods (e.g. factor analysis, multi-dimensional scaling, cluster analysis, network analysis, etc.) and visualization (White 2003b; An et al. 2011; Chen 1999; Moya-Anegón et al. 2007).

In the method of raw matrix formation and definition or modification of ACA, researchers focus on diagonal values in the raw co-citation matrix (White and McCain 1998; McCain 1991) and first- or all-author co-citation analysis (Persson 2001; Zhao and Logan 2002; Zhao 2006; Rousseau and Zuccala 2004; Zhao and Strotmann 2008; Schneider and Larsen 2009; Eom 2008b). The latter research has made traditional ACA more informative since more authors’ co-citation relationships were imported. However, these studies only focused on author-related information instead of other available metadata in citations. Moreover, some researchers studied on content-based ACA. Jeong et al. (2014), for example, tried to use the similarity of citance (i.e. citing sentences) to modify traditional ACA, the essence of which is to improve the step of the raw co-citation “count” calculation. The results showed that content-based ACA performed better than the previous methods. Nevertheless, content-based ACA requires full-text data in TXT or HTML format and more calculative complexity. Concerning these disadvantages, in this paper, we hope to modify the construction of raw co-citation matrix combined with other citation descriptive metadata (i.e., citations’ published time, citations’ published carrier, and citations’ keywords) in order to integrate more types of information and to improve the performance of ACA. This paper tries to modify traditional ACA by adding an “author-based parameter calculation” step (white block in Fig. 1).

Modified author co-citation analysis (MACA)

The framework of the proposed MACA, which analyzes the relationship of two authors by using general descriptive metadata of citations including the published time, keywords, and carrier, is shown in Fig. 2. Obviously, the major difference between ACA and MACA is the stage of constructing raw co-citation matrix. The authors’ names, published time, carriers and keywords of each citation should be abstracted in the first stage. The co-citation matrix of MACA is then constructed by four matrices, called author-based parameter, time-based parameter, carrier-based parameter, and keyword-based parameter, based on the four kinds of corresponding descriptive metadata, respectively. Note that in Fig. 2, the white blocks refer to new steps we introduce, while the green blocks mean traditional steps. The calculations of three different parameters and the co-citation matrix are detailed in the following.

Fig. 2
figure 2

The framework of the proposed MACA

Calculation of the time-based parameter between two authors

An academic article usually exposes the research interest, professional field, and specific contribution of an author. The published time of an article may also implicitly show the authors’ research period on this work. According to the observation of general academic research procedure, the researchers usually read literatures first and formulate their problem inside the studies, then looked for the current solutions or algorithms related to their problems. The researchers, especially in engineering field, cite recent studies for exploiting, modifying, or comparing. It simply implies that two authors’ works could be related, cooperated, or continued while the published time of their articles, especially co-cited by an article, is near.

Nevertheless, the purpose of the citations in an article, more often than not, could be different, and they might belong to different professional fields (Bu et al. 2015; Brooks 1985). For example, a mathematic theory proposed in a citation is cited for conducting an algorithm, and the method of another citation belonged to bibliometrics is cited for evaluating its results. The analytical result of author co-citation combined with the calculation of their published time could not be influenced while the analysis in a specific field is mainly considered. However, the authors belonging to different professional fields would actually be shown obviously in the knowledge graph. In other words, the relationship between authors of two citations within similar published time should be reflected on the knowledge graph if their studies are in the similar research field.

Three academic researchers within their professional fields are indexed and shown in Table 1. The histogram of the number of pairwise authors` are co-cited according to their time difference as demonstrated in Fig. 3 as well. The distributions of the pairwise authors, 1 and 2, 2 and 3, 1 and 3, are drawn as a solid line, placing a circular, triangular, and square markers at the data points, respectively. Obviously, a total of 36 articles are co-cited, and 72 % have less than 3-year difference. These articles, closed at the published time, have similar or related issues in network science after examining them artificially. The similarity is also revealed in the observation of other pairwise authors. Moreover, there are not many co-citations with more than a 5-year difference, and one of them could be a literature review or a classic study in a professional field. It implies that the interest field of the authors might be related in the same period while their articles having only a small difference in published time are co-cited. In other words, the authors having a number of co-citations with small differences in published time can have closer positions on the knowledge graph.

Table 1 Three authors and their area of interests
Fig. 3
figure 3

The distribution of the number of pairwise authors’ co-citations to their time difference

As a result, two basic assumptions on calculation of time-based parameter between two authors show the following: (1) A small difference between two citations’ published time implies that authors tend to study similar issues in the same time period. (2) An obvious difference between them refers that though authors may research in similar issues in different periods of time, the representation of the authors’ relationship should be distinct in the knowledge graph because of various concepts, methods, or even diversified demands in different periods. Therefore, the time-based parameter of MACA indicates the quantity of relationship in time dimension between two authors whose works are co-cited in one or more articles. Figure 4 shows the block diagram to calculate the time-based parameter. At first, all referred papers made by author \(A_{i}\) are collected respectively. Then the published time of all papers cited are extracted and inputted to time-based relation calculator. After that, the time-based parameter of MACA will be produced.

Fig. 4
figure 4

The procedure of calculating time-based parameter in MACA (PT published time)

In time-based relation calculator, assume that an article, \(P_{l}\) and \(l \in \left[ {1,n} \right]\), has the references, \(D_{r}\) and \(r \in \left[ {1,m} \right]\), with their authors, \(A_{i}\) and \(i \in \left[ {1,I} \right]\), and their published year, \(t_{r}\) and \(r \in \left[ {1,m} \right]\). Then the average of published time of an author \(A_{i}\) in the article \(P_{l}\) is

$${\text{Pub}}\_{\text{ave}}\_t_{{A_{i} ,P_{l} }} = \frac{1}{m}\mathop \sum \limits_{r = 1}^{m} t_{r}$$
(1)

The time-based parameter of two authors is calculated by all average of published time of two authors, \(A_{i}\) and \(A_{j}\), in all \(n\) articles and shown as

$${\text{Ave}}\_{\text{FTR}}_{{A_{i} ,A_{j} }} = \frac{1}{n}\mathop \sum \limits_{l = 1}^{n} \left( {1 + { \ln }\left( {1 + \left| {{\text{Pub}}\_{\text{ave}}\_t_{{A_{i,} P_{l} }} - {\text{Pub}}\_{\text{ave}}\_t_{{A_{j} ,P_{l} }} } \right|} \right)} \right)^{ - 1}$$
(2)

where \({\text{Pub}}\_{\text{ave}}\_t_{{A_{i} ,P_{l} }}\) and \({\text{Pub}}\_{\text{ave}}\_t_{{A_{j,} P_{l} }}\) are average published time of two co-cited authors, \(A_{i}\) and \(A_{j}\), in the same article, \(P_{l}\). Then the range of Eq. (2) is \(\left[ {0,1} \right]\) with its domain \(\left[ {1, + \infty } \right)\) and is shown in Fig. 5. Apparently, it reaches the maximum value 1 when \({\text{Pub}}\_{\text{ave}}\_t_{{A_{i} ,P_{l} }}\) is equal to \({\text{Pub}}\_{\text{ave}}\_t_{{A_{j} ,P_{l} }}\), and it is closer to 0 if the difference is larger enough. Note that this design has two advantages: (1) The function can easily reflect the citation relationship in time dimension between two authors; (2) It can simply be merged into the calculation of traditional ACA for normalizations because its range is \(\left[ {0,1} \right]\).

Fig. 5
figure 5

The value of time-based parameter

For example, Table 2 shows that an article X has four references, \(D_{1}\), \(D_{2}\), \(D_{3}\), and \(D_{4}\), with their authors, \(A\), \(A\), \(B\), and \(B\), respectively. According to Eq. (2), the time relation of each author in the references can be calculated as \(\left( {1 + { \ln }\left( {1 + \left| {\frac{1990 + 2002}{2} - \frac{2010 + 2010}{2}} \right|} \right)} \right)^{ - 1} \approx 0.27\). Suppose that the two authors are also co-cited in another two articles with their time correlation 1.00 and 0.59, respectively. Then their time-based parameter is (0.27 + 1.00 + 0.59)/3 = 0.62.

Table 2 Examples of four papers and their published time

Calculation of the carrier-based parameter between two authors

A carrier here is defined as a form of a publication, such as journals, conferences, magazines, books, electronic resources, etc. Carriers often have specific concentrations in a professional field because they typically dedicate a specific group of readers. The articles in a carrier usually have similar issues and characteristics, such as special issues, special columns, or distinguishing themes, etc. Authors also would like to publish their studies in the carrier in which focused topics are matched. In other words, authors of similar or related fields are co-cited when their articles are published in the same or field-related carriers.

For example, three major topics are discussed, information retrieval and technology, Internet information and information searching behavior, citation analysis and term co-occurrence research, after analyzing all articles in 1999–2008 on Journal of the American Society for Information Science and Technology Footnote 1 (JASIST) (Li and Gong 2010). A. Spink, a famous scientist in information retrieval, published 22 articles, 1.205 % of the whole papers, on JASIST from 2001 to 2010 (Yang 2013). Obviously, the author’s particular interest is included in the scope of JASIST. Meanwhile, another author in the carrier might have similar research field to Dr. Spink if their studies are co-cited. A similar example lies in Y. Ding publishing many of her articles on Journal of Informetrics and Scientometrics (Ding 2011; Ding et al. 2000, 2013). However, the disciplines of the authors could be discerned even if their articles published in different kind of carriers are co-cited because their cited purposes are distinct. Again, we take the example of “PageRank for ranking authors in co-citation networks” (Ding et al. 2009) in which two papers, namely “Inside PageRank” (Bianchini et al. 2005) and “Co-authorship networks in the digital library research community” (Liu et al. 2007), were co-cited. The journal of the former paper is ACM Transactions on Internet Technology, obvious a journal in CS, while that of the other paper is Information Processing and Management, a typical LIS journal. Meanwhile, their authors belong to corresponding fields as well.

Thus, three basic assumptions on calculation of carrier-based parameter between two authors lie on the followings: (1) The information carriers have their specific knowledge range, even though cross-disciplinary sources have strong pertinence. As a result, the knowledge range of information carriers can be cataloged and indexed according to their research areas. (2) The papers published in particular carriers are relative to some extent because the carriers usually focus on particular issues or have given features. (3) The authors would normally submit and publish their articles in information carriers whose concerns are matched with the directions of their studies. Therefore, carrier-based parameter of MACA indicates the quantity of relationship in information carrier perspective between two authors whose works are co-cited in one or more articles. Figure 6 shows the block diagram to calculate the carrier-based parameter. At first, all referred papers made by the author \(A_{i}\) are collected respectively. All information carriers of the papers cited are extracted and are given indexes in field indexer according to their focus areas. Then the field indexes of all papers cited are computed and inputted to carrier-based relation calculator. After that, the carrier-based parameter of MACA will be produced.

Fig. 6
figure 6

The procedure of calculating carrier-based parameter in MACA (PC published carriers)

In carrier-based relation calculator, suppose that there are \(K\) distinct information carriers in dataset, which are divided into \(\xi\) different professional fields. An article, \(P_{l}\) and \(l \in \left[ {1,n} \right]\), has the references, \(D_{r}\) and \(r \in \left[ {1,m} \right]\), with their authors, \(A_{i}\) and \(i \in \left[ {1,I} \right]\), and their information carrier, \(c_{q}\) and \(q \in \left[ {1,K} \right]\). A field distribution matrix, showing the field relation of a reference \(D_{r}\) and its author \(A_{i}\) in article \(P_{l}\), is formulated as \(F = \left( {f_{l,i,j,r} } \right)\)

$$f_{l,i,j,r} = \left\{ { \begin{array}{*{20}l} {1,\quad c_{q} \,{\text{of}}\, D_{r} \,{\text{with}} \,\,A_{i} \,{\text{is }}\,{\text{related}}\, {\text{to }}\,j{\text{th }}\,{\text{field}}} \hfill \\ {0,\quad c_{q} \,{\text{of}}\, D_{r} \,{\text{with}} \,\,A_{i}\,{\text{is }}\,{\text{not}}\, {\text{related}}\, {\text{to}}\, j{\text{th}}\, {\text{field}}} \hfill \\ \end{array} } \right.$$
(3)

where \(j \in \left[ {1,\xi } \right]\) is the field index, and the field relation, \({\text{FR}}\) of an author \(A_{i}\) in article \(P_{l}\) on field \(j\) is further defined as

$${\text{FR}}_{{A_{i} ,P_{l} ,j}} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if}}\, \mathop \sum \limits_{r = 1}^{m} f_{l,i,j,r} > 0} \hfill \\ 0 \hfill & {{\text{otherwise}}.} \hfill \\ \end{array} } \right.$$
(4)

Then the field correlation between two co-cited author, \(A_{i}\) and \(A_{z}\), in article \(P_{l}\) is calculated by

$${\text{FCR}}_{{A_{i} ,A_{z} ,P_{l} }} = \frac{1}{\xi }\mathop \sum \limits_{j = 1}^{\xi } {\text{FR}}_{{A_{i} ,P_{l} ,j}} \cdot {\text{FR}}_{{A_{z} ,P_{l} ,j}}$$
(5)

Therefore, the carrier-based parameter of the two authors within the range \(\left[ {0,1} \right]\), shown in Eq. (6), is the average of their field correlation in the dataset.

$${\text{Ave}}\_{\text{FCR}}_{{A_{i} ,A_{z} }} = \frac{1}{n}\mathop \sum \limits_{l = 1}^{n} {\text{FCR}}_{{A_{i} ,A_{z} ,P_{l} }}$$
(6)

For example, Table 3 shows that an article X has four references, \(D_{1}\), \(D_{2}\), \(D_{3}\), and \(D_{4}\), with their authors, \(A\), \(A\), \(B\), and \(B\), respectively. And these references are belonged to \(F_{4}\), \(F_{3}\), \(F_{3}\), and \(F_{5}\), individually in total five different professional fields. According to Eq. (4), Table 4 indicates the research field relation of each author in the references. After calculation, the two authors’ field correlation is 1. Suppose that the two authors are also co-cited in other two articles with their field correlation 2 and 1, respectively. Then their carrier-based parameter is [(1 + 2 + 1)/5]/3 \(\approx 0.27\).

Table 3 Examples of four papers and their field distributions
Table 4 The field relation of two authors in article X

Calculation of the keyword-based parameter between two authors

Generally, keywords in an article are usually important access points relevant to readers’ interests and authors’ studies. There are several ways of choosing keywords in writing academic papers, and fitting into the categories that have already been prescribed by the journal’s “instruction to authors” would be a possible method of choosing keywords. Keywords are sometimes generated automatically by the library information systems at proof stage (Hartley 2008). According to Hartley, these keywords are selected from the following series of suggested categories: the discipline (e.g., economic, computer science, mathematics), methods (e.g., experiment, case study, questionnaire, algorithm), data source (e.g., primary, secondary, library), location (e.g., city, institution), or topic (e.g., information security, image processing, nature language processing). Due to the limitation of the number of keywords, some researchers have to judge and weigh between keywords. In most cases, keywords often orientate the main professional field of an article and they might expose the authors’ interests in an academic field. As a result, the interested fields of two authors, whose articles are co-cited, might be correlated while the keywords of the articles belong to closer professional fields.

For example, keywords of all articles published by Y. Ding, a productive researcher in LIS, on JASIST before 2015 and the number of them used in her articles are shown in Table 5. Note that keywords of her partial articles are not found in the PDF version and those added by Web of Science system are automatically selected in our observation. Some generalized keywords, like “science”, “library”, “time”, etc., are deleted here. These keywords can be roughly divided into eight parts: citation analysis, bibliometrics/scientometrics, social networks, knowledge management, topic modeling, semantic web, scientific collaboration, and scientific evaluation. The partitions obviously indicate Dr. Ding’s interests and professional fields. After examining the statements on her website,Footnote 2 we found that her interest fields includes semantic web, healthcare, social network, citation analysis, knowledge engineering, and information retrieval. These partitions of keywords are matched with her interest fields except for “healthcare” because it is not involved in her articles published on JASIST. Apparently the more keywords collected could reflect the interest fields of an author. Moreover, L. Bornmann, a well-known sociologist of science, was co-cited with Y. Ding many times, for instance, “Generalized preferential attachment considering aging” (Wu et al. 2014). In that paper, (Ding et al. 2013) and (Bornmann and Daniel 2008) were co-cited. The keywords of the former paper are “content-based citation analysis”, “citation”, “mentioning”, and “citation analysis”, while those of the other paper are “reference services”, and “bibliometrics systems”. In this case, the articles of both authors, Y. Ding and L. Bornmann, may be related to bibliometrics/scientometrics, and presumably they should have similar interests, i.e. bibliometrics/scientometrics. After examining his personal website,Footnote 3 this assumption is correct—the area of L. Bornmann includes research evaluation, peer review, bibliometrics, and altmetrics, very similar to that of Y. Ding. As a result, keywords the authors used could reflect their area of interests.

Table 5 All keywords of Y. Ding’s articles published on JASIST (NK the number of keywords used)

Therefore, three basic assumptions on method of calculation of keyword-based parameter between two authors lie on the followings: (1) Similar to the information carrier, keywords can also be divided into specific research areas and types. (2) The same or similar meaning of keywords used in different articles indicates that there is certain connection or relation on these studies and authors’ interests. (3) The more the number of similar keywords appeared in two articles are, the stronger the relation between them. Therefore, the keyword-based parameter of MACA indicates the quantity of relationship in keyword perspective between two authors whose works are co-cited in one or more articles. Figure 7 shows the block diagram to calculate the keyword-based parameter. After collecting all referred papers made by the author \(A_{i}\) individually, all keywords of the papers cited are extracted and given indexes in field indexer according to their focusing areas. Then the field indexes of all papers cited are computed and inputted to keyword-based relation calculator. After that, the keyword-based parameter of MACA will be produced.

Fig. 7
figure 7

The procedure of calculating keyword-based parameter in MACA (FK fields of keywords. Note that FK is a vector instead of one value)

In keyword-based relation calculator, assume that there are total \(K\) distinct citations’ keywords in dataset which are divided into \(\xi\) different professional fields. Similar to the definition of carrier-based parameter, an article, \(P_{l}\) and \(l \in \left[ {1,n} \right]\), has the references, \(D_{r}\) and \(r \in \left[ {1,m} \right]\), with their authors, \(A_{i}\) and \(i \in \left[ {1,I} \right]\), and their keywords, \(k_{q}\) and \(q \in \left[ {1,K} \right]\). A big matrix, showing the field distribution of the keyword \(k_{q}\) in a reference \(D_{r}\) and its author \(A_{i}\) in article \(P_{l}\), is formulated as \(F = \left( {f_{l,i,j,r} } \right)\) and

$$f_{l,i,j,r,q} = \left\{ {\begin{array}{*{20}l} {1,\quad k_{q} \,{\text{of}}\, D_{r} \, {\text{with}}\,\, A_{i} \, {\text{is}}\, {\text{related }}\,{\text{to}}\, {\text{the}} \,j{\text{th }}\,{\text{field}}} \hfill \\ {0,\quad k_{q} \,{\text{of }}\,D_{r} \,{\text{with}} \,\, A_{i} \, {\text{is }}\,{\text{not }}\,{\text{related }}\,{\text{to }}\,{\text{the}}\, j{\text{th}}\, {\text{field}}} \hfill \\ \end{array} } \right.$$
(7)

where \(j \in \left[ {1,\xi } \right]\) is the field index. Define the field relation, \({\text{DFR}},\) of a reference \(D_{r}\) of an author \(A_{i}\) in article \(P_{l}\) on field \(j\) as

$${\text{DFR}}_{{A_{i} ,P_{l} ,D_{r} ,j}} = \frac{{\mathop \sum \nolimits_{q = 1}^{K} f_{l,i,j,r,q} }}{{\mathop \sum \nolimits_{j = 1}^{\xi } \mathop \sum \nolimits_{q = 1}^{K} f_{l,i,j,r,q} }} \cdot \varepsilon$$
(8)

where \(\varepsilon \in N^{*}\), normally \(3 \le k \le 7\), is adaptive variable for normalization of keyword-based parameter, and the field relation, \({\text{FR}}\), of author \(A_{i}\) in article \(P_{l}\) on field \(j\) can be calculated by

$${\text{FR}}_{{A_{i} ,P_{l} ,j}} = \frac{{\mathop \sum \nolimits_{r = 1}^{m} {\text{DFR}}_{{A_{i} ,P_{l} ,D_{r} ,j}} }}{{{\text{NZ}}\left[ {{\text{DFR}}_{{A_{i} ,P_{l} ,D_{r} ,j}} } \right]}}$$
(9)

Here, \({\text{NZ}}\left[ \cdot \right]\) is a function that assigns the number of nonzero entries to an input matrix. Then the field correlation between two co-cited authors, \(A_{i}\) and \(A_{k}\), in article \(P_{l}\) is calculated by

$${\text{FKR}}_{{A_{i} ,A_{z} ,P_{l} }} = \frac{1}{\xi }\mathop \sum \limits_{j = 1}^{\xi } \left[ {1 + \left( {{\text{FR}}_{{A_{i} ,P_{l} ,j}} - {\text{FR}}_{{A_{z} ,P_{l} ,j}} } \right)^{2} } \right]^{ - 1}$$
(10)

Therefore, the keyword-based parameter of the two authors within the range \(\left[ {0,1} \right]\), shown in Eq. (11), is the average of their field correlation in the dataset.

$${\text{Ave}}\_{\text{FKR}}_{{A_{i} ,A_{z} }} = \frac{1}{n}\mathop \sum \limits_{l = 1}^{n} {\text{FKR}}_{{A_{i} ,A_{z} ,P_{l} }}$$
(11)

For example, an article X having four references, \(D_{1}\), \(D_{2}\), \(D_{3}\), and \(D_{4}\), with their corresponding authors, \(A\), \(A\), \(B\), and \(B\), is shown in Table 6. And Table 7 indicates that total ten keywords in citations with their field distribution. Obviously, the field distribution of the referred papers in keyword perspective can be calculated and their results are shown in Table 8.

Table 6 Examples of four papers and their field distributions
Table 7 Examples of the field distribution of overall keywords in citations
Table 8 Examples of field distribution of the referred papers in keyword perspective

The field relation of each reference with its corresponding author on these fields is then computed according to Eq. (8). Here, \(\varepsilon\) is set to 4 in this case and their calculated results are shown in Table 9. Each author’s field distribution is estimated one by one and exhibited in Table 10. Hence, the field correlation between two co-cited authors A and B in paper X can be calculated as:

$${\text{FCR}}_{A,B,X} = \frac{1}{5} \times \left[ {\frac{1}{{1 + \left( {2.2 - 1.4} \right)^{2} }} + \frac{1}{{1 + \left( {0.5 - 0.0} \right)^{2} }} + \frac{1}{{1 + \left( {0.5 - \frac{1}{3}} \right)^{2} }} + \frac{1}{{1 + \left( {0.4 - 0.0} \right)^{2} }} + \frac{1}{{1 + \left( {0.4 - \frac{34}{15}} \right)^{2} }}} \right] \approx 0.69$$
Table 9 Field relation of all references in keyword perspective (\(\varepsilon = 4\))
Table 10 The field distribution of the authors in keyword perspective

If A and B are also co-cited in other two papers, Y, and Z, with their field correlation \(FCR_{A,B,Y}\) = \(FCR_{A,B,Z}\) = 1, the keyword-based parameter of the two authors is (0.69 + 1 + 1)/3 \(\approx\) 0.897.

Construction of the co-citation matrix based on three above parameters

The proposed MACA mainly construct raw co-citation matrix synthesized author-based co-citation matrix of ACA with time-based parameter, carrier-based parameter, and keyword-based parameter. Furthermore, each entry of these matrices should be normalized to the same space for calculation and its range is set to \(\left[ {0,1} \right]\) in this paper. Only the normalization of author-based co-citation matrix in ACA is required because the range of other three parameters mentioned is satisfied. Here, the original author-based co-citation matrix in ACA after normalization is defined as

$${\text{Nor}}\_{\text{RCM}}_{{A_{i} A_{z} }} = \frac{{{\text{RCM}}_{{A_{i} A_{z} }} }}{{{\text{Max}}\left( {{\text{RCM}}_{{A_{i} A_{z} }} } \right)}}$$
(12)

where \({\text{RCM}}_{{A_{i} A_{j} }}\) is author-based co-citation matrix in ACA between the two authors, \(A_{i}\) and \(A_{z}\), and the function \({\text{Max}}\left( \cdot \right)\) output the maximal entry of a matrix. Then the co-citation matrix, notated as \(M = \left( {m_{i,z} } \right)\), in MACA is formulated as

$$m_{i,z} = w_{t} \cdot {\text{Ave}}\_{\text{FTR}}_{{A_{i} ,A_{j} }} + w_{c} \cdot {\text{Ave}}\_{\text{FKR}}_{{A_{i} ,A_{z} }} + w_{k} \cdot {\text{Ave}}\_{\text{FCR}}_{{A_{i} ,A_{z} }} + w_{A} \cdot {\text{Nor}}\_{\text{RCM}}_{{A_{i} A_{z} }}$$
(13)

where \(w_{t}\), \(w_{c}\), \(w_{k} ,\) and \(w_{A}\) indicate the weight of time-, carrier-, and keyword-based parameter, as well as author-based co-citation matrix value. Besides, each weight is larger than 0 and the summation of these weights is 1.

Experimental results and discussion

Dataset and preprocessing

The primary dataset used in the paper is the articles in JASIST from January 2003 to December 2012. All general descriptive metadata of the articles and their citations, including title, authors, published time, published carriers, volume and issues, keywords, and the number of pages, are exploited. Totally 2038 articles and 68,606 citations are used after preliminary refinement. For citations, only the first-author information of an article is adopted and their names are processed for disambiguation and artificial filtration. The authors appeared <10 times are ignored for keeping experimental quality, and then 958 authors and 30,512 citations were left. At last, 100 most popular authors, i.e. they received most citations, are selected for reducing computation complexity. The diagonal entries of the citation matrices are set to 0 in our experiment. Multi-type analyses including MDS and factor analysis are executed for showing the performance of the proposed MACA, and all results are demonstrated in a two-dimensional graph by using ALSCAL in SPSS 20.0. The factor analysis abstracts all principal components whose values are more than 1 and the “rotation solution” is outputted by using the “maximum variance analysis”.

Indicating the affiliated professional field of keywords and information carriers

The affiliated professional field of keywords and information carriers should be provided before calculating carrier- and keyword-based parameters in the proposed MACA. Thesaurus utilization and manual classification are major ways to index their professional fields in the paper. The procedure for classifying the belonging file of keywords is described as follows:

  1. (1)

    Extracting all keywords of citations in dataset.

  2. (2)

    Removing duplication and filtering simply (e.g. “method”, “methods” and “methodology” are regarded as the same keyword).

  3. (3)

    Classifying keywords according to subject headings in thesauruses. (Note that some keywords may belong to more than one field)

  4. (4)

    Classifying keywords manually that are not available in thesauruses. (A few academic professionals in LIS area would examine the classified results)

In the dataset, total 2053 keywords are extracted and 6 major categories are defined after the procedure. Table 11 shows these fields of keywords with their examples, basic statistics, and the indexes assigned. Note that the categories can be more specific, but six fields would be enough in our experiments for demonstrating the performance of MACA. Some keywords can be classified into more than one field, and “information retrieval” would be both in category 5 and 6, for example.

Table 11 Catalogs of citations keywords and their indexes

Similarly, the procedure for classifying the belonging field of information carriers is described as follows:

  1. (1)

    Extracting all carriers of citations in dataset.

  2. (2)

    Searching the catalogs of each information carrier on Essential Science Indicator (ESI).

  3. (3)

    Classifying carrier manually which are not available on ESI.

    1. (a)

      Downloaded its contents and reading more than 50 the articles of each carrier in the experimental period.

    2. (b)

      Classified them according to their keywords, the characteristic of contents, and judgments.

    3. (c)

      The classified results would be examined by a few academic professionals in LIS area.

In the dataset, all primary articles from JASIST have more focus on LIS. The citation sources are majorly divided into five categories in our experiment after the procedure. Table 12 shows these fields of information carriers with its index assigned. Obviously, some information carriers also have more than one belonging fields. For instance, the carrier “iConference” might affiliate both category 4 and 5.

Table 12 Catalogs of information carriers and their indexes

Multi-dimensional scaling (MDS)

Multi-dimensional scaling (MDS), usually for visualizing the level of similarity of individual cases in a dataset, is employed in showing the performance of the proposed MACA. Three parameters majorly in MACA are proposed to combine with raw co-citation matrix in ACA. For showing their performance separately, the notations of the ACA combined with the parameters are defined as follows:

  1. (1)

    MDS-A: MDS result of the traditional ACA (ACA).

  2. (2)

    MDS-AT: MDS result of ACA combined with time-based parameter (ACA + T). The weights used for author- and time-based parameters are 0.6 and 0.4, respectively.

  3. (3)

    MDS-AC: MDS result of ACA combined with carrier-based parameter (ACA + C). The weights used for author- and carrier-based parameters are 0.6 and 0.4, respectively.

  4. (4)

    MDS-AK: MDS result of ACA combined with keyword-based parameter (ACA + K). The weights used for author- and keyword-based parameters are 0.6 and 0.4, respectively.

  5. (5)

    MDS-ATC: MDS result of ACA combined with T and C (ACA + TC). The weights used for author-, time- and carrier-based parameters are 0.5, 0.25, and 0.25, respectively.

  6. (6)

    MDS-ATK: MDS result of ACA combined with T and K (ACA + TK). The weights used for author-, time- and keyword-based parameters are 0.5, 0.25, and 0.25, respectively.

  7. (7)

    MDS-ACK: MDS result of ACA combined with C and K (ACA + CK). The weights used for author-, carrier-, and keyword-based parameters are 0.5, 0.25, and 0.25, respectively.

  8. (8)

    MDS-M: The MDS result of ACA combined with all three parameters (MACA). The weights used for author-, time-, carrier-, and keyword-based parameters are 0.5, 0.2, 0.1, and 0.2, respectively.

All the weights used in these algorithms combined with other parameters are finally decided after examining all possible experiments. All of these have 0.5 or more for author-based parameter because the author co-citation relationship should be a basic element to construct the network. Figure 8 demonstrates MDS results of ACA combined with the parameters separately. MDS-A, MDS-AT, MDS-AC, MDS-AK, MDS-ATC, MDS-ATK, MDS-ACK, and MDS-M are shown in from the 1st row to the 4th row left and right, respectively.

Fig. 8
figure 8

MDS results produced by ACA and MACA-series (ACA + T, ACA + C, ACA + K, ACA + TC, ACA + TK, ACA + CK, MACA)

The area of each aggregation in MDS-M is smaller than that in MDS-A due to the data set from JASIST focusing on particular fields. MDS-M also has remoter results between points in different aggregations because three categories split in the dataset are different in a sense. The professional fields of three categories are shown in Table 13. Some authors having more than one study within these fields would be located in the junction of the aggregations. In all MDS results, the points in the category, i.e., information retrieval/information behavior/user studies, are more separated because most of the authors in the aggregation have studies combined with other fields. In MDS-M, the authors in semantic-related aggregation are more gathered due to their commonly focusing on the algorithms. And their studies are closer to retrieval-related/behavior-related aggregation. For example, some text mining methods are exploited to construct their solutions and explain experimental results in the area of user studies (e.g., Davis 2004; Park and Park 2014). The points in informetrics-related aggregation are also concentrated because the issues in the field are more specific. And information retrieval-related authors also have studies cited with the articles in informetrics-based research (Swanson et al. 2001).

Table 13 Authors’ interests of each aggregation in MDS results

In order to explain the nuances among these algorithms, six authors in the dataset are selected and their interest areas with mark given are shown in Table 14. The location of these authors are identified and colored in every graph of MDS results in Fig. 8. In the case of the authors 1–3, their locations on MDS-A are more separate than those on MDS-M. Meaningfully, MACA indicate the authors’ studies are relatively similar, and several studies of them in semantic and network-based research are surely covered after examining their studies. Besides, that their locations on MDS-AT, MDS-AC, and MDS-AK, have different relative distance indicates each parameter have various impacts on their correlations. In the case of the author 5 and 6, they have closer distance between their locations on MDS-A. Due to the correlation of their studies, the locations between them on MDS-M and others indicate the actuality. Moreover, the two author groups, 1–3 and 5–6, have resembling interests, such as the studies of authors 1 and 5. Thus these five authors should be similar and their locations on MDS-M are also typical closer in visualization. In the case of the author 4, her field classified is unlike others’ and her locations on MDS-A and MDS-M are also far from them. In fact, examining the areas of interests of author 4 and others also reveal these dissimilarities shown in the graphs. Furthermore, as for author 7, his position in MDS-A is far from that of 1, 2, and 3; yet in MDS-M, their distance decreases, which implies that his field is in some degree related to 1, 2, and 3’s. Indeed, author 7 has some relative studies, such as semantic-based methods to analyze the researchers’ citing behavior (Case and Miller 2011). That explains why his position is closer to category III as more general information of the citations is involved.

Table 14 Seven authors and their area of interests

MDS-measurement

In order to exhibit MDS results more quantitatively, MDS-measurement, named, \(\sigma\), is deduced by two variables, \(c\) and \(S\), indicating cohesion and separation. These two variables are majorly exploited to evaluate the effect of a clustering result (Kaufman and Rousseeuw 1990). Assume that all \(\phi\) authors are divided into \(\xi\) categories by their field belongs in MDS graph \(G\). In \(p\), \(p \in \left[ {1,\xi } \right]\), category there are \(n_{p}\) authors with their coordinate \(\left( {x_{q}^{p} ,y_{q}^{p} } \right)\), \(q \in \left[ {1,n_{p} } \right]\). And the coordinate of central point is \(\left( {x_{c}^{p} ,y_{c}^{p} } \right)\). The Euclidean metric, \(\rho_{i}^{p}\), of a certain point in a category, can be defined as:

$$\rho_{i}^{p} = \sqrt {\left( {x_{i}^{p} - x_{c}^{p} } \right)^{2} + \left( {y_{i}^{p} - y_{c}^{p} } \right)^{2} }, \quad \forall i \in \left[ {1,n_{p} } \right]$$
(14)

The sum of Euclidean metric between all points and central points within their categories is

$$c = \mathop \sum \limits_{{p \in \left\{ {1,2, \ldots ,\xi } \right\}}} \frac{1}{{2n_{p} }}\mathop \sum \limits_{{i \in \left\{ {1,2, \ldots ,n_{p} } \right\}}} \rho_{i}^{p}$$
(15)

Besides, the sum of Euclidean metric between every two points in different categories is

$$S = \mathop \sum \limits_{s \in p \cap v \notin p} c_{sv}$$
(16)

Then MDS-measurement is defined as

$$\sigma = c/S$$
(17)

Here \(c\) represents the degree of cohesion in clustering result and \(S\) is the degree of separation. Higher cohesion (bigger \(c\)) in the same category and higher separation (smaller \(S\)) in different category would be regarded as a good result. MDS-measurements of ACA and MACA are shown in Table 15 and \(\sigma \left( {\text{MACA}} \right) < \sigma \left( {{\text{ACA}} + {\text{TK}}} \right) < \sigma \left( {{\text{ACA}} + {\text{K}}} \right) < \sigma \left( {\text{ACA}} \right)\) reveals that \(\sigma\) becomes smaller as more factors are involved in the experiments. This implies that points in the same category are closer and those in different categories are more separate while more elements are involved. In addition, in cases of the same number of factors, the parameter K does impact more than T and C in our experiment. Observationally, the parameters, T, K, and C, have different impact on ACA results. In the difference of MDS-measurement between ACA + T and ACA + TC, smaller \(c\) with larger \(S\) refers that the carrier where citations are published has more impact on the points with different categories. The parameter servers the authors whose research is in different fields. This also can be observed in ACA + K and ACA + CK, or ACA + TK and MACA. In MDS-measurement among ACA + T, ACA + C, and ACA + K, furthermore, larger \(c\) with small \(S\) refers that the published time and keywords of citations has more impact on the points within one category. The parameter gathers the authors whose research is in the same field.

Table 15 MDS-measurement results of different models

Factor analysis

Factor analysis, a statistical method, is usually utilized to describe variability among observed variables and factors produced are correlated variables concerning a potentially lower number of unobserved variables. Table 16 shows the results of factor analysis based on different models with different authors (load factor >0.3). Total five factors, notating 1–5, are obtained and their indexes represent information retrieval and seeking, traditional LIS and information analyses, informetrics and data-science related research, information (seeking) behavior and user studies, and semantic- or network-based analysis, respectively. More than a factor existed in an author indicate that the author probably has different study fields. The accumulative contribution value of each factor in different algorithms is also shown. For example, the accumulative contribution values of the 1–5 factors in ACA are 36.9, 57.1, 76.2, 90.9, 97.8 %, respectively. The prominent 1st factor reveals that the authors, whose study field belongs to information retrieval and seeking, are popular and authoritative.

Table 16 Factor analysis to all algorithms on three example authors (NK number of keywords, and ACT accmulative contribution value. Note that the authors’ load factor is larger than 0.3 in this table)

The five factors in different algorithms also demonstrate varying degree of author’s interested areas. For example, Dr. Don R. Swanson, a famous researcher in LIS, has many important studies in different professional areas. According to the investigation, his main area of interest is information retrieval (Swanson 1979), user psychology, and behavior analysis (Swandon 1977). The factors, 1 and 4, in ACA can obviously establish its factualness. The 5th factor identified in ACA + T and ACA + K provides a clue that he has several studies related to the area (Swanson 1960). It can’t be observed in ACA + C because the carriers of these articles probably might not have strong attributes of the area. Moreover, the 3rd factor which emerged in four other algorithms indicates that his research areas are perhaps related to informetrics and data science. The observation produced by these algorithms is correct after examining his publication (Swanson et al. 2001). Observations in the other two authors likewise reveals that professional field of their partial research can be explored in MACA. Compared with ACA, as a result, MACA could obtain more details and nuances from the dataset while more information is imported.

Conclusion

A Modified Author Co-Citation Analysis (MACA) method is proposed in the paper for eliciting a bird’s eye view of intellectual structure in a research field. Four kinds of different general descriptive metadata, authors of a citation, citations’ published time, citations’ published carrier, and keywords of a citation, are exploited in MACA to construct a co-citation network. The major difference of MACA from ACA is the stage at which the former constructs the raw co-citation matrix when calculating the author co-citation relationship, the difference of their published time, the relationship in professional fields based on their carriers, and keywords. In our experimental results, more professional fields of an author are explored in MACA and the distribution of each field indicates the number of research district. Compared with ACA, MACA have more detailed and sensitive demonstrations in MDS. The main contributions of the proposed MACA are as follows: (1) By adding more information to the author co-citation analysis one can provide more details and nuance analysis to the dataset; (2) MACA has a good demonstration of the analysis of knowledge domain while extra calculations required in MACA are just a little more than what is required in ACA; (3) Different additional information has different impacts on the clustering results. For example, the published carriers would obviously separate authors whose interests are in different fields and the keywords of the authors’ articles effectively gather authors with different interests. As a result, MACA can be another option to understand researchers and the knowledge map in a study field with higher fineness. Furthermore, content-based ACA exploiting the content in an article can also be combined with MACA for improving the accuracy and efficiency of analysis.

However, the two parameters, carriers and keywords, in MACA are derived from the classification of professional fields. In this paper, a simple way to reconstruct the categories of the fields for analysis is proposed and we believe that there are many other methods to establish these, such as classification, machine learning, and ontology-based method, etc. The more categories one can divide will reveal nuances from the results of clustering in MACA. Thus, we would like to focus on the classifying methods and adaptive size of categories for MACA in the future.