Introduction

Social network analysis (SNA) is a rapidly developing scientific field that has appeared and grown significantly over the past 50 years, in the number of scientific publications and in the different disciplines involved (Borgatti and Foster 2003; Otte and Rousseau 2002). Until the 2000s the field was mostly developed inside different branches of social sciences, it then received significant attention from researchers in the natural sciences, which led to the so-called “invasion of the physicists” (Bonachich 2004) and resulted in the development of Network Science (Freeman 2004, 2011). To a large extent, this increase in interest in the topic was also due to the emergence of the World Wide Web in the 1990s and online social networks in the 2000s. This inevitably led to the extension of thematic areas where the methodology of network analysis is applied.

The usual way to study thematic areas and get important information on the topics developed in different scientific areas is to analyze the keywords used in their publications. In today’s academic world, keywords have become an important part of the information about publications, as it is usually obligatory to provide them with an article or book. However, when keywords are not provided by the author, they can be assigned to the paper by the journal or database, or automatically extracted from the title. Thus, the topical identity of any field can easily be constructed based on the metadata of the academic works.

Although the development of SNA has attracted the attention of a number of researchers, this attention has mostly been given to explorations of co-authorship structures (Batagelj et al. 2014; Leydesdorff et al. 2008; Otte and Rousseau 2002), citation structures of works or journals (Batagelj et al. 2014; Hummon and Carley 1993; Leydesdorff et al. 2008) and bibliographic coupling (Batagelj et al. 2014; Brandes and Pich 2011) between works, authors, or journals in the whole field. Different subfields (Batagelj et al. 2014, 2020; Hummon et al. 1990; Kejžar et al. 2010) and subdisciplines within the field (Borgatti and Foster 2003; Lazer et al. 2009; Otte and Rousseau 2002; Varga and Nemeslaki 2012) have also been studied. However, there are few examples of analysis of the main topics in SNA, where the data comes from one journal (Leydesdorff et al. 2008) or special subtopics (Batagelj et al. 2014, 2020).

Study on the development of SNA (Maltseva and Batagelj 2019), based on the analysis of networks of articles from the Web of Science (WoS) database matching the query “social network*”, influential works, and those published in the main journals in the field (up to 2018), has shown that the number of publications on SNA topics has grown significantly, and on average it doubles every 3 years. This is due to the huge interest to SNA from other disciplines, such as physics, economics, computer science, media studies, and—surprisingly—from behavioral biology. We assume that the growth in the number of articles and disciplines involved should be reflected in the topics observed in the field, and with the analysis of keywords, we observe the scope of contexts in which SNA is applied. In this article, we present the core concepts which unify the field, and, vice versa, the concepts showing disciplinary differences. The extraction of such information is used to compare different units—authors, groups of authors, or journals. This analysis reveals important information for the systematic description of the current development of SNA.

This paper is organized as follows. “Literature review” section presents several previous studies on keywords used in SNA. “Data” section describes the dataset and some issues of the network construction from the original two-mode network connecting works with keywords. “Basic network properties” section presents some statistical properties of this network and a list of the most-used keywords. In “Keywords co-occurrence” section we provide the analysis of the network of keyword co-occurrences. In “Keywords and authors” and “Keywords and journals” sections, we show the possibility of checking the keywords associated with authors, groups of authors or journals, using two-mode networks connecting works with authors and works with journals. We used a temporal version of the original network to get an insight into the dynamics of keyword usage. Our approach to bibliometric network analysis has already shown its productivity in a number of studies of different scientific fields and topics (Batagelj et al. 2014, 2017, 2020; Kejžar et al. 2010). The approach to temporal network analysis in Batagelj and Maltseva (2020) and Batagelj and Praprotnik (2016) is applied to large bibliographic networks for the first time.

Literature review

There are few studies describing the field of SNA using keyword analysis and the datasets used do not cover the whole field—they describe either the keywords associated with one journal (Leydesdorff et al. 2008), or literature on specific topics in SNA (Batagelj et al. 2014, 2019).

Leydesdorf et al. (2008) present an analysis of the topical development of SNA through the analysis of networks of title word co-occurrences of articles published in the journal Social Networks. During the period 1988–2006, 165 title words occurred more than once in a single year, and were included in the analysis. The authors find that over time, “particular issues reappear, notably centrality, measurement and measure, and concepts relating to data collection”, and “less frequently, concepts related to balance, blockmodels or equivalence appear”, which shows the methodological identity of the journal. For different years, the set of words in the network was different, meaning that the title words in the publications of each year provide a specific selection from the larger vocabulary of a discourse shaped and reproduced at the level of the specialty.

To provide more stable results, the semantic domain was enlarged: 6071 titles containing the keywords social network or social networks, were harvested from Google Scholar, and 172 words (occurring 8 or more times in any single year) were used for additional analysis. This resulted in a more stable structure; however, the titles of most publications focused on substantive issues rather than methodological ones (social capital, concepts referring to less-privileged social groups, such as minorities, women, patients, and the elderly). Authors also showed that some words changed their network position over time, such as the theoretical concepts of capital and community, which became central for the application of SNA across the social sciences. The words method and model also moved to the center over time, suggesting the rise of methodological reflection among scholars investigating social networks.

In a bibliometric analysis of the literature on centrality (Batagelj et al. 2014) and the literature on network clustering and blockmodeling (Batagelj et al. 2019), the authors presented lists of the most frequent keywords, constructed from the titles and keywords provided in the full descriptions of articles in the WoS. Most of the top keywords were either expected, or trivial (social, network), or generic with limited value (model, graph, structure, etc.). According to authors, as a tool of explanation, the keywords should be examined with great care in clearly defined contexts—in some groups of closely related works or authors. Kejžar et al. (2010) presented a method to construct such subnetworks of the topics of selected groups of authors from the two-mode networks of works with authors and works with keywords; this method was used in this article.

Previous studies show the importance of the analysis of relatively large datasets, which not only validate the results, but also make the analysis more systematic. We can expect the appearance (and reappearance) of the expected, or trivial and generic, words associated with SNA—which might be regarded as its core concepts. There should also be words associated with substantive issues, devoted to the topics which are being studied in different subfields of SNA. This paper aims to uncover these topics.

Data

Data collection

The data collection, cleaning and network construction were presented in detail by Maltseva and Batagelj (2019). Our dataset consists of publications from the WoS Core Collection database matching the query “social network*”, and other works highly cited in the SNA field, and published in main SNA journals, up to 2018. The first part of the dataset is based on the SN5 data collected for the Viszards session at the Sunbelt 2008 (Batagelj et al. 2014), and contains all the records obtained for the query “social network*” and articles from the journal Social Networks, until 2007. Obtained descriptions of the works can be of two types: with full descriptions (hits), and cited only (terminal, listed only in the CR field of a work description in WoS). We additionally searched for the terminal works without full descriptions which were most frequently cited, and papers on SNA of around 100 social networkers. The final version of SN5 contained 7950 works with a full description (hits), and 193,376 works (hits and cited only). The SN5 data were extended in June 2018 using the same search scheme. Starting from 2007, 576 articles from Social Networks journal were added. Additionally, in 2018, all the articles from the network—related journals contained in the WoS were included—such as Network Science, Social Network Analysis and Mining, Journal of Complex Networks (in total, 431 article). Again, we additionally collected full descriptions for terminal works with high (at least 150) citation frequencies. We also included manual descriptions of important terminal works from the dataset BM on blockmodeling (Batagelj et al. 2019). Finally, our dataset included 70,792 WoS records with complete descriptions (hits).

Using WoS2Pajek 1.5 (Batagelj 2017), we transformed our data into a collection of networks: a one-mode citation network \(\mathbf {Cite}\) on works (from the field CR of the WoS file description) and two-mode networks—the authorship network \(\mathbf {W\!A}\) on works \(\times\) authors (from the field AU), the journal network \(\mathbf {WJ}\) on works \(\times\) journals (from the field CR or J9), and the keyword network \(\mathbf {WK}\) on works \(\times\) keywords (from the fields ID, DE or TI). The keywords are single words—phrases were split to components. They were lemmatized and stopwords were removed. After data cleaning, from 70,792 hits we produced networks with sets of the following sizes: works \(|W| = 1{,}297{,}133\), authors \(|A| = 395{,}971\), journals \(|J| = 69{,}146\), key words \(|K| = 32{,}409\). We removed multiple links and loops and obtained basic networks \(\mathbf {CiteN}\), \(\mathbf {WAn}\), \(\mathbf {WJn}\), and \(\mathbf {WKn}\). For the terminal works only partial information is provided: the name of the first author, journal, publication year, journal issue, and the first page number. That is why it is not correct to use these networks for the analysis of keywords and authors. We constructed reduced networks containing only works with complete descriptions \(\mathbf {CiteR}\), \(\mathbf {WAr}\), \(\mathbf {WJr}\), and \(\mathbf {WKr}\), where the sizes of sets are as follows: works \(|W| = 70{,}792\), authors \(|A| = 93{,}011\), journals \(|J| = 8943\), key words \(|K| = 32{,}409\). The total number of keywords is lower than the total number of documents, which means that the same keywords reappear in papers several times. In this paper, we use these three two-mode networks for the analysis.

Even though the initial search was oriented towards social networks, an additional ‘saturation’ search of the papers which were cited a lot by the field’s representatives, as well as inclusion of the works from journals important for the field, and the most prominent authors allowed us to improve the dataset in sense of the broader inclusion of the publications related to network analysis in general. Thus, the dataset covers not only the works of social scientists, but also influential papers published by physicists, biological scientists, information and computer scientists, etc. This additional search allowed us also to include influential papers, usually published earlier, that could have been overlooked by our search queries because they do not use the contemporary terminology.

Derived networks

A two-mode network can be represented as a two-mode matrix. A pair of two-mode networks can be multiplied, if the second set of nodes in the first network is equal to the first set of nodes in the second network. If all weights in two-mode networks are equal to 1, then the product of the weights will also be equal to 1 and therefore a [uv] element of the product matrix counts the number of ways we can move from node u using the first network through the second set and afterwards using the second network to node v (Batagelj and Cerinšek 2013; Batagelj et al. 2014). In our case, this shared set is the set of works (papers, reports, books, etc.), which links bibliographic networks to each other. Using the multiplication of two-mode networks, we constructed derived networks.

Multiplying a network \({\mathbf {W}}{\mathbf {K}}\) with its transpose, we obtain the network of keyword co-occurrences \({\mathbf {K}}{\mathbf {K}}\) = \({\mathbf {W}}{\mathbf {K}}^T\) * \({\mathbf {W}}{\mathbf {K}}\). The weight of an edge between two nodes \(w[k_1,k_2]\) in the keyword co-occurrence network \({\mathbf {K}}{\mathbf {K}}\) tells us in how many works the keywords \(k_1\) and \(k_2\) were used together. Multiplying different compatible two-mode networks, we construct the network of authors and keywords \({\mathbf {A}}{\mathbf {K}}\) = \({\mathbf {W}}{\mathbf {A}}^T\) * \({\mathbf {W}}{\mathbf {K}}\), counting in how many works author u used the keyword k, and journals and keywords \({\mathbf {J}}{\mathbf {K}}\) = \({\mathbf {W}}{\mathbf {J}}^T\) * \({\mathbf {W}}{\mathbf {K}}\) counting how many times journal j used the keyword k.

Normalization in derived networks

Derived networks can have some deficiencies, such as overrating the contribution of bibliographic entities with many ties (works with many authors or keywords, journals with many works). To deal with such cases, the fractional approach (Batagelj and Cerinšek 2013; Batagelj 2019; Gauffriau et al. 2007) was used. This takes into account the contribution of bibliographic entities (works, authors, or journals), normalizing their weights so that their input to the resulting network is equal to 1.

Let us provide an example of a two-mode network of works \(\times\) keywords \({\mathbf {W}}{\mathbf {K}}\). In a regular network, the outdegree is equal to the number of keywords of the work, and the indegree is equal to the number of works in which the same keywords are used. The normalization creates the network \(\mathbf {nWK}\) where the weight of each arc is divided by the sum of weights of all arcs having the same initial node as this arc (the outdegree of a node):

$$\begin{aligned} n({\mathbf {W}}{\mathbf {K}})[w,k] = \frac{{\mathbf {W}}{\mathbf {K}}[w,k]}{\max (1,\text {outdeg}(w))} \end{aligned}$$

where w is a work and k is a keyword. The contribution of each paper w is equal to 1, and we assume that each keyword takes an equal place among others. The proposed normalization is applied to different two-mode networks \({\mathbf {W}}{\mathbf {K}}\), \({\mathbf {W}}{\mathbf {A}}\), and \({\mathbf {W}}{\mathbf {J}}\), and thus the product networks \(\mathbf {nKK}\), \(\mathbf {nAK}\), and \(\mathbf {nJK}\) are also normalized.

For \({\mathbf {J}}{\mathbf {K}}\), we also applied the TF–IDF approach (term frequency—inverse document frequency) to the normalization (Robertson 2004), which allows us to evaluate the importance of a word to a document in a corpus of documents. A detailed description of each derived network construction and normalization is presented in the corresponding sections below.

Temporal networks

Applying the temporal quantities approach (Batagelj and Maltseva 2020; Batagelj and Praprotnik 2016) to the \(\mathbf {WKr}\) network, we constructed temporal networks, using Python libraries Nets and TQ (Batagelj 2014). These networks can be of two types—instantaneous (with values given per year) \(\mathbf {WKins}\), and cumulative \(\mathbf {WKcum}\). They are stored in the json format. Using the multiplication and normalization of temporal networks, different derived temporal networks can be constructed. The construction of these networks is described in the corresponding sections below.

Basic network properties

Statistical distribution

For the works with full descriptions (\(DC=1\)), the keywords are supposed to be presented in special fields DE (Author Keywords) and ID (Keywords Plus). However, for some publications this information is not provided. In such cases the keywords are constructed by WoS2Pajek from the titles of works. All composite keywords were split into single words, and lemmatization was used to deal with the word-equivalence problem. However, the works which are cited only (\(DC=0\)) do not have keywords—in our case, 95% of the works in the \(\mathbf {WKn}\) network.

In \(\mathbf {WKr}\), the network constructed from works with complete descriptions, the number of keywords in 70,792 works varies from 1 to 84 (Fig. 1, top). The distribution of the number of keywords used in all works (Fig. 1, bottom) shows that large numbers of keywords are mentioned only once (16,164), twice (3919), or three times (1970). The usage of these keywords is episodic, and it shows the wide scope of the contexts where SNA is applied. There are also keywords which are used intensively, constructing the core concepts of the field.

Fig. 1
figure 1

Logarithmic plots with distributions of the number of keywords per paper (top) and number of keywords used in all works (bottom)

Figure 2 presents the temporal distributions of the number of all keywords (top) and unique (different) keywords (bottom) used in SNA publications. The observed rise of the number of keywords used is due to the fast growth in the number of articles on SNA topics starting from 2007, which was shown in Batagelj and Maltseva (2019). In 2007 the number of keywords used was around 30,000; in 2017 it was 160,000. The number of different keywords also shows the growth in the range of scientific fields and disciplines where SNA is applied—in 2005 it was around 3000; in 2017 it was four times larger.

Fig. 2
figure 2

WKins: distributions based on keywords and works

The most used keywords

The most frequent keywords are presented in Table 1. Not surprisingly, the words social and network are mentioned in the largest number of works, followed by analysis, which is trivial, but also shows the relevance of the data to the topic being studied. Some other frequently used words—graph, structure, relationship, role, tie (marked in boldface)—are related to network analysis, while others—datum, base, information, research, theory, model, algorithm, approach, pattern, effect—to scientific research in general (they are generic with limited value). General graph theoretic terms such as node, edge, arc, link, path, connection, do not appear among the top terms. There are also words related to exact substantive topics being studied in network analysis—online, networking, facebook, internet, site, web; health; behavior; education; support; communication; influence; innovation; trust; risk; family; community. We note that keywords can have different meanings in different contexts, therefore their identification in different subgroups (of authors or journals) can give us better understanding of the topic structure of SNA.

Table 1 WKn net indegree: the most used keywords

We counted the proportion of the number of appearances of each keyword to the most frequent keyword appearance for each year based on the \(\mathbf {WKins}\) network. This proportion normalizes the importance of certain keyword over time from 0 to 100%. The proportions for the most used keywords (Table 1) over time are presented in the figures below: the most frequent keywords up to 100% and 50% (Fig. 3), and up to 30%, 14%, and 10% (Fig. 4). It is expected that the keywords social and network get the maximum levels of usage in almost all the years starting from the 1970s. Other keywords presented in Fig. 3 have maximum usage in the 1970s due to the small number of works published in this period. However, it shows that these words—structure, theory, graph, relationship, role, innovation—are used for all of the recent history of SNA. It is interesting that these keywords were very frequently used in the early years (up to 1970s).

Some of the most used keywords, presented in Figs. 3 and  4, have been in use for a long time—these are community, support, health, algorithm, behavior, tie). Some of the words appeared later—in the 1980s and 1990s (trust, technology, service, web, risk), or the 2000s (internet, media, online, facebook). Mentioned since the 1990s, the word detection grew in the 2010s, presumably due to studies of community detection. The word animal, which “surprisingly” appeared at the analysis of the citation network (Maltseva and Batagelj 2019), is presented in the field from 1990s.

Fig. 3
figure 3

Distribution of proportions of keywords: scales of 100% and 50%

Fig. 4
figure 4

Distribution of proportions of keywords: scales of 30%, 14%, and 10%

Keywords co-occurrence

Network construction

We applied the column projection to the normalized reduced \(\mathbf {WKr}\) network to construct the normalized one-mode network \(\mathbf {nKK}\):

$$\begin{aligned} \mathbf {nKK} = n({\mathbf {W}}{\mathbf {K}})^T * n({\mathbf {W}}{\mathbf {K}}) \end{aligned}$$

In this network, the loops were deleted and bidirected arcs were transformed to edges (with the summation of the arc weights). The obtained network \(\mathbf {nKK}\) consists of 32,409 nodes and 2,799,530 edges. The weight \(\mathbf {nKK}[i,j]\) on the edges between the nodes (keywords) is equal to the fractional co-occurrence of keywords i and j in the same works. It holds that \(\mathbf {nKK}[i,j] = \mathbf {nKK}[j,i]\) and \(\sum _{i,j} \mathbf {nKK}[i,j] = |W|\)—each work has value 1 that is redistributed over keywords (Batagelj et al. 2019).

Keyword co-occurrence network analysis

An exploratory analysis showed that in the \(\mathbf {nKK}\) network the most frequent words social, network, and analysis connected most of the other keywords, which is why we excluded these three nodes from the network. Using the Link Islands approach (Batagelj et al. 2014), we searched for subnetworks sized from 2 to 75 nodes. A large number of islands (342) was obtained, where the majority of islands (301) represented only pairs of keywords. The main island includes 75 nodes; there are also some islands of smaller sizes.

A large part of the main island (Fig. 5) consists of keywords on the topic of networking sites and social media (networking, online, site, service, internet, web 2.0, semantic, technology, media, facebook, twitter, technology). Other words connected to this group are information, use, user, privacy, and security, presumably raising the issues of networking service usage. Information is also connected to the words diffusion, innovation, knowledge, and management.

Other central keywords are base, connected to the words model and community (also connected to each other). Model is connected to the words dynamics, complex, spread, influence (with latter connected to maximization), and community—by detection, structure, complex, algorithm, and virtual. Another group of words connected graph is algorithm, model, random, theory, centrality (connected to betweenness), large (connected to scale linked to free). Other locally highly connected groups are formed by the words datum, big, and mining, prediction and link. These nodes, which are the largest part of the main component, form a group of keywords on the methodological issues of SNA.

Some words appearing in this subnetwork are associated with substantial topics in SNA: on health (health, support, life, care, mental, adult, behavior) and education (education, higher, student, learn, e–, learning). Learn is also connected to the word machine, a developing topic in computer science.

Fig. 5
figure 5

The main island of the \(\mathbf {nKK}\) network

Other islands identify some expressions from topics being developed in SNA (strength, weak, tie; corporate - interlock - directorate; triadic - closure; small - world, or some broad topics from substantive studies (organ - donor - donation; persecutory - delusion - paranoia; trade - international - migration), and some stable phrases with limited value (special, issue, introduction).

To go deeper into the meaning of the keywords, we looked at them in different contexts—the contexts associated with selected groups of authors and journals which were found to be important during our previous analysis of co-authorship, citation and bibliographic coupling structures among authors and journals.

Keywords and authors

Network construction

To construct the network of authors and keywords \({\mathbf {A}}{\mathbf {K}}\), we used the normalized reduced networks \(\mathbf {WAr}\) and \(\mathbf {WKr}\). The first network was transposed and then multiplied by the second in the following way:

$$\begin{aligned} \mathbf {nAK} = n({\mathbf {W}}{\mathbf {A}}) ^ T * n({\mathbf {W}}{\mathbf {K}}) \end{aligned}$$

The obtained network is normalized. In this network, the weight \(\mathbf {nAK}[a,k]\) of the edge between the nodes a and k is equal to the fractional use of author a of keyword k. It can be extended to a group of authors C, for a given keyword k:

$$\begin{aligned} \mathbf {nAK}[C,k] = \sum _{a \in C}\mathbf {nAK}[a,k] \end{aligned}$$

In this section, we used the results of the analysis of co-authorship networks between the authors in the field of SNA. From the network \(\mathbf {WAr}\), which consisted of 70,792 works and 93,011 authors, we created collaboration network \(\mathbf {Ct'}\) (Batagelj and Cerinšek 2013; Batagelj et al. 2014). We used normalization proposed by Newman (2001), who interpreted collaboration in a “strict” way—as a collaboration only with others (excluding single authored papers). In this case, for the initial \(\mathbf {WAr}\) network the weight of each arc is divided by the sum of the weights of all arcs having the same initial node (its outdegree) subtracting the initial author (which is 1). Then the network \(\mathbf {Ct'}\) is constructed by the transposition of the regularly normalized \(\mathbf {n(WA)}\) network and multiplying it by the Newman normalized \(\mathbf {n'(WA)}\) network.

$$\begin{aligned} n'({\mathbf {W}}{\mathbf {A}})[w,a] = \frac{{\mathbf {W}}{\mathbf {A}}[w,a]}{\max (1,\text {outdeg}(w)-1)} \end{aligned}$$

then

$$\begin{aligned} \mathbf {Ct'} = \mathbf {n(WA)}^T * \mathbf {n'(WA)} \end{aligned}$$

The obtained \(\mathbf {Ct'}\) is undirected and does not have loops. The contribution of a complete subgraph corresponding to each work is 1. The weights of the edges between the nodes (authors) are equal to the total contribution of the “strict collaboration” of authors i and j to works they wrote together. The total contribution for an author is counted by line weights—it is equal to the sum of the weights of all the works he or she co-authored.

Keywords used by selected groups of authors

To extract the groups of authors collaborating with each other from the \(\mathbf {Ct'}\) network, we used the Islands approach (Batagelj et al. 2014). We generated 14,222 simple islands of between 2 and 50 nodes (in sum, 45,524 nodes, or 45% of all nodes in the network). The sizes and number of islands show that there are many groups of collaborating authors that can be extracted out of the \(\mathbf {Ct'}\) network. There are different ways to identify the islands for the further inspection, based on the size of islands, largest values of line weights, or specific names. To get islands with really strong ties, we removed all the lines lower the threshold of 7.5 from the \(\mathbf {Ct'}\) network and extracted the network of 32 nodes. Then we manually searched for the islands to which these 32 nodes belong, and extracted them. Another approach used was to search for the structures for some well-known authors.

For presenting the keywords associated with groups of authors, we have chosen simple islands represented by BARABASI_A (8 authors), BORGATTI_S, SNIJDERS_T (4 authors each), CHRISTAKIS_K, SKVORETZ_J (3 authors each), WASSERMAN_S, PATTISON_P, VALENTE_T, DOREIAN_P (2 authors each) for the extraction of keywords. The selected islands with the members of each group are presented in Fig. 6. The top-20 keywords for each group are presented in Tables 2,  3 and 4. The top keywords for these clusters are the trivial keywords network and social. Other keywords can provide some description of the topics being studied by selected groups of authors, oriented either on methodological or substantive issues.

Fig. 6
figure 6

Collaboration network: selected simple islands

The island of Borgatti, Everett, Boyd, and Halgin can be attributed to the methodological group, having the keywords graph, centrality, role, regular, equivalence, semigroup, structure, clique, and homomorphism, as can the pair of Doreian and Conti with the words equivalence, evolution, journal, balance, blockmodeling, generalized, regular, ranking. For Robins and Pattison the words are model, graph, random, Markov, logit, logistic, regression, exponential, p, semigroup, asterisk, multirelational. The pair of Wasserman and Faust can be represented with the words correction, model, exchange, stochastic, structure, statistical, blockmodel, equivalence, logit, triad (there are also logistic and regression in 23th and 24th places). The group of 4 authors connected to Snijders has the keywords Markov, random, friendship, behavior, peer, inference, influence, stochastic, actor, longitudinal, orient which reflect their work in stochastic actor-oriented models. The island represented by Skvoretz has the keywords power, exchange, bias, model, correction, theorem, approximation, simulation, dynamic.

Table 2 Keywords used in the clusters of authors from Fig. 6 (1)
Table 3 Keywords used in the clusters of authors from Fig. 6 (2)
Table 4 Keywords used in the clusters of authors from Fig. 6 (3)

The network science representatives—the group of 8 authors with Barabási, Posfai, Albert, and others—can also be attributed to the methodological stream, having the words dynamics, complex, scale, web, community, world, internet, model, free, evolve, and random.

The top keywords for other selected groups cover some substantive issues. The group of Fowler, Christakis, and Shakya have keywords spread, behavior, health, smoking, human, cooperation, obesity, influence, evolution, dynamics. The group of Valente is represented by the words health, diffusion, behavior, innovation, peer, adolescent, influence, smoking, prevention, cigarette, leader. As an example, it is interesting to compare the latter with the description on the official home page of Thomas Valente, who is working on the topics of social networks, behavior change, and program evaluation and uses social network analysis, health communication, and mathematical models to implement and evaluate health promotion programs designed to prevent tobacco and substance abuse, unintended fertility, and STD/HIV infections, and is also engaged in mapping community coalitions and collaborations to improve health care delivery and reduce healthcare disparities.

Some simple islands form larger general islands. The general island formed by the groups of SNIJDERS_T, SKVORETZ_J, WASSERMAN_S, PATTISON_P, and DOREIAN_P is presented in Fig. 7. The keywords for this island are presented in Table 5. We can see that the keywords with largest values are more commonly used words, such as network, social, model, analysis, graph, structure, datum, structural, theory, group, method. However, there are also special words on methodological issues, mentioned in the islands above, such as correction, exchange, equivalence, random, power, markov, evolution, statistical, dynamics, generalized, regression, exponential, blockmodel, logit, p, cluster, logistic, dynamic, blockmodeling. This is a group of authors dealing with methodological issues in SNA.

Fig. 7
figure 7

Collaboration network: a general island formed out of several simple islands

Table 5 Keywords used in the general island of authors (Fig. 7)

Keywords and journals

Network construction

To construct the derived network of journals and keywords, \({\mathbf {J}}{\mathbf {K}}\), we used the normalized reduced networks \(\mathbf {WJr}\) and \(\mathbf {WKr}\). The first network was transposed and then multiplied by the second in the following way:

$$\begin{aligned} \mathbf {nJK} = n({\mathbf {W}}{\mathbf {J}}) ^ T * n({\mathbf {W}}{\mathbf {K}}) \end{aligned}$$

The network is normalized. In this network, the weight, \(\mathbf {nJK}[j,k]\), on the edges between the nodes j and k is equal to the fractional contribution of journal j for given keyword k; or for a group of journals C:

$$\begin{aligned} \mathbf {nJK\text {[C,k]}} = \sum _{j \in C}\mathbf {nJK}[j,k] \end{aligned}$$

We used the TF–IDF approach to line weighting (Robertson 2004), which allows us to evaluate the importance of a word to a document in a corpus of documents. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. In our case, \({\mathbf {T}}{\mathbf {F}}\) shows the number of times a keyword appears in a selected journal, divided by the total number of keywords in the journal, and \(\mathbf {IDF}\) is the logarithm of the number of journals in the corpus divided by the number of journals where the specific keyword appears. We used the reduced networks \(\mathbf {WJr}\) and \(\mathbf {WKr}\) for the \(\mathbf {JKr}\) network construction, and calculated TF–IDF indexes for the keywords in the following way:

$$\begin{aligned}&\mathbf{TF}{-}\mathbf{IDF}(\text {keyword, JOUR}) = {\mathbf {T}}{\mathbf {F}}(\text {keyword, JOUR}) * \mathbf {IDF}(\text {keyword})\\&{\mathbf {T}}{\mathbf {F}}\text {(keyword, JOUR)} = \frac{\#\text { of times keyword appeared in JOUR}}{\text {Total }\#\text { of keywords in JOUR}}\\&\mathbf {IDF}\text {(keyword)} = \log \frac{\# \text { of JOURs}}{\#\text { of JOURs with keyword}} \end{aligned}$$

Keywords in selected journals

In our analysis, we identified journals intensively used in SNA. To present the analysis of the keywords associated with these journals, we have chosen journals Social Networks (SOC NETWORKS), Lecture Notes in Computer Science (LECT NOTES COMPUT SC), Physica A (PHYSICA A), PLOS ONE (PLOS ONE), American Journal of Sociology (AM J SOCIOL), and Animal Behaviour (ANIM BEHAV).

Using the fractional approach to network normalization, we extracted the top keywords associated with the selected journals (Tables 6, 7). As shown above, the most used keywords are trivial, and many frequently used words are generic, giving limited value. However, other words represent the features of the discourse provided by each of the selected journals.

For Social Networks, such keywords are centrality, measure, random, equivalence, role, describing methodological issues, and community, organization, group, exchange, communication, support, friendship, focusing on substantive ones. Comparing this set with the keywords associated with the American Journal of Sociology, we can support the observation of Leydesdorff et al. (2008) that in the social sciences SNA is used in studies on substantive, not methodological issues. Keywords for AM J SOCIOL are market, organization, power, friendship, family, exchange, action, collective, behavior, class, community, culture, state, industry; however, there are also keywords reflecting the traditional terms of SNA (strength, weak, embeddedness, diffusion, small, world).

For Lecture Notes in Computer Science, the special keywords are those describing computer networks and services (online, web, privacy, service, networking, recommendation, mobile, media, twitter) and those representing the computer science issues being studied (detection, semantic, mining). For Physica A, the most used keywords identify the methodological and substantive issues which network scientists are working on (complex, dynamics, evolution, community, detection, spread, small, world, free, scale, epidemic, diffusion, propagation, opinion, behavior, rumor, online); the “traditional” SNA term centrality also appears in the list. The most frequently used keywords for the general scientific journal PLOS ONE are similar to those in Phisica A (dynamics, complex, behavior, community, evolution, online), but health research has a bigger focus (health, population, human, risk, disease, spread, influence). In Animal Behaviour, attention is given to the studies of animals and nature, which are associated with the keywords animal, individual, population, female, male, wild, selection, reproductive, primate, dolphin, fission, macaque, being studied in sense of the behavior (and behaviour), group, structure, dynamics, association, evolution, dominance, organization, interaction, transmission, and success.

The results obtained by the TF–IDF approach (Tables 8,  9) are similar to the results of fractional normalization. However, they even more clearly show the special features of the discourses provided in the selected journals. For all the journals, besides LNCS, the keyword social moved away, and network is far from the first place. In the list of the top keywords in Animal Behaviour, the trivial and generic words with limited value are replaced by the terms from biology. The words structure and structural remain in the lists of the top keywords in all the journals. We can conclude that this approach better identifies the keywords associated with some substantial topics developing in SNA.

Table 6 Selected journals and keywords: fractional approach (1)
Table 7 Selected journals and keywords: fractional approach (2)
Table 8 Selected journals and keywords: TF–IDF index (1)
Table 9 Selected journals and keywords: TF–IDF index (2)

Discussion and conclusion

This paper provides an insight into the topics developed in SNA and reveals important information for its systematic description. As previous studies have shown, the identification of the keywords used in publications can provide important information on the discourse developed in the field and its main streams and topics, either methodological or substantial. However, it was also shown that the results of such an analysis should be examined with a great care. Small samples mean the networks for separate years can be significantly different, both in the set of words and their peripheral or central positions. The most used keywords can be trivial and anticipated before the analysis, or generic with limited value. Last but not least, the results are inevitably connected to the data, which are in turn dependent on the databases used for data collection and the queries used for identifying works: depending on these, the results can reflect certain disciplines, fields or subfields, and can be oriented to methodological or substantive issues.

In this study, we used the keywords obtained from the works published in the WoS database matching the search query “social network*”, influential works, and those published in the main journals indexed in the WoS. The time coverage is from the very first articles published in 1970s, up to 2018. 32,409 keywords were obtained from 70,792 works with complete descriptions, from the fields Author Keywords, Keywords Plus, and titles. The distributions of the numbers of all keywords used and the unique (different) keywords over time show accelerated growth starting from 2007, which was the result of the increasing number of works on SNA in various scientific fields, and disciplines applying SNA in their studies, as shown by Maltseva and Batagelj (2019). The wide scope of the contexts where SNA is applied is also shown by the large number of keywords which are used episodically.

In our analysis, we looked at the distribution of the most frequently used keywords in a two-mode network of works \(\times\) keywords and the islands obtained from the normalized one-mode network of keyword co-occurrence. To go deeper, we placed the keywords into the clearly defined contexts: selected groups of authors closely connected to each other according to their co-authorship, and selected journals representing different disciplines. These results support the conclusions made in previous studies: the most used keywords are trivial—social and network,—and many other frequently used keywords have limited value: they express terms commonly used in research in general (such as datum, base, information, research, theory, algorithm, approach), or in SNA (such as graph, structure, relationship, role, tie). Temporal analysis showed the constant presence and usage of these words (counted as the proportion of the number of the appearances of each keyword to the most frequent keyword appearance for each year).

Another group of topics identified in SNA can be assigned to the methodological stream. These topics appeared in the analysis of co-occurrence networks and (mostly) in the lists of frequent keywords used by selected groups of authors and by selected journals. In the main island of the \(\mathbf {KKn}\) network, we identify the topics of graph theory, dynamic and complex network models, models of spread and influence maximization, agent based models, random graph models, large scale-free networks, community detection algorithm, link prediction, graph centrality, innovation diffusion, semantic web, machine learning, big data, and data mining. Other islands identify some topics traditionally developed in SNA, such as the strength of weak ties, triadic closure, interlock directorates, and small world. With the previous group of trivial and general keywords in SNA, these words can be seen as the core concepts of the field.

Besides these keywords, the list of the most used keywords largely provided the keywords representing some substantive topics developed in SNA: networking sites and social media; community, family, health, education studies; trust and support; innovation and influence. A temporal analysis of the keywords associated with these topics showed that while some of them (community, support, health, behavior) are present in the field from 1970s, others appeared later, in the 1980s and the 1990s (technology, service, web) and in the 2000s (words connected to internet and media studies). Some of the words could change their topical origin over time—for example, the word community, which could be associated with the studies of offline communities in 1970s and 1980s, online communities in the 1990s and 2000s, and the algorithms of community detection from 2010s (as the usage of the word detection also increased from this time). The analysis of the co-occurrence network \(\mathbf {nKK}\) adds other substantive issues connected with networking sites and social media—the topics of information privacy, security and information management.

Methodological and substantive streams are also found in the selected groups of authors and journals representing different disciplines. We identified a set of social network analysts (the group’s representatives are Borgatti, Pattison, Wasserman, Doreian, Snijders, and Skvoretz) working mostly on the the methodological issues of SNA, and several other groups (representatives are Valente and Christakis) who work on substantial issues. The group of network scientists (represented by Barabási) was also attributed to the methodological stream. The analysis of keywords for journals showed the disciplinary differences between the selected sources. The comparison of Social Networks with the American Journal of Sociology showed that the former is mostly methodologically oriented while the later applies the tools of SNA for substantive studies, supporting the previous observation of Leydesdorff et al. (2008). Lecture Notes in Computer Science is devoted to the topics of internet networks and services, developed by the computer scientists. Physica A is in a way similar to the general scientific journal PLOS ONE—both focus on issues developed in network science; however, the latter also focuses on health studies. Animal Behavior publishes works on the social networks of animals. The proposed fractional and TF–IDF approaches showed their strengths in the identification of the keywords for selected subgroups, and the latter was better at identifying keywords associated with substantial topics. We suppose that these approaches can be further used for the extraction of the unit (author or journal) identities and their clustering according to similarity, and this could be a direction for future research.

There are some limitations in the current study. First of all, the initial search was oriented towards social networks, and thus some works related to a broader field of network analysis in general could have been overlooked. At the same time, the search query for “network analysis” would be too broad (beyond the data analysis), including the works related to computer networks, optimization problems on networks, etc. That is why, on the first step, our search was somehow limited. However, on the second step, we extended the results of the original query and added works initially not included in the search. This additional ‘saturation’ search of the papers which were cited a lot by the field’s representatives, as well as inclusion of the works from journals important for the field, and the most prominent authors allowed us to improve the dataset in the sense of broader inclusion of the publications related to network analysis in general. Thus, the obtained dataset covers not only the works of social scientists, but also influential papers published by physicists, biological scientists, information and computer scientists, etc. This search allowed also to include additional influential papers, usually published earlier, that could have been overlooked by our search queries because they do not use the contemporary terminology. Second, our dataset is based on the information available in the WoS. Adding publications from the journals not indexed in the WoS, or the analysis of some smaller datasets (e.g., articles from specific journals) could provide extra results. For the further analysis, the obtained dataset can be extended through the additional search queries, such as “complex network*” and “network science”, and usage of other bibliographic databases, which will make the view of the whole landscape of network analysis more complete and conclusive. Although we do not expect substantial changes in the top level results. Third, even though the choice of authors and journals is motivated by the previous analysis of co-authorship, citation and bibliographic coupling structures among authors and journals, the choice of the authors’ groups and journals is partially subjective. That is why it should be seen as an illustration of a methodological approach. Finally, the approach of temporal network analysis, which is applied to large bibliographic networks for the first time, needs further developments in the reading and visualization of the results. This is one of the tasks for the future.