Introduction

“Science of Science” refers to research into scientific and technological knowledge and explores the fundamental laws of the development of science and technology. It developed in Poland as early as the 1910s. The approach of Polish researchers who moved from metaphysics to empirical research and from the analysis of a single scientific discipline to the overall study of science laid the theoretical foundations for Science of Science as a specific field of study in Poland (Chen et al. 2017). However, the book of the communist-oriented English researcher (Bernal 1939) “The Social Function of Science” is generally recognized as the symbol of the real birth of the Science of Science. This book was directly influenced by the “Hessen episode” (Zhao and Jiang 1988) which had its deeper origins in Marxist ideology as Marx argued that “the essence of science is just its social function”. Similar ideas seem to be recurrent and more recently (Zhao and Jiang 1983) also argued that science and society are closely related and cannot be separated from each other.

As a global, interdisciplinary subject, the main objective of Science of Science is to consider all scientific and technological knowledge and activities as a research subject in order to explore the fundamental laws of the development of science and technology. Its field of research should therefore include historical, philosophical, sociological and economic research on science. However, the development of Science of Science on the international scene has not been so straightforward. Figure 1, taken from (He et al. 2017), illustrates the scientific development route of Bernal Prize winners. It is divided into three different research axes, “Scientometrics”, “Science, Technology and Society (STS)” and “Sociology of Scientific Knowledge (SSK)”. Derek J. de Solla Price inherited and developed Bernal’s scientific ideas and paradigms, deepening and broadening the theory and methods of Science of Science with an emphasis on data and quantitative analysis of science (Liu et al. 2013). The American scientific sociologist R. K. Merton examined the relationship between science, technology and society (STS) as an independent research subject while excluding the possibility of sociological research into the content of scientific knowledge (Pu and Di 1998). As a result, research on the sociology of science exploring the “social perspectives” and “cognitive perspectives” of Science of Science has been continuously differentiated during the development of the field (as illustrated, for example, by the creation of the SSK, which focuses on the fields of “anthropology” and “ethics”). In fact, Science of Science has gradually moved away from the original paradigm of Bernal’s scientific theory.Footnote 1

Fig. 1
figure 1

The development route of Science of Science (He et al. 2017)

In the specific context of China, Bernal’s book (Bernal 1939) has been the subject of great interest since its publication. In particular, the part mentioning China and highlighting the limits of the development of modern science in that country quickly attracted the attention of leading Chinese scientists such as Zhu Kezhen (1890–1974, President of Zhejiang University), Wu Xuezhou (1902–1983, Director of the Chinese Chemistry Institution) and Ren Hongjun (1886–1961, one of the founders of the China Science Society) (Qian and Li 2012) and the comments made there therefore quickly spread throughout China.

The formal birth of Science of Science in China, however, came from Tsien Hsueshen’s initiative in the document entitled “Science and Technology” in 1977, which encouraged the creation of a new research space in China called “Science of Science” (Tsien 1979). Following on from Bernal’s thinking, Tsien stressed that the field of Science of Science should be part of the social sciences (Tsien 1979, 1980). So far, three specific research institutes have focused their research on Science of Science namely those in Beijing, Tianjin and Shanghai. In addition, non-specialized institutes based in Beijing, such as CAST (China Association for Science and Technology), CASTED (The Chinese Academy of Science and Technology for Development of the Ministry of Science and Technology of the People’s Republic of China), the Institute of Science and Development of the Chinese Academy of Sciences, the Chinese Institute of Engineering Development Strategies and many colleges and universities throughout China have also invested significant resources in basic and applied research into Science of Science.

In 2010, on the occasion of the 30th anniversary of the Chinese journal Science of Science and S&T Management, Liu Zeyuan highlighted the borders and main fields of Science of Science in China by mapping the literature in this field with the CiteSpaceFootnote 2 tool (Liu 2017). Two major blocks of knowledge, corresponding to 2 complementary paths of development of Science of Science in China, were thus isolated—scientometrics, focusing on quantitative analysis, and scientific studies, focusing on philosophical analysis.

In this article, we exploit the research material of the latest 40 years in Science of Science in China and put in place a new method to understand and monitor both more clearly and more accurately the development in this field. Our objective is therefore to provide relevant indications on the origin of Chinese Science of Science, its structure and future directions through an original data analysis method, operating in a completely unsupervised manner, without any parameters and without external knowledge source.

The section "Data collection and preprocessing" of the paper presents the data collection and data preprocessing steps. The section "Feature maximization as a global approach for data analysis" focuses on the description of the feature maximization metric and of its associated feature selection process. This latter process is also illustrated with a simple example in that section. Section "Data analysis process" presents our experimental protocol and the principle of our original visualization method based on contrast graphs as well. Section "Data analysis and visualization results" is dedicated to the analysis of our experimental results by 3 expertsFootnote 3 of the domain. Section "Comparison with LDA" provides an additional comparison of the topic extraction capabilities of the method with the well-known LDA approach. Lastly, section "Conclusions and discussion" draws our conclusion and perspectives.

Data collection and preprocessing

Given the relatively blurred borders and vast scope of Science of Science (cf. section "Introduction"), it is not easy to make a complete and accurate extraction of literature relating to this domain. For this reason, in this paper we chose to focus on the evolution of the core content of the domain rather than trying to be exhaustive.

We queried the China National Knowledge Infrastructure (CNKI) database using “Science of Science” as the thematic term.Footnote 4 We extracted 2401 articles belonging to journals referenced in the Chinese core journals list of Beijing University and in CSSCIFootnote 5 (covering a research period until 2017-10-22). Data cleaning was carried out in a second phase to remove items that did not correspond to research documents (e.g. meeting notices, journal presentations or editorials). After this phase we retained 1334 journal articles. We used these to then retrieve 2677 cited articles (after removing duplicates), 1539 of which were published in specialized journals. We added these 1539 documents to our 1334 core documents to form our experimental data set of a total of 2873 research articles.

We then first looked for an indirect way to validate our data collection process. To do this, we tracked the distribution of articles according to their publication dates. The trend we observed (Fig. 2) was found to comply perfectly with Liu’s recent observations based on experience of the domain (Liu 2017) showing that Science of Science research in China has gone through three stages, namely a period of rapid growth (1977–1991) (1), a period of rather difficult development (1992–2003) (2) and a period of rejuvenation (2004–2017) (3).

Fig. 2
figure 2

Distribution of the number of papers per years in the experimental dataset and correspondence to historical periods for Science of Science in China

In a second phase, the titles, abstracts and keywords of the 2790 articles were extracted.Footnote 6

The indexation process was quite complex. It started with an initial dictionary of 9679 keywords gathered from the keyword field of the 2790 articles. We used NLPIR-ICTCLAS,Footnote 7 a specific toolbox for Chinese language processing, for word segmentation and tagging of titles and article summaries. Due to the particularities of Science of Science, software cannot accurately segment certain domain terms expressed by complex multi-character words, such as “科学学”: “Science of Science”, “科学学研究”: “Science Research”, “科学逻辑学”: “Science Logistics”, “科学的社会功能”: “Science Social Function”. We therefore carried out a later reconstruction of these words.

Among the extracted words, we then used an ad-hoc Python programme to filter the elements labelled as nouns and deleted quantities (numbers, dates, etc.). Then, we carried out a second phase of term cleaning to remove empty and meaningless terms or those covering the whole context of the dataset (e.g. “search”, “analysis”, “year”) and also to merge terms with similar meanings (e.g. “著者分布” and “作者分布”: distribution of authors, “作者合作网” and “作者合作网络”: network of co-authors, “知识图谱” and “知识图谱分析”: knowledge mapping). Once these words were merged with the initial keyword dictionary, this resulted in a dictionary of 13,442 Chinese terms.

The dictionary of terms was then translated into English. Due to a poorer vocabulary in English than in Chinese, the translation was likely to generate new equivalent words (such as “知识地图”: “knowledge geography” and “可视化底图”: “basic visualization map”, “科技评估”: “S&T evaluation” and “科研评价”: “research evaluation”) that needed to be merged again. After this process, we obtained a dictionary of 11,931 English terms. Category labels (resp. “name”, “city”, “country”) were finally attached to the terms representing the corresponding entities (resp. person, place and country).

To eliminate the remaining noise we applied an additional cleaning pass detailed in Table 1. Firstly, we merged the remaining equivalent words into a single entry (for example, an author may appear with or without his or her first name such as “Merton” and “R.K. Merton”—an institution may appear with its acronym or in an expanded form such as “NSF” and “National Science Foundation”). Secondly, we deleted words or expressions whose meaning was unclear in English and corrected some translation errors. This last process led to us deleting 360 entries in the glossary (235 merged entries and 125 deleted entries). A frequency threshold of 6Footnote 8 was finally applied to remove low frequency words. The result was a final dictionary of 1576 terms which were used to re-index the articles.

Table 1 Summary of the lexicon processing steps

Feature maximization as a global approach for data analysis

Most of our further data analysis of the Science of Science dataset is based on a feature selection approach relying itself on feature maximization metric (Lamirel et al. 2011). We shall first present this important metric before presenting the whole data analysis process. Feature maximization is an unbiased metric which can be used to estimate the quality of a classification whether it is supervised or not. In unsupervised classification (i.e. clustering), this measure exploits the properties (i.e. the features) of clusters’ associated data for different purposes (clustering labelling and cluster content highlighting, overall display of clustering results like on the contrast graph presented in this paper, optimal clustering model detection). Its main advantages are that it is free of parameters, totally independent of the clustering method and its operating mode, it works suitably in high dimensional spaces and represents a better compromise between discrimination and generalization than usual metrics (Euclidean, Cosine or Chi square, etc.).

Feature F-measure

Let us consider a partition C which results from a clustering methodFootnote 9 applied to a dataset D represented by a group of features F. The feature F-measure \(FF_{c} \left( f \right)\) of a feature f associated with a cluster c is defined as the harmonic mean of the Feature Recall \(FR_{c} \left( f \right)\) and the Feature Predominance \(FP_{c} \left( f \right)\) which are themselves defined as follows:

$$FR_{c} \left( f \right) = \frac{{\varSigma_{d \in c} W_{d}^{f} }}{{\varSigma_{c \in C} \varSigma_{d \in c} W_{d}^{f} }}$$
(1)
$$FP_{c} \left( f \right) = \frac{{\varSigma_{d \in c} W_{d}^{f} }}{{\varSigma_{{f^{{\prime }} \in F_{c} ,d \in c}} W_{d}^{{f^{{\prime }} }} }}$$
(2)

avec

$$FF_{c} \left( f \right) = 2\left( {\frac{{FR_{c} \left( f \right) \times FP_{c} \left( f \right)}}{{FR_{c} \left( f \right) + FP_{c} \left( f \right)}}} \right)$$
(3)

where \(W_{d}^{f}\) represents the weight of the feature f for the data d and Fc represents all the features present in the dataset associated to the cluster c. Feature Predominance measures the ability of f to describe cluster c. In a complementary way, Feature Recall means f can be characterized according to its ability to discriminate c from other clusters.

Feature Recall is a scale-independent measure but Feature Predominance is not. We have however throw experiments (Lamirel et al. 2015) that the F-measure which is a combination of these two measures is only lightly influenced by feature scaling. Nevertheless, data must be standardized to guarantee full scale independent behaviour for this measure. Furthermore, the choice of the weighting scheme for data is not really limited by the approach but it is necessary to deal with positive values. The scheme therefore needs to be capable of figuring out the semantic significance and importance of the feature for the data.Footnote 10

Feature maximization

In a supervised context, feature maximization measurement is capable of generating a powerful feature selection process. In our unsupervised (clustering) context, the selection process can be used to describe or label clusters according to the most typical and representative features. This process is a parameter-free process that uses the capacity of Feature F-measure to discriminate between clusters (\(FR_{c} \left( f \right)\) index) and also its ability to faithfully represent cluster data (\(FP_{c} \left( f \right)\) index). The set Sc of features that are characteristic of a given cluster c belonging to a partition C is defined as:

$$S_{c} = \left\{ {f \in F_{c} | {FF_{c} \left( f \right)} > \overline{FF} \left( f \right) \; {\text{and}}\; FF_{c} \left( f \right) > \overline{FF}_{D} } \right\}$$
(4)

with

$$\overline{FF} \left( f \right) = \varSigma_{{c^{{\prime }} \in C}} \frac{{FF_{{c^{{\prime }} \left( f \right)}} }}{{\left| {C_{/f} } \right|}}\; {\text{and}}\; \overline{FF}_{D} = \varSigma_{f \in F} \frac{{\overline{FF} \left( f \right)}}{\left| F \right|}$$
(5)

where \(C_{/f}\) represents the subset of C in which the feature f occurs.

Finally, the set of all selected features SC is the subset of F defined by:

$$S_{C} = \cup_{c \in C} S_{c}$$
(6)

In other words, the features judged relevant for a given cluster are those whose representations are (1) better in this cluster than their average representation in all the clusters and (2) better than the average representation of all the features in the partition in terms of Feature F-measure. Features which never respect the second condition in any cluster are discarded. This latter operation corresponds to a feature selection process.

Contrast

A specific concept of contrast \(G_{c} \left( f \right)\) can be defined to calculate the performance of a retained feature f for a given cluster c. It is an indicator value which is proportional to the ratio between the F-measure FFc(f) of a feature in the cluster c and the average F-measure \(\overline{FF}\) of this feature for the whole partition. The contrast of a feature f for a cluster c is expressed as:

$$G_{c} \left( f \right) = FF_{c} \left( f \right)/\overline{FF} \left( f \right)$$
(7)

The active features of a cluster are those for which the contrast is greater than 1. Moreover, the higher the contrast of a feature for one cluster, the better its performance in describing the cluster content.

As already mentioned, in clustering the active features in a cluster are selected features for which the contrast is greater than 1 in that cluster. Conversely, the passive features in a cluster are selected features present in the cluster’s data for which the contrast is less than unity. As regards the principle of the method, this type of selected features inevitably has a contrast greater than 1 in one or several cluster(s) (see Eq. 7 for details). A simple way to exploit the features obtained is to use active selected features and their associated contrast for cluster labelling as we proposed in (Lamirel et al. 2015). We also used this idea further in the experimental context discussed in this paper.

Illustrative example

Below we give an example of the operating mode of the method on the basis of a toy-dataset encompassing two classesFootnote 11 (Men (M), Women (F)) described with 3 features: Nose_Size, Hair_Length, Shoe_Size. Figure 3 shows the source data and how the F-measure calculation of the Shoe_Size feature operates in the Men class.

Fig. 3
figure 3

Principle of feature F-measure computation for sample data

As shown in Fig. 4, the second step consists of calculating the average F-measure of each feature over the classes, and the overall average F-measure for the combination of all features and all classes. In this figure, notation \(\overline{{F\left( {.,.} \right)}}\) stands for the overall average \(\overline{FF}_{D}\) presented in (Eq. 5) and notation \(\overline{{F\left( {x,.} \right)}}\) stands for average of class x which is itself computed as:

$$\overline{{F\left( {x,.} \right)}} = \varSigma_{{f \in S_{x} }} \frac{{FF_{x} \left( f \right)}}{{\left| {S_{x} } \right|}}$$
Fig. 4
figure 4

Principle of computation of the overall feature F-measure average and elimination of irrelevant features—here ‘Nose Size’

Features with F-measures that are systematically lower than the overall average are eliminated and thus the Nose_Size feature is removed. The remaining features (i.e. selected features) are considered active in the classes in which their F-measure is above the marginal average:

  1. 1.

    Shoes_Size is active in the Men’s class,

  2. 2.

    Hair_Length is active in the Women’s class.

The contrast ratio highlights the degree of activity and passivity of selected features as regards their F-measure marginal average in different classes. Figure 5 illustrates how the contrast is calculated for the example presented.

Fig. 5
figure 5

The contrast computation principle for selected features

In the context of this example, the contrast may be considered to be a function that will have the following effects in virtual terms:

  1. 1.

    Increase the length of women’s hair,

  2. 2.

    Increase the size of the men’s shoes,

  3. 3.

    Decrease the length of the men’s hair,

  4. 4.

    Reduce the size of women’s shoes.

Data analysis process

The overall architecture of our experimental process is presented in Fig. 6. After preprocessing steps, the process uses clustering in combination with feature maximization to extract the main topics of research from the Science of Science dataset under study. We will show later on in this paper that the combination of a suitable clustering approach - like neural clustering based on growing gas (Fritzke 1995)—with feature maximization offers superior performances to alternative approaches for topic extraction like LDA (Blei et al. 2003). This is only the case as long as an optimal clustering model (i.e. a suitable number of clusters) can be properly identified from the analyzed data. We thus propose to exploit one of our recent and efficient approaches also based on feature maximization for the optimal model detection task (Lamirel et al. 2016). Processing the clustering results with a graph approach based on contrast is an original method presented in this paper. It enables the cognitive overload resulting from the representation of interactions in large datasets to be reduced and the dependencies between extracted topics through shared features with high contrast to be correctly calculated. The last part of our approach exploits external labels of data associated to clusters. Firstly, publication dates are used to perform a diachronic analysis of the activity of clusters (i.e. topics) and secondly, author information is used to highlight the most influential authors in the different topics. Dates and author information are also reported on the contrast graph. Details of the approach’s different steps are given in the next sections.

Fig. 6
figure 6

Overall data analysis process

Clustering and optimal model detection

We use 2 different well-known clustering methods, namely k-means (MacQueen 1967), a winner-takes-all method, and GNG (Fritzke 1995), a winner-takes-most method with Hebbian learning. We have reported on the method which produced the best results in the following experiments. In all cases, the GNG method proved to be superior to the k-means method because of it is a Hebbian, incremental and winner take-most learning process which provides better independence from initial conditions and outliers and avoids producing degenerated clustering results. These kinds of results have also been observed in many of our former experiments (Lamirel et al. 2011).

The selection of the optimal model relies on feature maximization metrics presented in the former section. Our former experiments on reference datasets show that most of the usual quality estimatorsFootnote 12 do not produce satisfactory results in a realistic data context. They were also found to be sensitive to noise and to perform poorly with high dimensional data (Kassab and Lamirel 2008). A more accurate method is thus to exploit feature maximization, and more especially information related to the activity and passivity of selected features in clusters, to define clustering quality indexes identifying an optimal partition. This kind of partition is expected to maximize the contrast described by Eq. 7. Indeed, the more contrasted the clusters’ features are, the more these clusters are found to be compact or separated. Hence, this approach leads to the definition of three different quality indexes: PC (Positive Contrast), EC (Extended Contrast), and CB (ComBined contrast).

Below there is an example the expression of the PC and EC index. The CB index represents a weighted combination of the other two. A more precise description of this approach as well as experiments comparing our indexes with other indexes on real-life data ranging from simple to complex data can be found in the reference (Lamirel et al. 2016).

The PC index’s, principle corresponds by analogy to that of intra-cluster inertia in the usual models. It is a macro-measure based on the maximization of the average weighted contrast of active features for optimal partition. For a partition comprising k clusters, it can be expressed as:

$$PC = \mathop {\arg \;\max}\limits_{k} \left({\frac{1}{k}\mathop \sum \limits_{i = 1}^{k} \frac{1}{{\left| {s_{i}} \right|}}\mathop \sum \limits_{{f \in S_{i}}} G_{i} \left(f \right)} \right)$$
(8)

The EC index’s principle corresponds by analogy to that of the combination between intra-cluster inertia and inter-cluster inertia in the usual models. This index is based on the maximization of the average weighted compromise between the contrast of active features and the inverted contrast of passive features for optimal partition:

$$EC = \mathop {\arg \;\max}\limits_{k} \left[{\frac{1}{k}\mathop \sum \limits_{i = 1}^{k} \left({\frac{{\left| {s_{i}} \right|\mathop \sum \nolimits_{{f \in S_{i}}} G_{i} \left(f \right) + \left| {\overline{{s_{i}}}} \right|\mathop \sum \nolimits_{{h \in \overline{{s_{i}}}}} \frac{1}{{G_{i} \left(h \right)}}}}{{\left| {s_{i} \left| + \right|\overline{{s_{i}}}} \right|}}} \right)} \right]$$
(9)

where ni is the number of data associated with the cluster i, \(\left| {s_{i} } \right|\) represents the number of active features in i, and \(\left| {\overline{{s_{i} }} } \right|\), the number of passive features in the same cluster.

In our experiment, we vary the number of clusters in a range of up to 1/50 of the number of data. We reject size 1 models for two main reasons—firstly the quality indexes are not intended to produce results in this case and secondly these models correspond to a trivial clustering operation.

We carry out crisp clustering. In this case each data is re-assigned to a single cluster after the clustering process and the usual form of the assignment function af which associates a data d to a given cluster is as follows:

$$af\left( d \right) = \mathop {\arg \;\hbox{min} }\limits_{k} \left( {Dist\left( {\vec{k},\vec{d}} \right)} \right)$$
(10)

where Dist represents a distance function (generally Euclidean distance is used), \(\vec{k}\) represents the description vector of cluster k and \(\vec{d}\) represents the description vector of document d.

By using the clusters’ associated data, we select the model that optimizes the PC + EC combination of the above-mentioned indexes. This technique makes it possible to obtain the relevant number of clusters highlighting the main science research topics during the period under study. Expert analysis of the obtained results confirms that the clustering model chosen as optimal using our approach consistently accurately represents all the main research topics in the Science of Science field. Figure 7 presents trends in the evolution of the PC and EC indices and the optimal point (i.e. the optimal number of clusters or the optimal model) found for the model with 13 clusters. Figure 8 presents a description of a cluster based on its most contrasted characteristics and Table 2 presents the list of cluster titles that the expert characterized by exploiting the most contrasted elements.

Fig. 7
figure 7

Clustering quality evaluation (trends of PC and PC + EC indexes) and optimal model (13 clusters) highlighting. Index values have been rescaled for better visualization

Fig. 8
figure 8

Example of a description of a cluster through the list of its most contrasted features (here terms). The cluster’s related topic is knowledge mapping

Table 2 List and summary description of the clusters obtained

Contrast graphs

In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V such that every edge connects a vertex in U to one in V. Contrast graphs are bipartite graphs based on the relations between a set of features S and a set of labels L (Cuxac and Lamirel 2013). Theoretically, the set of labels L could represent any kind of information to which features can be related and the set of features S is a subset of an overall feature set F (i.e. the original feature space on which a dataset’s data relies) obtained through a feature selection process, like feature maximization as presented above. In the case of feature maximization, the weight \(c_{{\left( {u,v} \right)}}\) of an edge \(\left( {u,v} \right),\;u \in S, \;v \in L\) represents the contrast of feature u for a label v as and is defined by Eq. 7.Footnote 13

These kinds of graphs have many interesting properties. Firstly, they reduce the cognitive overload produced with classical graphs’ representation because of the associated feature selection process which reduces the number of potential connections. Secondly, they can be used to indirectly highlight relationships between labels whenever features have contrasted interaction with several labels. Thirdly, the combination of this approach with weighted force-directed model (Kobourov 2012) for graph representation highlights the central or most influential labels of the L set and makes it easy to identify the most densely connected labels through associated features with these latter appearing close together on the graph.

We proposed a first original use of contrast graph in the case of the analysis of the transdisciplinarity between different research domains and time periods in Cuxac and (2013). Figure 9 shows a resulting contrast graph where features represent the authors of research papers and labels represent a combination of time period and research domains. The authors who connect domains and time periods clearly appear on the graph represented in the figure. They can be considered as knowledge transmitters which consequently clearly highlights their major social and scientific role.

Fig. 9
figure 9

An example of a contrast graph materializing the relationships between authors and scientific domains associated with time periods. The red circles highlight authors who are knowledge transmitters between domains and periods

Exploitation of complementary information through external labels

As (Attik et al. 2006) pointed out, external labels are information which is associated with data but does not play any role in the initial data analysis process. However, this information could include important clues to help enhance the precision of the analysis. In the case of the clustering process presented above, external labels can be exploited in a secondary step (i.e. after the clustering process) by evaluating their posterior distribution into clusters through clusters’ associated data to provide complementary information about the latter data or related topics.

In the case of our Science of Science dataset we focus on two kinds of external labels namely papers’ publication dates and papers’ authors. Papers’ publication dates are processed to carry out a diachronic analysis of topic activities highlighting the importance or activity of each topic in each time period with an activity considered either individually or relatively to the other topics. As is shown in the next section related to the analysis of the results, this approach helps to precisely understand the chronology of the research activity in an overall research field, like the Science of Science in our specific case. Information on papers’ authors can be processed to highlight the most important contributors who drive or influence a main area of research (i.e. a topic). Such authors may even be considered to be central contributors if they led, influenced or coordinated several areas of research at the same time.Footnote 14

In the context of our experiment, our external label analysis is based on two different measures—label frequency and label prevalence. Label frequency \(F_{c}^{l}\) of a label l of a type t in a cluster c can be defined as:

$$F_{c}^{l} = Card\{ d \in D | af\left( d \right) = c \wedge l \in Extlab_{t} \left( d \right)\}$$
(11)

where Card is the set cardinal function, D is the whole set of exploited data, af the function defined at Eq. 10 (that provides the cluster associated to data d) and \(Extlab_{t} \left( d \right)\) a function that provides the list of external labels of type t associated to data d.

Label prevalence is a cluster-based measure. A label l is prevalent in a cluster c if:

$${\nexists } c^{\prime } \in C, c^{\prime } \ne c,\;F_{{c^{\prime } }}^{l} > F_{c}^{l} \wedge {\nexists } l^{\prime } \in L_{c} , l^{\prime } \ne l , F_{c}^{{l^{\prime } }} > F_{c}^{l}$$
(12)

where Lc is the set of labels occurring in clusters c through its associated data.

Prevalence is used to highlight the prior influence of a label. The consequences of this definition are that a label can be solely prevalent in one unique cluster and some clusters might not have any prevalent labels.

Data analysis and visualization results

General topic structure of the Science of Science domain

In the specific case of our experiment on Science of Science data, we propose to build a contrast graph between a set of clusters (set L) representing the main research topics in a given field extracted by the clustering process and the most contrasted features (set S) issued from the cluster descriptions (see Fig. 8). This approach combines clustering and contrast graphs in an original way and is particularly useful for Science expertsFootnote 15 to understand the construction of their domain, highlighting the most central topics in the domain (domain generators) as well as those which are the most connected. In the resulting graph, only the edges with a contrast greater than 1.4 are retained for the representation (1074 of the 1576 terms used for clustering are kept). Figure 7 represents the entire graph and Fig. 8 represents a sub-section highlighting closely related domains that share many characteristics.

The spatial distribution of the 13 topics is shown in Fig. 10 (a zoomed extract is also presented in Fig. 10b). According to all of our 3 experts, this graph highlights a very clearly interpretable structure of the Science of Science field in China. In such a model, highly interconnected topics will tend to appear at the centre of the representation (see section "Contrast graphs"). In our case, this information on the core domains is represented by two complementary topics—”8# Subject attributes of the Science of Science domain” and “7# Research policy and impacts on society”.

Fig. 10
figure 10

a Global contrast graph representing main topics and domain structure in Science of Science in China. (Cluster 9# is highlighted, and the detailed information on clusters is shown in Table 2). b: Zoom presenting a specific area of the global graph of this figure

The explanation given by the experts is as follows:

  1. 1.

    Science of Science is a reflective field and must therefore study its own development along with the models of science themselves (topic 8#).

  2. 2.

    Science of Science is an applied science that guides practice. Therefore, it must be practice-oriented, study the social function of science and serve research policy (topic 7#).

Three main fields of research appear around the basic information corresponding to usual scientific activities namely “A. Scientific knowledge system”, “B. System of Practice of Science” and “C. Support of scientific activity”. These form the complete logical structure of Science of Science from the points of view of cognition, application and institutional structures respectively.

The field “A. Scientific Knowledge System” is associated with the five related topics “10# History on Science of Science”, “4# Domain structure and peripheral disciplines on Science of Science”, “6# Discipline System”, “0# Quantitative Analysis on science” and “9# Mapping of knowledge on science”. The historical evidence (10#) served as a starting point for the first studies of the structure of Science of Science and peripheral disciplines and for the further construction of the field (4#). Finally, this evidence helped develop quantitative approaches to scientific research (0#). The recent emergence of new data management techniques and software technologies has led to the development of elaborate visualization approaches (9#). Although the discipline system field (6#) is obviously related to the structure of the field (4#), it is a more independent research field related to the study of research methodology, comparative approaches in research and security sciences.

The field “B. System of scientific practice” is composed of four main related topics: “1# Research evaluation”, “3# Innovation management”, “2# Educational sciences and talent culture” and “5# Philosophical foundations of the Science of Science domain”. Today, innovation strategy is one of the Chinese government’s major concerns because it aims to link science and technology to the economy and value-added applications for the development of Chinese society. The role of science and technology in social practice is therefore logically materialized by innovation management (3#) and this field has indeed developed considerably in recent years (see also Fig. 11). Research evaluation (1#) (analysis of scientific inputs and outputs, scientific decision-making) and encouraging education and talent (2#) play complementary roles in innovation strategy. As a result, these areas have also become essential issues in China’s scientific practice system. Finally, the entire system of scientific practice is clearly guided by the philosophical foundations (5#) inherited from Marx’s philosophy and Engels’ dialectic of nature.

Fig. 11
figure 11

The pathmap of topic change in Chinese Science of Science

The field “C. Support of the system of scientific activity” is composed of two main themes—”#11 Publications on Science of Science and “#12: Organization on Science of Science”. This field is clearly linked to the management of the production of scientific research (#11: publications, research periodicals), as well as to the organization of activities in the field (#12: learned societies, conferences and colloquia). These tasks support the successful development of the domain and ensure its sustainability.

The evolution of Science of Science

In the last 40 years, 13 research themes have been observed in Chinese Science of Science. The way they evolved, as materialized using the publication dates of the articles analyzed (see section "Exploitation of complementary information through external labels"), is also very clearly highlighted by our method and presented in Fig. 11. It is additionally justified by the analysis of the experts as described below.

In the 1980s, activity in Science of Science was only just beginning in China. The most widely discussed topic in the academic world was the issue of domain-specific attributes (8#). At that time, researchers tried to identify the nature and general patterns of the field combining Bernal’s thinking with the actual Chinese context (7#). When the National Science Conference was held in 1978, China’s science and technology system began to enter a period of reform and three major journals in the field of Science of Science (11#) were successively created—Science Research Management (1978), Science of Science and S&T Management (1980) and Studies in Science of Science (1983). These journals were supported by the government in their early stages and quickly attracted a large number of articles which meant the publication and management of research results became an important topic.

At the same time, academics have studied the history of Science of Science (10#) to find evidence of the theoretical basis and construction of the discipline in China by using the work of the main foreign actors in the field. This is how they laid the theoretical foundations of research policy in China.

Then, Science of Science in China developed more slowly for a period until the third council of the Chinese Association for Science of Science and Science and Technology Policy was established in 1997 (12#). The doctoral programme launched in the same year at Dalian University of Technology helped improve the discipline’s institutional system and therefore science teaching methods and promoting talent became topical themes in 2005 (2#).

Philosophical ideas play a key role in the educational system to guide practice and this point appeared more important in promoting and supporting professional talent in science (5#). Thus, in 2008, Chinese academic circles became more interested in the philosophical origins of Science of Science, and more particularly in the foundations of Marx’s philosophy.

The early development of the method of scientific knowledge mapping in China (Chen and Liu 2005) paved the way for a new field of research aimed at obtaining information on the structure of the fields and peripheral disciplines of Science of Science (4#) such as scientific and technological studies, technological theory, technological philosophy, library science or knowledge economy.

In 2012, Chinese academics stated that scientific activities are a system in themselves. Therefore, to increase effectiveness, these activities must be evaluated and planned using approaches involving scientific and technical systems (1#). There are two main reasons for this change in context. The acceleration of decision-making in science and technology required the rapid and objective assessment of research input sand outputs while many new approaches were developed in scientometrics which were to successfully complement traditional assessment methods.

In Yang Xiaolin’s book, he relates the words of Wu Mingyu, one of the pioneers in the field of Science of Science in China: “People who engage in science of science should first emphasize the concept of innovation”.Footnote 16 In the same vein, shortly after its creation in 1992, the NSFC (National Natural Science Foundation of China) began to support research on innovation. As awareness of the central role of innovation in China increases, research themes related to innovation are rapidly expanding (“technological innovation”, “global innovation”, “independent innovation”, “collaborative innovation”, “disruptive innovation”, etc.), making innovation management (3#) a very popular theme in 2013.

In 2016, quantitative scientific analysis (0#) and the disciplinary system became more popular, and in 2017, relevant research on science knowledge mapping (9#) became a highlight of scientific research in Science of Science in China.Footnote 17

Figure 12 shows the distribution of the number of papers per year in all thematic groups. This kind of approach can be used to highlight specific periods of thematic activity. These include a growth in activity indicating emerging topics (0#, 9#, 1#, 3#, 6#), seed topics initiating the field and characterized by strong initial activity growth followed by a long time activity decrease (#8), topics that have had a significant and localized period of maturation/activity in the analyzed period (#10), or even topics presenting local peaks of activity that can be explained by particular events in the historical development of the field. The takeover of Science of Science initiated by the creation of the third council of the China Association for Science and Science & Technology in 1997 corresponds to such a local peak in topics #6 and 12# for example.

Fig. 12
figure 12

Activity trends of the 13 research topics in Science of Science domain in China. Ascending trends are presented first

A more overall representation of the influence of each cluster (i.e. topic) in different periods (using 3-year blocks) can be derived from previous distributions. This representation presented in Fig. 13 can then be used to better understand the laws of Science of Science development in China. This point of view can especially help distinguish between important but accidental topics that have a chance of developing in the short term and rational important topics that play a major role in the construction of the domain in the long term.

Fig. 13
figure 13

Coordinated influence of research topics in Science of Science in China (by 3-years periods)

The topics “0# Quantitative analysis of science”, “9# Science knowledge mapping” and “3# Innovation management” did not appear at the beginning of scientific research in the field of Science of Science in China and it was only in recent years that the status of these topics became increasingly important. The establishment of the dominant position of the topic “0# Quantitative analysis of science” shows that Science of Science has reached maturity as a subject. The importance of the topic “9# Science knowledge mapping” indicates that Science of Science has become an open subject, integrating computational approaches and information visualization technologies. The growing prosperity of the topic “3# Innovation management” shows that Science of Science is an increasingly practice-oriented domain that emphasizes the economic value of science and technology, and shows its strategic position in China today. In comparison, research topics involving the attributes of the field (8#), the construction of scientific organization and publication processes (#11) and the management of scientific research results (#12) have gradually weakened which also indicates that scientific research in Science of Science is gradually becoming mature and standardized in China.

Comparison of Chinese and word contexts in Science of Science

We briefly summarize hereafter the main differences between the results of our Science of Science study performed in China and those of former studies that have been performed in the international community, like by Fortunato et al. (2018) and Zeng et al. (2017).

  1. 1.

    The states of social background and S&T development are different. The foundation work of Science of Science by Bernal, The Social Function of Science (Bernal 1939) was directly derived from the Soviet scholar Boris Hessen’s report on the social and economic roots of Newton’s mechanics at the 2nd International Congress of History of Science (ICHS) in 1931 (Hessen 1931) and it should be noted that Bernal admitted he was influenced by the “impact of Marxism”. One of the important viewpoints in his book is that “science can be planned” and he focused more on how to rebuild after the war. One landmark of the establishment of Chinese Science of Science came in 1977 when Tsien Hsueshen was the first to propose the creation of a new discipline of “科学的科学” (Science of Science) in his long essay “Modern science and technology”. It occurred simultaneously with China’s Economic Reform and opening of society and another was the advent of the “spring of science” in China(The National Science Conference was held in the early spring of 1978). Han Bingcheng, who worked in the Embassy in the Soviet Union, simplified “科学的科学” into “科学学” according to Chinese language habits for disciplines (1978) and he introduced the development of science of the Soviet Union for the first time and was the first to use the concept of “Science of Science”.

  2. 2.

    The main development path of the core research is different. The development path of Science of Science worldwide mainly involves the “sociology of science” and “scientometrics” which is clearly shown in Fig. 1. In China, the development path mainly involved the “philosophy of science and technology” and “scientometrics” with the latter prospering more in recent years. Worldwide research has concentrated more on science and technology’s role and its influence on social and economic development and summarized the law of science and technology development through quantitative analysis. However, in China, research focuses more on how to carry out scientific work and understand the social function of science according to Marxism and Engels’ natural dialectics. The construction of China’s scientific output database was comparatively late (e.g. CNKI was founded in 1999) means that it is somewhat difficult to access to international scientific output data in China as scientometrics did not begin flourish until nearly 20 years later.

  3. 3.

    The technology innovation system is an important hot topic in the Science of Science but the research perspective is slightly different. Chinese scholars of Science of Science entered the field of technological innovation in the 1980s and emphasized the source role of science and technology in the innovation system from the perspective of science & technology management, technological economics, philosophy of science and technology and system science. The content was broadened from focusing on “science and technology activity itself” to the interaction between science-technology-economy and research policy. Indeed a large amount of research results with Chinese characteristics are adopted by government departments and then become national strategy. In that context, some international scholars, especially in the field of scientometrics, like Loyd Leydesdorff (Etzkowitz and Leydesdorff 1997), have carried out empirical research on the triple helix innovation theory using the data of papers, patents and economy.

  4. 4.

    The development paths in China and abroad tend to recently converge. The development of information science and data analysis technology combined with a deeper understanding of the complexity of science and technology activities, scientific big data analysis and complexity scientific analysis have become hot topics in the field of Science of Science at home and abroad. Science of Science has returned to its own method research which has encouraged advances in Scientometrics 2.0 which has become the consensus view of development both in China and worldwide.

Comparison with LDA

As our method’s topic extraction capabilities represent a central point for the overall analysis process to be correctly achieved, we went on to compare it with the LDA approach (Blei et al. 2003), a state-of-the-art method for topic extraction. We use similar data as that which was used for the clustering process as input for LDA, namely 2790 articles represented by a bag of words model of 1576 terms selected with a frequency threshold of 6.Footnote 18 GensimFootnote 19 implementation of LDA is chosen for our experimentation. For the sake of comparison with our method, the number of topics to be extracted by LDA is fixed at the optimal number of clusters identified by our clustering quality criteria (i.e. 13). The training chunk size and the number of passes of training of LDA were made to vary. The other LDA parameters are set to their default or automatic values (according to (Hoffmann et al. 2010) for grid search of optimal model). The best result as regards varying parameters is kept. For both the LDA and clustering with feature maximization methods, the topics are presented to the experts in the form of their 10 highest ranked terms. The ranking criteria used for the clustering results is the Feature F-measure value (eq. 3).

The further role of the experts was to check and approve topics through suitable identification of their content using the 10 highest ranked terms for both LDA and clustering with feature maximization topic lists. Suitable identification is assumed to derive from a topic title. The results of the process are presented in Fig. 14 (clustering with feature maximization) and Fig. 15 (LDA).

Fig. 14
figure 14

Results of topic extraction obtained by the combination of clustering and feature maximization (top 10 ranked terms and topic titles provided by experts). The blue colour is used for titles which were difficult to establish by experts with the single use of top ten topic words

Fig. 15
figure 15

Results of topic extraction obtained by LDA (top 10 ranked terms and topic titles provided by experts). Blue is used for titles which were difficult or impossible (??) to establish by experts with the single use of top 10 topic words. A single question mark also means unclear content and a double question mark means no possible title (incoherent content). Green is used to represent the potential corresponding topics on the clustering with feature maximization (CF) side

As a result, all our 3 experts agreed on the fact that they can easily label all the topics generated by clustering with feature maximization with the help of the 10 top terms (see Fig. 14). Conversely, they found that generating explanations on most of the topics produced by LDA is pretty much an impossible task as only 6 topics made some sense including 5 with very unclear meaning (see Fig. 15). Additionally, for these 5 “unclear topics” they also point out they can find clear descriptions with the clustering with feature maximization method.

An additional remark can be made about the very unstable level of generality of the results produced by LDA. Some topics look appear to refer to very specific and casual context (like topic 6LD, partly describing the main area of work of a specific scientist) while others refer to unclear areas with very large scopes (like topics 7LD to 12LD). Conversely, the results of the clustering with feature maximization method (i.e. all topics obtained) seem to reach a very balanced level of generality.

Finally, regarding the comparison of methods, clustering with feature maximization can highlight most of the important topics of the Science of Science domain as stated in section "Clustering and optimal model detection" whereas LDA failed to highlight any of the main topics like quantitative analysis, knowledge mapping, innovation management, publication and organization in Science of Science, etc. Therefore LDA seems to be an unusable method in the context of our study especially for further fine-grained steps of detection of topic interaction and time-based topic analysis described in section "Data analysis and visualization results".

Conclusions and discussion

Science of Science in China as a practice-oriented fundamental theoretical research field was born with the reform and greater openness of the nation. In this work, we use elaborate and original methods of data analysis and knowledge mapping to objectively reveal the historical changes in Science of Science research topics in China and to reflect the central role of this field in the national development process. Our approach has also shown that the rapid development of the Chinese economy and its increasingly active practice of innovation have led to new research topics in the field of Science and Science.

The domain experts considered our approach to be a useful tool. The most specific discoveries they made with it are that research has moved from a period of pre-maturation of the subject to related disciplines and analysis of the structure of knowledge, from qualitative analysis to quantitative and visual analysis, from general research on the social function to more specific research on the economic and strategic function.

The combination of feature maximization and unsupervised learning and the joint use of contrast graphs for visualization is an original approach which we have proposed in this work. Our full-scale experiments were approved by experts in the field and showed that, without supervision, parameters or the support of any external source of knowledge, this method could very effectively reveal the research themes, their interactions and changes in a very complex research field such as Science of Science in China. In this article, we propose in particular a method for visualizing the analysis results using line maximization. This method is very suitable for large-scale data analysis in large dimensions. It also tolerates the integration of a wide range of additional information that can enrich analytical results and provide clarity and precision of results that current competing methods cannot provide. For example, as we have shown, methods such as LDA, which could potentially replace the proposed approach for the part concerning topic extraction, severely suffer from the dependence on parameters that are very difficult to control and working hypotheses that are difficult to verify on the distribution of words especially in the case of analysis of complex data. Consequently, these problems severely limit the quality of their results (level of generality, accuracy and precision) in that context.

Finally, as we have shown, the results we obtained have already proved sufficiently meaningful for all the 3 experts we asked to review our work but conducting a more detailed qualitative analysis based on semi-directive questionnaires given to several expert reviewers would certainly allow a more thorough validation of these results. We intend this step to be the subject of further work.