Introduction

Topic extraction from scientific literature seems to be as much an art as a science. Different teams within the field of scientometrics use different approaches, based on their familiarity with specific methods, investment in the development of specific tools, long-term experience with the mapping of scientific fields, and in-house experimentation to optimize an approach. Rarely results are published that apply alternative approaches to the same data set and compare the results, and there is a lack of understanding of how differences between approaches affect the results obtained. In what ways do the solutions that they produce differ from one another? Is one approach better than another? What are the knobs and levers’ of each approach, and how do they affect the results? As laid out in the introduction of this special issue (Gläser et al. 2017) there is a growing need to have some certainty about to what extent structures emerging from methodological approaches are indeed representation of thematic structures in science or artifacts produced by methods themselves.

To shed light on these questions, we have applied to the same data set a variety of topic extraction approaches that are documented in articles in this special issue. The data set consists of bibliographic data of documents in the astrophysics literature and is hereafter called the Astro Data Set. In this article we provide a comparative overview of the properties of these approaches and the topic solutions that they deliver. However, due to the fluidity of cognitive structures in science, and the multiplicity of reference frames (Gläser et al. 2017), there is no single ground truth that would tell us authoritatively how to divide the documents in the Astro Data Set (or any other set of scholarly articles) into topics. Therefore, how to compare the topic solutions and generate useful descriptions of their differences in the presence of multiple, inaccessible ground truths, is a research problem in its own right. For our purposes here, we will be interested in descriptions of solutions and their differences that:

  • Capture various dimensions of how a solution differs from other solutions;

  • Reveal the distinctiveness of the perspective that a solution provides into the topical structure of the field;

  • Generate hints for differences between solutions that can be attributed to specific properties of the approaches used.

Ideally, our comparisons would be reviewed by area experts, who could evaluate the merits of the different perspectives created by the solutions [a further research direction discussed in Gläser et al. (2017)]. Meanwhile, we were fortunate to have some knowledge expertise within our author team with several authors trained in engineering or physics, including a subarea of astrophysics. We also occasionally discussed our results with astrophysicists outside of the author group. This allowed us to bring to bear this expertise on the interpretation of topical structures that were constructed by the various approaches. But from the point of view of reproducibility of the results and their interpretation this can also be seen as problematic. We used two ways to at least acknowledge this problem: (a) by being transparent and articulating whenever subject expertise guided the analysis and (b) by trying to find reproducible ways to compare the interpretative dimensions (e.g. in the labeling of structures) across approaches.

In this paper we provide a first insight into how topic extraction approaches construct topics differently and how the resulting topical structures differ. Eventually, we would like to deliver guidance to the scientometrics community and users of topic extraction results on how to choose among approaches and what to keep in mind when interpreting results. More work needs to be done that we describe in the final section of this paper.

Comparing approaches, detecting in what aspects results agree or differ, and trying to understand why, is the core of this paper. The paper proceeds as follows: First, we provide a framework to characterize the approaches, and discuss where differences in their work flows arise (“Overview on topic extraction approaches” section). Second, we introduce methods that we used to compare topic solutions (“Tools for comparing topic extraction solutions” section). Third, an ensemble-based comparison of the solutions they generate is conducted (“Findings: comparisons across whole solutions” section). This comparison by means of an overview is countered by a number of specific comparisons. They are guided by assumptions about which perspective of the self-organised nature of the emergence of scientific topics is placed in the foreground by each of the methods (“Findings: specific comparisons” section). The paper concludes with a discussion of how the discourse around comparison of methods and approaches could be further fostered in the scientometric community.

The data set

The Astro Data Set consists of the bibliographic data of 111,616 publications published in the years 2003–2010 in 59 astrophysical journals indexed by Web of Science (see “Data set: journal titles” in Appendix for the list of journal titles). To cover primarily original scientific content only documents of type Article, Letter, and Proceedings Paper were included, whereas document types Biographical-Item, Book Review, Correction, Editorial Material, Meeting Abstract, News Item, Reprint and Review were excluded. Reference links between publications were reconstructed by matching bibliographic information using a rule-based script developed by Michael Heinz (Humboldt University).

Overview on topic extraction approaches

The selection of the eight topic extraction approaches compared in this paper is opportunistic in that these are the approaches developed and used by the teams that have come together to collaborate and produce this special issue on ‘Same Data, Different Results’.Footnote 1 This means that for each approach used in this comparison there is one member or team in our collaboration who is intimately familiar with that approach. What occurred in the discussions at a series of workshops over a couple of years, is to which extent each of us had to make informed, and sometimes pragmatic, decisions on what approach to pursue and how to tweak it to meet the specific objectives of our respective research and tool development projects. These discussions led to a framework or language to characterize and distinguish features of approaches, including the distinction between a ‘data modeling’ component and a ‘clustering algorithm’ component that is reflected in the organization of Table 2 which provides an overview of the distinguishing properties of the approaches and of the specific solutions that we decided to include in our comparison.Footnote 2

Table 1 provides an overview of the various combinations of data models and clustering algorithms covered by the solutions included in our comparison (and what areas of the potential space of combinations are left unexplored due to resource limitations). The three solutions c (Van Eck and Waltman 2017), hd (Havemann et al. 2017), and u (Velden et al. 2017) are delivered by a set of approaches that model the data as a direct citation network, but use different clustering algorithms; another two solutions, eb and en are delivered by a set of approaches that use the same clustering algorithm, but model the data slightly differently: the first one as a bibliographic coupling network and the second one as a hybrid network based on bibliographic coupling in combination with terms extracted using Natural Language Processing (NLP) (Glänzel and Thijs 2017); another set of approaches models the data as a semantic matrix by interpreting each bibliographic or other metadata field as a semantic entity and applies two different clustering algorithms (Koopman and Wang 2017a), delivering solutions ol and ok, respectively; finally, solution sr (Boyack 2017a) is generated by using the direct citation network of a superset of literature from a global science map and projecting the Astro Data Set onto a clustered version of that map. All eight topic extraction approaches and their results are described in detail in the corresponding companion articles in this special issue.

The way the data is modeled (what features of the articles in the data set are extracted and used to represent the data) and the choice of clustering algorithm that is used to detect regularities in the data and extract groups of articles that represent candidates for ‘topics’ are key differences between approaches. Importantly for our purpose here, the set of approaches in this special issue covers a wide range of ways to model data and a number of clustering algorithms. But, there are clearly also dimensions missed, as for instance, author relations. Note further that all approaches in this sample use a document (or link) clustering algorithm. Future work should include also topic modeling approaches and possibly hybrid document clustering and topic modeling approaches, such as Xie and Xing (2013). Still, the variety within our sample makes it suitable as a first set to explore the question of how approaches and their results differ. The data models cover citation based models, hybrid models (citation and text based), and so-called semantic models. The algorithms used include four of the most popular clustering algorithms, namely k-means (MacKay 2003), Infomap (Rosvall and Bergstrom 2008), Louvain (Blondel et al. 2008), and Smart Local Moving Algorithm (Waltman and Eck 2012, 2013), which is an improved variant of the Louvain algorithm, along with a new memetic type algorithm (Havemann et al. 2017). The latter has been designed specifically for the extraction of overlapping, poly-hierarchical topics in the scientific literature.

Table 1 Combinations of data models and clustering algorithms

Figure 1 schematically depicts the steps of a typical topic extraction work flow that consists of data preprocessing, construction of a data model, and the selection and application of a clustering algorithm. Differences between approaches can occur along any of these steps. Whereas for the purpose of this comparison all teams start from the same data set of source documents (“The data set” section), a first source of divergence during the preprocessing of this raw data is that some teams proceed by mapping this data set to their in-house database (Boyack 2017a; Glänzel and Thijs 2017). As some publications cannot be mapped to an entity in those in-house databases, those teams work with smaller subsets of documents. Also the information contained in those in-house databases on each publication (e.g. information about references to other publications) may differ from the information used by those teams that worked with the original data set. To give an example, the team that provided solutions en and eb had access to the unique reference codes given by Thompson Reuters to construct citation links between documents (Glänzel and Thijs 2017), whereas other participants worked with the reference links deducted from rule base parsing of reference strings mentioned in “The data set” section.

Fig. 1
figure 1

Schematic of a typical topic extraction workflow. Topic extraction approaches can differ in any of these steps, thereby producing variation between solutions

A fundamental difference between the solutions produced by the various approaches is their coverage, ranging from 91–100% of the 111, 616 documents in the Astro Data Set. The reason for this variation can be found not only in differences in the preprocessing stage of the data, but also during further steps in the workflow, as described in the following:

Solutions ok and ol As can be seen in Table 2, the most comprehensive solutions are ok and ol with a coverage of 100%, delivering 31 topics and 32 topics respectively.

Solutions en and eb Next in terms of coverage are solutions en and eb that include 97.99 and 97.22% of all documents, delivering 11 topics and 13 topics, respectively. These solutions were generated from a data set that was created by mapping the Astro Data Set onto an in-house version of the Web of Science, which resulted in reduction of the original set to 110,412 publications (~99%) (Glänzel and Thijs 2017). This subset was further reduced in two steps. First, in the data modeling step 82 documents were excluded from en because they did not reach a chosen similarity threshold for the minimal lexical similarity between any two documents in the data set. For eb the data modeling step resulted in 1479 documents being dropped because they did not share any references with any other documents in the data set (and hence did not couple). Second, after the clustering step, all documents were excluded from solutions en and eb that had been assigned by the clustering algorithm to single document clusters or ‘small, irrelevant’ clusters (Glänzel and Thijs 2017), resulting in 954 documents getting omitted from solution en with a final coverage of 110,330 documents, and 421 documents getting omitted from solution eb with a final coverage of 108,512 documents.

Solution sr Solution sr delivers 555 topics. During data preprocessing, the source data was mapped to an in-house database from SCOPUS, resulting in a reduction to 107,888 documents (96.66% of the full Astro Data Set). The clustering step consisted of locating those remaining documents in the global map of science clustered at the region-level (Boyack 2017a). In this step, 584 documents could not be located in the global science map indicating that they were not included in the creation of the global map because of missing reference or citation information. Therefore the final sr solutions covers 107,304 documents in total which corresponds to a coverage of 96.14%.

Solutions u, c and hd Solutions c, u, and hd have the lowest coverage. They are based on the direct citation model and include only the documents in the giant component of the direct citation network.Footnote 3 Solutions c and u both deliver 22 topics and their coverage is nearly the same at 91.23%. Solution c omits three documents that are connected to the giant component of the direct citation network only by future pointing references, whereas those three documents are included in u. Solution hd has a slightly smaller coverage (91.17%) and delivers 111 topics. It covers 66 documents less than solution u due to an additional selection process after the clustering step: Out of a total set of 381 valid clusters produced by the approach only a subset of 113 clusters was selected to meet criteria for a minimum cluster size of 20 papers and a minimum quality of clusters as measured by the associated cost function (see Havemann et al. 2017, for details). We further decided to include only 111 of the 113 clusters and omit the two largest clusters of this solution from the comparison as they provided only limited information about the topical structure of the Astro Data Set.Footnote 4

Finally, a number of parameters usually need to be set in the modeling of the data and in the application of a clustering algorithm that influence the results achieved, such as a minimum threshold for the strengths of links to be considered in a bibliographic coupling network, or a requirement for a minimum size of clusters to be extracted. In Table 2 we list those parameters for each approach.

Table 2 Properties of approaches

Tools for comparing topic extraction solutions

We use a variety of tools to compare solutions and capture differences in how they group documents from the Astro Data Set into topics. We use a quantitative measure to get a first idea about the similarity or disparity of solutions. We use visual mappings of solutions onto various reference frames that support comparing solutions to one another. Finally, we explored a variety of labeling approaches to capture the content of a topic and compare solutions with regard to the content of the topics that they construct.

Metrics: Normalized Mutual Information

To quantify the degree of similarity between solutions, we used an information theoretic measure that is commonly used in computer science to compare clusterings, namely Normalized Mutual Information (NMI). It considers membership in a cluster as a random variable and quantifies to what extent knowing one clustering reduces uncertainty about the other clustering. See “Comparison metric: Normalized Mutual Information” in Appendix for details on how this measure is defined.

Labeling

Thesaurus terms Our first approach to labeling clusters makes use of thesaurus terms from the Unified Astronomy Thesaurus (UAT), a public domain thesaurus specific to astronomy.Footnote 5 As described in detail in Boyack (2017b), it contains 1915 unique terms at a maximum depth of 12 levels. The Astro Data Set was indexed to generate thesaurus terms for each document using title and abstract as input.Footnote 6 To generate cluster labels we used the most specific terms assigned to each document plus level 2 terms. Of these we selected as labels the most relevant terms as determined by a NMI measure that compares distribution of terms in one cluster with that in other clusters (see Koopman and Wang 2017 for details). See “Data files” in Appendix for download information for the corresponding data file.

Natural language terms A second approach to labeling clusters used terms extracted from the titles and abstracts of documents. As for the thesaurus terms, we constructed labels by selecting the ten terms with highest NMI scores when cluster documents are compared to non-cluster documents. This labeling approach is described in detail by Koopman and Wang (2017b). The labels for all clusters ≥100 documents are given in “Cluster-level labels” of Appendix. See also “Data files” of Appendix for download information for the corresponding data file.

Journal signature In Velden et al. (2017) a high-level classification of document clusters is introduced that builds on the observation that groups of clusters share similarities with regard to their most popular and distinctive journal titles (‘journal signature’). Using this approach, six scientific domains were distinguished that seem to correspond to sub-disciplines within Astronomy and Astrophysics: Gravitation and Cosmology, Astroparticle Physics, Astrophysics, Solar Physics, Planetary Science, Space Science. Based on their journal signature, the 35 largest clusters of each of the seven disjoint cluster solutions that are included in our comparison, were assigned to those domains. Most assignments were straightforward, but some cases were ambiguous and more difficult to decide when a journal signature exposed a mixture of characteristics. This high-level grouping of clusters by scientific domains provides yet another reference frame for comparisons, as solutions differ in how they divide up a domain into topics, and how they shape the interfaces between domains. See “Data files” in Appendix for download information for the corresponding data files.

Visual mapping

Little Ariadne As described in Koopman et al. (2017), Little Ariadne is a special instantiation of Ariadne, a user friendly tool for browsing bibliographic databases. This specific instance uses the bibliographic information in the Astro Data Set and is available at http://thoth.pica.nl/astro/. In our analysis we use the tool to visualize how the document clusters provided by the eight different approaches relate to one another in an abstract semantic space. Similarity here is based on a semantic matrix that is created from indexing entities such as authors, journals (ISSN), subjects, citations, topical terms, MAI thesaurus terms, cluster IDs, and citations (see Koopman et al. 2017 for details). The visualization we produce with Little Ariadne highlights which clusters from different solutions are very similar to one another, and which solutions produced clusters that are relatively distinct from all clusters produced by the other solutions.

Lexical fingerprint The lexical fingerprint is a method to quantify and visually compare the topical content of individual clusters, within a solution and across all solutions (Koopman and Wang 2017b). It builds on the mutual information based labeling of document clusters described above. The lexical terms that constitute the baseline of the fingerprint are selected in a two step process: First, for each solution a ranked list of the 50 terms with highest NMI score is created. Then a joint set of terms for the fingerprint is created, by selecting the 50 highest ranking terms across those lists, excluding terms that appear only on one solution’s list. For the visualization of the fingerprint of a cluster, the joint list of 50 terms is arranged along the x-axis based on their similarity according to the semantic matrix used for Little Ariadne. The y-axis gives the NMI score of a cluster for the respective terms. The resulting lexical fingerprints looks like radiation emission spectra, except that the values on the x-axis do not represent continuous values of wavelengths or frequencies but instead are terms, and hence categorical values. See “Data files” in Appendix for download information for the corresponding data file with a list of fingerprint terms and the scores for all clusters ≥100 documents from all eight solutions.

Affinity networks The construction of topic affinity networks is a method to map and visualize the internal structure of a solution. The method shows how the document clusters extracted relate to one another based on direct citation links between documents. In the calculation of link strengths between document clusters only the surplus of citations relative to a random null model (based on cluster sizes) are considered in order to reduce the ‘cluttering’ of the visualization from a pervasive background of connectivity within the scientific literature (see Velden et al. 2017; Velden and Lagoze 2013, for details of the method). See “Data files” in Appendix for download information for the corresponding data files.

Findings: comparisons across whole solutions

Differences in topic size distributions

Figure 2 shows the accumulative size distribution of the document clusters that are extracted by the eight approaches. Given the overlap of clusters in the hd solution, we removed duplicates from unions of clusters when calculating the accumulative fractional size. The distribution shows that solutions hd, sr and en are highly concentrated in that they reach a coverage of \(75\%\) of the Astro Data Set by their first six largest clusters alone.Footnote 7 By contrast, solutions ol and ok show much lower concentration, reaching \(75 \%\) coverage only when including the 18 (ol) and 20 (ok) largest clusters, respectively.

Fig. 2
figure 2

Accumulative fractional size distribution of clusters in each solution. The y-axis indicates what fraction of the total set of 111,616 documents are included, the x-axis corresponds to cluster rank, ordered by cluster size

Degree of similarity between solutions

To get a first idea of the degree of similarity between solutions we use Normalized Mutual Information as a quantitative measure of the similarity between a pair of solutions (see Table 3). Note that this metric as well as the topic affinity networks used further below, could only be produced for disjoint cluster solutions such that hd is excluded from the comparison in this section.

Table 3 Normalised Mutual Information (emphasis: max, min value)
Fig. 3
figure 3

Grouping of clustering solutions based on degree of mutual similarities in cluster membership measured by NMI

The median of the distribution of NMI scores is 0.36. The highest similarity score of NMI=0.63 is obtained for the pair of solutions c and u. Both are based on the same data model, have nearly identical coverage of data, and differ only in the clustering algorithm used. Figure 3 shows groupings of solutions at different levels of agreement with respect to the quartile of the NMI score between each pair of solutions (1st quartile: NMI < 0.32, 2nd quartile: 0.32 \(\le\) NMI < 0.36, 3rd quartile: 0.36 \(\le\) NMI < 0.44, 4th quartile: NMI \(\ge\) 0.44). For example, the similarity score of each possible pairing of solutions in the set (c, ok, eb, en) is larger than the 1st quartile of similarity scores, i.e. NMI \(\ge 0.32\). Besides the pair of solutions with the maximal NMI score, we find two overlapping groups of solutions with high similarity scores above the third quartile level (NMI \(\ge 0.44\)), namely (u, c, ol) and (c, ol, ok). In the following we will refer to the union of these two groups as the ‘core group’ of solutions. The next similar solutions to this core group are sr and eb. Solution en is the most dissimilar. It joins a subset of the core set only if we allow for NMI values as low as the 2nd quartile.

To visually inspect the degree of similarity between the solutions we generate their topic affinity networks (see Figs. 4, 5). An affinity network shows how the different topics within a solution connect to one another based on direct citations and thereby allows to visualize the topical structure that a solution imposes on the Astro Data Set. To support the comparison and interpretation of these maps,Footnote 8 we subdivide the affinity network into scientific domains based on the journal signature of the document clusters that constitute the nodes of the network (Velden et al. 2017).

Fig. 4
figure 4

Topic affinity networks for solutions en, eb, u, c. Node size indicates number of documents, and link strength relative preference given by publications in one topic to cite publications in the other. Links are directed, colored by their source node and curve clockwise away from it. Node colors visible in the online version—indicate a scientific domain based on journal signature: red (Gravitational Physics and Cosmology), yellow (Astroparticle Physics), green (Astrophysics), orange (Solar Physics), blue (Planetary Science), purple (Space Science). The first number in node labels indicates rank of node by size, and the second number, in brackets are the cluster indeces provided by the creators of solutions and used in the remainder of the article to identify clusters [Network visualization: gephi + Force Atlas 2 algorithm, one of the few network layout algorithms that considers edge weights in directed networks]. (Color figure online)

Fig. 5
figure 5

Topic affinity networks for solutions ok, ol, sr. Node size indicates number of documents, and link strength relative preference given by publications in one topic to cite publications in the other. Links are directed, colored by their source node and curve clockwise away from it. Node colors visible in the online version—indicate a scientific domain based on journal signature: red (Gravitational Physics and Cosmology), yellow (Astroparticle Physics), green (Astrophysics), orange (Solar Physics), blue (Planetary Science), purple (Space Science). The first number in node labels indicates rank of node by size, and the second number, in brackets, are the cluster indeces provided by the creators of solutions and used in the remainder of the article to identify clusters [Network visualization: gephi + Force Atlas 2 ]. (Color figure online)

The affinity networks in Figs. 4 and 5 reveal that in all seven solutions, the domain Astrophysics is the largest domain and most central domain in the sense that it interfaces with each of the other domains. Its relative size ranges between 50 and 55% of the documents covered by a solution, with the exception of en where it includes only about 42% of documents. Interestingly, the neighboring domain of Planetary Science is much larger in en than in all other solutions, suggesting that a number of documents that other solutions have assigned to clusters in the domain of Astrophysics may have been assigned by en to the Planetary Science domain instead (see our detailed investigation further below).

The topology of affinity networks in Figs. 4 and 5 underscores the similarity between the core group of solutions (u, c, ol and ok) that was already indicated by the quantitative NMI measure. It consists of an elongated structure with the domains of Gravitational Physics and Cosmology and Astroparticle Physics located at one end, the domain of Astrophysics in the middle, and the domains of Solar Physics, Planetary Science, and Space Science located at the other end. This structure suggests an organization of the field from objects at large scales of space-time and larger distance from earth to smaller objects and closer distance to earth, as discussed in Velden et al. (2017). Solutions eb and sr can be seen as exposing variants of this pattern. Due to their very different number of clusters (13 versus 51Footnote 9) they sit at opposite ends of the spectrum of solutions.

Solution eb shows a structure that is similar to the core group with the Astrophysics domain at the center. However, the low number of clusters in eb seemingly suppresses the separate identification of the domain of Space Science. An inspection of cluster labels provided in “Cluster-level labels” of Appendix reveals that topics that in other solutions are a core component of the domain of Space Science, such as those relating to ‘solar wind’ and ‘ionosphere’ (see cluster labels for e.g. c18, u17, u18, ok4, ok25) are included in eb in the Solar Physics domain.

Solution sr is distinct because of its much higher number of clusters (51), an extreme variation in cluster sizes, and its high concentration of documents in a small number of clusters (see also Fig. 2). Interestingly, the cluster size distribution in sr differs significantly across domains: whereas Astrophysics, Solar Physics and Gravitational Physics and Cosmology are each dominated by one or two large topics, the domains of Planetary Science, Space Science and Astroparticle Physics show significant scatter with a large number of small topics.

The topology of connections between domains in the affinity network of sr is similar to that of the core group of solutions: Astrophysics takes a central position and the other domains are split into two groups, with one group attaching to one end (Gravitational Physics, Cosmology and Astroparticle Physics), and the other group of domains attaching to the other end (Solar Physics, Planetary Science, and Space Science). The quantitative measure of similarity, NMI, suggests a relatively high (3rd quartile level) similarity between solutions u, c, and sr but not between sr and ol or ok. Looking at Figs. 4 and 5 this seems plausible because the former three solutions share a high concentration of documents in large clusters in the domains of Astrophysics and Gravitational Physics and Cosmology and a lack of such a concentration in the domain of Space Science. These tendencies are not shared by the ol and ok solutions that show greater scatter of documents across several clusters in Astrophysics and Gravitational Physics and Cosmology, and a concentration of documents in one or two larger clusters for the domain of Space Science.

Finally, the affinity network of solution en looks very distinct from the affinity networks of the other solutions. Besides having only a small number of clusters (11) and no topics assigned to the Space Science domain (two features it shares with the eb solution), en constructs the interface between Solar Physics and the other domains in a unique way. It links Solar Physics with Gravitational Physics and Cosmology through topic en10 (‘gravitational waves’) and with Astrophysics through topics en4 (‘x-ray’) and en6 (‘gamma ray’), that both also interface directly with Gravitational Physics and Cosmology. This moves Solar Physics away from the other end of the elongated structure that characterizes the core group of solutions. This striking topological difference in combination with the low NMI similarity score of en when compared to any of the other solutions suggests a difference in the aggregation of topics that will be further investigated in “Citation based versus semantic data models” section.

What solutions agree about

The visualization tool Little Ariadne can be used to generate a global view on all eight solutions and how their topic clusters relate to one another based on semantic similarity (see Fig. 6). Relationships between topics are based on the distance measure of the semantic matrix used by Little Ariadne. Eye-catching is the large-scale structure of this map with areas of higher concentration of topics and relative voids in between. This large-scale structure corresponds well to the high-level domains that were derived from journal signatures (see “Labeling” section). This suggests that the similarity in journal signatures correlates to a large extent with semantic similarity as measured by Ariadne, with one caveat, namely that Fig. 6 suggests a subdivision of the Astrophysics domain into larger objects (galaxies) versus smaller objects (stars), which is a distinction that is not obvious in the analysis of the journal signatures of the corresponding topics.

Fig. 6
figure 6

Relationships between clusters from all eight solutions as seen by Little Ariadne. The bold labels indicate high-level scientific domains (Velden et al. 2017) that correspond well to the large scale structure of the network of clusters shown here. The dotted-line ovals indicate the approximate location of the clusters that are associated with each of the 13 largest shared document sets (sh1–sh13). The shared document sets are labeled by the top two terms generated using the entropy based labeling method introduced in Koopman and Wang (2017a)

To further explore the agreement between solutions, we looked for sets of documents that are clustered together into a single topic by every solution. Of the 111,616 documents in the data set, 96,921 are included in all solutions. There are 4289 (maximal) document sets, that include at least 2 documents and for which each solution has at least one cluster containing each set. We call these ‘shared document sets’ and interpret them as representing ‘hard thematic cores’ of documents that all solutions agree belong together in one topic. The 13 largest shared document comprise 23,217 documents and account for about 21% of the documents in the Astro Data Set. Their size and associated clusters are listed in “Shared document sets” of Appendix. The approximate position of the associated clusters in the cluster network is indicated in Fig. 6 by light blue ovals. With exception of the domain of Space Science that is not represented in solutions eb and en, all domains contain one or several of the 13 shared document sets, meaning they have thematic cores that are identified unambiguously by all eight topic extraction approaches. Labels for the shared document sets that describe the content of these thematic cores are given in Table 6 in “Shared document sets” of Appendix.

Upon inspection of the lists of clusters associated with each of the 13 largest shared document sets, we noticed two instances where the majority of solutions place two document sets into the same cluster whereas a small set of solutions disagrees and separates those two document sets into distinct topics. The first case concerns document sets 5 and 13 in the Gravitational Physics and Cosmology domain. A visual analysis of the lexical fingerprints of the clusters (see Fig. 7) shows that solutions ok and ol distinguish between the topics of ‘inflation’ and ‘dark energy’, whereas all other solutions combine these two topics into one. From a theoretical perspective, inflation (early universe expansion) and dark energy (current phase expansion) are separate phenomena, however they are potentially linked, which is the concern of the so called ‘quintessence’ theory in astrophysics. This suggests that from a subject expert’s standpoint detecting the linkage between the topics as well as detecting the distinctiveness of the two topics provide informative perspectives on the topical structure of the field.

Fig. 7
figure 7

Lexical fingerprints of clusters that include the shared data sets 5 and 13. Whereas most topic solutions assign the two shared paper sets to a single topic (top diagram), solutions ol and ok (below) distinguish the topics of ‘dark energy’ (ok17, ol 29) and ‘inflation’ (ok19, ol18)

The second case concerns the domain of Planetary Science where solutions c, ok, ol assign the shared document sets 8 and 10 to two distinct topics. The set of clusters in solutions c, ok, ol that include document set 10 (ol15, ok26, c14) show a clear signal for the terms ‘mars’ and ‘surface’ (see Fig. 8). The set of clusters in solutions c, ok, ol relating to document set 8 (ol12, ok22, c12) however has only a very weakly expressed fingerprint. This is likely because the most relevant terms for these clusters were suppressed in the construction of the lexical space for the fingerprint analysis, an issue discussed in Koopman and Wang (2017a)). Consulting the cluster labels given for these three clusters in “Cluster-level labels” of Appendix, we find ‘asteroid’, and ‘comet’ listed as top terms for those clusters, terms that are not included in the lexical fingerprint. The labels for the shared document set 8 (see Table 6 in “Shared document sets” of Appendix) confirm that the topic of this second shared document set is focused on ‘comets’ and ‘asteroids’. From an astrophysical perspective a distinction between research on asteroids and comets on the one hand (document set 8) and research on planets in the solar system (document set 10) seems a plausible one to make and whether to merge the two topics into a more general planetary science one would seem a matter of resolution.

Fig. 8
figure 8

Lexical fingerprints of the clusters associated with shared data sets 8 and 10. Most topic solutions assign the two shared paper sets to a single topic (top diagram). However, solutions ol, ok, and c (below) assign them to two different topics. One of them is a topic described by the terms ’mars’ and ’surface’ (ol15, c14). The fingerprint for the second topic (ol12, c12) is less well expressed, because key terms for their characterization such as ‘comet’ or ‘asteroid’ are not included in the vocabulary used for the construction of the lexical fingerprint

Findings: specific comparisons

In this section we compare pairs of solutions that differ with regard to some specific aspect of their approach (e.g. the same data model was used but different clustering algorithms) and explore whether we can develop hypotheses how differences between solutions link to differences of the approaches.

Local versus global data

Seven out of the eight approaches represent topics constructed from local data in the sense that they are based exclusively on the information contained in the Astro Data Set. By contrast, sr generates topics by mapping the documents of the Astro Data Set onto the partitioning of the STS global map of science, the clustered direct citation network of a much larger data set of publications. The underlying data covers a longer time period, 1996–2012, and publications from all areas of science, about 49 million documents in total (Boyack 2017a).

The topical structure that sr constructs by embedding the Astro Data Set into a global context greatly varies in resolution across the different domains, as can be seen from Figs. 4 and 5. For a detailed analysis see “Details of analysis of local data versus global data” in Appendix. The domains of Gravitation and Cosmology, Astrophysics, and Solar Physics are highly concentrated with almost all documents included in one or two large clusters. In the other domains, documents are dispersed across a larger number of clusters. This suggests that in those domains documents have links to many other parts of the scientific literature outside of the data set. As demonstrated in the companion article on the sr solution (Boyack 2017a), many of the smaller topics in the sr solution are instances where an astronomy-related application constitutes a part of another, much larger discipline. For example, some of the small topics found by sr in Planetary Science and Space Science seem to have clear links to geology or atmospheric and climate science.

This greater resolution of topics at the periphery of the Astro Data Set provides an alternative perspective to the one provided by solutions that construct cohesive clusters in those domains. The appropriateness of either may depend on the purpose of the topic extraction. For example, Boyack (2017a) suggests that a journal based field delineation that neglects the global context of an area of research is increasingly inappropriate to capture topics of research given the increasing interdisciplinarity of research. At the same time, the sr solution lacks topical resolution for the two largest domains that constitute the core of the field. The use of an aggregated version of the global science map to create sr is likely responsible. As discussed in Boyack (2017a), an aggregated version of the global science map was used so that the number of clusters would correspond more closely with the other solutions submitted to the comparison exercise reported here.Footnote 10

Citation based versus semantic data models

Another fundamental distinction between approaches is whether their data model uses citation links or lexical similarities in the meta data of an article (such as title and abstract) to relate documents to one another. While citation is a technically unambiguous signal (either there is a citation from one document to another or there is not), there is the potential issue of a social distortion of citation patterns due to rivalries between authors who may avoid citing each others work even though it is related, or bias due to the Matthew’s effect in favor of renown authors that attract citations even though other works may be equally relevant but do not get cited as much. By contrast, a semantic approach could be seen as being less vulnerable to such behavioral distortions. However, its signals may be technically ambiguous and lead to false positives when the same term is used in different specializations to indicate different concepts. This will occur less often if the data set is focused on a specific scientific area such as the one represented by the Astro data Set.

Four of the eight solutions included in our comparison are exclusively based on citation information (c, u, eb and hd), whereas three have a semantic component in their data model. All of these latter three, however, are some sort of hybrid and none is based purely on semantic information. Solution sr is based on a fine-grained clustering of a direct citation network that covers all of science and then uses semantic information to merge clusters again to larger topics. Solution en is generated by an explicitly hybrid approach that combines bibliographic coupling and document similarities based on terms extracted using an NLP approach. Finally, solutions ok and ol are based on what could be termed a ‘hyper’ semantic data model: it interprets all types of fields in the bibliographic record of a publication as an entity, e.g. author name, article title, journal name, reference. It then constructs a lexical profile for each instance of an entity for all entities by constructing a vector based on the number of publications where those instances co-occur with a given term or subject extracted from the entire data set. To relate articles to one another, for each article the lexical profiles of its entities are combined into one vector. References are one of the entity types included, such that a citation based signal is reintroduced through the back door: two articles that cite the same document will be more similar to each other since the lexical profiles of the respective instance of the reference entity will be the same. The heterogeneity within each of the two sets of solutions, citation based versus (hybrid) semantic, with regard to resolution (number of clusters), clustering algorithms used, and data models, is so great that we restrict a detailed analysis to subsets that reduce this heterogeneity.

Direct citation (c, u) versus hypersemantic data model (ol, ok)

Based on the NMI calculations, these four solutions form a core group of very similar solutions (see Fig. 3), even though they are based on two very different inputs to their data models. The similarity between these two sets of solutions is also reflected in the affinity networks in Figs. 4 and 5.

One significant difference between the two sets of solutions, however, is not captured by the NMI scores because their calculation is based only on those documents that are included in both solutions that are being compared. The direct citation based solutions and the hypersemantic solutions differ substantially in coverage. Whereas the hypersemantic data model includes all documents in the Astro Data Set, the direct citation based approach is applicable only to documents that have direct citations to other documents in the data set, and solutions c and u specifically included only the giant component of the direct citation network.

Interestingly, we find that a large proportion of the ca. 9000 documents omitted from the citation based solutions contribute to a single large topic in solution ok (ca. 6300 documents), and in solution ol (ca. 7800 documents). Based on the cluster labels the topic seems to be space missions (see “Details of analysis of citation based versus semantic data models” in Appendix for details). We observe that these two clusters exhibit the lowest within cluster citation rateFootnote 11 (\(\le\)20%), much lower than the within citation rates in the majority of clusters (\(\ge\)40%). The resulting sparsely connected direct citation network makes it less likely for a citation based approach to construct a topic out of these documents.

Bibliographic coupling (eb) versus hybrid (en)

As reported above, the purely biographic coupling solution eb and the hybrid solution en do not expose a great similarity based on their NMI score, although they partially overlap in their data model (a bibliographic coupling network), use the same clustering algorithm (Louvain), and have a similar number of clusters. A first observation from the affinity networks in Fig. 9 is that the addition of a lexical component in the data model for en has led to a greater aggregation of documents into topics: although en covers a slightly larger number of documents (109,376 versus 108,512), it distinguishes only 11 topics while eb distinguishes 13 topics.

Fig. 9
figure 9

Annotated affinity networks for solutions en (left) and eb (right). Labels used are the entropy-based thesaurus term labels. The labels in bold are human generated to support readability. Nodes are numbered by size, and numbers in brackets refer to the original cluster index given by the creators of the solutions (e.g. 1(No4) refers to the largest cluster in the network, with original cluster index 4)

A more detailed analysis of the differences in topological structure of the affinity networks in Fig. 9 is documented in the “Details of analysis of citation based versus semantic data models” in Appendix. It suggests that some of the distinctive features of solution en when compared to eb can probably be explained by aggregation effects due to the lexical component in the data model of en, such as the relatively larger sizes of the Planetary Science and Astroparticle Physics domains when compared to eb. We find in our analysis that in eb research on extra-solar planets seems to be included primarily in Astrophysics, whereas in en it is split between Planetary Science and Astrophysics, thereby contributing to a bigger size of Planetary Science in en. The search for extra-solar planets is to a large extent about the close observation of stars and variations in their movement or radiation. Hence we can expect publications on the search for extra solar planets to frequently reference literature on stellar observations, resulting in close ties in the citation based data model of eb. This connection is weakened in en because of its data model. It considers also lexical similarity, such that the use of terms relating to ‘planets’ in the publications about extra-solar planets strengthens their links into the planetary science literature. We speculate that a similar effect may be at work with regard to literature on supergravity, resulting in en in a greater aggregation of documents into the Astroparticle Physics domain (see “Details of analysis of citation based versus semantic data models” in Appendix for details).

Second, we notice that in solution en the topic representing the Solar Physics domain is curiously placed at the Gravitation and Cosmology end of the affinity network in contrast to the affinity networks of the other solutions that place it at the other end, alongside Planetary Science. We find in a search of titles of documents that in en the term ‘plasma’ is relatively concentrated with 71% of occurrences in the single Solar Physics topic, whereas in eb the concentration of the term ‘plasma’ in the Solar Physics topic is considerably lower, with 52% of occurrences. Further, the affinity network of en in Fig. 9 shows that the single document cluster that constitutes Solar Physics has relatively strong citation links to the topics of specific types of radiation sources (‘gamma-ray sources’, ‘x-ray sources’, and ‘gravitational wave sources’). This suggests that due to the lexical component in the data model of en, documents that could have been placed into those latter topics based on citations were subsumed instead into the solar physics topic because of their use of terms like ‘plasma’.

These observed effects of a lexical component in the data model on the topics constructed raise questions about a conceptual shift in perspective on topical relatedness and on what constitutes a topic. From a theoretical standpoint, citations constitute part of the scientific discourse and according to Gläser (2006) are an important step in the integration of the scientific knowledge base of a research specialty. By contrast, the lexical identity of terms, even when based on agreement about their semantic meaning, does not reflect the same type of topical relatedness as enacted and constructed in scientific discourse. The semantic component in the hybrid approach of en constructs topical relatedness based on lexical agreement in order to protect against social distortions in citation patterns. However, it does so at the price of de-emphasizing the discursive context that is expressed in citation patterns (that would make a distinction between plasma in the study of active galaxies versus plasma in the context of solar physics). The notable aggregation effects in the construction of topics by solution en discussed in this section suggest that this shift in perspective and its implications for the topical structure it constructs deserve further investigation.

Clustering: local clustering versus global clustering

Seven out of eight approaches use a global clustering approach. The clustering algorithms they use are designed to take information from the entire network into account when defining document clusters and they produce document clusters that are disjoint, that is each document is assigned to one cluster only. By contrast, the memetic clustering algorithm used to produce the hd solution builds clusters locally, starting from seeds and evaluating the immediate environment of each cluster to decide on cluster membership of a node. This approach produces overlapping clusters and assigns to each document a strength of membership. The property of producing overlapping clusters would seem more appropriate to theoretical considerations about the poly-hierarchical nature of topics as argued in Havemann et al. (2017), however it shares with the other approaches the unresolved methodological challenge of evaluating the appropriateness of the topics it constructs.

To explore the difference between the local clustering approach and the global clustering approach we compare solution hd to solution c that was produced using the same data model (a direct citation network) but with a different clustering algorithm. In our exploratory investigation we pursue the following strategy: we select two domains to investigate in detail, namely Astroparticle Physics and Gravitational Physics and Cosmology. We compare the lexical fingerprints of the topics that hd and c constructed in these two domains to see how they differ. We report our findings below (see “Details of analysis of local versus global clustering” in Appendix for details).

When comparing fingerprints of topics in Astroparticle Physics for hd and c (see Fig. 10) we observe that both solutions agree in identifying two major topics, one with a peak at ‘qcd’ (c7: 5363 documents, hd10: 5701 documents) and one with a peak at ‘standard model’ (c8: 5211, hd11: 5165 documents). In addition, hd offers a third distinct topic with a peak at ‘decays’ (hd18: 1812 documents). All other topics identified by hd seem to be variations of these three topics and tend to be smaller.

Fig. 10
figure 10

Comparison of (partial) fingerprints for Astroparticle Physics topics in solutions hd and c

The comparison of lexical fingerprints for topics in the domain Gravitational Physics and Cosmology reveals a similar picture, however without the discovery of a distinct new topic by hd (see “Details of analysis of local versus global clustering” in Appendix): the two solutions agree in the identification of four major topics, and the additional topics that hd identifies in the domain Gravitational Physics and Cosmology all seem to be smaller variants of the major ones.

This suggests that the local clustering approach reproduces the major topics identified by the global clustering approach. Importantly, it further offers an additional more focused topic (‘decays’) that is not distinguished in the global clustering solution. Also, it produces at times a scatter of sets of smaller topics that are largely redundant and seem to be close variants of a larger topic within the solution. We further observe with regard to the major topics retrieved that the local approach, since it allows for overlap, produces at times more inclusive topics (see discussion of hd5 versus c2 in “Details of analysis of local versus global clustering” of Appendix).

Discussion

In this paper we focus on the similarity and dissimilarity of different topic extraction methods and the topic solutions that they deliver. On a general level, as for instance relevant to information retrieval, we found that there is a big overlap in the representations delivered by the different approaches, especially if one views the topical structure of a field as a continuum (a cognitive landscape) rather than a discrete categorization, and focuses less on the rather artificially drawn borders of topics and more on their relative distance, as visualized e.g. by topic affinity networks.Footnote 12 However, to study the emergence of something new (starting at a microscopic level), and for evaluation when applied to a micro level such as groups or institutions, small differences between the topic structures constructed by these approaches matter.

The comparison of approaches in this paper provides first insights into the variability of the solutions delivered, and first suggestions of how specific features of approaches shape the topical structures that they construct. For a detailed analysis of how choices of data models, clustering algorithms and parameter values link to specific features of solutions, a more systematic and comprehensive experimental design is needed that varies only one variable at a time. One would ideally study the complete space of solutions generated by all combinations of data models and clustering algorithms and a systematic scanning of parameters that determine the resolution of a solution. This would allow us to evaluate the relative role of data model and algorithms in producing similarities and differences, and to explore the possible influence of the number of clusters we define by the various resolution parameters of the different algorithms. If we had similar numbers of clusters, and each algorithm is forced to distribute the papers between this number of clusters: How likely would it be that the clusters will be similar? Under what conditions could they be different? In terms of the effort required a study of the entire solution space has been outside the scope of this activity. Instead, one of the main contributions of this paper (and its companion papers) are methods for investigating differences between solutions, such as Ariadne, the lexical fingerprint analysis, and the affinity network visualization that we expect will be valuable in such a future undertaking. One realization is that we still lack tools and methods to compare clustering solutions that generate overlapping topics.

The specific observations that we made in the comparative analysis of solutions give rise to a number of questions: The first observation concerns the similarity of solutions based on the ‘hypersemantic’ data model (ol and ok) to solutions based on a direct citation network (c and u); this was a rather surprising finding. Does it suggest a relative robustness of the topical features that are exposed by these methods?

Further, it has been interesting to see in our preliminary analysis that the local clustering approach hd that allows for overlapping topics to form, not only reproduces the major topics constructed by the other approaches (along with a scatter of smaller, similar topics), but also some new topics, not detected by the other approaches. Does this suggest a greater sensitivity of this method do detect ‘bridging or ‘emerging’ topics that tend to be suppressed by approaches that only allow for disjoint topics?

Finally, the peculiarities of the en solution compared to the other (disjoint) solutions in this comparison might be due to the fact that it reflects a different perspective on the data. But we do not yet have a good grasp on how to identify and characterize such alternate perspectives on topical structures, and it remains an open challenge to establish the validity and usefulness of the different perspectives. We have encountered this issue in our discussion of external versus internal perspectives (see “Local versus global data” section), as well in our discussion of the hybrid lexical approach in contrast to a citation based approach (see “Citation based versus semantic data models” section). We lack well articulated links between (alternate) theories of what constitutes a scientific topic and the operationalization of topics in topic extraction approaches through the way the data is modeled and the clustering algorithm is designed [see discussion in the introduction to this special issue (Gläser et al. 2017)]. We envision as a next step in order to move toward a theory of topical structures in scientific fields, to take a set of empirically extracted topical structures such as the ones in this article and explore the different uses and properties of those perspectives in interaction with topic experts and users of topic extraction results (such as science policy consultants).

Conclusions

It seems evident that uncertainty about the appropriateness of a topical structure constructed by a topic extraction approach cannot be removed. Uncertainty may relate to the accuracy and completeness of the raw data used, to the validity of the operationalization of topics by the choice of data model and clustering algorithm, to the existence of undetected coding bugs, as well as to the interpretation of the topic extraction outcome [see conceptualization of uncertainty from Arthur Petersen’s work on climate modeling (Petersen 2012)]. To which extent such uncertainty is acceptable would seem to depend on the purpose of the topic identification; it makes a difference whether the identification of topics is done to contribute a metric analysis to a science history argument (Burger and Bujdosó 1985), to be used during consultation in the discourse with experts, or for evaluation purposes (Hicks et al. 2015).

We would like to encourage future work on topic identification to re-use part of the framework developed here to better describe and distinguish approaches (raw data, data model, algorithm, parameters). Ideally in times of open science, algorithms and raw data should be shared—using existing Trusted Digital Repositories, which allow to find software and data also in the long term (Dillo et al. 2013). Because this proves to be problematic due to mixed ownership of data products used we would like to call the community—probably also in collaboration with the private information services—to create benchmark datasets, which can be shared openly. The lack of such benchmark datasets [already remarked on by the IR community (Mayr and Scharnhorst 2015)] seems to hamper the further methodological development in the field of scientometrics, and unnecessarily restrains discussions as conducted in this special issue.

The general lack of benchmark data sets also widens the gap between those operating in the field of bibliometrics as professionals in research evaluation (bibliometrics as service), those applying bibliometrics occasionally for such purposes and those applying bibliometrics as one method next to others to better understand the dynamics of science. If exclusive access to specific databases and tacit knowledge on the implementation of certain algorithms becomes the dominant regime, the further development of bibliometric methods comes to a stop, and it becomes more probable that other communities will re-enter into the same problem space, by simply ignoring lessons learned in the history of bibliometrics.

As a first step, the group behind this special issue on “Same Data, Different Results” reached an agreement with the primary owner of the Astro Data Set, originally Thomson Reuters, now Clarivate Analytics, to enable us to share the Astro Data Set with the wider scientific community. We would like to invite you to join us in constructing topical structures from this data set and in comparing our approaches and results. See the call for participation in a topic extraction challenge in this special issue (Boyack et al. 2017) or the website www.topic-challenge.info for further information.