Introduction

Innovation is a popular idea. One might describe it as a universal goal of businesses, universities, and scientists. Economists and business entrepreneurs see it as a means to sustain growth and increase productivity and revenue (Wong et al. 2005; Szirmai et al. 2011; Naranjo-Valencia et al. 2018). Universities push for the creation of innovative research and teaching styles (Amador et al. 2018; Kryukov and Gorin 2016; Kim 2015). Sociologists and animal behaviorists often explore innovation through new behaviors or ideas that propagate throughout a group (Brosnan and Hopper 2014; Ramsey et al. 2007; Shane 1993). Developmental biologists and geneticists view innovation in terms of gene mutations that lead to persistent changes in the output of regulatory networks (Wagner 2014; Davidson and Erwin 2010; Holland et al. 1994). All of these representations of innovation share a common theme: novelty, or invention, followed by adoption and propagation. A new idea or behavior is created and then integrated into the surrounding community.

The process of innovation is inherently collective, requiring shared consensus on novel strategies. In a sense, the process of innovation spreading represents one of the simplest and most ancient algorithms for computing optimal solutions to shared problems. Determining the conditions that contribute to successful innovation is therefore fundamental to an understanding of the broader subject of collective computation.

Here, we focus on identifying conditions that lead to successful innovations in modern science. Some have argued that innovation is easier at the fringes of fields where novel connections can come without being overly constrained by the current status quo (Eduardo da Motta e Albuquerque 2007; Fitjar and Rodríguez-Pose 2011; Doloreux 2003). We aim to bring quantitative rigor to this hypothesis, asking: Are innovations indeed more likely to arise from areas outside the mainstream, or are well-established lines of research that are more well-tested in fact better able to inject long-lasting conceptual change?

Innovation can be a nebulous term with a variety of interpretations, even within a discipline. Economics, business, and finance often view innovation through a monetary and efficiency lens with a goal of greater sustained profit over time (Le Phi et al. 2017; Rondi et al. 2018; He and Tian 2018). Behavioral innovation is considered in terms of individual novelty followed by cultural transmission (Brosnan and Hopper 2014; Ducatez et al. 2015; Akre and Johnsen 2014). In a biological context, innovation is often viewed in a genetic and developmental framework (Werren 2011; Arendt et al. 2016; Peter Gogarten and Townsend 2005). Novelty and persistence repeat in each of these examples. The theme of novelty and persistence is echoed from Schumpeter’s and Brozen’s work, “Invention, Innovation, Imitation” (Brozen 1951). Our novelty equates to their idea of invention. Innovation and imitation are considered together in our notion of persistence. We will therefore base our operational definition of innovation on the joint occurrence of novelty and persistence.

The field of evolutionary medicine is ideal for our study of innovation because it offers a unique opportunity to study innovation within context of the marriage between two distinct scientific fields, evolutionary biology and human health and disease. Evolutionary medicine began with the article “Dawn of Darwinian Medicine” and the follow up book Why We Get Sick: the new science of Darwinian medicine (Williams and Nesse 1991; Nesse and Williams 1994). These ideas expanded into an endeavor to better educate clinicians with the general principles of evolution as they relate to human health and disease (Nesse et al. 2009). The ideological framework of evolutionary medicine is novel and persists today as a collective innovation of sorts built on the individual innovations of its practitioners. We are particularly interested in evolutionary medicine because of the large amount of previous research of logical and biological innovations (Fagerberg 2004; De Quinn 2000; Fagerberg et al. 2005; Quandt et al. 2014; Cavalier-Smith 2002; Wagner and Altenberg 1996; Muller and Wagner 1991; Wagner 2018). This study examines the language of evolutionary medicine to computationally identify innovations and their origins across all publications in the field.

Fig. 1
figure 1

Example of Bibliographic Coupling. Publication A cites Publication C, and Publication B also cites Publication C (dashed lines). Because they share this citation, Publications A and B are connected by an edge in the bibliographic coupling network (solid line)

We first characterize conceptual connections among publications in the corpus using a bibliographic coupling network (Fig. 1). Two publications are joined by an edge in this network when there is at least one other reference that they both cite. The structure of this network characterizes the relative popularity and connectivity of various concepts: manuscripts with a large degree use concepts that are more closely related to previous work, with clusters of highly-connected manuscripts citing the same popular references (Kessler 1963; Yan and Ding 2012). We expect concepts to be used heterogeneously, with some popular core concepts that accumulate large clusters of manuscripts, and some peripheral concepts that cite references that are dissimilar to other manuscripts in the corpus. We characterize the degree to which the conceptual network displays a distinct core and periphery using the rich club coefficient (Colizza et al. 2006; Opsahl et al. 2008). Plainly, the more connected, centrally located rich clubs form around important concepts—identified here as shared citations—in evolutionary medicine.

Armed with the conceptual network structure of an entire scientific discipline, we are then able to pinpoint where innovative ideas arise within the structure. Defining innovation in a way that includes novelty and persistence, we ask whether innovative science is more likely to occur incrementally, appearing first within well-established conceptual territory, or radically, beginning in less well-trodden conceptual areas.

Methods

Fig. 2
figure 2

Identification of Innovative Publications. A step-by-step guide to identifying innovative publications in the evolutionary medicine corpus. We begin with a list that includes all members of the EvMed Network and editors, including contributors, from two evolutionary medicine textbooks (Trevathan et al. 1999; Gluckman et al. 2009). Each individual’s publications were downloaded in PDF format using the Web of Science Thompson Reuters (2012). Those PDF files were converted to plain text using Giles Damerow et al. (2017). With each publication in plain text format, Wordsmith Tools was employed to create word frequency lists, which were shortened to lists of keywords that occur significantly more often than in the Baker-Brown Corpus Scott (1999); Baker (2006). A total of 531,181 keywords were identified, of which 38,694 are unique. Publications were then ranked by innovation index I, which measures the number of novel and persistent keywords (those that were not present in any previous years but were present in all subsequent years). We selected for further analysis the top 100 publications with largest I

Defining and creating the corpus

To create a comprehensive corpus of all publications within the field of evolutionary medicine, we started with (1) a list of individuals who self-identify as interested in evolutionary medicine, using The International Society for Evolution, Medicine, and Public Health (ISEMPH) global directory for interested scholars, clinicians, students, and community supporters (EvMed Network) Nesse (2018) and (2) contributing authors to two major textbooks on evolutionary medicine (Trevathan et al. 1999; Gluckman et al. 2009). This corpus has been previously used as an example of bibliographic coupling networks for historians of science (Painter et al. 2019). For each individual in this group, we gathered a comprehensive list of publicly available publications, resulting in a corpus of 6,456 publications. The Thompson and Reuters’ database Web of Science (WoS) Thompson Reuters (2012) was queried for each member’s publication history allowing for the collection of PDFs. While there are undoubtedly publications in the corpus that do not pertain directly to evolutionary medicine, this corpus is overwhelmingly dominated by evolutionary medicine and contains the vast majority of all evolutionary medicine journal articles and books available through the Web of Science.

Defining the bibliographic coupling network

Bibliographic coupling is an indicator of conceptual similarity (Boyack and Klavans 2010; Zhao and Strotmann 2008; Jarneving 2007; Kessler 1963). Introduced in 1963 by M. M. Kessler; bibliographic coupling is garnering an increase in attention as of late (Kessler 1963; Boyack and Klavans 2010; Zhao and Strotmann 2008; Jarneving 2007; Small 1997). Two publications are considered to be bibliographically coupled if they cite one or more of the same publications (Fig. 1). In this study, we use unweighted edges (with an edge between two papers if they share at least one citation) in order to easily measure the rich-club coefficient, which is most simply defined in terms of unweighted networks (Gollo et al. 2015; Bassett et al. 2013). We make use of the bibliographic coupling network defined at yearly intervals: The network becomes larger over time as publications are added, such that we start with an initial smallest network in 2007 and end with the largest network in 2017.

The bibliographic coupling networks were created using the metadata of citations provided from WOS for each publication. The network graphs are structured as \(G = (V, E)\), where V are nodes representing publications and E is the edges between representing shared citations. If publication A and publication B both cite publication C, the nodes representing A and B will share an edge vis-a-vis C Kessler (1963).

The rich club coefficient \(\rho (k)\) (Colizza et al. 2006; McAuley et al. 2007) measures whether nodes with large degree (the “rich club”) have more connections than expected amongst themselves. Specifically, \(\rho (k)\) is defined as the ratio of the number of edges among nodes with degree larger than k to the number in a maximally random ensemble that shares the same degree distribution. In Fig. 6, we estimate \(\rho (k)\) using a standard method provided by the python package networkx, sampling from the ensemble of random networks using the default number of 100N double-edge swaps.

Defining innovation

Figure 2 displays the process for selecting the most innovative papers in the corpus. First, PDFs of all publications were converted to plain text using Giles, an existing platform for text extraction and optical character recognition (OCR) Damerow et al. (2017). Word counts for each publication were analyzed to identify keywords, defined as those words that occur significantly more often compared to the reference Baker-Brown Corpus of General American English Baker (2006) (and using WordSmith Tools’ default significance threshold p-value of \(10^{-6}\).) Many of the keywords existed within various areas of the greater scientific discourse. A reference corpus focused on scientific documents would preclude novel keywords from the corpus on the basis of their prevalence in other unrelated fields of research. Our study is intra-corpus, between years in the single corpus. Therefore, it is an appropriate reference corpus to provide a base of comparison for this study. Next, the keyword lists for each publication were compiled into yearly groups using WordSmith Tools Scott (1999). A curated stoplist removed irrelevant words that offered nothing to the keyword analysis. The keywords are identified using the Wordsmith Tools software and Cressie and Read’s Log Likelihood test. Scott (1999); Cressie and Read (1989) Keywords were normalized for consistency (known as lemma, “diseases” changed to “disease”, etc.) Each of the individual groups were compared to subsequent individual years; 2007 is compared to 2008, 2009, 2010, and so forth.

Innovative publications are identified as having new keywords not present in previous years but that persist in subsequent years. This is consistent with Joseph A. Schumpeter’s and Yale Brozen’s definition of invention, innovation, and imitation. Brozen (1951) Here, invention is the novelty appearance of a keyword. Schumpeter and Brozen equate innovation to a change in process. Finally, their imitation is simply the adoption by others. Our framework combines innovation and imitation with persistence. Innovative publications are then ranked by the number of keywords I that are used in later years, with the most innovative having the most keywords adopted into publications by the evolutionary medicine community.

Results

Fig. 3
figure 3

Publication years represented in the corpus. Most publications in evolutionary medicine were published after 2006 (note the log scale). To study innovation, we focus on years 2007 through 2014 (dashed orange lines), selecting 100 papers that contain the largest number of novel and persistent keywords

We created a corpus for evolutionary medicine that aims to include all publications in the field before January 2018 (see Fig. 3.) Refer to the Methods section for details on the construction of our corpus. A previous study about evolutionary medicine as a field only performed a bibliometric analysis of evolutionary medicine using metadata from search results (Alcock 2012). As far as we are aware, this is the first attempt at a comprehensive collection of evolutionary medicine full text publications and associated keywords.

Fig. 4
figure 4

Publications identified as innovative accumulate citations at a much faster rate. Plotted is the cumulative probability function for citations per year attained by Evolutionary Medicine papers published between 2007 and 2014 (measuring the proportion of papers with the given citation rate or lower). The 100 papers with largest innovation index I (orange) accumulated an average of 17.2 citations per year, while all papers (blue) accumulated an average of 6.4 citations per year. The 90% confidence intervals for randomly chosen subsets of 100 papers are shown as a blue shaded region. The distributions are highly significantly distinct (KS-test statistic 0.23, \(p < 10^{-4}\))

We then identified manuscripts in the corpus that introduce innovative ideas. Defining an innovation index I that measures the number of novel keywords used by each manuscript that then persist in later years of the dataset (see Methods section and Fig. 2), we select from 4794 papers published between 2007 and 2014 the 100 most innovative papers (Fig. 3). As a test that our measure I is a good indicator of the successful spread of a paper’s ideas, we measure the rate at which innovative papers accumulate citations (Fig. 4). Indeed, compared to their contemporaries, the papers our keyword analysis identifies as innovative accumulate citations at more than twice the background rate.

Fig. 5
figure 5

Periphery and Core of the 2008 Evolutionary Medicine Bibliographic Coupling Network. The red nodes represent publications and the gray edges represent a shared citation. The periphery of the network is characterized by loose connections that are often isolated from one another. This indicates a more unique set of citations when compared to the overall network. Conversely, the core contains publications with many shared citations resulting in denser areas of the network

Fig. 6
figure 6

A rich club of scientific citations. Bibliographic coupling within evolutionary medicine has a rich club structure. A rich club coefficient above one indicates that more edges than expected connect the most highly-connected publications. The degree threshold, varied on the horizontal axis, sets the cutoff that defines the core. A clear rich club structure is detected for papers that are connected via bibliographic coupling to more than a few hundred other papers in the corpus

Next we analyze the conceptual network structure within the corpus. Consistent with a previously studied network of co-authorship Colizza et al. (2006), the field of evolutionary medicine also displays a core-periphery structure in the network of bibliographic coupling (Fig. 5). The degree to which core manuscripts have distinct connectivity patterns from those in the periphery is quantified by the rich-club coefficient (Fig. 6), which measures the tendency of core manuscripts to be preferentially connected to other core manuscripts. Here we define the core as those publications having the largest bibliographic coupling degree. For instance, manuscripts that share a reference with more than 300 other manuscripts in the corpus (which make up 1.3 percent of the corpus) are about 1.5 times more likely to be connected to other manuscripts within that same core group, indicating a preference among these core manuscripts for citing the same set of popular publications. The core and periphery are clearly visible in visualizations of the network structure (e.g. Fig. 5).

This overarching core-periphery structure motivates us to ask whether innovative manuscripts arise more frequently in the core or in the periphery. Ordering manuscripts published between 2007 and 2014 by their bibliographic coupling at the year of publication, in Fig. 7 we compare the cumulative probability function of all 4794 papers from those years to the 100 most innovative papers. We observe that the curve of innovative papers (orange) lies above the confidence intervals for random samples from all papers (blue band), but only for small bibliographic degree. This indicates that, while papers with medium to large bibliographic degree contain innovations at the same rate that is expected based on the proportion of papers published with that degree, those with small bibliographic degree are disproportionately more likely to be innovative. For instance, looking to the far left of the plot, papers that begin with degree zero (that is, the papers they cite have not been cited by any other paper in the corpus to that point) make up over 20% of the innovative papers, but only about 10% of all papers.

The importance of understanding the specifics of our results must be explicitly stated. We found that innovation occurs by way of introducing a new idea, represented by a keyword, not previously found in evolutionary medicine, then that new keyword is adopted by other practitioners of evolutionary medicine in the following years. Furthermore, those innovative papers that first introduced a new keyword occupy a position at the periphery of the bibliographic network that maps evolutionary medicine publications based on their shared foundational knowledge, e.g. shared references. Whereas the periphery of a bibliographic coupling network is inherently more interdisciplinary, the rich-club represents the field by virtue of evolutionary medicine’s paradigm Kuhn (1962), which is already the interdisciplinary incorporation of evolutionary biology into medicine (Painter et al. (2019). Furthermore, this phenomenon occurred between 2007 and 2017. However, the workflow of constructing a corpus, extracting keywords, making comparisons, and building networks can be scaled up to handle different disciplines and larger timescales.

Fig. 7
figure 7

Innovative papers are disproportionately likely in the periphery. The distribution of bibliographic degree at the time of publication (displayed as the cumulative probability of having degree less than or equal to the given value). Compared to all publications in the given years (blue), the 100 most innovative papers (orange) are more likely to have small bibliographic degree, indicating they are more likely to lie outside the rich club. As suggested by comparing to randomly chosen sets of 100 papers (90% confidence intervals shown as blue shaded region), the difference between the distributions is statistically significant (two-sample KS-test statistic 0.158, \(p = 0.013\))

Discussion

In the era of “big data”, the history of science is increasingly encountering datasets of unprecedented comprehensiveness and detail like the one we introduce here for evolutionary medicine. Attempting to characterize scientific fields and their dynamics through an analysis of every published word, we will need new ways to quantify aggregate properties and explain how they arise in networks of individuals.

In this paper, we take a step in this direction by defining two quantitative aggregate measures: a measure of innovation and a characterization of the core-periphery structure of concepts within the studied field. With these quantitative measures, we are then able to rigorously test a hypothesis about collective properties that lead to increased innovation.

Our measure of innovation starts with keyword analysis to objectively identify important, thematic areas of text (Baker 2012; Biber 2011; Bondi and Scott 2010), then selects keywords that display two established characteristics of innovation: novelty and persistence Brozen (1951). The publications identified by this method as most innovative also accumulate citations at a significantly higher rate than average (Fig. 4). To our knowledge, this is a novel methodology to identify and quantify the nature of innovation through keyword extraction.

Our second quantitative measure characterizes the core-periphery structure of the conceptual network using the degree of bibliographic coupling. Many bibliographic networks display a rich-club network structure, with a core group of papers very tightly connected while a subset of papers are isolated or loosely connected to their neighbors. The rich club can be viewed as containing more popular or mainstream ideas while the periphery represents more novel or fringe ideology. The evolutionary medicine corpus also exhibits a rich club network structure with a distinct core and periphery (Fig. 7).

Our main finding in this work is that innovative publications occur significantly more often than expected in the periphery of the bibliographic coupling network. We conjecture that this may occur due to the flexibility afforded to these publications through their lack of strong connections to other publications. In the core, documents exhibit a high degree of sameness to other documents. The difficulty of producing persistent novelty may be greater when a publication is based on the same concepts used by many others in the field. Conversely, the publications at the periphery are less constrained by the high level of sameness in the core. It is worth noting that innovations do occur in the core, but not at a significantly different rate than would be expected based on the fraction of papers located there.

Timescales of conceptual dynamics

Though we have focused here on a particular decade (2007–2017), our method should be easily scaled to longer timescales. Over the scale of many decades, we expect to see dynamics for particular innovations that start at the periphery, accumulate new conceptual hubs, and create their own rich club core. The long term trends of innovative concepts could aid in identifying and tracking the behavior of authors and institutions that support such innovations. In particular, our discussion on core-periphery function suggests that authors and institutions may become more conservative after successfully moving from the periphery to the rich club core, a hypothesis that should be quantitatively testable with such longitudinal data.

Core-periphery function in collective computation

Having found that innovations are over-represented in the periphery of a scientific field, we may speculate about functional advantages of the core-periphery structure that is common to a number of scientific networks. While maintaining a thriving core may be crucial for developing existing ideas, allowing for a diverse periphery could be equally important to avoid stasis and to promote necessary adaptation. It is likely that, as previous work on bibliographic coupling networks suggests (Chen 2003; Small 2003; Biscaro and Giupponi 2014; Boyack and Klavans 2010; Kuusi and Meyer 2007; Ferreira 2018; Nettle and Frankenhuis 2019), the paradigm exemplified by a scientific field is, in a sense, defined by the stability of the rich-clubs, and we are more likely to encounter a Kuhn-like scientific revolution Kuhn (1962) originating in the more volatile, interdisciplinary periphery.

More broadly, trade-offs between robustness and adaptability occur across a range of biosocial systems. For instance, living systems may tune their distance from a collective instability in order to be either more predictable and robust (further from the transition) or sensitive and adaptable (closer to the transition) Daniels et al. (2017). The core-periphery structure may be viewed as a strategy for specialization of different parts of the network to different levels of flexibility.

In neuroscience, a clear rich-club structure has been found within brain networks (van den Heuvel and Sporns 2011; Bassett et al. 2013; Nigam et al. 2016). There is some evidence that this structure could allow for both robust stereotyped learned behavior (controlled by the core) and higher variance behavior needed for faster adaptation (occurring in the periphery). In particular, highly connected core areas are found to be slower to change during learning compared to peripheral areas Bassett et al. (2013), and a detailed computational model of the macaque brain displays stable dynamics in the core simultaneous with unstable dynamics in the periphery Gollo et al. (2015). Looking forward, we may be inspired to look for a similar dynamical pattern in scientific networks in which the rich-club is able to retain robust knowledge by virtue of its slowness to change.

Conclusion

This study proposes a framework for identifying innovation within a scientific field. Persistent novelty is used as a broad-sense identifier of innovation using keywords extracted from the plain text of publications. This is supported by the finding that these innovative publications accumulate citations at a statistically higher rate (see Fig. 4). We find these innovative publications are also statistically more likely to occur in the periphery of the network, outside the rich-club cluster that represents the foundational knowledge status-quo (see Fig. 7).

More work is necessary to determine whether similar patterns of innovation are prevalent in other scientific fields and more generally in social systems, including businesses, universities, and other social institutions. If similar patterns of innovation occur, viewing these organizations through a network lens that incorporates functional differences between core and periphery could be crucial to understanding and guiding their evolution.

Our approach could be used to study network dynamics induced by innovations, as one might hypothesize they could serve as seeds for the growth of new rich-clubs. It could also be used to estimate the background frequency of simultaneous, independent innovations. In general, we contend that viewing innovations and their behavior in the context of a broader knowledge map will deepen our conceptual understanding of the origin of successful new ideas in society.

Innovation exists in a variety of forms. Our work identifies innovation through novelty and persistence. This is but one flavor of innovation. Future work from this research project will explore knowledge evolution using biological evolution as framework. While we explore persistent novelty here, the recombination of existing ideas (in the core or periphery) is strikingly similar to the recombination of existing genes within genomes. We intend to explore and compare rates of keyword novelty with known rates of genetic mutations. In tandem, we will explore how the knowledge networks (here, bibliographic coupling,) focused on the product, relate to co-authorship networks, focused on the producers. We believe these future projects, combined with results presented here, will clarify a blurry, liminal space in which innovation enables a certain kind of possibility.