1 Introduction

“Economic change in all periods depends, more than most economists think, on what people believe.” (Joel Mokyr, The Enlightened Economy)

“Every historical act can only be performed by the ‘collective man’, and this presupposes the attainment of a ‘cultural-social’ unity […], on the basis of an equal and common conception of the world.” (Antonio Gramsci, The Prison Notebooks)

As the scholarly quest for the determinants of economic growth shifted attention away from factors such as labor, land and capital, a large literature identified scientific and technological progress as a key driver of development and prosperity (Bush, 1945, Jones 2002, Pakes & Sokoloff, 1996, Romer, 1990, Stephan, 2012). In the last few decades, scholars also pointed to the role of culture, i.e., the set shared beliefs, values, goals and traditions that a population holds and transmits over time, as a further determinant of the institutional choices and economic trajectories of a community (Alesina & Giuliano, 2015, Galor, 2011, Guiso et al., 2006, Landes, 1999, McCloskey, 2016, Mokyr, 2016, Spolaore 2014, 2020, Williamson, 2000)

We know little, however, about the relationship between science and culture. If they do not only develop independently but also interact with each other, this relationship may represent a further variable of interest to understand economic change. Mokyr (2013, 2016), for example, advanced the idea that certain scientists introduce new sets of beliefs in a population with their discoveries.Footnote 1 The impact of these individuals, therefore, affects not only the production and diffusion of scientific knowledge, but also changes how people, more broadly, interpret the world around them.Footnote 2 Mokyr calls these scientists “cultural entrepreneurs”.

In this paper, we propose an approach to test empirically the impact of scientific progress on the broader culture, and we apply our methodology to one of the major advancements in the history of science: Charles Darwin’s theory of evolution by natural selection. Assessing how scientific progress affects cultural change presents several empirical challenges. First, one would need a long time horizon to analyze the interplay between public discourse and scientific progress. Second, unobserved factors and events (especially over extended periods) make inferring causal links difficult. A further complication is how to define and measure, in the first place, culture and cultural change.

Most of the existing literature in the economics of culture relies on survey-based measure of specific attitudes, such as trust, cooperation, or “civicness”, or on activities whose intensity plausibly correlates with some of those attitudes.Footnote 3 To our knowledge, there have not been attempts to define measures of culture related to new scientific ideas and discoveries. Moreover, most existing measures concern recent times and represent beliefs in a given moment.Footnote 4 To conceptualize and measure culture and cultural change over a long (historical) period, and to identify the inclusion of certain scientific ideas into the broader public discourse, we adopt a different approach. We rely on concepts, sources and tools from such fields of research as the humanities, sociology and ethnography. The underlying claim of these approaches is that language embodies values and beliefs, and is a major channel of communication and transmission of them over time (Hamilton et al 2016; Kirby et al., 2007; Lévi-Strauss 1963; Nguyen et al., 2020; Whorf 1956). Changes in use of certain words and phrases, as well as in the meaning of a word, indicate changes in underlying beliefs and views of the world in ways that can be transmitted and further change (in a measurable way).

We study the evolution of certain phrases and expressions by performing digital text analysis on a large text corpus between 1820 and 1899. Written text, of course, represents only a part language-based communications, together with oral exchanges (Michalopoulos & Xue, 2021). There are, moreover, relevant forms of non-verbal behaviors and communications as well. In addition to being, in general, easier to record and measure the written word, the growth of the printing industry and the increase in literacy rates especially since the 19th Century, and the role not only of academic and other non-fiction texts, but also of the fictional literature especially through the diffusion of the novel (Lyons, 2003), makes the written language a major repository (and means of transmission) of broader values and beliefs. In fact, a relevant claim in the digital humanities and cultural linguistics literature is that digital text analysis or “distant reading” allows for the consideration of the “great unread”, i.e., the large quantity of texts that normally scholars do not study, but that, as a whole, represent the broader social and cultural climate and discourse at a given time (Cohen, 1999).

Although there is a general perception that the theory of evolution had broader influence, we know less about what concepts were particularly influential, how their influence evolved and entered the public discourse, and how long it took these ideas to diffuse beyond a narrow scientific community. The information that we retrieve from various text sources makes progress in addressing these questions.

The publication of On the Origin of Species, in 1859, made Darwin’s theory known to a vast public; moreover, the timing of the publication was largely unplanned. We rely on this event as our main source of natural variation. The main corpus on which we perform our text analysis is Google Books, a digitized collection of about eight million volumes. We define the publication year of On the Origin of Species as our reference date and concentrate the analysis on the four decades before and after it (1820–1899). We consider words and expressions that, according to many accounts, embody the key concepts of Darwin’s theory (Desmond & Moore, 1994; Mayr, 1982): Evolution, Survival, Competition, (Natural) Selection, and Adaptation. We compare the evolution of the frequency of use of these Darwinian words with a large number of words not related directly to Darwin’s theory but extensively used in On the Origin of Species. We then complement the frequency analysis on the Google Books corpus with evidence from UK Parliamentary Transcripts and US Congressional Records. With these additional corpora, we explore how certain concepts diffused not only in the cultural discourse, but also in the political arena, thus potentially shaping the policy debate. In addition to frequency of use, we assess semantic changes and the evolution of attitudes toward Darwinian concepts as additional measures of cultural change, applying word-embedding techniques.Footnote 5

We show, first, that some key concepts in Darwin’s theory increased their diffusion in the broader cultural discourse in the years immediately following the publication of On the Origin of Species: Natural Selection, Evolution, Survival and Competition. The patterns of diffusion of these words and expressions were similar in the non-fiction and fiction literature; this indicates that the underlying concepts had a broad impact on culture as well as on the social imaginary as represented, for example, by short stories and novels. Other concepts such as Selection and Adaptation did not experience a change in the rate of diffusion. We also document that some of the key Darwinian ideas entered the policy debate after 1859 but with some delay with respect to the entry of these concepts in the broader public discourse. The effects of On the Origin of the Species, moreover, were not specific to the English-speaking world; the Darwinian concepts diffused in non-English speaking countries right after the translation of the book in the corresponding language. Moreover, the translation occurred earlier in countries that industrialized earlier, such as Germany and France, than in “late comers” such as Italy and Spain.

The second set of results concerns changes in the semantics of these words as well as in the types of reactions, or sentiments, that they generated over time. Of interest is, for example, the increase in semantic association between certain words, such as Competition and Life, as well as between Life and Adaptation. The term Evolution, which came mostly from chemistry and physics in the first half of the 1800s, later in the century related more to concepts from biology as well as social and human subjects, indicating a broader reach of this idea in society. We also document an increased similarity between Evolution and words related to the traditional view of the Christian doctrine about the origins of the world, such as Creation and Genesis; this suggests a process of “secularization” of these ideas. Furthermore, Selection became more similar in meaning to other “Darwinian” words, such as Survival, Variation, Fittest and Heredity. Sentiment analysis shows a more positive attitude toward certain Darwinian concepts after the publication of On the Origin of Species, in particular Evolution, and a positive attitude toward Darwin himself.

Finally, we show that the word “Darwin” diffused more literature than the names of other major scholars in the same area (Lamarck, Chambers and Wallace), and that the semantic association of the focal concepts that we consider was higher with the name “Darwin” than with the other names. This suggests that these ideas were particularly associated, in the public discourse, with Darwin’s work and not just generically with the progress in the biological sciences of the time or ideas that were “in the air”.

The relationship between scientific discoveries and the public discourse may also contribute to understanding deeper social and political processes, such as the extent to which, to cite Alexander Hamilton’s reflections in the Federalist Papers, a society is based on a “culture of reason and evidence”. If a culture that values scientific inquiry is more likely to promote economic development, and scientific breakthroughs contribute to the evolution of culture in this direction, then studying this relationship acquires additional value. We see our approach as a fruitful one to investigate also the impact of other scientific breakthroughs in history.

In Sect. 2, we provide a brief account of Darwin’s elaboration of the theory of evolution by natural selection. We also explain why the publication of On the Origin of Species provides natural variation that allows studying the effect of Darwin’s theory on the broader public discourse. In Sect. 3, we describe the text-based data that we use and the techniques and empirical strategies that we adopt to extract information about cultural change. Section 4 reports the findings. In Sect. 5, we provide a discussion and propose directions for future research.

2 Historical background and identification

“It is doubtful if any single book, except the ‘Principia’, ever worked so great and so rapid a revolution in science, or made so deep an impression on the general mind.”

Obituary for Charles Darwin, Proceedings of the Royal Society of London, 1888.

2.1 The development of Darwin’s theory of evolution

Charles Darwin’s interest in the evolution of living organisms largely developed during his voyage on the HMS Beagle, a ship of the Royal Navy, from 1831 to 1836. Over those five years, Darwin collected fossils from the places that he visited and observed their geographical distribution. Although his early conjectures built on previous theories (such as Lamarck’s and Chambers’) and considered the possibility of the transformation of one species into another (transmutation), he then developed his own theory of evolution based on the natural selection of the most adaptive (innate) characteristics of a species. Small, gradual variations within a species would emerge randomly, and would lead to branching of new species. Competition for resources and adaptive capacities would determine whether and where a particular species would be more likely to thrive. The developments in genetic research in 20th century provided corroboration and foundations to Darwin’s theory (Desmond & Moore, 1994; Mayr, 1982).Footnote 6

In addition to being one of the greatest scientific breakthroughs in history, there is a perception that Darwin’s theory of evolution had a wider cultural reach (Desmond & Moore, 1994; Fuller, 2017; Mayr 1982, 2001). Research in literary criticism analyzed how the production of certain poets and novelists began to reflect the competition and “struggle” for resources, the common origins of species (including humans), and a new conception of the role of nature and God in the creation.Footnote 7 Mokyr (2013, 2016) includes Darwin among a small set of “cultural entrepreneurs”, i.e., scientists whose discoveries affected deeply held and broadly shared popular beliefs. These accounts, however, focus on a narrow set of literary contributions or debates mostly restricted to scientific, political and economic elites, or a few highly successful literary works; this makes it hard to advance inferences about the broader cultural impact of this scientific advance, and about the cultural climate that preceded that breakthrough. Our approach to answering these questions, based on large text corpora, allows going beyond the analysis of a small set of texts and authors as a way to extrapolate general cultural views and trajectories.

2.2 The publication of On the Origin of Species as a source of natural variation

Some features of how Darwin made his work public enable us to identify the impact of his work on the broader cultural discourse. Although Darwin developed his theory over a long period, there is a precise time at which Darwin’s theory reached the broader public, and this is 1859, the year of publication of On the Origin of Species.Footnote 8 This publication date was largely unplanned. Darwin proceeded slowly initially and had to deal with sickness and deaths in his family that further delayed him. However, eventually he “rushed” in order not to lose priority over Alfred R. Wallace, who was researching on the same topics and had sent Darwin some of his writings that developed similar concepts and reached similar conclusions about natural selection.

The book and Darwin’s theory received almost immediate attention and diffusion, thanks to presentations at scientific meetings such as the Linnaean Society (of a joint paper with Wallace in 1858) and the British Association for the Advancement of Science (in 1860), as well as reviews in the popular press (see for example Gray, 1860; Huxley, 1859).

The unplanned publication date of Darwin’s theory provides the main source of variation for our empirical study. The rapid diffusion of the theory gives us an opportunity to observe the effect on the diffusion of the main concepts, and to establish which ones were especially novel and had an independent impact on the broader public discourse.

To be sure, On the Origin of Species was not the first treatment of evolution. Darwin’s theory was novel in several ways and more coherent than previous ones. However, earlier in the 19th Century some related ideas were already elaborated and discussed; examples include the work of Lamarck, the anonymous Vestiges of the Natural History of Creation (later attributed to the Scottish journalist and publisher Robert Chambers), and of course the work of Alfred R. Wallace. Herbert Spencer, moreover, published his Principles of Biology, which apply some Darwinian concepts also to society and ethics and not only to the natural sphere, in 1864. Our empirical strategy, however, allows assessing whether the publication of Darwin’s book represented a discontinuous change in the cultural discourse. In the analyses reported below we also attempt to address the issue of whether the theory of evolution as Darwin presented it was already “in the air” with a variety of empirical exercises.

3 Data and methods

“The limits of my language mean the limits of my world.” Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922).

To examine the diffusion and the evolution of the meaning and interpretation of scientific concepts over time, we exploit the increasing availability of digitized text corpora, as well as the tools of natural language analysis. We rely on recent work at the intersection of the humanities, linguistics, cultural studies and computer science, which uses the frequency of use of words in large text corpora and their semantic evolution as measures of changes in the public discourse and shared beliefs. Relying on natural language processing techniques, this line of research has explored, for example, the evolution of cultural trends as expressed by both the frequency of use of certain words and phrases and the change of their meaning, the evolution of literary styles, and scholarly influence on various areas of research.Footnote 9

These approaches focus on a particular source of expression and communication of cultural beliefs, i.e., formal language especially through the written word. This excludes more informal, but not less important, means of development and transmissions of values and worldviews. The increasing availability of text in digital forms for longer time periods, and the comparability of written text over time because of its “standardized” nature, provide, despite the limitations, a fruitful direction to learn, measure and compare ideas, beliefs and values from the past. In fact, also research in economics and economic history is increasingly using text as a source of data. Kelly et al. (2021), for example, apply text analysis techniques to patent data to build new measures of scientific and technological breakthrough. Michalopoulos and Xue (2021) rely on transcriptions of oral communications, such as folktales, and relate them to specific cultural traits such as trust, risk-aversion, and gender norms across countries and ethnicities.Footnote 10

By operationalizing concepts with certain words and phrases, our specific objective is to document which ideas emerged as novel in society following our scientific breakthrough of interest, and whether they influenced different cultural spheres over time. The first step in this investigation is to compute relative frequencies of some key words that embody the main concepts in Darwin’s theory of evolution, and that Darwin used extensively in his own work. These frequencies represent a basic measure of the adoption of certain ideas in the broader cultural and social discourse. Our investigation then focuses on word embeddings for the analysis of semantic and sentiment change (Aiden & Michel, 2014; Manovich, 2009; Michel et al., 2011; Roth, 2014).

3.1 Word frequencies

We first rely on Google N-GramsFootnote 11 (Lin et al., 2012) to assess how frequencies of words changed over time in fiction and non-fiction literature. The Google N-Grams data is a result of the Google Book project to build a vast collection of digitized books in partnership with major libraries.Footnote 12 First released in 2010, the data consist of a set of corpora of roughly eight million books, an estimated 6% of all books ever published (Lin et al., 2012). The texts cover roughly a 500-year span and there is a continuous update. The database includes different languages (besides English: Italian, French, German, Spanish, Russian, Hebrew, and Chinese). The English corpus alone has half a trillion words in it. For the period that we consider (i.e., 1820–1899), there are about 380,000 books containing more than 45 billion words in total.Footnote 13 The data include both fiction and non-fiction books, but not periodicals, and is aggregated depending on the number of terms considered; for instance, the 1-Ngram dataset includes single words and their frequency in a given corpus, and n-grams are combinations of n words and their frequency. We compute frequencies from 1-Ngrams and 2-Ngrams data for each year and express them in per-million-words terms.

The ability to separate fiction and non-fiction literature is relevant to us for two reasons. First, one critique to the N-Grams (and Google Books) corpus is that it may over-represent scientific texts (Pechenick et al., 2015). In our study, increases in the frequency of words related to Darwin’s theory may just reflect a disproportionate increase over time of the corpus of scientific books (included in the non-fiction category). Second, separating fiction and non-fiction literature enables the analysis of different types of relationships between Darwinian science and broader culture. The use of Darwin’s concepts in the non-fiction literature may better represent higher-educated or more erudite conversations. Conversely, given the diffusion of the novel, including in low-middle classes, and the relatively high literacy rates especially in England and the United States in the 19th century (Lyons, 2003), fictional literature may better measure the social imaginary (Armstrong, 1987; Winans, 1975).

Whilst the Google Books data allow us to measure Darwin’s influence on the broader cultural discourse, we also aim to assess whether his ideas diffused in the political discourse – and thus, potentially, in the policy process. We rely on the digitized collections of the UK Parliamentary Debates (Hansard) and the U.S. Congressional records (ProQuest’s Congressional Record Permanent Digital Collection).Footnote 14 The former includes reports of all discussions occurring in the House of Commons and House of LordsFootnote 15; the latter focuses on debates in the House and Senate.Footnote 16

3.2 Word meaning and embeddings

The analysis of word frequencies is informative, but does not provide insights about how a given word was used and its perception in society. The semantic changes and the evolution of attitudes toward a concept may be a more appropriate measure of cultural change if one interprets the meaning of a word as the association of that word with other concept and ideas, and the attitudes toward a concept as whether that concept had a positive or negative reception.

Natural language processing employs word-embedding techniques to determine the meaning of, and sentiment toward words from large text corpora, and their evolution over time. The main idea of word embedding is that we can evaluate semantic associations between words by analyzing co-occurrence patterns in a text. Two words of similar meaning are unlikely to appear, say, in the same sentence, but they are likely to be surrounded by similar words. For example, we would not expect that, within five words before and after the word “queen”, we read the word “monarch”; however, there is plausibly high overlap between the words that appear immediately before and after “queen”, and those that appear before and after “monarch”.

The outcome of word-embedding algorithms is a set of vectors that include information about co-occurring patterns among words. Consider for example a text corpus with V words w (w = 1, 2,…, V). For each word, one can specify a subset of “context words”. i.e., terms that appear within a window of m words before and after w. The objective is to represent each word w as a Nx1 vector, with N < V determined by the researcher, where each entry is a measure of how frequent the occurrence of w with each of the context words is. We rely on the Word2Vec approach (SkipGram with negative sampling; Mikolov et al., 2013), a technique that studies of semantic change have used extensively (e.g., Bolukbasi et al., 2016; Caliskan et al., 2017; Garg et al., 2017). The Word2Vec model is based on a neural-network structure that we represented, in simplified form, in Fig. 1A. The starting point is the definition of Vx1 one-hot vectors for each focal word w, i.e. vectors of all 0’s except one value of and entry of 1 in correspondence of the word of interest. Two matrices, called the embedding and context matrices, are initially filled with random weights that the training process updates. For each word w, the algorithm multiplies a one-hot vector, or input layer, by the embedding VxN matrix, to obtain a Nx1 vector, called the hidden layer. This vector simply “copies” the input layer into the embedding matrix that corresponds to the word w. In turn, multiplying the hidden layer by the NxV context matrix produces the Vx1 output layer. The V entries (or scores) in the output layers go through a soft-max activation function, which maps the scores to a probability distribution. The probability vectors have values that range from 0 to 1 and sum up to 1.Footnote 17

Fig. 1
figure 1

Word2Vec model and Sentiment Analysis with Embedded Vectors. Panel A: Word2Vec. Notes: The diagram illustrates the structure of a Word2Vec model. Each word is encoded into binary vectors (one-hot) of dimension Vx1. The embedding matrix (VxN) and the context matrix (NxV) are initialized with random weights (note that N < V). The multiplication of the initial one-hot vector and the embedding matrix gives us the embedding vector of the input word we are currently considering. This embedding vector forms a hidden layer of dimension Nx1. The multiplication of the hidden layer and the context matrix forms the output vector, which becomes a probability vector after a soft-max transformation. This vector can be readily compared to the one-hot vector that identifies the considered context word (i.e., target vector). The difference between the probability and the target vector modifies the scores of the embedding and context matrix through a backpropagation mechanism so that the weight can be adjusted accordingly to real words co-occurrence. Panel B: Sentiment Analysis. Notes: the figure shows a subset of the pair of words that are used in the paper to span the “Morality” dimension. Two embedded vectors for the word Evolution for the periods 1820–29 and 1890–99 respectively are drawn and projected on the dimension. This figure exemplifies a change in perception that a word can go through over time

These vectors can now be compared to the “target” one-hot-encoding vector of a given context word c to obtain a vector of errors by subtracting the probability vector from the “target” vector. Using this information, a backpropagation mechanism (Rumelhart et al., 1986) updates the weights in the embedded and context matrix. The training process proceeds by considering all combinations of words w and context words c.

The final output consists of a VxN “embedding matrix”. Each row in the matrix is the vector representation of each of the V words w, where each entry is a coordinate in an N-dimensional space and carries information about the context. The embedded vectors satisfy some “linearity” features in the relationship between, for example, the singular and plural form, or feminine and masculine version, of a word. Using a frequent example in the literature, we expect that, when the word vectors corresponding to king, kings, queen, queens, man and woman, the following holds: (king–kings) ≈ (queen–queens) and (king–man) ≈ (queen–woman).

The closer two word vectors are in this N-dimensional space, the stronger the semantic association between the two words. The main metric of the proximity between vectors is the cosine between them (Dubossarsky et al., 2015; Gulordava & Baroni, 2011; Jatowt & Duh, 2014; Kim et al., 2014; Kulkarni et al., 2015). Call \(\gamma\) the angle between two N-dimensional vectors \(u=({u}_{1},\dots\).\({u}_{N})\) and \(v=({v}_{1},\dots\).\({v}_{N})\). Then, \({u}^{^{\prime}}v=\sqrt{{\sum }_{i=1}^{N}{u}_{i}^{2}}*\sqrt{{\sum }_{i=1}^{N}{v}_{i}^{2}}*\mathrm{cos}\left(\gamma \right)=\Vert u\Vert \Vert v\Vert \mathrm{cos}\left(\gamma \right)\), or: \(\mathrm{cos}\left(\gamma \right)=\frac{{u}^{^{\prime}}v}{\Vert u\Vert \Vert v\Vert }\in [-\mathrm{1,1}]\). The more similar the two vectors, the closer to one the cosine.

We investigate whether the words that defined the main Darwinian concepts shared context words with different terms before and after the publication of On the Origin of Species. We rely on previously trained Word2Vec embeddings resulting from the N-grams distributed by Google Books (). Figures are available for every decade between 1800 and 1990 and data are designed to enable comparisons across decades. The models use a context window of four context words and parameters as suggested by Levy et al. (2015) to measure semantic changes in cultural shifts.

The measure of semantic similarity, and more generally embedded vectors, involve many dimensions; each vector is projected in an N-dimensional space and the measure of semantic similarity considers all these dimensions to assess whether vectors are located close to or apart from each other. Each dimension explains some of the variance that distinguishes association patterns among all the words in a text corpus. It is hard, however, to interpret each dimension in practice, i.e., to give a univocal explanation about the reason why two vectors are close (or apart) when considering one specific dimension. One might therefore consider projecting these vectors on a limited subset of pre-defined dimensions and evaluate the position of each vector in the new space, to measure the association of a given word within a set of more narrowly defined underlying concepts.Footnote 18 We specify some classes that might be relevant for gauging the sentiments surrounding key Darwinian concepts over time: “goodness”, “importance”, and “morality”. In order to create a given dimension, we have to define a list of words that express it. Following Jenkins (1958), who specified a list of terms for a variety of cultural dimensions, we consider pairs of antonym words related to the three areas that we want to measure. We then average the differences of all the vector pairs: \(\frac{\sum_{\mathrm{P}}^{\left|\mathrm{P}\right|}\overline{{\mathrm{p} }_{1}}- \overline{{\mathrm{p} }_{2}}}{\left|\mathrm{P}\right|},\) where \(\overline{{\mathrm{p} }_{1}}\) and \(\overline{{\mathrm{p} }_{2}}\) are the vectors of a one of P pairs of antonym words. The resulting vector represents an “average” dimension of the general underlying concept we aim to capture (e.g., Morality).

Finally, we calculate the similarity (or projection) of the vectors of key Darwinian words on each dimension and track the similarity over time. Figure 1B provides a simple graphical representation of how, for example, the word “Evolution” moves on the “Morality” spectrum. Each side includes a word that has an antonym on the opposite side (e.g., sinful, virtuous). The average of each pair forms a new dimension (represented by the underlying thick gray arrow in the figure) that should represent the concept of Morality in a comprehensive way. In practice, we measure the similarity between the word Evolution and the dimension spanned by \(\overrightarrow{(good}-\overrightarrow{evil})+\overrightarrow{(moral}-\overrightarrow{immoral})+\overrightarrow{(virtuous}-\overrightarrow{sinful})+\)… By measuring the similarity by decade, we are able to assess how much this concept was deemed to be “moral” over time. The more positive a projection,Footnote 19 the stronger the association of Evolution with Morality.

4 Findings

In the first part of this section, we analyze the evolution of the relative frequency of key words in Darwin’s theory and expressions as measures of the diffusion of key concepts in the public discourse around the time of the publication of On the Origin of Species in 1859. Then, we move to the analysis of semantic and sentiment changes concerning these words.

4.1 Frequency analysis

We consider terms (1-grams) that, from many accounts (Desmond & Moore, 1994, and Mayr, 1995), as well as our own reading, represent the key concepts in Darwin’s theory: Evolution, Selection, Adaptation, Competition, Survival, and the expression (2-gram) Natural Selection.

We begin by computing the frequency of these words and expressions, overall and separately for fiction and non-fiction books, and in comparisons with other frequent nouns in On the Origin of Species. Second, we assess the diffusion of the main Darwinian concepts in languages different than English, in order to explore the diffusion of the theory of evolution in other cultures, and whether it happens with some delay. Third, we attempt to isolate the contribution of Darwin to the public discourse from the general “presence in the air” of ideas about evolution, by tracing the use of the word Darwin itself, as opposed to other scientists engaged in that field. Finally, we explore the diffusion of Darwin’s ideas in the political debate.

4.1.1 Darwinian and “control” concepts; fiction and non-fiction books

Figure 2 reports the frequency of use of the key Darwinian terms Evolution, Selection, Adaptation, Competition, Survival, and the expression (2-gram) Natural Selection. The frequencies are per million words, in each year between 1820 and 1899, for the whole Google Book corpus (Panel A) and for non-fiction and fiction books separately (Panel B).

Fig. 2
figure 2

Frequencies (per 1 Million Words) of Darwinian Concepts in the Google Books Corpora. Notes: For each year, the graphs show the number of occurrences of the word or phrase reported on top per one million words. In panel A, the gray solid line displays the yearly frequency, whereas the red dashed line is a median band plot with 16 intervals (each of 5 years). Note that also the denominators for the calculation of the relative frequencies are separate for fiction and non-fiction

The expression Natural Selection, perhaps the most defining of Darwin’s concepts, was virtually non-existent in both the fiction and non-fiction literature before 1859 and experienced a significant increase in the rate of adoption since then. On the one hand, this may not be surprising, precisely because of the close association of Darwin’s work with the idea of natural selection. On the other hand, we may consider the significant increase in the diffusion of this concept immediately after the publication of On the Origin of Species as a validation of our approach; this initial analysis of frequencies does capture what we might have expected.

Evolution and Survival also substantially increased the adoption rate in the years immediately following the publication of Darwin’s book. The ideas that underlie these words and expressions, therefore, generated interest in not only specialized or more educated circles, but plausibly also in the more general cultural context.Footnote 20 Moreover, the diffusion of these concepts in the fiction literature lagged the diffusion in the non-fiction literature by a few years. Competition was already present in the first part of the 19th Century, especially in the non-fiction literature, but experienced an increase in the adoption rate after 1860. Selection experienced a weaker increase in relative frequency around the publication of On the Origin of Species.Footnote 21 Although Selection was already present before 1859, Natural Selection, as an expression, appeared after the publication of On the Origin of Species. This suggests the possibility that, after 1859, the word Selection might have experienced a change of meaning and use in the public discourse. We investigate this below.

In Table 1, Slope(1820–59) and Slope(1860–99) are the parameter estimates from spline regressions of the frequency (per million words) of each of the Darwinian words and phrases on a time variable that represents each year between 1820 and 1899 (expressed as t = 20, 21, …, 99), with one knot at year 1859. Slope(1860–99)-Slope(1820–59) is the difference between the two slopes. Table 2 displays results from the same spline regressions, separately for fiction and non-fiction books. Finally, we ran spline regressions with knots at each decade between 1820 and 1899. The estimates are in Table 3; for a more parsimonious exposition, we aggregated the six Darwinian expressions into an index given by the average annual frequency. Word-by-word estimates are in the Appendix Table A2). Columns (1) through (3) display the estimates separately for fiction and non-fiction books, as well as overall.

Table 1 spline regression analyses (one knot) – frequency of Darwinian concepts
Table 2 Spline regression analyses (one knot) – frequency of Darwinian concepts, separate for fiction and non-fiction books
Table 3 Spline regression analyses (eight knots) – average frequency of Darwinian and generic words

The estimates reinforce the visual evidence in Figs. 2 and 3. They also confirm the delay in diffusion in the fiction literature that we observed in the graphical representations: the estimated slopes are large and statistically significant starting in the 1870s for the fiction subsample.Footnote 22 The extent of these changes is substantial. For example, the average frequency of the six Darwinian words and expressions oscillated between 5 and 10% of the standard deviation of the yearly frequency of all words present in the Ngram corpus in a given year between 1820 and 1859, and then began to rise up to 30% of the yearly standard deviation from 1860 to the end of the century (see Figure A2 in the Appendix).

Fig. 3
figure 3

Differences-in-Differences Estimates of the Average Frequency of Darwinian and Generic Concepts in Each Decade between 1820 and 1899. Notes: Each dot in the graph represents the estimate of the parameters \({\delta }_{j}\) from the following regression model: \({\mathrm{ln}(y}_{wt})={\alpha }_{w}+{\beta }_{w}1\left(Darwinian\right)+{\sum }_{j=2}^{4}{\gamma }_{i}1( j0\le t\le j9)+{\sum }_{j=6}^{9}{\gamma }_{j}1(j0\le t\le j9)+{\sum }_{i=2}^{4}{\delta }_{i}1(j0\le t\le j9)*1\left(Darwinian\right)+{\sum }_{j=6}^{9}{\delta }_{j}1(j0\le t\le j9)*1\left(Darwinian\right)+{\varepsilon }_{wt}\), where \({y}_{wt}\) is the frequency of use of a word per million words used. Each value on the x-axis correspond to a decade; for example, 1830 corresponds to 1830–39. The omitted (or baseline) decade is 1850–59 (1850). Because the observed frequency is equal to zero in some cases, we add 0.001 to each frequency (0.001 is half of the lowest positive frequency per million words in our sample). The shaded area represents 95% confidence intervals that we computed using a wild bootstrap procedure (Roodman et al., 2019). Results are almost identical if we use an arcsine: \(\mathrm{z}=\mathrm{ln}\left(\mathrm{y}+\sqrt{{y}^{2}+1}\right)\), or if we apply the GMM procedure described in Bellego and Pape (2019) to estimate the parameters (the confidence intervals from bootstrapped standard errors are narrower in this last case).

In Column (4) of Table 3, we report estimates from of the average yearly frequency of a group of “control” or “placebo” words to compare to the terms that represent the key Darwinian concept. We selected the 100 most frequent nouns in On the Origin of Species, and then eliminated Selection, which is among these 100 nouns but also one of the Darwinian words. The remaining ninety-nine words are not specific to the theory of evolution. For these words, we do not detect any particular change in diffusion before and after 1859.Footnote 23

Fig. 4
figure 4

Frequencies (per 1 Million Words) of the Phrase “Natural Selection” in Six Languages Other than English. Notes: For each year, the figures report the number of occurrences (per million words) of the expression “Natural Selection” in the language indicated on top of a graph. The red dashed lines are a median band plot with 16 intervals (each of 5 years). The vertical dashed line are in correspondence of the year of the first published translation of On the Origin of Species in a given language

We further rely on this group of generic words to perform difference-in-difference analyses whose findings are in Table 4 and Fig. 3. In Table 4, we report the estimates from analyses where, for each year, we sum up the frequencies of the six Darwinian concepts on the one hand and of the ninety-nine control nouns on the other hand, and compare the trends in aggregate diffusion before and after 1859. Because the aggregate frequency of the generic words is much higher than the frequency of the Darwinian concepts pooled together, to make more immediate comparisons we transform these frequencies into their natural logarithms and include the logarithm of the time trend in the regression analyses. In this analysis, we also pool together fiction and non-fiction books. The regression model that we estimate is as follows:

Table 4 Differences-in-differences regressions – Darwinian and generic scientific concepts
$${\mathrm{ln}(y}_{wt})={\alpha }_{w}+{\beta }_{w}\mathrm{ln}\left(t\right)+{\gamma }_{w}\left(\mathrm{ln}\left(\mathrm{t}\right)-\mathrm{ln}\left(59\right)\right)*1\left(t>59\right)+{\delta }_{w}1\left(Darwinian\right)+{\theta }_{w}\mathrm{ln}\left(t\right)*1\left(Darwinian\right)+{\lambda }_{w}\left(\mathrm{ln}\left(\mathrm{t}\right)-\mathrm{ln}\left(59\right)\right)*1\left(t>59\right) +{\mu }_{w}\left(\mathrm{ln}\left(\mathrm{t}\right)-\mathrm{ln}\left(59\right)\right)*1\left(Darwinian\right)*1\left(t>59\right)+{\varepsilon }_{wt}$$
(1)

The sample thus includes 160 observations, two for each year, with one reporting information about the generic words (\(1\left(Darwinian\right)=0\)), and the other about the six Darwinian concepts (\(1\left(Darwinian\right)=1\)). Columns (1) and (2) of Table 2 display estimates of a simplified version of the model, were the left-hand-side variable is the natural logarithm of the sum of frequencies of Darwinian and generic terms separately, regressed on a time trend and the interaction between the indicator for years greater than 1859 and the difference between the current year and 1859. Estimates of the parameters of the full model are in Column (3). The estimate of the coefficient on the interaction between the indicator for Darwinian words, the indicator for the post-1859 period and the difference between the current year and 1859 (\({\mu }_{w}\)) is positive, large and statistically significant, indicating a much larger relative increase in the frequency of Darwinian concepts after 1859. The estimate of \({\theta }_{w}\) is significantly smaller than the estimate of \({\mu }_{w}\), but it is positive and statistically different from zero; this indicates that also before 1859, the frequency of Darwinian concepts was increasing at a higher rate that the combined generic terms. This is likely due to the trend and diffusion that some Darwinian terms, such as Selection and Adaptation, were experiencing also in the first half of the 19th Century. The trend, however, clearly had an additional, substantial acceleration after the publication of On the Origin of Species.

Second, we define a model where the outcome variable is the annual frequency (from 1820 to 1899) of each of the six Darwinian concepts and of the ninety-nine control nouns separately. Here we estimate the average difference in frequency for the Darwinian words and the generic words in each decade:

$${\mathrm{ln}(y}_{wt})={\alpha }_{w}+{\beta }_{w}1\left(Darwinian\right)+{\sum }_{j=2}^{4}{\gamma }_{i}1( j0\le t\le j9)+ {\sum }_{j=6}^{9}{\gamma }_{j}1(j0\le t\le j9)+{\sum }_{i=2}^{4}{\delta }_{i}1(j0\le t\le j9)*1\left(Darwinian\right)+{\sum }_{j=6}^{9}{\delta }_{j}1(j0\le t\le j9)*1\left(Darwinian\right)+{\varepsilon }_{wt}$$
(2)

This analysis is on (6 + 99)*80 = 8,400 observations. The omitted time category is the decade 1850–59 (\(50\le t\le 59)\). The \({\delta }_{j}\) coefficients thus indicate the difference between Darwinian and control terms, as compared to the reference difference in the 1850–59 period. Figure 3 displays the estimates of the \({\delta }_{j}\) coefficients and their 95% confidence intervals, and shows that the difference in relative frequency between the Darwinian and generic terms is much larger after the publication of On the Origin of Species than before. The main unit of observation in this analysis is a given term, so we cluster standard errors at that level. We have, therefore, 105 clusters; however, the number of “treated” units (and consequently the number of treated clusters) is small relative to the control ones. As such, to estimate confidence intervals around each of the estimated difference-in-differences parameters, we rely on the subcluster wild bootstrap procedure of MacKinnon and Webb (2018; see also Roodman et al., 2019). Despite the larger estimated confidence intervals, the estimates still show a significant change after 1859.

To assess the robustness of these last two analyses, we also defined a second control group; we selected the 100 words whose frequency between 1855 and 1858, the years immediately before the publication of Darwin’s book, was closer in absolute value to the average frequency of the six Darwinian expressions. Whereas the selection of first group of words was motivated by the fact that those terms were present in On the Origin of Species, the rationale for this second group is the similar diffusion in the public discourse. Appendix Figure A5 in the Appendix presents the same type of plot as the one in Fig. 3, from a regression with the alternative control group; the patterns are remarkably similar.Footnote 24

In addition to the comparison with two sets of words, we address a further concern that the significant change in the frequency of Darwinian terms after 1859 may be due to an overall change in the composition of texts, at least in the Ngram corpus. Although this corpus does not identify books, but only words and expressions, we can assess whether the total number of words, and the rate of “entry” and “exit” of words in the corpus, was different in the years around the publication of On the Origin of Species as compared to other periods. Figure A6 and Table A5 in the Appendix shows that his was not the case.

4.1.2 Translation in other languages

Were the effects of On the Origin of Species specific to the social context in which the book was written and first published? Or did the treatise generate a similar impact in other countries upon its translation? Moreover, did the diffusion of scientific concepts in the cultural environment relate to the status of a country economic development, literacy rate, or development of the publishing industry? To answer these questions, we study whether the translations of On the Origin of Species generated a similar the diffusion of its key concepts in other languages.

As shown in Fig. 4, the phrase Natural Selection substantially increased its diffusion upon the translation of On the Origin of Species. The same holds for such words as Evolution, Survival, and Competition (Figure A7 in the Appendix). Moreover, the frequency of use of most words started increasing right after 1859, indicating that, even in the absence of an official translation, Darwin’s concepts diffused across borders. These results suggest that the cultural effects of On the Origin of Species were not specific to the English-speaking context.

We cannot claim that the translation years are exogenous. For example, the translation might have occurred first in countries where the interest was higher, and this, in turn, might have affected diffusion. The likely endogeneity of the publication year, however, offers an opportunity for additional considerations about the relationship between the broad cultural acceptance of scientific concepts and economic development. For instance, in countries like Italy and Spain, both “late comers” during the Industrial Revolution (Ciccarelli & Nuvolari, 2015), the translation of On the Origin of Species occurred later than the translation into French and German, i.e., the languages of two countries where industrialization occurred earlier. Conclusions for Russian and Chinese terms are more tentative, because the N-gram repository plausibly includes a relatively small number of books in these languages. Nonetheless, Russia was mostly a feudal country until World War I (Markevich & Zhuravskaya, 2018), and had the first translation of Darwin’s book even later; and China was long isolated from the scientific debate, which, according for example to Mokyr (2008), delayed its industrial development. It is perhaps not surprising, given the features of these two countries, that the diffusion of Darwinian words and phrases was extremely limited in Russian and Chinese books in the 19th Century.Footnote 25

The translation year may not only depend on the status of a country economic development, but also on the literacy rate or the development of the publishing and translating industries. We collected data on these variables from the countries’ Censuses between 1800 and 1950 for the available years. Darwin’s concepts diffused first in countries with a higher literacy rate (like the UK, US, France, and Germany) than in countries with a low literacy rate (such as Italy, Spain and Russia). Similarly, where the publishing or translating industries were more developed, Darwin’s idea diffused sooner. This is the case of France, and Germany. By contrast, countries with less diffusion also experienced a lower development of the publishing industries, like Spain, Italy, Russia, and China.

4.1.3 Ideas in the air, substitution and multiple attribution

The various findings that we just reported show that some concepts, as measured by the words that embody them, were only marginally present in the public discourse before the publication of On the Origin of Species. However, even if they entered the public discourse only after 1859, certain terms may have simply substituted existing ones while expressing the same ideas. In Fig. 5 we report the frequency of occurrence of the names of four scientists who contributed, in different ways, to the understanding of evolution. In addition to Darwin, we consider Alfred Russell Wallace, Robert Chambers, and Jean-Baptiste Lamarck. Lamarck’s theory of the transmission of acquired traits is frequently mentioned as an example of “failed” theory to compare to Darwin’s. Chambers’ Vestiges of the Natural History of Creation introduced, in the 1840s, the idea of an “evolution” of living and non-living beings over time, more as a speculation than as a complete scientific treatment (note that the author was anonymous until 1884). Alfred Russell Wallace’s work was close, in time and content, to Darwin’s. Figure 5 shows that both Darwin and Wallace increased their occurrence in the English book corpus in the second half of the 19th Century, but Darwin’s frequency increased substantially more. Chambers and Lamarck were already present before then, but their frequency remained stable (and low) after 1860.Footnote 26 The estimated difference in the increase of diffusion after 1989 between Darwin and the other names are statistically significant (see also Appendix Table A6).

Fig. 5
figure 5

Frequencies (per 1 Million Words) of Occurrences of the Names Charles Darwin, Alfred Wallace, Robert Chambers and Jean-Baptiste Lamarck in the English Google Books Corpus. Notes: For each year, the figures report the number of occurrences (per million words) of the name indicated in the legend. When we consider both the first and last names (left panel), we include different combinations of the full names of the four scientists: Alfred Russel Wallace, Alfred Wallace, Charles Darwin, Charles Robert Darwin, Robert Chambers, Jean-Baptiste Lamarck, Jean-Baptiste de Lamarck, Jean Baptiste Lamarck, and Jean Baptiste de Lamarck. The vertical dashed line is in correspondence of the year of the year of publication of On the Origin of Species (1859)

Because Lamarck was (and originally wrote in) French, we then compare the diffusion of the words Darwin and Lamarck in the French corpus. After 1860, the relative occurrence of the word Darwin in French books surpassed the frequency of Lamarck. We also compare terms that related to the study of the emergence and development of new species: Evolution and Transmutation. Although Evolution, which we already analyzed above, is typically associated with Darwin’s work, earlier works in biology (including some of Darwin’s) used the term Transmutation to characterize (gradual or discrete) transformations of plants and animals. By comparing these two words, we want to assess whether the broader literature and cultural discourse also picked up the “newer” word to express these changes. For books in French, we consider the word Transformism (Transformisme in French), which was used by Lamarck. The graphical representation of our findings is in Fig. 6. The general pattern is that Evolution became progressively more frequent than Transmutation, with a significant change in frequency after the 1850s. The substantially larger frequency of Evolution also suggests that this word did not just “replace” words that expressed overall similar concepts, but plausibly represent a broader diffusion of certain new ideas.

Fig. 6
figure 6

Frequency of Occurrence (per million words) of the Words Darwin, Lamarck, Transmutation, Transformism and Evolution in the English and French Google Book Corpora. Notes: For each year, the figures report the number of occurrences (per million words) of the name indicated in the legend. The vertical dashed lines are in correspondence of the year of the year of publication of On the Origin of Species (1859)

Overall, this evidence suggests that Darwin, with his own work and especially his 1859 book, caused a discontinuous change in the cultural discourse.

4.1.4 The diffusion of Darwinian concepts in the political arena

We perform frequency analyses on the UK Parliamentary debates and U.S. Congress data to assess whether Darwin’s theory spilled over not only to the cultural discourse, but also to the political debate; arguably, culturally accepted scientific concepts may also affect how laws are shaped.

The UK Parliamentary data include a transcription of the debates in the House of Lords and the House of Commons. The corpus of Congressional Records includes the transcripts of all legislative debates occurring on the floor of the US Congress. It also contains additional materials, such as communications from the president and the executive branch agencies memorials, petitions, and supplementary information on the current legislation. We argue that these two corpora represent the official and most comprehensive daily account of the political discussion happening in the United Kingdom and United States. Although the text corpora of parliamentary debates are smaller than those we used on the main analysis, we think they can still offer suggestive evidence of the diffusion of the Darwinian concepts in the political debate.

Figure 7 shows an increase of the frequencies of such words and concept as Evolution, Survival and Natural Selection in both the Parliamentary and Congress debates after 1859. The evidence of an increase in use of these terms is clearer for the US Congress than for the UK Parliament. Overall, these results suggest that, after diffusing in the cultural environment, key Darwin’s concepts also reached the political debate. The ten-year median bands, in particular, indicate a change in the use of these words a few years after we see these changes in the Google Books data. The lag may suggest that the cultural diffusion was faster than, and perhaps a pre-condition for political diffusion.Footnote 27 In the case of the US, the Civil War in the first half of the 1860s may have further delayed the introduction of these new concepts in the legislative debate (Masci, 2019). Another explanation for the delay we observe might be that prior to 1837 each House was only required to keep an internal journal of its proceedings. External reporters could report verbatim debates only after that year. This might have hindered our capacity to fully capture the presence of Darwinian concepts in the initial period of the analysis.

Fig. 7
figure 7

Frequencies (per 1 Million Words) of Selected Darwinian Words and Phrases in the UK Parliamentary Debates and US Congressional Records. Notes: For each year, the graphs show the number of occurrences of the word or phrase reported on top per one million words. The gray solid line displays the yearly frequency, whereas the red dashed line is a median band plot with 16 intervals (each of 5 years)

4.1.5 Divine versus Darwinian creation

Although, according to many accounts, Darwin did not intend his theory of evolution through natural selection to go against religious (Christian) beliefs and doctrine, implications of his discoveries such as the common origins of species, random variation and the absence of an intelligent design were largely perceived as a major blow to the Christian view of creation. In addition to exploring the diffusion of Darwinian concepts into the political discourse, a further way to assess how influential the cultural diffusion of the theory of evolution was is to investigate how certain topics that concerned both the religious sphere and Darwin’s investigation evolved over time.

We next investigate if Darwin’s theory had any impact on religion by focusing on specific terms with a strong religious root but also related to Darwin’s theory. We analyze two terms related to the origins of the world, Creation and Genesis, the world Creator, which is one of the characterization of God in many religions, and the word God itself. We take advantage of certain rules or conventions in written text, when certain words are used in a religious context: the expression of the initial letter in upper case. In Fig. 8, we report the yearly frequency of use of the words God, Creator, Creation and Genesis with and without an upper-case initial. The increase in the use of the lower-case version of God, Genesis and Creator is visibly faster than the upper-case equivalent, perhaps indicating an overall process of relative “secularization” of the cultural domain. More relevant for our analysis, we observe a change in growth rate for the lower-case version of these three words again around 1860, whereas the upper-case equivalent terms follow a trend that does not change meaningfully for the whole eighty-year period around the publication of On the Origin of Species. For the word Creation, plausibly a term with a broader set of uses and meanings than the other three, there is no particular pattern for either the lower-case or the upper-case version. Overall, we interpret this evidence as showing that certain terms and underlying ideas with a strong religious connotation became more relevant also in the non-religious discourse. Below we report our explorations of semantic change where we will also assess the evolution of the use of certain terms with religious connotation by investigating changes in their meaning.

Fig. 8
figure 8

Creation and the Theory of Evolution: Frequencies of Words and lower and upper case initials. Notes: For each year, the graphs show the number of occurrences of the word or phrase reported on top per one million words. The gray solid line displays the yearly frequency of the version of the word with an upper-case initial, whereas the red dashed line shows the frequency of the version starting with a lower-case letter

4.2 Semantic and sentiment analysis

Word embedding techniques require very large sample sizes to produce reliable results and insights. For this reason, in this section we limit the analysis to the Google Book database, and aggregate the data at the decade level.

4.2.1 Semantic analysis

Figure 9 introduces the second part of our study, where we move from the analysis of the frequency of use of certain words and the concepts underlying them, to the analysis of the semantic evolution of certain words and concepts, to see whether this evolution occurred in ways that we can relate to Darwin’s theory. In the graphs, the horizontal axis reports decades (the time unit of reference), and the vertical axis indicates the cosine between the two-word vectors of interest.

Fig. 9
figure 9

Semantic Associations between Selected Pairs of Words. Notes: The graphs report the similarity between each pair of words, as measured by the cosine of the angle between each pair of word vectors. The weights in the word vectors were calculated with a Word2Vec algorithm. On the x-axis, 1820 represents the decade 1820–29, 1830 represents the decade 1830–39, and so on

One aspect of Darwin’s theory is that life (or existence) includes adaptation, as well as competition, among its defining aspects. There is an increase in the semantic association between Life on the one hand, and Adaptation, Struggle and Competition on the other hand, especially after 1859. For Life and Struggle, we see a trend since the early 19th Century. We also investigate themes that presumably represented a controversy with the religious approach to the origins of species. One implication of Darwin’s theory is that evolution applies to humans in the same way as it applies to other animals; although Darwin did not explicitly treat the human species in his 1859 book, this was the topic of his 1871 The Descent of Man and Selection in Relation to Sex. The semantic evolution of the word Human shows an increase in its similarity with Animal especially in the late 1800s. Furthermore, we investigate whether Darwinian concepts at the basis of the process of the birth of species, in particular Evolution, came to relate with terms that expressed this process in the religious discourse: Creation and, even more specifically, Genesis. The last two panels of Fig. 9 show a growing sematic similarity of Creation and Genesi with Evolution through the 19th century, consistent with a “secularization” of the discourse about the origin of the world. Again, the change in semantic similarity seems to accelerate starting in the 1860–70 decade.Footnote 28

A second analysis of semantic changes focuses, again, on the key words and concepts that we considered so far. However, instead of investigating the similarity of these words with a select sample of other concepts, we “let the data speak” by determining, for each decade, the ten words with the highest semantic connection (cosine similarity) to these key words.Footnote 29 Table 5 reports the findings. We excluded from the rankings the words that had the same root as the focal key word as well as the most obvious synonyms (e.g., Compete or Competitor for Competition); we also defined a lower bound to the relevant cosine similarity to be equal to 0.05.

Table 5 Top 10 most similar words for selected Darwinian words

The table identifies a few interesting facts. First, the term Adaptation became, over the 19th Century, less related to physical or “mechanical” terms (such as Mechanism) and increasingly similar to concepts that represented living beings (such as Organism and Reproduction).

Second, substantial changes in meaning and association concern the word Evolution. In the first half of the 19th Century, the terms that were closest to Evolution came mostly from chemistry and physics. Later in the 1800s concepts from biology as well as related to human society were semantically more similar to Evolution. Examples include Social and Progress. Note also how the word Darwinian itself became closely associated with Evolution.

Third, Selection was more closely related to the concept of Choice (and qualification for the choice such as “careful” or judicious”) in the first half of 1800; the similarity in meaning with Choice remained also later, but Selection also became more similar in meaning to other specific “Darwinian” words, such as Survival, Variation, Fittest and Heredity.

Fourth, very few words had a similarity in meaning with Survival, likely because the word itself was only rarely used in the first half of the 19th Century. Later in the century, the word was increasingly associated to other concepts related to evolutionary theory, notably Fittest, Evolution, Struggle and Selection. The increasing relatedness with Fittest toward the end of the 1880s is likely due also to the publication of the Principles of Biology by Herbert Spencer in 1864, where this concept applies also to society and ethics and not only to the natural sphere. Competition, in contrast, maintained an association with a stable set of words, mostly related to production and markets, throughout the century.Footnote 30

In Fig. 10 we display the semantic association between the five key words in On the Origin of Species that we took as expressing Darwin’ contribution (Evolution, Selection, Survival, Competition and Adaptation), and the names of the four scientists (including Darwin) we considered in Sect. 4.1.3 above. With this exercise, we explore whether these key terms that defined the theory of evolution by natural selection were, in fact, specifically associated with Darwin or were part of a discourse that also included the contribution of other scientists. In general, the similarity of these words with Darwin is systematically positive and greater than the similarity with the other names. Lamarck generally shows higher similarity with the five key words than Chambers and Wallace. This suggest that Darwin and Lamarck remained the two most prominent figures, among students of evolution, in the cultural discourse.

Fig. 10
figure 10

Semantic Associations between the Key Words in On the Origin of Species and the Names Darwin, Wallace, Chambers and Lamarck. Notes: The graphs report the similarity between word on top of each chart and each of the four names in the legend. The weights in the word vectors were calculated with a Word2Vec algorithm. On the x-axis, 1820 represents the decade 1820–29, 1830 represents the decade 1830–39, and so on

4.2.2 Sentiment analysis

Figure 11 (Panels A through C) displays the evolution over time of the perceptions or sentiments about the key Darwinian concepts in English books, as well as about Darwin himself. We focus on the proximity to three categories of antonyms: Unimportant vs. Important, Bad vs. Good, and Immoral vs. Moral. These dichotomies help assessing whether Darwin’s concepts gained relevance and had a positive or negative connotation in the public discourse.

Fig. 11
figure 11

Sentiment Analysis of Selected Darwinian Words in the Google Books Corpus. Notes: The graphs report the similarity between word on top of each chart, and set of antonyms within a certain category. On the y-axis positive values of the cosine indicate higher similarity with the “positive” end of a category (Important, Good, Moral), whereas negative values indicate closer association with the negative end (Bad, Unimportant, Immoral). On the x-axis, 1820 represents the decade 1820–29, 1830 represents the decade 1830–39, and so on

Although the evidence is not clear-cut, the term Evolution is, especially after 1859, perceived as more important, moral, and good, and so is Survival. Therefore, these two key concepts in Darwin’s theory not only experienced an increase in use and evolution of their meaning (especially Evolution, as described in Sect. 5.1), but also were received positively. The combination of these three changes (frequency of use, semantics, and sentiments) for some of the Darwinian concepts that we consider corroborates our argument that these ideas had a novel impact on the cultural discourse. The term Darwin also shows positive reception, with spikes around the publication of On the Origin of Species.

5 Conclusions

To the extent that both cultural and scientific change are major drivers of long-term economic outcomes, the investigation of how these two phenomena interact with each other promises to offer a deeper understanding of their role in enhancing growth.

We focused on one of the greatest scientific breakthroughs, the theory of evolution via natural selection of Charles Darwin, and explored its impact on the public discourse. Given the undoubted importance of Darwin’s theory, there is a diffused perception that it affected culture in many different ways, from changing the interpretation of nature to influencing ideas about race and equality among humans. Existing accounts, however, largely rest on qualitative or narrative evidence limited to scientists or cultural elites in society, whereas little is known about the wide diffusion of Darwin’s ideas into society. Arguably, to affect cultural change, a scientist should have an impact on the collective imagination of a population. Moreover, it is difficult to identify, from existing accounts, which Darwinian concepts were actually novel in the cultural discourse, and which ones were already part of it. We address these challenges by analyzing the diffusion and the semantic evolution of the key words and phrases that embody Darwin’s main concept in hundreds of thousands of books, with the use of techniques from machine learning. We rely on the largely unplanned publication date of On the Origin of Species as source of natural variation, and compare the use of these words and phrases with more generic terms that Darwin used.

Our analysis shows that the key concepts expressed by Evolution, Survival, and Natural Selection were those that diffused in fiction and non-fiction literature immediately after the publication of On the Origin of Species. Competition, a theme already present in the broader literature, diffused significantly more rapidly after 1859. The adoption of some of these words and phrases in the broader cultural conversation led also to a change in the meaning of the concepts, providing further evidence of the impact of Darwin’s theory in society at large; overall, the attitude toward these concepts was positive rather than adversarial.

Our approach has several inductive and descriptive aspects. The choice of the concepts on which to focus may seem somewhat arbitrary; however, we based our selection on the main topics that Darwin developed, as well as on the analysis of several interpretations of Darwin’s theory of evolution. Moreover, it is generally hard to provide causal identification with this type of analysis. The unplanned publication date of On the Origin of Species, the reliance on very large amount of data, and the consistency in the patterns of different words, phrases and concepts, give us some confidence about the nature of the patterns that we established.

Finally, this is a single case study, and generalizations about the relationship between major scientific discoveries and their cultural reception are difficult to make. We limited our analysis of the impact of Darwin’s theory to the diffusion on specific ideas into the broader public discourse; as such, in addition to not claiming that our work inform on how any scientific breakthrough pervades cultural attitudes, we are careful in implying that our evidence identifies an impact of Darwin’s theory on culture in general. Our contribution is, in fact, to identify empirical approaches that allow for both measurement on otherwise hard-to-measure phenomena, and to propose credible strategies to assess the relationships of interest. We believe that similar approaches enabled by machine learning techniques do provide promising tools to explore this relationship beyond the specific historical episode on which we focus. Examples of relevant scientific breakthroughs include the theory of relativity or the indeterminacy principle in physics, the discovery of the DNA, and the emergence of biotechnology and genetic engineering. In fact, one could go beyond scientific discoveries and employ a similar approach to explore the cultural antecedents and effects of new technologies as well as of new industries, such as computers and the Internet (see for example Turner, 2010).