Abstract
Cultural phylogenetics has made remarkable progress by relying on methods originally developed in biology. But biological and cultural evolution do not always proceed according to the same principles. So what, if anything, could justify the use of phylogenetic methods to reconstruct the evolutionary history of culture? In this paper, we describe models used to assess the reliability of inference methods and show how these models play an underappreciated role in addressing that question. The notion of reliability is of course central to these models. As we explain, a common way of understanding reliability is in terms of low error rates. A careful look at case studies in cultural phylogenetics suggests that reliability models partly corroborate this understanding of reliability but also raises points of tension. We conclude by hinting at a few ways forward.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Cultural phylogenetics is the study of evolutionary relationships among cultural traits. In recent years, one of the most notable achievements in cultural phylogenetics was the reconstruction of the evolutionary history of several language families. An early example is Holden (2002), who found that the expansion of the Bantu language family followed the spread of farming across sub-Saharan Africa from 5000 to 1500 year ago. Another remarkable case is a series of studies by Gray and Atkinson (2003), Atkinson et al. (2005), and Bouckaert et al. (2012) on the origins of the Indo-European language family. Weighing in on what has been described as “the most intensively studied, yet still the most recalcitrant, problem of historical linguistics” (Diamond and Bellwood 2003, p. 601), they found strong support for the hypothesis that all Indo-European languages originated in Anatolia about 8000 to 9500 years ago and not 6000 years ago in the Pontic Steppe. Similar methods have now also been used to recover the phylogenies of Austronesian (Gray et al. 2009), Semitic (Kitchen et al. 2009), and Sino-Tibetan languages (Zhang et al. 2019).
However, phylogenetic methods were originally developed to infer the evolutionary history of biological species. Despite striking parallels, cultural and biological evolution do not always proceed according to the same principles (Boyd and Richerson 1988; Cavalli-Sforza and Feldman 1981; Mesoudi 2011). For the most part, biological evolution unfolds in a vertical line of descent, as traits are inherited from parent to offspring. But cultural transmission is often horizontal, with cultural traits transmitted between individuals that do not stand in a parent-offspring relationship. Because of this, there have long been worries that phylogenetic methods are not applicable to the cultural realm. An early and vivid expression of this can be seen in Gould (1992). He writes: “The basic topologies of biological and cultural change are completely different. Biological evolution is a system of constant divergence without subsequent joining of branches. (...) In human history, transmission across lineages is, perhaps, the major source of cultural change” (p. 64). Since then, concerns about the viability of cultural phylogenetics as a discipline have persisted (Claidière and André 2012; Tëmkin and Eldredge 2007)—see also Borgerhoff Mulder et al. (2006) for a thorough review, as well as Collard et al. (2006) and Evans et al. (2021) for a general defense of the use of phylogenetic methods in the study of cultural evolution.
So what, if anything, could justify the use of phylogenetic methods to reconstruct the evolutionary history of human languages and other cultural traits? In this paper, we argue that some models play an underappreciated role in answering this question. These are what we call reliability models. Reliability models are unique in that their main use is to assess the performance of inference methods, where “performance” is usually understood in terms of how reliable inference methods are. These models are common in cultural phylogenetics and related disciplines because it is often impractical, when not ethically objectionable, to conduct experimental or long-term field studies on cultural traits to determine how inference methods perform. Models do not suffer from these shortcomings. To assess the performance of phylogenetic methods, reliability models rely on computational methods to simulate data under different initial conditions. Upon simulating the data, inference methods are then applied to the simulated data so as to recover the initial conditions that were used in the simulation. Since the initial conditions are programmed in a computer, it is possible to keep a record of these conditions. As a result, it is also possible to determine how successful inference methods are at recovering them. This in turn permits us to assess the quality and estimate the reliability of the inference methods. According to Bokulich’s (2020) taxonomy of data models, reliability models therefore belong to the category of “synthetic data” models as they generate data with no direct input from the world.Footnote 1
Despite addressing such a fundamental question, reliability models have not received sufficient attention in philosophical debates about modeling. In an attempt to redress this issue, we begin section “Error-based accounts of reliability” by first clarifying what is at stake in debates about reliability. To do so, we draw on accounts by Mayo (1996, 2018), Woodward (2000), and Bovens and Hartmann (2004) according to which reliability is largely a matter of low error rates. In section “Inference methods in phylogenetics”, we take a careful look at reliability models in cultural phylogenetics and related fields, paying special attention to studies conducted by Nunn et al. (2010) and others. We then show in section “Reliability models” that reliability models play an important role in justifying the use of inference methods in cultural phylogenetics. We also show that in some ways reliability models are in line with error-based accounts of reliability but that in other ways reliability models give us reasons to question error-based accounts. This is because by understanding reliability simply in terms of error rates we run the risk of overlooking the importance of base rates—i.e., the unconditional probability of possible states of the world—when assessing the performance of inference methods. After hinting at possible ways forward, we conclude in section “Conclusion” with some brief remarks on the import of reliability models to debates about related notions—such as the purpose of robustness analysis, the nature of experimental replications, and the justification of public trust in science.
Error-based accounts of reliability
To understand the use of reliability models, it is important to first get clear on the notion of reliability. When talking about inference making in particular, an intuitive way to think about reliability is in terms of error. Error can consist in accepting a false statement and in rejecting a true statement. In statistical inference, error rates are therefore given by the probability of accepting a false statement and the probability of rejecting a true statement. For a simple example, consider a method to infer the pattern of descent among Dutch, English, and German. Phylogeneticists generally agree that English split from Dutch and German before Dutch split from German, meaning that Dutch and German are more closely related to each other than either language is to English (Gray and Atkinson 2003). Now suppose we were to apply the method to vocabulary, grammatical, or phonological data about these languages. According to an understanding of reliability in terms of error, the method would be reliable if it were unlikely to infer an incorrect pattern of descent—e.g., that the linguistic ancestor of both Dutch and English split from the ancestor of German before Dutch and English became separate languages. Similarly, the method would be reliable if it were likely to infer the correct pattern—namely, that the ancestor of both Dutch and German split from the ancestor of English first. On this common way of understanding reliability, the reliability of an inference method consists in low error rates or, equivalently, in high accuracy rates.Footnote 2
A prominent discussion of reliability along these lines can be found in Mayo (1996, 2018). One of Mayo’s main goals is to vindicate patterns of statistical inference that are widespread across the sciences. A particularly common one is hypothesis testing. When testing a hypothesis, a standard procedure is to first formulate a null hypothesis together with a statistical model of the phenomenon under investigation. This allows us to derive the probability of observing different sets of data under the assumption that the null hypothesis is correct. Upon collecting appropriate data, we can then calculate the probability of observing data at least as extreme as the actual data conditional on the assumption that the null hypothesis is correct. If this probability drops below a certain threshold, we can then say that the test rejects the null hypothesis on the grounds that the observed data would be too unlikely if the hypothesis were in fact correct. Otherwise, we say that the test fails to reject the null hypothesis.
Within this framework, Mayo takes an inference method to be reliable if it has low error probabilities. Error probabilities measure the chance that some error might occur when applying the inference method. They give us a summary description of the multifarious ways in which things might go wrong when making an inference, as error can creep in from a variety of different sources. In the context of hypothesis testing, error probabilities often represent the chance that the test procedure would lead us astray by rejecting a hypothesis when the hypothesis is correct or by failing to reject a hypothesis when the hypothesis is false. These are also called false positives and false negatives, or type-I and type-II errors (see Fig. 1). In the example above, a false positive would be to reject the hypothesis that, say, the ancestor of Dutch and German split from the ancestor of English first when that is the actual pattern of descent among these languages. A false negative would be to fail to reject that hypothesis when that is not the actual pattern of descent.Footnote 3 For Mayo, to say that a method is reliable is thus to say that it has low rates of both types of error. When applied to a very large number of data sets, a reliable method for inferring the pattern of descent among different languages would therefore rarely infer an incorrect pattern of descent while often inferring the correct pattern.
Another defense of this way of understanding reliability is due to Woodward (2000). A centerpiece of his counterfactual account is Bogen and Woodward’s (1988) distinction between data and phenomena. Roughly, data is the direct output of a particular experiment and measuring device. It is what provides evidence for the occurrence of a phenomenon. Phenomena, in turn, are stable and general processes that are inferred from the data. They are not local to a particular experimental setting, so it is phenomena that figure in predictions and explanations of comprehensive scientific theories. For example, we could collect data on the presence or absence of certain words in the vocabulary of different languages. The data could then provide evidence for a particular phenomenon, such as the pattern of descent among those languages. This pattern could then be explained by different theories of cultural evolution.
Following this taxonomy, Woodward takes an inference method to be reliable to the extent that there is a pattern of “counterfactual dependence” between the hypothesis that the inference method supports and the corresponding phenomenon. That is, an inference method takes available data and return support for a hypothesis—a statement or claim about a phenomenon of interest. Now, to say that there is a pattern of “counterfactual dependence” between this hypothesis and corresponding phenomenon is to say that: (a) if a given hypothesis were true of that phenomenon, then the method would probably support the hypothesis, and (b) if the hypothesis were false, then the method would probably not support the hypothesis. An inference method is therefore reliable if it has a high probability of supporting a hypothesis if and only if the hypothesis is true—in other words, if the method were to support the hypothesis, then the hypothesis would be true and the corresponding phenomenon would in fact occur and if the hypothesis were true and the phenomenon described by the hypothesis were in fact to occur, then the method would support a hypothesis.
Although Woodward’s point may seem complicated, it is in fact quite simple. His point is simply that reliability is a matter of error rates. This is because the probability that an inference method would support a hypothesis when the hypothesis is true is just the complement of the probability of a false negative: it is the probability of not rejecting a hypothesis when the hypothesis is true (a true negative). Similarly, the probability that an inference method would not support a hypothesis when the hypothesis is false is the complement of the probability of a false positive: it is the probability of rejecting a hypothesis when the hypothesis is false (a true positive). These probabilities are the true positive rate (sensitivity) and the true negative rate (specificity), which are the accuracy rates of the inference method (see Fig. 1). This is to say that Woodward takes a method to be reliable if it has high accuracy rates or, equivalently, low error rates. Woodward’s counterfactual account is therefore equivalent to understanding reliability in terms of low error rates.
Bovens and Hartmann (2004) defend a similar view in the context of data generation. In particular, they take reliability to measure the probability that a source produces data consistent with a hypothesis given that the hypothesis is true in comparison to the probability that the source produces the same data when the hypothesis is false. More formally, reliability is given by \(r=1-\frac{q}{p}\), where p is the probability of data consistent with the hypothesis given that the hypothesis is true and q is the probability of generating data consistent with a hypothesis given that the hypothesis is false (with \(0<p,q<1\)). A source is maximally unreliable when \(p=q\) so that it produces data consistent with a hypothesis with the same probability whether or not the hypothesis is true (in this case, \(r=0\)); a source is maximally reliable when \(p=1\) and \(q=0\) so that it always produces data consistent with a hypothesis when the hypothesis is true and never when the hypothesis is false (in this case, \(r=1\)). So r ranges between zero when reliability is low and unity when reliability is high. This is to say that a source is reliable in producing data to the extent that the error rate (q) is comparatively low—see also Merdes et al. (2021) for a recent discussion. Understanding reliability in terms of error is thus in philosophical discussions about data production and data analysis.Footnote 4
Error rates certainly capture a central feature of reliability. The reliability of an inference method should be proportional to its error rates. But some qualifications are in order. First, an inference method may be reliable against some background conditions and yet unreliable against others. Similarly, a method may be reliable only if applied to data of sufficiently high quality, or data of a certain sort. For example, an inference method may be reliable when applied to a sufficiently large amount of vocabulary data but unreliable when supplied with a dataset that is too small. Or the method may be reliable if applied, say, to lexical data or slow-changing words but unreliable when applied to phonological data or fast-changing words. The reliability of an inference method therefore depends on its error rates across a range of background conditions.
Second, an inference method may select a hypothesis that is more or less precise than the hypothesis selected by some other method. When this occurs, both methods may differ in reliability even if they exhibit the same error rates or the same error rates across the same range of conditions. For example, an inference method may fail to identify that some languages are related at all. A second method may correctly infer that the languages are related but fail to recover how related they actually are. A third method may not only infer that the languages are related, but also correctly recover a high degree of relatedness among them. The first method should count as less reliable than the other two, whereas the third method is presumably more reliable than the first and the second. If this is right, then precision also matters when it comes to determining the reliability of inference methods. The reliability of an inference method should therefore be a function of how precise the hypotheses are that the method supports.
These qualifications suggest that reliability is a subtle notion that deserves careful scrutiny. But they also pick up important threads. First, the reliability of an inference method should be proportional to its error rates. Second, the reliability of an inference method should also be sensitive to the range of conditions across which its error rates vary. Third, the reliability of an inference method should take into account how precise the hypotheses are that the method supports. Understanding reliability in terms of error rates explicitly takes into consideration the first of these requirements. But there is no reason to think that it could not accommodate the other two, so this is not really a problem. We will get into a more serious problem for this way of understanding reliability below. Before doing so, we turn in the next section to models that assess the reliability of inference methods. Given how prevalent such models are in cultural phylogenetics, case studies in this discipline provide a helpful starting point.
Inference methods in phylogenetics
Phylogenetic inference methods have a long and venerable tradition in biology, arguably dating as far back as Darwin’s (1859) depiction of a tree-like diagram to represent the pattern of descent among biological species. In the study of cultural evolution, phylogenetic approaches actually predate Darwin. Years before the publication of the Origins, Schleicher (1853) had already sketched a tree to represent the ancestral relations among Indo-European languages. Currently, phylogeneticists have at their disposal highly sophisticated methods for recovering phylogenetic relationships (Lemey et al. 2009). Many of these inference methods have also been employed in the study of culture—see Mesoudi (2011) for an overview of many interesting applications. A particularly successful case is in reconstructing the evolutionary history of human languages. Well-studied cases include the phylogenies of major linguistic groups, from the Indo-European (Gray and Atkinson 2003) and the Austronesian families (Gray et al. 2009), to the Bantu (Holden 2002), Semitic (Kitchen et al. 2009), and Sino-Tibetan languages (Zhang et al. 2019).Footnote 5
Reliability models are quite widespread in cultural phylogenetics, as worries about the reliability of phylogenetic methods permeate much of the field. One reason for this is that phylogenetics methods were developed in biology under the assumption that evolution proceeds in a vertical line of descent, with traits inherited from parent to offspring. For the most part this is the case in biology, although recent work has revealed a surprisingly large amount of cases that deviate from vertical transmission—see Doolittle and Bapteste (2007) for a still comprehensive review. But cultural evolution often does not proceed in this way. Cultural transmission is often horizontal, meaning that cultural traits can be transmitted between individuals that do not stand in a parent-offspring relationship (see Fig. 2). For example, eye color in humans is a biological trait that is transmitted vertically, as offspring usually resemble their parents. But this is not the case with cultural markers, such as accents. Accents tend to be horizontally acquired, with the accent of second-generation immigrants often being more similar to that of their peers than to that of their parents (Evans et al. 2007; Floccia et al. 2012). A particularly pressing problem in cultural phylogenetics is thus whether to trust inference methods that were developed on the assumption that inheritance is vertical.
Recent reliability models in cultural phylogenetics address precisely this question. A good example is Nunn et al. (2010). The main goal of their study was to assess the reliability of methods for detecting vertical and horizontal transmission. In particular, they chose to evaluate the reliability of two methods: the consistency index (\({\textit{CI}}\)), and the retention index (\({\textit{RI}}\)). For the sake of brevity, we focus here on \({\textit{CI}}\). \({\textit{CI}}\) measures deviation from vertical transmission in a given phylogenetic tree. It is given by \({\textit{CI}}= m/s\), where m is the minimum number of changes that a tree with as much vertical transmission as possible would require to explain the distribution of a trait, s is the minimum number of changes that a particular tree of interest would require, and the ratio m/s ranges between 0 and 1.
If the CI value for a trait in some phylogenetic tree is very high (i.e., close to 1), then the minimum number of changes in that tree comes very close to the minimum number of changes for a tree with the highest degree of vertical transmission. Here, “degree of vertical transmission” refers to the number of traits whose distribution in the tree can be explained by vertical transmission. So when the \({\textit{CI}}\) value is high, this is usually taken to mean that vertical transmission does a good job at accounting for the distribution of the trait in the tree under consideration. If the \({\textit{CI}}\) value is low (i.e., a lot less than 1), then some mechanism other than vertical transmission must partly account for the distribution of the trait. A high \({\textit{CI}}\) value therefore supports the hypothesis that transmission of the trait is mostly vertical, and a low \({\textit{CI}}\) value supports the hypothesis that transmission of the trait is not mostly vertical—either because of horizontal transmission or convergent evolution. Although \({\textit{CI}}\) cannot distinguish between horizontal transmission and convergent evolution, low \({\textit{CI}}\) values are usually taken as evidence of horizontal transmission on the assumption that convergent evolution is rare.
For example, consider a trait that takes one of three possible values (0, 1, or 2). Given the distribution of traits observed in Fig. 3, the minimum number of changes that a tree with as much vertical transmission as possible would require is 2, so \(m=2\). Now suppose that a particular tree is such that the minimum number of changes for that trait is also \(s=2\). This means that \({\textit{CI}}=1\), so that the tree can account for the distribution of traits without invoking any horizontal transmission or convergent evolution. If this particular tree is the correct one for the trait in question, then the high \({\textit{CI}}\) value provides support for the hypothesis that the trait evolved by vertical transmission. Now say that another tree of interest is such that the minimum number of changes for that trait is \(s=3\). In this case, \({\textit{CI}}=0.67\). This \({\textit{CI}}\) value indicates that at some point the trait was acquired independently of common descent, either because of horizontal transmission or convergent evolution. If this turns out to be the correct tree, then the low \({\textit{CI}}\) value supports the hypothesis that there was a high degree of horizontal transmission in the evolution of the trait under the assumption that convergent evolution is indeed rare (see Fig. 3).
To assess how reliable \({\textit{CI}}\) is in detecting vertical and horizontal transmission in the context of cultural evolution, Nunn and colleagues first simulated societies arranged in a two-dimensional lattice. Each society occupied a different cell of the lattice and had a set number of traits. At every time step, there was a small probability that each society would either change, go extinct, donate traits to nearby populations, or colonize an adjacent cell with a daughter population. The traits of a society could therefore change when neighboring populations donated their traits, or change independently of other traits due to their intrinsic evolutionary rate. As societies changed and colonized empty cells with their descendants, the simulation also retained a virtual record of the phylogenetic relations among the societies populating the lattice.Footnote 6 With this record in hand, they were then able to compare the actual rates of vertical and horizontal transmission with the hypotheses that \({\textit{CI}}\) favored about the rates. Measuring \({\textit{CI}}\)-values for the entire simulated dataset, they were finally able to show that this method is very sensitive to high evolutionary rates. In particular, they found that \({\textit{CI}}\) is reliable in detecting a high degree of vertical transmission but not very reliable in detecting high rates of horizontal transmission. This is because low \({\textit{CI}}\)-values may be due to a high degree of horizontal transmission or to convergence driven by a high evolutionary rate.
Previous work by Nunn et al. (2006), and independently by Greenhill et al. (2009), Currie et al. (2010), and Crema et al. (2014) pursue a similar strategy. But the sort of reliability model developed in these studies is by no means an idiosyncrasy of cultural phylogenetics. In biology, a rudimentary form of such models was perhaps the use of “caminalcules” in morphological phylogenetics. Caminalcules were fictional organisms with morphological traits whose evolutionary history was known only to their creator, who used them in early attempts to test general principles of phylogenetic inference (Camin and Sokal 1965). With the advent of molecular phylogenetics soon thereafter, full-fledged reliability models spread. Their use goes at least as far back as Felsenstein’s (1978) mathematical model showing that parsimony-based methods can be inconsistent under certain conditions. Model-based studies of reliability continue to be quite common—for a few recent examples, see Puttick et al. (2017) and Vernygora et al. (2020).
Aside from slight differences, reliability models across these closely related disciplines are thus variations on a common theme: first, reliability models use computational tools to generate data under known conditions; second, reliability models apply inference methods to the simulated datasets; third, reliability models conclude by assessing the reliability of the inference methods. Often their goal is to determine not only whether a single inference method is reliable, but also the conditions under which and the extent to which different methods are or are not reliable. In this way it is also possible to compare the reliability of different inference methods, identify what method performs best under what conditions, and thus choose which one to use.Footnote 7
Reliability models
The case studies of the previous section provide some far-reaching lessons. First, they offer a glimpse into the general format that paradigmatic cases of reliability models generally take. As with many other model-based tools, reliability models typically start out by laying down a specific set of assumptions. These assumptions can be quite complex, so computers are very helpful in storing and manipulating them. It is thus no surprise that Nunn et al. (2010) made extensive use of computer simulations. The process of simulating data can also be quite opaque. So computers are again called for, assisting the derivation of results from the model assumptions—for example, what happens when cultural traits evolve according to different rates of vertical or horizontal transmission. The simulated data is then fed into the inference methods under investigation. With the aid of these methods, we can generate hypotheses about the conditions that produced the data. To be clear, the conditions are known because they are the assumptions that go into building the model. Since the conditions are known in advance, it is possible to determine with great precision how successful the different inference methods are at recovering them. In the studies by Nunn et al. (2010), the simulated dataset was used to estimate \({\textit{CI}}\)-values and thus generate hypotheses about the rates of vertical and horizontal transmission that produced the data.
Second, reliability models corroborate the view that reliability should be a function of error rates. Recall that to say that reliability is a function of error rates is to say that an inference method is reliable when it has low error rates. So for a method to be reliable, it must have a low probability of favoring a hypothesis when the hypothesis is false, and a low probability of not favoring a hypothesis when the hypothesis is correct. In the context of a reliability model, this is to say that a reliable method tends to favor a hypothesis that correctly describes the assumptions that went into building the model. For example, Nunn et al.’s (2010) models show that \({\textit{CI}}\) has low error rates when transmission is vertical: when transmission is vertical, \({\textit{CI}}\) values are generally high; when \({\textit{CI}}\) values are high, transmission is vertical. Given that high \({\textit{CI}}\) values favor the hypothesis that transmission is mostly vertical, their studies show that \({\textit{CI}}\) is indeed reliable in this case. But their models also show that this is not true when \({\textit{CI}}\) values are low. Although \({\textit{CI}}\) values are low if there is a high degree of horizontal transmission, it is not the case that transmission is for the most part horizontal if \({\textit{CI}}\) values are low. This is because low \({\textit{CI}}\) values can also be due to high evolutionary rates. So \({\textit{CI}}\) does not have low error rates in this case. As it turns out, \({\textit{CI}}\) may therefore be a reliable method to infer either horizontal transmission or convergent evolution. But \({\textit{CI}}\) is not a reliable method when it comes to inferring horizontal transmission—at least not in the case of culture, where evolutionary rates can be much higher than in the biological realm.
However, reliability models highlight that reliability should also be a function of base rates—that is, the unconditional probability of possible states of the world. To see why, consider the following toy example. Suppose that most English words beginning with a certain prefix are loanwords from Arabic. Loanwords are words permanently adopted from one language into another without translation. Suppose also that 90% of the words beginning with the prefix are loanwords from Arabic, and that 10% of the words beginning with the prefix are not loanwords. Now consider two inference methods to detect whether a randomly selected English word beginning with that prefix is in fact a loanword from Arabic. Suppose further that the two inference methods have the following error rates. When a loanword begins with the prefix, method \(M_1\) erroneously supports the hypothesis that the word under investigation is not a loanword in 5% of the cases; when a word begins with the prefix but is not actually a loanword, method \(M_1\) erroneously supports the hypothesis that the word is a loanword in 20% of the cases. This is to say that \(M_1\) has a \(5\%\) false-negative rate and a 20% false-positive rate (see Fig. 4). As for method \(M_2\), let the error rates be the same as with method \(M_1\) but reversed: when a loanword begins with the prefix, method \(M_2\) supports the hypothesis that the word in question is not a loanword in 20% of the cases; and when a word begins with the prefix but is not a loanword, method \(M_2\) supports the hypothesis that the word is a loanword in 5% of the cases. So \(M_2\) has exactly the same error rates as \(M_1\), namely a 5% false-negative rate and a 20% false-positive rate.
Because of how the example is constructed, both methods have the same error rates: the false negative rate of method \(M_1\) is equal to the false positive rate of method \(M_2\) (0.05), and the false positive rate of method \(M_1\) matches the false negative rate of method \(M_2\) (0.2). If reliability were simply a function of error rates, both methods should therefore count as equally reliable. Yet, the overall probability that method \(M_1\) favors the incorrect hypothesis is \(0.9 \cdot 0.05 + 0.1 \cdot 0.2=0.065\), whereas the overall probability that method \(M_2\) favors the incorrect hypothesis is \(0.9 \cdot 0.2 + 0.1 \cdot 0.05=0.185\). The small discrepancy is due to the large difference in the prior probability of a randomly picked word being a loanword, and the prior probability of it not being a loanword. Since method \(M_1\) is less prone to errors than method \(M_2\) when it comes to words that are very common, the overall error probability of method \(M_1\) is also slightly lower than the overall error probability of method \(M_2\). There is thus something valuable about the first method in that it has an overall higher probability of supporting the correct hypothesis than the second one.
But error-based accounts of reliability ignore the difference between the two methods, lumping them together with respect to their reliability. What should we make of this? At worst, the case seems to suggest that we should stop thinking of reliability simply in terms of error rates. After all, this way of thinking would us lead to make an incorrect pronouncement in this case and take both methods to be equally reliable when in fact the first method is on average less prone to errors than the second one. Straightforward as it sounds, this may be too rash of a conclusion. Someone could embrace the difference between these two methods as representative of a valuable feature that \(M_1\) has, but a feature that ultimately differs from what they call “reliability”. Confusing as this use of terms may be, here is not the place to legislate on terminological preferences. More importantly, however, the point is just that we should not gloss over an important distinction: \(M_1\) differs from \(M_2\) in that it performs better than \(M_2\). To insist that reliability is imply a matter of error rates would be to ignore this difference.
If this is right, then a satisfactory understanding of reliability should at the very least take more into account than just error rates. Attending to error rates alone ignores the effect that prior probabilities can have on the performance and therefore on the reliability of an inference method. But what could a satisfactory understanding of reliability look like? Although we will not be able to answer this question here, a promising way forward may be to take a Bayesian approach to reliability. Bayesians would have no difficulty incorporating the prior probability of hypotheses into a notion of reliability that does justice to the complexities raised above. From an information-theoretic perspective, another option might be to conceive of the reliability of an inference method as a measure of how much information the hypotheses selected by the method carry about the actual state of the world.
Yet another option would be to consider the overall error rate. There are some potential pitfalls with this approach, however. An obvious choice of measure for the overall error rate would be the expected error rate over possible states of the world. Attractive as this option may seem at first, such a measure would require taking into account base rates—i.e., the prior probabilities of the hypotheses that describe the possible states of the world. Those who defend understanding reliability in terms of error rates, such as Woodard and Mayo, typically claim quite explicitly that reliability is a property of inference methods alone. So it is not clear that they would be willing to endorse this move because prior probabilities are not properties of an inference method. Be that as it may, reliability should take more into account than just error rates.
Before concluding, an important caveat: reliability models make the concern with prior probabilities so salient precisely because they are ill-suited to offer any real guidance on these matters. Reliability models are formidable tools with which to determine error rates. As such, they excel at determining the conditions under which particular inference methods are likely or unlikely to succeed. Although these models can therefore help us assess the reliability of an inference method given a wide range of conditions, reliability models cannot establish when or how often we should encounter these conditions in the real world. For that, there is no good substitute for empirical work—be it by directly analyzing real-world data, or by coupling real-world data with other models as in the case of generative models (Kandler and Powell 2018). It is thus not surprising that Nunn et al. (2010), for example, do not attempt to estimate how common horizontal transmission or high evolutionary rates really are in cultural evolution, choosing instead to simply report the performance of \({\textit{CI}}\) and other inference methods under these conditions. For present purposes, this means that reliability models should not be expected to help us adjudicate on the difficulty raised above for error-based accounts of reliability. In sum, reliability models may suggest that error rates are not all that matter when it comes to reliability. In line with the toy example above, they may also suggest that the prior probability of a hypothesis being true matters too. But reliability models cannot help us estimate these prior probabilities.
Conclusion
When philosophers consider questions about justification, they typically focus on the justification of theories and hypotheses. But just as crucial to science is the justification of inference methods. Here, we have shown that reliability models play an important role in justifying the use of inference methods. In the particular case of phylogenetics, reliability models allow us to determine the conditions under which inference methods that were developed in biology assuming that transmission is mostly vertical can and cannot be safely applied to the cultural realm where horizontal transmission and high evolutionary rates are the norm. As the studies reviewed above illustrate, there is unfortunately no simple way to resolve this issue: conditions vary depending on the method under consideration and system of interest. But this is itself a valuable lesson that reliability models allow us to draw—namely, that methods borrowed from biological phylogenetics to study culture should neither be rejected out of hand, nor endorsed tout court.
Reliability models in cultural and biological phylogenetics are also instructive for bringing the notion of reliability into sharper focus. For one, these models corroborate a common way of understanding reliability—namely, in terms of error rates. But their use also suggest that error rates are not all that matters. Indeed, reliability models help clarify that another important aspect to consider is the probability of each possible inference prior to the application of any inference method. This is therefore a question that a full understanding of reliability should address, although here we have only hinted at possible ways of doing so. If this is right, then philosophers might also want to revise or perhaps even abandon existing ways of understanding reliability. In either case, this could prove consequential for issues in philosophy and beyond that routinely invoke the notion of reliability—such as the purpose of robustness analysis (Levins 1966; Weisberg 2006; Wimsatt 1981), the nature of experimental replications (Machery 2020), and the justification of public trust in science (Irzik and Kurtulmus 2020; Wilholt 2013).
Notes
A related category of synthetic data models are generative models. Generative models simulate data by specifying a probabilistic causal mechanism of interest. By comparing the distribution of simulated data with data observed in the real world, it is then possible to ascertain how likely it is that the observed data was produced by the hypothesized causal mechanism—for a discussion of generative models in cultural evolution, see Kandler and Powell (2018).
The term “reliability” may also be used in a more general sense to denote the extent to which a study is replicable, reproducible, and reasoned. Here, we restrict our attention to the use of the term when it comes to the performance of inference methods.
A similar notion is also at the core of reliabilist theories of epistemic justification. Following Goldman (1976, 1979), proponents of reliabilism typically take a belief-forming process to be reliable just in case it tends to deliver true beliefs–for more recent discussions, see Alston (1995), Adler (2005), and Comesaña (2009, 2010); see also Goldman (1999) for the notion of reliable belief-forming processes in social epistemology.
Note that the trait-bearing entities in Nunn et al. (2010) are simulated societies rather than simulated individuals. This is important because it underscores the point that phylogenies can be built with either individuals or societies as trait bearers. In the case of accents and eye colors, the trait-bearing entities were individuals; in the case of languages, the trait-bearing entities are societies—i.e., communities of language users.
Bokulich (2020) also discusses models that simulate data. But her focus is on simulations whose purpose is to correct noisy or missing data. They therefore differ from reliability models in that their function is to calibrate methods of data production and data correction, and not to assess the reliability of inference methods.
References
Adler JE (2005) Reliabilist justification (or knowledge) as a good truth-ratio. Pac Philos Q 86(4):445–458
Alston WP (1995) How to think about reliability. Philos Top 23(1):1–29
Atkinson Q, Nicholls G, Welch D, Gray R (2005) From words to dates: water into wine, mathemagic or phylogenetic inference? Trans Philol Soc 103(2):193–219
Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, Bollen KA, Brembs B, Brown L, Camerer C et al (2018) Redefine statistical significance. Nat Hum Behav 2(1):6–10
Bogen J, Woodward J (1988) Saving the phenomena. Philos Rev 97(3):303–352
Bokulich A (2020) Towards a taxonomy of the model-ladenness of data. Philos Sci 87(5):793–806
Borgerhoff Mulder M, Nunn CL, Towner MC (2006) Cultural macroevolution and the transmission of traits. Evol Anthropol Issues News Rev Issues News Rev 15(2):52–64
Bouckaert R, Lemey P, Dunn M, Greenhill SJ, Alekseyenko AV, Drummond AJ, Gray RD, Suchard MA, Atkinson QD (2012) Mapping the origins and expansion of the Indo-European language family. Science 337(6097):957–960
Bovens L, Hartmann S (2004) Bayesian epistemology. OUP Oxford, Oxford
Boyd R, Richerson PJ (1988) Culture and the evolutionary process. University of Chicago press, Chicago
Camin JH, Sokal RR (1965) A method for deducing branching sequences in phylogeny. Evolution 311–326
Cavalli-Sforza LL, Feldman MW (1981) Cultural transmission and evolution: a quantitative approach. Princeton University Press, Princeton
Claidière N, André J-B (2012) The transmission of genes and culture: a questionable analogy. Evol Biol 39(1):12–24
Collard M, Shennan SJ, Tehrani JJ (2006) Branching, blending, and the evolution of cultural similarities and differences among human populations. Evol Hum Behav 27(3):169–184
Comesaña J (2009) What lottery problem for reliabilism? Pac Philos Q 90(1):1–20
Comesaña J (2010) Evidentialist reliabilism. Noûs 44(4):571–600
Crema ER, Kerig T, Shennan S (2014) Culture, space, and metapopulation: a simulation-based study for evaluating signals of blending and branching. J Archaeol Sci 43:289–298
Currie TE, Greenhill SJ, Mace R (2010) Is horizontal transmission really a problem for phylogenetic comparative methods? A simulation study using continuous cultural traits. Philos Trans R Soc B Biol Sci 365(1559):3903–3912
Darwin C (1859) On the origin of species by means of natural selection. Murray
Diamond J, Bellwood P (2003) Farmers and their languages: the first expansions. Science 300(5619):597–603
Doolittle WF, Bapteste E (2007) Pattern pluralism and the tree of life hypothesis. Proc Natl Acad Sci 104(7):2043–2049
Evans B, Mistry A, Moreiras C (2007) An acoustic study of first-and second-generation Gujarati immigrants in Wembley: evidence for accent convergence. In: Proceedings of the 16th international congress of phonetic sciences (ICPhS XVI). Citeseer, pp 1741–1744
Evans CL, Greenhill SJ, Watts J, List J-M, Botero CA, Gray RD, Kirby KR (2021) The uses and abuses of tree thinking in cultural evolution. Philos Trans R Soc B 376(1828):20200056
Felsenstein J (1978) Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool 27(4):401–410
Floccia C, Delle Luche C, Durrant S, Butler J, Goslin J (2012) Parent or community: where do 20-month-olds exposed to two accents acquire their representation of words? Cognition 124(1):95–100
Goldman AI (1976) Discrimination and perceptual knowledge. J Philos 73:771–791
Goldman AI (1979) What is justified belief? Justification and knowledge. Springer, Berlin, pp 1–23
Goldman AI (1999) Knowledge in a social world. Oxford University Press
Gould SJ (1992) Bully for brontosaurus: reflections in natural history. Norton, New York
Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426(6965):435–439
Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansion pulses and pauses in pacific settlement. Science 323(5913):479–483
Greenhill SJ, Currie TE, Gray RD (2009) Does horizontal transmission invalidate cultural phylogenies? Proc R Soc B Biol Sci 276(1665):2299–2306
Holden CJ (2002) Bantu language trees reflect the spread of farming across sub-Saharan Africa: a maximum-parsimony analysis. Proc R Soc Lond Ser B Biol Sci 269(1493):793–799
Hull DL (1988) Science as a process: an evolutionary account of the social and conceptual development of science. University of Chicago Press, Chicago
Irzik G, Kurtulmus F (2020) What is epistemic public trust in science? Br J Philos Sci
Kandler A, Powell A (2018) Generative inference for cultural evolution. Philos Trans R Soc B Biol Sci 373(1743):20170056
Kitchen A, Ehret C, Assefa S, Mulligan CJ (2009) Bayesian phylogenetic analysis of semitic languages identifies an early bronze age origin of semitic in the near east. Proc R Soc B Biol Sci 276(1668):2703–2710
Lakens D, Adolfi FG, Albers CJ, Anvari F, Apps MA, Argamon SE, Baguley T, Becker RB, Benning SD, Bradford DE et al (2018) Justify your alpha. Nature Hum Behav 2(3):168–171
Lemey P, Salemi M, Vandamme A-M (2009) The phylogenetic handbook: a practical approach to phylogenetic analysis and hypothesis testing. Cambridge University Press, Cambridge
Levins R (1966) The strategy of model building in population biology. Am Sci 54(4):421–431
Machery E (2020) What is a replication? Philos Sci 87(4):545–567
Mayo DG (1996) Error and the growth of experimental knowledge. University of Chicago Press, Chicago
Mayo DG (2018) Statistical inference as severe testing. Cambridge University Press, Cambridge
Merdes C, Von Sydow M, Hahn U (2021) Formal models of source reliability. Synthese 198(23):5773–5801
Mesoudi A (2011) Cultural evolution: how Darwinian theory can explain human culture and synthesize the social sciences. University of Chicago Press, Chicago
Nunn CL, Mulder MB, Langley S (2006) Comparative methods for studying cultural trait evolution: a simulation study. Cross-Cult Res 40(2):177–209
Nunn CL, Arnold C, Matthews L, Mulder MB (2010) Simulating trait evolution for cross-cultural comparison. Philos Trans R Soc B Biol Sci 365(1559):3807–3819
Puttick MN, O’Reilly JE, Tanner AR, Fleming JF, Clark J, Holloway L, Lozano-Fernandez J, Parry LA, Tarver JE, Pisani D et al (2017) Uncertain-tree: discriminating among competing approaches to the phylogenetic analysis of phenotype data. Proc R Soc B Biol Sci 284(1846):20162290
Schleicher A (1853) Die ersten spaltungen des indogermanischen urvolkes. Allgemeine Monatsschrift für Wissenschaft und Literatur 3:786–787
Sober E (1991) Reconstructing the past: Parsimony, evolution, and inference. MIT press, Cambridge
Tëmkin I, Eldredge N (2007) Phylogenetics and material cultural evolution. Curr Anthropol 48(1):146–154
Vernygora OV, Simões TR, Campbell EO (2020) Evaluating the performance of probabilistic algorithms for phylogenetic analysis of big morphological datasets: a simulation study. Syst Biol 69(6):1088–1105
Weisberg M (2006) Robustness analysis. Philos Sci 73(5):730–742
Wilholt T (2013) Epistemic trust in science. Br J Philos Sci 64(2):233–253
Wimsatt WC (1981) Robustness, reliability, and overdetermination. Sci Inq Soc Sci 124–163
Woodward J (2000) Data, phenomena, and reliability. Philos Sci 67:S163–S179
Zhang M, Yan S, Pan W, Jin L (2019) phylogenetic evidence for Sino-Tibetan origin in northern China in the Late Neolithic. Nature 569(7754):112–115
Acknowledgements
I would like to thank Hannah Read, Michael Weisberg, Robert Brandon, Alex Rosenberg, Kevin Hoover, Carlotta Pavese, Adrian Currie, and Gareth Roberts for extremely helpful feedback on earlier versions of this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We have no conflict of interest to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ventura, R. Reliability models in cultural phylogenetics. Biol Philos 38, 19 (2023). https://doi.org/10.1007/s10539-023-09900-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10539-023-09900-6