Introduction

Cultural phylogenetics is the study of evolutionary relationships among cultural traits. In recent years, one of the most notable achievements in cultural phylogenetics was the reconstruction of the evolutionary history of several language families. An early example is Holden (2002), who found that the expansion of the Bantu language family followed the spread of farming across sub-Saharan Africa from 5000 to 1500 year ago. Another remarkable case is a series of studies by Gray and Atkinson (2003), Atkinson et al. (2005), and Bouckaert et al. (2012) on the origins of the Indo-European language family. Weighing in on what has been described as “the most intensively studied, yet still the most recalcitrant, problem of historical linguistics” (Diamond and Bellwood 2003, p. 601), they found strong support for the hypothesis that all Indo-European languages originated in Anatolia about 8000 to 9500 years ago and not 6000 years ago in the Pontic Steppe. Similar methods have now also been used to recover the phylogenies of Austronesian (Gray et al. 2009), Semitic (Kitchen et al. 2009), and Sino-Tibetan languages (Zhang et al. 2019).

However, phylogenetic methods were originally developed to infer the evolutionary history of biological species. Despite striking parallels, cultural and biological evolution do not always proceed according to the same principles (Boyd and Richerson 1988; Cavalli-Sforza and Feldman 1981; Mesoudi 2011). For the most part, biological evolution unfolds in a vertical line of descent, as traits are inherited from parent to offspring. But cultural transmission is often horizontal, with cultural traits transmitted between individuals that do not stand in a parent-offspring relationship. Because of this, there have long been worries that phylogenetic methods are not applicable to the cultural realm. An early and vivid expression of this can be seen in Gould (1992). He writes: “The basic topologies of biological and cultural change are completely different. Biological evolution is a system of constant divergence without subsequent joining of branches. (...) In human history, transmission across lineages is, perhaps, the major source of cultural change” (p. 64). Since then, concerns about the viability of cultural phylogenetics as a discipline have persisted (Claidière and André 2012; Tëmkin and Eldredge 2007)—see also Borgerhoff Mulder et al. (2006) for a thorough review, as well as Collard et al. (2006) and Evans et al. (2021) for a general defense of the use of phylogenetic methods in the study of cultural evolution.

So what, if anything, could justify the use of phylogenetic methods to reconstruct the evolutionary history of human languages and other cultural traits? In this paper, we argue that some models play an underappreciated role in answering this question. These are what we call reliability models. Reliability models are unique in that their main use is to assess the performance of inference methods, where “performance” is usually understood in terms of how reliable inference methods are. These models are common in cultural phylogenetics and related disciplines because it is often impractical, when not ethically objectionable, to conduct experimental or long-term field studies on cultural traits to determine how inference methods perform. Models do not suffer from these shortcomings. To assess the performance of phylogenetic methods, reliability models rely on computational methods to simulate data under different initial conditions. Upon simulating the data, inference methods are then applied to the simulated data so as to recover the initial conditions that were used in the simulation. Since the initial conditions are programmed in a computer, it is possible to keep a record of these conditions. As a result, it is also possible to determine how successful inference methods are at recovering them. This in turn permits us to assess the quality and estimate the reliability of the inference methods. According to Bokulich’s (2020) taxonomy of data models, reliability models therefore belong to the category of “synthetic data” models as they generate data with no direct input from the world.Footnote 1

Despite addressing such a fundamental question, reliability models have not received sufficient attention in philosophical debates about modeling. In an attempt to redress this issue, we begin section “Error-based accounts of reliability” by first clarifying what is at stake in debates about reliability. To do so, we draw on accounts by Mayo (1996, 2018), Woodward (2000), and Bovens and Hartmann (2004) according to which reliability is largely a matter of low error rates. In section “Inference methods in phylogenetics”, we take a careful look at reliability models in cultural phylogenetics and related fields, paying special attention to studies conducted by Nunn et al. (2010) and others. We then show in section “Reliability models” that reliability models play an important role in justifying the use of inference methods in cultural phylogenetics. We also show that in some ways reliability models are in line with error-based accounts of reliability but that in other ways reliability models give us reasons to question error-based accounts. This is because by understanding reliability simply in terms of error rates we run the risk of overlooking the importance of base rates—i.e., the unconditional probability of possible states of the world—when assessing the performance of inference methods. After hinting at possible ways forward, we conclude in section “Conclusion” with some brief remarks on the import of reliability models to debates about related notions—such as the purpose of robustness analysis, the nature of experimental replications, and the justification of public trust in science.

Error-based accounts of reliability

To understand the use of reliability models, it is important to first get clear on the notion of reliability. When talking about inference making in particular, an intuitive way to think about reliability is in terms of error. Error can consist in accepting a false statement and in rejecting a true statement. In statistical inference, error rates are therefore given by the probability of accepting a false statement and the probability of rejecting a true statement. For a simple example, consider a method to infer the pattern of descent among Dutch, English, and German. Phylogeneticists generally agree that English split from Dutch and German before Dutch split from German, meaning that Dutch and German are more closely related to each other than either language is to English (Gray and Atkinson 2003). Now suppose we were to apply the method to vocabulary, grammatical, or phonological data about these languages. According to an understanding of reliability in terms of error, the method would be reliable if it were unlikely to infer an incorrect pattern of descent—e.g., that the linguistic ancestor of both Dutch and English split from the ancestor of German before Dutch and English became separate languages. Similarly, the method would be reliable if it were likely to infer the correct pattern—namely, that the ancestor of both Dutch and German split from the ancestor of English first. On this common way of understanding reliability, the reliability of an inference method consists in low error rates or, equivalently, in high accuracy rates.Footnote 2

A prominent discussion of reliability along these lines can be found in Mayo (1996, 2018). One of Mayo’s main goals is to vindicate patterns of statistical inference that are widespread across the sciences. A particularly common one is hypothesis testing. When testing a hypothesis, a standard procedure is to first formulate a null hypothesis together with a statistical model of the phenomenon under investigation. This allows us to derive the probability of observing different sets of data under the assumption that the null hypothesis is correct. Upon collecting appropriate data, we can then calculate the probability of observing data at least as extreme as the actual data conditional on the assumption that the null hypothesis is correct. If this probability drops below a certain threshold, we can then say that the test rejects the null hypothesis on the grounds that the observed data would be too unlikely if the hypothesis were in fact correct. Otherwise, we say that the test fails to reject the null hypothesis.

Within this framework, Mayo takes an inference method to be reliable if it has low error probabilities. Error probabilities measure the chance that some error might occur when applying the inference method. They give us a summary description of the multifarious ways in which things might go wrong when making an inference, as error can creep in from a variety of different sources. In the context of hypothesis testing, error probabilities often represent the chance that the test procedure would lead us astray by rejecting a hypothesis when the hypothesis is correct or by failing to reject a hypothesis when the hypothesis is false. These are also called false positives and false negatives, or type-I and type-II errors (see Fig. 1). In the example above, a false positive would be to reject the hypothesis that, say, the ancestor of Dutch and German split from the ancestor of English first when that is the actual pattern of descent among these languages. A false negative would be to fail to reject that hypothesis when that is not the actual pattern of descent.Footnote 3 For Mayo, to say that a method is reliable is thus to say that it has low rates of both types of error. When applied to a very large number of data sets, a reliable method for inferring the pattern of descent among different languages would therefore rarely infer an incorrect pattern of descent while often inferring the correct pattern.

Fig. 1
figure 1

A simple diagram to clarify the notions of false-positive rate, false-negative rate, specificity, and sensitivity. \(S_1\) and \(S_2\) represent two possible states of the world, with \(H_1\) and \(H_2\) representing hypotheses that describe the corresponding states. Suppose that \(S_1\) is the actual state of the world, so that \(H_1\) is true. Then q represents the sensitivity or the true-positive rate of a method that infers the state of the world and p represents the specificity or the true-negative rate of the method; conversely, \(1-q\) and \(1-p\) represent the error rates, i.e. the false-positive or type-I error rate and the false-negative or type-II error rate

Another defense of this way of understanding reliability is due to Woodward (2000). A centerpiece of his counterfactual account is Bogen and Woodward’s (1988) distinction between data and phenomena. Roughly, data is the direct output of a particular experiment and measuring device. It is what provides evidence for the occurrence of a phenomenon. Phenomena, in turn, are stable and general processes that are inferred from the data. They are not local to a particular experimental setting, so it is phenomena that figure in predictions and explanations of comprehensive scientific theories. For example, we could collect data on the presence or absence of certain words in the vocabulary of different languages. The data could then provide evidence for a particular phenomenon, such as the pattern of descent among those languages. This pattern could then be explained by different theories of cultural evolution.

Following this taxonomy, Woodward takes an inference method to be reliable to the extent that there is a pattern of “counterfactual dependence” between the hypothesis that the inference method supports and the corresponding phenomenon. That is, an inference method takes available data and return support for a hypothesis—a statement or claim about a phenomenon of interest. Now, to say that there is a pattern of “counterfactual dependence” between this hypothesis and corresponding phenomenon is to say that: (a) if a given hypothesis were true of that phenomenon, then the method would probably support the hypothesis, and (b) if the hypothesis were false, then the method would probably not support the hypothesis. An inference method is therefore reliable if it has a high probability of supporting a hypothesis if and only if the hypothesis is true—in other words, if the method were to support the hypothesis, then the hypothesis would be true and the corresponding phenomenon would in fact occur and if the hypothesis were true and the phenomenon described by the hypothesis were in fact to occur, then the method would support a hypothesis.

Although Woodward’s point may seem complicated, it is in fact quite simple. His point is simply that reliability is a matter of error rates. This is because the probability that an inference method would support a hypothesis when the hypothesis is true is just the complement of the probability of a false negative: it is the probability of not rejecting a hypothesis when the hypothesis is true (a true negative). Similarly, the probability that an inference method would not support a hypothesis when the hypothesis is false is the complement of the probability of a false positive: it is the probability of rejecting a hypothesis when the hypothesis is false (a true positive). These probabilities are the true positive rate (sensitivity) and the true negative rate (specificity), which are the accuracy rates of the inference method (see Fig. 1). This is to say that Woodward takes a method to be reliable if it has high accuracy rates or, equivalently, low error rates. Woodward’s counterfactual account is therefore equivalent to understanding reliability in terms of low error rates.

Bovens and Hartmann (2004) defend a similar view in the context of data generation. In particular, they take reliability to measure the probability that a source produces data consistent with a hypothesis given that the hypothesis is true in comparison to the probability that the source produces the same data when the hypothesis is false. More formally, reliability is given by \(r=1-\frac{q}{p}\), where p is the probability of data consistent with the hypothesis given that the hypothesis is true and q is the probability of generating data consistent with a hypothesis given that the hypothesis is false (with \(0<p,q<1\)). A source is maximally unreliable when \(p=q\) so that it produces data consistent with a hypothesis with the same probability whether or not the hypothesis is true (in this case, \(r=0\)); a source is maximally reliable when \(p=1\) and \(q=0\) so that it always produces data consistent with a hypothesis when the hypothesis is true and never when the hypothesis is false (in this case, \(r=1\)). So r ranges between zero when reliability is low and unity when reliability is high. This is to say that a source is reliable in producing data to the extent that the error rate (q) is comparatively low—see also Merdes et al. (2021) for a recent discussion. Understanding reliability in terms of error is thus in philosophical discussions about data production and data analysis.Footnote 4

Error rates certainly capture a central feature of reliability. The reliability of an inference method should be proportional to its error rates. But some qualifications are in order. First, an inference method may be reliable against some background conditions and yet unreliable against others. Similarly, a method may be reliable only if applied to data of sufficiently high quality, or data of a certain sort. For example, an inference method may be reliable when applied to a sufficiently large amount of vocabulary data but unreliable when supplied with a dataset that is too small. Or the method may be reliable if applied, say, to lexical data or slow-changing words but unreliable when applied to phonological data or fast-changing words. The reliability of an inference method therefore depends on its error rates across a range of background conditions.

Second, an inference method may select a hypothesis that is more or less precise than the hypothesis selected by some other method. When this occurs, both methods may differ in reliability even if they exhibit the same error rates or the same error rates across the same range of conditions. For example, an inference method may fail to identify that some languages are related at all. A second method may correctly infer that the languages are related but fail to recover how related they actually are. A third method may not only infer that the languages are related, but also correctly recover a high degree of relatedness among them. The first method should count as less reliable than the other two, whereas the third method is presumably more reliable than the first and the second. If this is right, then precision also matters when it comes to determining the reliability of inference methods. The reliability of an inference method should therefore be a function of how precise the hypotheses are that the method supports.

These qualifications suggest that reliability is a subtle notion that deserves careful scrutiny. But they also pick up important threads. First, the reliability of an inference method should be proportional to its error rates. Second, the reliability of an inference method should also be sensitive to the range of conditions across which its error rates vary. Third, the reliability of an inference method should take into account how precise the hypotheses are that the method supports. Understanding reliability in terms of error rates explicitly takes into consideration the first of these requirements. But there is no reason to think that it could not accommodate the other two, so this is not really a problem. We will get into a more serious problem for this way of understanding reliability below. Before doing so, we turn in the next section to models that assess the reliability of inference methods. Given how prevalent such models are in cultural phylogenetics, case studies in this discipline provide a helpful starting point.

Inference methods in phylogenetics

Phylogenetic inference methods have a long and venerable tradition in biology, arguably dating as far back as Darwin’s (1859) depiction of a tree-like diagram to represent the pattern of descent among biological species. In the study of cultural evolution, phylogenetic approaches actually predate Darwin. Years before the publication of the Origins, Schleicher (1853) had already sketched a tree to represent the ancestral relations among Indo-European languages. Currently, phylogeneticists have at their disposal highly sophisticated methods for recovering phylogenetic relationships (Lemey et al. 2009). Many of these inference methods have also been employed in the study of culture—see Mesoudi (2011) for an overview of many interesting applications. A particularly successful case is in reconstructing the evolutionary history of human languages. Well-studied cases include the phylogenies of major linguistic groups, from the Indo-European (Gray and Atkinson 2003) and the Austronesian families (Gray et al. 2009), to the Bantu (Holden 2002), Semitic (Kitchen et al. 2009), and Sino-Tibetan languages (Zhang et al. 2019).Footnote 5

Reliability models are quite widespread in cultural phylogenetics, as worries about the reliability of phylogenetic methods permeate much of the field. One reason for this is that phylogenetics methods were developed in biology under the assumption that evolution proceeds in a vertical line of descent, with traits inherited from parent to offspring. For the most part this is the case in biology, although recent work has revealed a surprisingly large amount of cases that deviate from vertical transmission—see Doolittle and Bapteste (2007) for a still comprehensive review. But cultural evolution often does not proceed in this way. Cultural transmission is often horizontal, meaning that cultural traits can be transmitted between individuals that do not stand in a parent-offspring relationship (see Fig. 2). For example, eye color in humans is a biological trait that is transmitted vertically, as offspring usually resemble their parents. But this is not the case with cultural markers, such as accents. Accents tend to be horizontally acquired, with the accent of second-generation immigrants often being more similar to that of their peers than to that of their parents (Evans et al. 2007; Floccia et al. 2012). A particularly pressing problem in cultural phylogenetics is thus whether to trust inference methods that were developed on the assumption that inheritance is vertical.

Fig. 2
figure 2

Vertical and horizontal modes of transmission. a Vertical transmission occurs when transmission of a trait is between parent and offspring. b Horizontal transmission occurs when transmission of a trait is between individuals that do not stand in a parent-offspring relation. Circles represent individuals; arrows represent descent from parent to offspring; color fillings represent different trait states; dotted lines represent transmission of traits

Recent reliability models in cultural phylogenetics address precisely this question. A good example is Nunn et al. (2010). The main goal of their study was to assess the reliability of methods for detecting vertical and horizontal transmission. In particular, they chose to evaluate the reliability of two methods: the consistency index (\({\textit{CI}}\)), and the retention index (\({\textit{RI}}\)). For the sake of brevity, we focus here on \({\textit{CI}}\). \({\textit{CI}}\) measures deviation from vertical transmission in a given phylogenetic tree. It is given by \({\textit{CI}}= m/s\), where m is the minimum number of changes that a tree with as much vertical transmission as possible would require to explain the distribution of a trait, s is the minimum number of changes that a particular tree of interest would require, and the ratio m/s ranges between 0 and 1.

If the CI value for a trait in some phylogenetic tree is very high (i.e., close to 1), then the minimum number of changes in that tree comes very close to the minimum number of changes for a tree with the highest degree of vertical transmission. Here, “degree of vertical transmission” refers to the number of traits whose distribution in the tree can be explained by vertical transmission. So when the \({\textit{CI}}\) value is high, this is usually taken to mean that vertical transmission does a good job at accounting for the distribution of the trait in the tree under consideration. If the \({\textit{CI}}\) value is low (i.e., a lot less than 1), then some mechanism other than vertical transmission must partly account for the distribution of the trait. A high \({\textit{CI}}\) value therefore supports the hypothesis that transmission of the trait is mostly vertical, and a low \({\textit{CI}}\) value supports the hypothesis that transmission of the trait is not mostly vertical—either because of horizontal transmission or convergent evolution. Although \({\textit{CI}}\) cannot distinguish between horizontal transmission and convergent evolution, low \({\textit{CI}}\) values are usually taken as evidence of horizontal transmission on the assumption that convergent evolution is rare.

For example, consider a trait that takes one of three possible values (0, 1, or 2). Given the distribution of traits observed in Fig. 3, the minimum number of changes that a tree with as much vertical transmission as possible would require is 2, so \(m=2\). Now suppose that a particular tree is such that the minimum number of changes for that trait is also \(s=2\). This means that \({\textit{CI}}=1\), so that the tree can account for the distribution of traits without invoking any horizontal transmission or convergent evolution. If this particular tree is the correct one for the trait in question, then the high \({\textit{CI}}\) value provides support for the hypothesis that the trait evolved by vertical transmission. Now say that another tree of interest is such that the minimum number of changes for that trait is \(s=3\). In this case, \({\textit{CI}}=0.67\). This \({\textit{CI}}\) value indicates that at some point the trait was acquired independently of common descent, either because of horizontal transmission or convergent evolution. If this turns out to be the correct tree, then the low \({\textit{CI}}\) value supports the hypothesis that there was a high degree of horizontal transmission in the evolution of the trait under the assumption that convergent evolution is indeed rare (see Fig. 3).

Fig. 3
figure 3

Minimum number of changes on different phylogenetic trees. a On this tree, the minimum number of changes to account for the distribution of the three-state trait depicted here (white, gray, and black) is three, so \(s=3\). b On this tree, the minimum number of changes to account for the same trait is two, so \(s=2\). Circles represent taxa; lines represent descent; circle shades represent different trait states; black bars represent change in trait state

To assess how reliable \({\textit{CI}}\) is in detecting vertical and horizontal transmission in the context of cultural evolution, Nunn and colleagues first simulated societies arranged in a two-dimensional lattice. Each society occupied a different cell of the lattice and had a set number of traits. At every time step, there was a small probability that each society would either change, go extinct, donate traits to nearby populations, or colonize an adjacent cell with a daughter population. The traits of a society could therefore change when neighboring populations donated their traits, or change independently of other traits due to their intrinsic evolutionary rate. As societies changed and colonized empty cells with their descendants, the simulation also retained a virtual record of the phylogenetic relations among the societies populating the lattice.Footnote 6 With this record in hand, they were then able to compare the actual rates of vertical and horizontal transmission with the hypotheses that \({\textit{CI}}\) favored about the rates. Measuring \({\textit{CI}}\)-values for the entire simulated dataset, they were finally able to show that this method is very sensitive to high evolutionary rates. In particular, they found that \({\textit{CI}}\) is reliable in detecting a high degree of vertical transmission but not very reliable in detecting high rates of horizontal transmission. This is because low \({\textit{CI}}\)-values may be due to a high degree of horizontal transmission or to convergence driven by a high evolutionary rate.

Previous work by Nunn et al. (2006), and independently by Greenhill et al. (2009), Currie et al. (2010), and Crema et al. (2014) pursue a similar strategy. But the sort of reliability model developed in these studies is by no means an idiosyncrasy of cultural phylogenetics. In biology, a rudimentary form of such models was perhaps the use of “caminalcules” in morphological phylogenetics. Caminalcules were fictional organisms with morphological traits whose evolutionary history was known only to their creator, who used them in early attempts to test general principles of phylogenetic inference (Camin and Sokal 1965). With the advent of molecular phylogenetics soon thereafter, full-fledged reliability models spread. Their use goes at least as far back as Felsenstein’s (1978) mathematical model showing that parsimony-based methods can be inconsistent under certain conditions. Model-based studies of reliability continue to be quite common—for a few recent examples, see Puttick et al. (2017) and Vernygora et al. (2020).

Aside from slight differences, reliability models across these closely related disciplines are thus variations on a common theme: first, reliability models use computational tools to generate data under known conditions; second, reliability models apply inference methods to the simulated datasets; third, reliability models conclude by assessing the reliability of the inference methods. Often their goal is to determine not only whether a single inference method is reliable, but also the conditions under which and the extent to which different methods are or are not reliable. In this way it is also possible to compare the reliability of different inference methods, identify what method performs best under what conditions, and thus choose which one to use.Footnote 7

Reliability models

The case studies of the previous section provide some far-reaching lessons. First, they offer a glimpse into the general format that paradigmatic cases of reliability models generally take. As with many other model-based tools, reliability models typically start out by laying down a specific set of assumptions. These assumptions can be quite complex, so computers are very helpful in storing and manipulating them. It is thus no surprise that Nunn et al. (2010) made extensive use of computer simulations. The process of simulating data can also be quite opaque. So computers are again called for, assisting the derivation of results from the model assumptions—for example, what happens when cultural traits evolve according to different rates of vertical or horizontal transmission. The simulated data is then fed into the inference methods under investigation. With the aid of these methods, we can generate hypotheses about the conditions that produced the data. To be clear, the conditions are known because they are the assumptions that go into building the model. Since the conditions are known in advance, it is possible to determine with great precision how successful the different inference methods are at recovering them. In the studies by Nunn et al. (2010), the simulated dataset was used to estimate \({\textit{CI}}\)-values and thus generate hypotheses about the rates of vertical and horizontal transmission that produced the data.

Second, reliability models corroborate the view that reliability should be a function of error rates. Recall that to say that reliability is a function of error rates is to say that an inference method is reliable when it has low error rates. So for a method to be reliable, it must have a low probability of favoring a hypothesis when the hypothesis is false, and a low probability of not favoring a hypothesis when the hypothesis is correct. In the context of a reliability model, this is to say that a reliable method tends to favor a hypothesis that correctly describes the assumptions that went into building the model. For example, Nunn et al.’s (2010) models show that \({\textit{CI}}\) has low error rates when transmission is vertical: when transmission is vertical, \({\textit{CI}}\) values are generally high; when \({\textit{CI}}\) values are high, transmission is vertical. Given that high \({\textit{CI}}\) values favor the hypothesis that transmission is mostly vertical, their studies show that \({\textit{CI}}\) is indeed reliable in this case. But their models also show that this is not true when \({\textit{CI}}\) values are low. Although \({\textit{CI}}\) values are low if there is a high degree of horizontal transmission, it is not the case that transmission is for the most part horizontal if \({\textit{CI}}\) values are low. This is because low \({\textit{CI}}\) values can also be due to high evolutionary rates. So \({\textit{CI}}\) does not have low error rates in this case. As it turns out, \({\textit{CI}}\) may therefore be a reliable method to infer either horizontal transmission or convergent evolution. But \({\textit{CI}}\) is not a reliable method when it comes to inferring horizontal transmission—at least not in the case of culture, where evolutionary rates can be much higher than in the biological realm.

However, reliability models highlight that reliability should also be a function of base rates—that is, the unconditional probability of possible states of the world. To see why, consider the following toy example. Suppose that most English words beginning with a certain prefix are loanwords from Arabic. Loanwords are words permanently adopted from one language into another without translation. Suppose also that 90% of the words beginning with the prefix are loanwords from Arabic, and that 10% of the words beginning with the prefix are not loanwords. Now consider two inference methods to detect whether a randomly selected English word beginning with that prefix is in fact a loanword from Arabic. Suppose further that the two inference methods have the following error rates. When a loanword begins with the prefix, method \(M_1\) erroneously supports the hypothesis that the word under investigation is not a loanword in 5% of the cases; when a word begins with the prefix but is not actually a loanword, method \(M_1\) erroneously supports the hypothesis that the word is a loanword in 20% of the cases. This is to say that \(M_1\) has a \(5\%\) false-negative rate and a 20% false-positive rate (see Fig. 4). As for method \(M_2\), let the error rates be the same as with method \(M_1\) but reversed: when a loanword begins with the prefix, method \(M_2\) supports the hypothesis that the word in question is not a loanword in 20% of the cases; and when a word begins with the prefix but is not a loanword, method \(M_2\) supports the hypothesis that the word is a loanword in 5% of the cases. So \(M_2\) has exactly the same error rates as \(M_1\), namely a 5% false-negative rate and a 20% false-positive rate.

Fig. 4
figure 4

A toy example. \(S_1\) corresponds to the state in which an English word with a certain prefix is a loanword, and \(S_2\) corresponds to the state in which the word is not a loanword. \(H_1\) and \(H_2\) correspond to hypotheses describing these states. For method \(M_1\), \(q=0.95\) and \(p=0.8\) so that \(1-q=0.05\) and \(1-p=0.2\); for method \(M_2\), \(q=0.8\) and \(p=0.95\) so that \(1-q=0.2\) and \(1-p=0.05\)

Because of how the example is constructed, both methods have the same error rates: the false negative rate of method \(M_1\) is equal to the false positive rate of method \(M_2\) (0.05), and the false positive rate of method \(M_1\) matches the false negative rate of method \(M_2\) (0.2). If reliability were simply a function of error rates, both methods should therefore count as equally reliable. Yet, the overall probability that method \(M_1\) favors the incorrect hypothesis is \(0.9 \cdot 0.05 + 0.1 \cdot 0.2=0.065\), whereas the overall probability that method \(M_2\) favors the incorrect hypothesis is \(0.9 \cdot 0.2 + 0.1 \cdot 0.05=0.185\). The small discrepancy is due to the large difference in the prior probability of a randomly picked word being a loanword, and the prior probability of it not being a loanword. Since method \(M_1\) is less prone to errors than method \(M_2\) when it comes to words that are very common, the overall error probability of method \(M_1\) is also slightly lower than the overall error probability of method \(M_2\). There is thus something valuable about the first method in that it has an overall higher probability of supporting the correct hypothesis than the second one.

But error-based accounts of reliability ignore the difference between the two methods, lumping them together with respect to their reliability. What should we make of this? At worst, the case seems to suggest that we should stop thinking of reliability simply in terms of error rates. After all, this way of thinking would us lead to make an incorrect pronouncement in this case and take both methods to be equally reliable when in fact the first method is on average less prone to errors than the second one. Straightforward as it sounds, this may be too rash of a conclusion. Someone could embrace the difference between these two methods as representative of a valuable feature that \(M_1\) has, but a feature that ultimately differs from what they call “reliability”. Confusing as this use of terms may be, here is not the place to legislate on terminological preferences. More importantly, however, the point is just that we should not gloss over an important distinction: \(M_1\) differs from \(M_2\) in that it performs better than \(M_2\). To insist that reliability is imply a matter of error rates would be to ignore this difference.

If this is right, then a satisfactory understanding of reliability should at the very least take more into account than just error rates. Attending to error rates alone ignores the effect that prior probabilities can have on the performance and therefore on the reliability of an inference method. But what could a satisfactory understanding of reliability look like? Although we will not be able to answer this question here, a promising way forward may be to take a Bayesian approach to reliability. Bayesians would have no difficulty incorporating the prior probability of hypotheses into a notion of reliability that does justice to the complexities raised above. From an information-theoretic perspective, another option might be to conceive of the reliability of an inference method as a measure of how much information the hypotheses selected by the method carry about the actual state of the world.

Yet another option would be to consider the overall error rate. There are some potential pitfalls with this approach, however. An obvious choice of measure for the overall error rate would be the expected error rate over possible states of the world. Attractive as this option may seem at first, such a measure would require taking into account base rates—i.e., the prior probabilities of the hypotheses that describe the possible states of the world. Those who defend understanding reliability in terms of error rates, such as Woodard and Mayo, typically claim quite explicitly that reliability is a property of inference methods alone. So it is not clear that they would be willing to endorse this move because prior probabilities are not properties of an inference method. Be that as it may, reliability should take more into account than just error rates.

Before concluding, an important caveat: reliability models make the concern with prior probabilities so salient precisely because they are ill-suited to offer any real guidance on these matters. Reliability models are formidable tools with which to determine error rates. As such, they excel at determining the conditions under which particular inference methods are likely or unlikely to succeed. Although these models can therefore help us assess the reliability of an inference method given a wide range of conditions, reliability models cannot establish when or how often we should encounter these conditions in the real world. For that, there is no good substitute for empirical work—be it by directly analyzing real-world data, or by coupling real-world data with other models as in the case of generative models (Kandler and Powell 2018). It is thus not surprising that Nunn et al. (2010), for example, do not attempt to estimate how common horizontal transmission or high evolutionary rates really are in cultural evolution, choosing instead to simply report the performance of \({\textit{CI}}\) and other inference methods under these conditions. For present purposes, this means that reliability models should not be expected to help us adjudicate on the difficulty raised above for error-based accounts of reliability. In sum, reliability models may suggest that error rates are not all that matter when it comes to reliability. In line with the toy example above, they may also suggest that the prior probability of a hypothesis being true matters too. But reliability models cannot help us estimate these prior probabilities.

Conclusion

When philosophers consider questions about justification, they typically focus on the justification of theories and hypotheses. But just as crucial to science is the justification of inference methods. Here, we have shown that reliability models play an important role in justifying the use of inference methods. In the particular case of phylogenetics, reliability models allow us to determine the conditions under which inference methods that were developed in biology assuming that transmission is mostly vertical can and cannot be safely applied to the cultural realm where horizontal transmission and high evolutionary rates are the norm. As the studies reviewed above illustrate, there is unfortunately no simple way to resolve this issue: conditions vary depending on the method under consideration and system of interest. But this is itself a valuable lesson that reliability models allow us to draw—namely, that methods borrowed from biological phylogenetics to study culture should neither be rejected out of hand, nor endorsed tout court.

Reliability models in cultural and biological phylogenetics are also instructive for bringing the notion of reliability into sharper focus. For one, these models corroborate a common way of understanding reliability—namely, in terms of error rates. But their use also suggest that error rates are not all that matters. Indeed, reliability models help clarify that another important aspect to consider is the probability of each possible inference prior to the application of any inference method. This is therefore a question that a full understanding of reliability should address, although here we have only hinted at possible ways of doing so. If this is right, then philosophers might also want to revise or perhaps even abandon existing ways of understanding reliability. In either case, this could prove consequential for issues in philosophy and beyond that routinely invoke the notion of reliability—such as the purpose of robustness analysis (Levins 1966; Weisberg 2006; Wimsatt 1981), the nature of experimental replications (Machery 2020), and the justification of public trust in science (Irzik and Kurtulmus 2020; Wilholt 2013).