Keywords

1 Introduction

The theory of evolution states that the diversity of species can be explained by descendants with modification. Darwin [3] was able to provide evidence in favor of his theory, despite the limitations at that time. Nowadays, technology is a powerful tool which allows to generate a huge quantity of evidence in favor of this theory. The support comes from different areas, for instance, Molecular Biology, Paleontology, Biogeography, Biochemistry, and Phylogenetics. The present article is located in the latter which is the study of the evolutionary relationship among groups of organisms based typically on molecular sequencing data.

As in any other field, in phylogenetics data analysis is performed mainly under two statistical approaches: Frequentist and Bayesian. The latter has gained ground in phylogenetics due to its flexibility to deal with large dataset with the complex evolutionary models. Studying a particular Bayesian measure, the probability that the data have been generated from a tree-like evolutionary model, we asses whether the patterns of evolution in the molecular sequencing data (DNA) could reasonably arise due to chance. In other words, if the theory of evolution was right, the sequence alignments should contain information which connects the species from where the DNA was taken. If it is so, we asses if these patterns can be due to chance acting alone.

To evaluate if these patterns emerged from the molecular data is due to chance, we use a method known as randomization. This method allows to detect if the data contain nonrandom information that links the species in a common evolutionary history. It performs by comparing a statistic obtained from the data to the distribution of the same statistic obtained from a set of functional data, generated randomly from the original one, which consequently does not contain any phylogenetic signal. If the data support evolution, their information should be significant enough to be differentiated for that one obtained just by chance. This technique was already proposed by [1] in a nonparametric framework.

This article aims to show in a practical way how the evolution theory is supported for a logical method as it is randomization by studying a Bayesian quantity: the marginal likelihood. First, two toy examples are presented as means to understand the method and then an application on a real dataset which contains part of the primate family is given in order to detect phylogenetic signal. The description of the statistical methods and phylogenetic models are omitted but the respective references are given.

2 Randomization

Randomization is a method used to assess the effect of certain factor or treatment on a variable of interest. This is carried out by studying the properties of the distribution of a statistic calculated from randomized datasets. Each of these functional datasets is generated by randomly assigning the observations to the factor/treatment, i.e., the experimental units are relabeled. The new data will not show any effect of the factor on the variable. The factor is obviated and any difference between its levels is caused by chance. This is analogous to shuffling playing cards to eliminate any kind of intervention.

The method compares the statistic of the original data with the distribution of the same statistic of the randomized data. Such statistic, for example, can be mean, median, mode, or variance. This method does not need to make any assumption about the population, it just works with the data to make inferences. Assumptions such as normality or equal variances. The following example helps to understand the method.

2.1 Toy Examples

Consider that we have the marks of a test for 10 students differenced by the method of study (A or B) to which the students were randomly allocated. The marks are presented in percentage and are shown in Table 1. The objective is to determine which of the methods of study is more effective. Both examples are developed at the same context but they will differ in the dataset. They could have been treated analytically, but to illustrate randomization in a general way we have used simulations. They just have didactic purposes and clear patterns have been arbitrarily assigned.

Table 1 Data for the toy examples
Table 2 Means for each method according to the example. “Difference” depicts the subtraction between the means of Method A and B

2.1.1 Example 1

Clearly, method A presents higher marks than method B (see Table 1, Example 1). This can be also noticed comparing their means (see Table 2). Apparently, method A is better than B. But can this be due to chance acting alone? Randomization can give us an idea.

We generate a new dataset where each mark is assigned randomly to either method A or B. The number of marks per method is set to 5, as in the original dataset. Then, the difference between the means is calculated and registered. This procedure is repeated 10,000 times. The mean differences are plotted in Fig. 1.

We can see that the mean difference is around zero. This is expected because the difference in means is just due to chance. The effect of the method has been obliterated. The observed difference, that was calculated from the original data, is 33.86 and located in the right extreme of the distribution. In case that chance is acting alone, it would be unusual to get an observed difference as big as that observed in the data. Assuming a well-designed experiment, we conclude that method A effectively yields better results than B on average.

Fig. 1
figure 1

Distribution of the mean of the randomized datasets for the toy examples. In Example 1, the observed difference is unlikely to have happened under chance acting alone. On the other hand, in Example 2, this difference could have been just due to chance and nothing to do with which method was used

2.1.2 Example 2

Now consider the data given in Table 1 for Example 2. In this case, both methods yield apparently similar results. The difference between their means is just 7.03 (see Table 2, Example 2). But again, can this be due to chance acting alone? To give an answer we repeat the procedure in Example 1. The results are shown in Fig. 1.

The distribution of the differences between the means of method A and B for the randomized datasets has its center around zero and is relatively symmetric. Similar characteristics were found in Example 1 because the potential effect of the method of study has been wiped out in both examples. The observed difference −7.03 is near its center. When the chance is acting alone, this difference is highly probable, unlike Example 1, where the difference in means was unusual under chance acting alone. Thus, assuming a well-designed experiment, we could claim that the methods of study yield similar results, on average, and the observed difference is just due to chance acting alone.

In these cases, we compared the effect of the method of study on the mark mean, but we could have studied any other characteristic, for instance, standard deviation, median, or a specific probability. In strict rigor, the comparison should be carried out by using an appropriate statistical test, for instance, a t-test. In the next case, we will study the probability of the data given the model in order to detect phylogenetic signal in a molecular dataset of five primates.

3 Phylogenetic Analysis

Now, we apply the same concept in order to analyze if a molecular dataset of a group of primates has information about their common evolutionary history. This is a subset of a dataset which has been previously analyzed in the literature [7]. This subset contains five kinds of primates: macaque, guereza, orangutan, chimpanzee, and human. The alignment corresponds to mitochondrial DNA which has length of 15,727 sites. To wit, the DNA is composed by four nucleobases: adenine (A), cytosine (C), guanine (G), and thymine (T). An extract of the data is shown in Fig. 2. The relationship among these species is uncontroversial and can be visualized as the tree shown in Fig. 3. Human and chimpanzee share a more recent common ancestor. This makes them more closely related. Orangutan is also part of this clade, but with a farer ancestor. Macaque and guereza form another clade. All the species are connected through their most recent common ancestor, which is located in the root of the tree (left vertex of the tree) (Fig. 3).

Fig. 2
figure 2

Extract of the mitochondrial DNA for five species of primates. Each column represents a site

Fig. 3
figure 3

Evolutionary relationship among five members of the primate family. From the top: macaque, guereza, orangutan, chimpanzee, and human. These species are related via common ancestry

In order to eliminate any kind of correlation in the dataset, we permute each site generating a new dataset. In other words, each site is reordered randomly. For instance, site 2 = (C,T,C,T,T) displayed in Fig. 2, can be permuted as (T,C,T,T,C). The theory of evolution [3] states that all organisms are related through common ancestors. So, if the data were generated by a tree, they should contain this information, unlike in case the data are randomized.

In the previous examples, the mean difference was studied, but now we will study the probability of the data given the model, which will be referred to from now as marginal likelihood. Phylogenetic deals with very small probability values, so it is convenient to work with log values. The evolutionary relationship among the species is modeled by the tree, which is displayed in Fig. 3. This tree represents the factor to be tested in this analysis, similar to the method of study that was tested in the previous example. We describe the evolutionary process along the tree assuming a GTR+\(\varGamma _4\) model, which is the most general time reversible model. A good readable material about these models is given in [9]. The prior distributions on the parameters involved in the model are defined in Appendix 1.

The calculation of the marginal likelihood is a challenging problem in phylogenetics, even in simple models. Therefore, it requires a numerical approximation. Here, we estimate it via Nesting Sampling [8], algorithm introduced to phylogenetics by [4]. Details of the estimation process are given in Appendix 2.

We generate 1000 randomized datasets and calculate, for each one, their log-marginal likelihoods. Also, we estimate this quantity for the original dataset. The results are shown in Fig. 4 and the descriptive statistics in Table 3.

Fig. 4
figure 4

Log-marginal likelihood of the observed data compared to the distribution of this quantity obtained from randomized datasets. The observed log-marginal likelihood is much higher than that one would expect under chance acting alone. The information contained in the molecular data of this primate family is more highly probable of being obtained due to common ancestry than just due to chance

Table 3 Descriptive statistics for the estimated log-marginal likelihoods from the randomized datasets

The estimates for the randomized data fluctuate between −53484 and −51675 with a mean of −51737. On the other hand, the observed log-marginal likelihood estimate is −49658 (with a standard deviation of 0.73). This is located at the right side of the distribution of the log-marginal likelihoods for the randomized datasets, approximately, 26 standard deviations away from the mean.

Following the reasoning of Example 1, we conclude that it would be unusual that an observed log-marginal likelihood would be as large as the one observed in the data when chance is acting alone. The probability that the original data has been generated by the tree structure is much higher than the randomized datasets have. This means that the patterns in the DNA are more likely to be explained by the tree-like structure than just to occur due to chance. In other words, the data contain phylogenetic information that cannot be explained only by chance. The mitochondrial DNA has retained the common evolutionary history of these species, and our analysis has shown that it would have been highly unlikely to obtain this disposition of the bases in the data as a result of pure chance. This is evidence which supports the tree structure behind the evolutionary history of these 5 species of primates that is consistent with the theory of evolution.

4 Conclusion

A brief introduction to randomization method has been given. Two toy examples have been studied to explain its logic. Example 1 represented a case in which the treatment had an effect on the studied characteristic, while Example 2 presented a case when chance was acting alone. Both examples aimed to set up the logic which is used in the analysis of a primate family dataset.

We analyzed a real dataset of five species of primates under a Bayesian statistical approach and used randomization to detect if this contained nonrandom information. The data were permuted to eliminate any kind of phylogenetic signal, and then the probability that these randomized data came from the tree model was calculated (marginal likelihood). This procedure was repeated several times, generating a distribution for the estimates. The probability for the original dataset was much higher than the maximum value of the same value of the randomized data. We would not expect such a probability if there was no tree signal. Therefore, we concluded that chance was not acting alone and these species have a tree-like relationship. The presence of a hierarchical structure provides evidence for descent from common ancestry.

The results given here are consistent with the theory of evolution and are added to the huge amount of evidence which supports it. For instance, 28 morphological datasets were analyzed and are in favor of the tree-like models [1]; in addition sequence data for 5 proteins from 11 species contains similar phylogenetic information [5]. In this line, we have shown that Bayesian inference provides the means to detect this phylogenetic signal through the marginal likelihood. In practice, it is unusual to find data that completely lack hierarchical structure [2] and the data analyzed here were not the exceptions.

All the analysis and plots have been produced in R-project [6].