Abstract
The theory of evolution states that the diversity of species can be explained by descent with modification. Therefore, all living beings are related through a common ancestor. This evolutionary process must have left traces in our molecular composition. In this work, we present a randomization procedure in order to determine if a group of five species of the primate family, namely, macaque, guereza, orangutan, chimpanzee, and human, has retained these traces in its molecules. First, we present the randomization methodology through two toy examples, which allow to understand its logic. We then carry out a DNA data analysis to assess if the group of primates contains phylogenetic information which links them in a joint evolutionary history. This is carried out by monitoring a Bayesian measure, called marginal likelihood, which we estimate by using nested sampling. We found that it would be unusual to get the relationship observed in the data among these primate species if they had not shared a common ancestor. The results are in total agreement with the theory of evolution.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The theory of evolution states that the diversity of species can be explained by descendants with modification. Darwin [3] was able to provide evidence in favor of his theory, despite the limitations at that time. Nowadays, technology is a powerful tool which allows to generate a huge quantity of evidence in favor of this theory. The support comes from different areas, for instance, Molecular Biology, Paleontology, Biogeography, Biochemistry, and Phylogenetics. The present article is located in the latter which is the study of the evolutionary relationship among groups of organisms based typically on molecular sequencing data.
As in any other field, in phylogenetics data analysis is performed mainly under two statistical approaches: Frequentist and Bayesian. The latter has gained ground in phylogenetics due to its flexibility to deal with large dataset with the complex evolutionary models. Studying a particular Bayesian measure, the probability that the data have been generated from a tree-like evolutionary model, we asses whether the patterns of evolution in the molecular sequencing data (DNA) could reasonably arise due to chance. In other words, if the theory of evolution was right, the sequence alignments should contain information which connects the species from where the DNA was taken. If it is so, we asses if these patterns can be due to chance acting alone.
To evaluate if these patterns emerged from the molecular data is due to chance, we use a method known as randomization. This method allows to detect if the data contain nonrandom information that links the species in a common evolutionary history. It performs by comparing a statistic obtained from the data to the distribution of the same statistic obtained from a set of functional data, generated randomly from the original one, which consequently does not contain any phylogenetic signal. If the data support evolution, their information should be significant enough to be differentiated for that one obtained just by chance. This technique was already proposed by [1] in a nonparametric framework.
This article aims to show in a practical way how the evolution theory is supported for a logical method as it is randomization by studying a Bayesian quantity: the marginal likelihood. First, two toy examples are presented as means to understand the method and then an application on a real dataset which contains part of the primate family is given in order to detect phylogenetic signal. The description of the statistical methods and phylogenetic models are omitted but the respective references are given.
2 Randomization
Randomization is a method used to assess the effect of certain factor or treatment on a variable of interest. This is carried out by studying the properties of the distribution of a statistic calculated from randomized datasets. Each of these functional datasets is generated by randomly assigning the observations to the factor/treatment, i.e., the experimental units are relabeled. The new data will not show any effect of the factor on the variable. The factor is obviated and any difference between its levels is caused by chance. This is analogous to shuffling playing cards to eliminate any kind of intervention.
The method compares the statistic of the original data with the distribution of the same statistic of the randomized data. Such statistic, for example, can be mean, median, mode, or variance. This method does not need to make any assumption about the population, it just works with the data to make inferences. Assumptions such as normality or equal variances. The following example helps to understand the method.
2.1 Toy Examples
Consider that we have the marks of a test for 10 students differenced by the method of study (A or B) to which the students were randomly allocated. The marks are presented in percentage and are shown in Table 1. The objective is to determine which of the methods of study is more effective. Both examples are developed at the same context but they will differ in the dataset. They could have been treated analytically, but to illustrate randomization in a general way we have used simulations. They just have didactic purposes and clear patterns have been arbitrarily assigned.
2.1.1 Example 1
Clearly, method A presents higher marks than method B (see Table 1, Example 1). This can be also noticed comparing their means (see Table 2). Apparently, method A is better than B. But can this be due to chance acting alone? Randomization can give us an idea.
We generate a new dataset where each mark is assigned randomly to either method A or B. The number of marks per method is set to 5, as in the original dataset. Then, the difference between the means is calculated and registered. This procedure is repeated 10,000 times. The mean differences are plotted in Fig. 1.
We can see that the mean difference is around zero. This is expected because the difference in means is just due to chance. The effect of the method has been obliterated. The observed difference, that was calculated from the original data, is 33.86 and located in the right extreme of the distribution. In case that chance is acting alone, it would be unusual to get an observed difference as big as that observed in the data. Assuming a well-designed experiment, we conclude that method A effectively yields better results than B on average.
2.1.2 Example 2
Now consider the data given in Table 1 for Example 2. In this case, both methods yield apparently similar results. The difference between their means is just 7.03 (see Table 2, Example 2). But again, can this be due to chance acting alone? To give an answer we repeat the procedure in Example 1. The results are shown in Fig. 1.
The distribution of the differences between the means of method A and B for the randomized datasets has its center around zero and is relatively symmetric. Similar characteristics were found in Example 1 because the potential effect of the method of study has been wiped out in both examples. The observed difference −7.03 is near its center. When the chance is acting alone, this difference is highly probable, unlike Example 1, where the difference in means was unusual under chance acting alone. Thus, assuming a well-designed experiment, we could claim that the methods of study yield similar results, on average, and the observed difference is just due to chance acting alone.
In these cases, we compared the effect of the method of study on the mark mean, but we could have studied any other characteristic, for instance, standard deviation, median, or a specific probability. In strict rigor, the comparison should be carried out by using an appropriate statistical test, for instance, a t-test. In the next case, we will study the probability of the data given the model in order to detect phylogenetic signal in a molecular dataset of five primates.
3 Phylogenetic Analysis
Now, we apply the same concept in order to analyze if a molecular dataset of a group of primates has information about their common evolutionary history. This is a subset of a dataset which has been previously analyzed in the literature [7]. This subset contains five kinds of primates: macaque, guereza, orangutan, chimpanzee, and human. The alignment corresponds to mitochondrial DNA which has length of 15,727 sites. To wit, the DNA is composed by four nucleobases: adenine (A), cytosine (C), guanine (G), and thymine (T). An extract of the data is shown in Fig. 2. The relationship among these species is uncontroversial and can be visualized as the tree shown in Fig. 3. Human and chimpanzee share a more recent common ancestor. This makes them more closely related. Orangutan is also part of this clade, but with a farer ancestor. Macaque and guereza form another clade. All the species are connected through their most recent common ancestor, which is located in the root of the tree (left vertex of the tree) (Fig. 3).
In order to eliminate any kind of correlation in the dataset, we permute each site generating a new dataset. In other words, each site is reordered randomly. For instance, site 2 = (C,T,C,T,T) displayed in Fig. 2, can be permuted as (T,C,T,T,C). The theory of evolution [3] states that all organisms are related through common ancestors. So, if the data were generated by a tree, they should contain this information, unlike in case the data are randomized.
In the previous examples, the mean difference was studied, but now we will study the probability of the data given the model, which will be referred to from now as marginal likelihood. Phylogenetic deals with very small probability values, so it is convenient to work with log values. The evolutionary relationship among the species is modeled by the tree, which is displayed in Fig. 3. This tree represents the factor to be tested in this analysis, similar to the method of study that was tested in the previous example. We describe the evolutionary process along the tree assuming a GTR+\(\varGamma _4\) model, which is the most general time reversible model. A good readable material about these models is given in [9]. The prior distributions on the parameters involved in the model are defined in Appendix 1.
The calculation of the marginal likelihood is a challenging problem in phylogenetics, even in simple models. Therefore, it requires a numerical approximation. Here, we estimate it via Nesting Sampling [8], algorithm introduced to phylogenetics by [4]. Details of the estimation process are given in Appendix 2.
We generate 1000 randomized datasets and calculate, for each one, their log-marginal likelihoods. Also, we estimate this quantity for the original dataset. The results are shown in Fig. 4 and the descriptive statistics in Table 3.
The estimates for the randomized data fluctuate between −53484 and −51675 with a mean of −51737. On the other hand, the observed log-marginal likelihood estimate is −49658 (with a standard deviation of 0.73). This is located at the right side of the distribution of the log-marginal likelihoods for the randomized datasets, approximately, 26 standard deviations away from the mean.
Following the reasoning of Example 1, we conclude that it would be unusual that an observed log-marginal likelihood would be as large as the one observed in the data when chance is acting alone. The probability that the original data has been generated by the tree structure is much higher than the randomized datasets have. This means that the patterns in the DNA are more likely to be explained by the tree-like structure than just to occur due to chance. In other words, the data contain phylogenetic information that cannot be explained only by chance. The mitochondrial DNA has retained the common evolutionary history of these species, and our analysis has shown that it would have been highly unlikely to obtain this disposition of the bases in the data as a result of pure chance. This is evidence which supports the tree structure behind the evolutionary history of these 5 species of primates that is consistent with the theory of evolution.
4 Conclusion
A brief introduction to randomization method has been given. Two toy examples have been studied to explain its logic. Example 1 represented a case in which the treatment had an effect on the studied characteristic, while Example 2 presented a case when chance was acting alone. Both examples aimed to set up the logic which is used in the analysis of a primate family dataset.
We analyzed a real dataset of five species of primates under a Bayesian statistical approach and used randomization to detect if this contained nonrandom information. The data were permuted to eliminate any kind of phylogenetic signal, and then the probability that these randomized data came from the tree model was calculated (marginal likelihood). This procedure was repeated several times, generating a distribution for the estimates. The probability for the original dataset was much higher than the maximum value of the same value of the randomized data. We would not expect such a probability if there was no tree signal. Therefore, we concluded that chance was not acting alone and these species have a tree-like relationship. The presence of a hierarchical structure provides evidence for descent from common ancestry.
The results given here are consistent with the theory of evolution and are added to the huge amount of evidence which supports it. For instance, 28 morphological datasets were analyzed and are in favor of the tree-like models [1]; in addition sequence data for 5 proteins from 11 species contains similar phylogenetic information [5]. In this line, we have shown that Bayesian inference provides the means to detect this phylogenetic signal through the marginal likelihood. In practice, it is unusual to find data that completely lack hierarchical structure [2] and the data analyzed here were not the exceptions.
All the analysis and plots have been produced in R-project [6].
References
Archie, J.W.: A randomization test for phylogenetic information in systematic data. Syst. Zool. 38, 239–252 (1989)
Baum, D.A., Smith, S.D.: Tree Thinking: An Introduction to Phylogenetic Biology. Roberts and Company Publishers, Greenwood Village (2012)
Darwin, C.: On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. J. Murray, London (1859)
Maturana Russel, P., Brewer, B.J., Klaere, S., Bouckaert, R.: Model selection and parameter inference in phylogenetics using nested sampling. arXiv:1703.05471v3
Penny, D., Foulds, L.R., Hendy, M.D.: Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences. Nature 297, 197–200 (1982)
R Core Team.: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2015)
Roos, C., Zinner, D., Kubatko, L.S., Schwarz, C., Yang, M., Meyer, D., Nash, S.D., Xing, J., Batzer, M.A., Brameier, M., Leendertz, F.H., Ziegler, T., Perwitasari-Farajallah, D., Nadler, T., Walter, L., Osterholz, M.: Nuclear versus mitochondrial DNA: evidence for hybridization in colobine monkeys. BMC Evol. Biol. 11, 77 (2011)
Skilling, J.: Nested sampling for general Bayesian computation. Bayesian Anal. 1, 833–860 (2006)
Yang, Z.: Molecular Evolution: A Statistical Approach. Oxford University Press, Oxford (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1
We analyze the dataset assuming a \(\mathrm{GTR}+\varGamma _4\) model and consider the following prior distributions on the parameters involved in the analysis:
-
Branch lengths: \(t_i|\mu \sim \) Exp\((1/\mu )\), for \(i =1,\ldots ,8,\) with \(\mu \sim \) Inverse-Gamma(3,0.2).
-
Relative rates: \(q_i|\phi \sim \) Exp\((\phi )\), for \(i =1,\ldots ,5,\) with \(\phi \sim \) Exp(1).
-
Base frequencies: \(\pi \sim \) Dirichlet(1,1,1,1).
-
Gamma shape parameter: \(\lambda \sim \) Gamma(0.5,1).
For more information about the parameters involved in the phylogenetic analysis, see [9].
Appendix 2
Nested sampling [8] is a Bayesian algorithm to estimate mainly the marginal likelihood. It requires a tunning parameter called active points. The precision of the estimate depends on the number of active points. The higher it is, the more accurate the estimate and the higher the computational cost are.
To estimate the observed marginal likelihood, we use 100 active points. This yields a standard deviation of 0.73 of the log-marginal likelihood estimate. For the 1000 randomized datasets, we use five active points in order to get a quick picture of their log-marginal likelihood distribution.
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Maturana Russel, P. (2018). Bayesian Support for Evolution: Detecting Phylogenetic Signal in a Subset of the Primate Family. In: Polpo, A., Stern, J., Louzada, F., Izbicki, R., Takada, H. (eds) Bayesian Inference and Maximum Entropy Methods in Science and Engineering. maxent 2017. Springer Proceedings in Mathematics & Statistics, vol 239. Springer, Cham. https://doi.org/10.1007/978-3-319-91143-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-91143-4_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91142-7
Online ISBN: 978-3-319-91143-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)