Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony

Berglund-Sonnhammer, Ann-Charlotte; Steffansson, Pär; Betts, Matthew J.; Liberles, David A.

doi:10.1007/s00239-005-0096-1

Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony

Published: 07 July 2006

Volume 63, pages 240–250, (2006)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Molecular Evolution Aims and scope Submit manuscript

Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony

Download PDF

Ann-Charlotte Berglund-Sonnhammer^1,2,
Pär Steffansson¹,
Matthew J. Betts³^nAff4 &
…
David A. Liberles^1,3^nAff5

356 Accesses
60 Citations
Explore all metrics

Abstract

Gene duplication and gene loss as well as other biological events can result in multiple copies of genes in a given species. Because of these gene duplication and loss dynamics, in addition to variation in sequence evolution and other sources of uncertainty, different gene trees ultimately present different evolutionary histories. All of this together results in gene trees that give different topologies from each other, making consensus species trees ambiguous in places. Other sources of data to generate species trees are also unable to provide completely resolved binary species trees. However, in addition to gene duplication events, speciation events have provided some underlying phylogenetic signal, enabling development of algorithms to characterize these processes. Therefore, a soft parsimony algorithm has been developed that enables the mapping of gene trees onto species trees and modification of uncertain or weakly supported branches based on minimizing the number of gene duplication and loss events implied by the tree. The algorithm also allows for rooting of unrooted trees and for removal of in-paralogues (lineage-specific duplicates and redundant sequences masquerading as such). The algorithm has also been made available for download as a software package, Softparsmap.

Inferring Multiple Consensus Trees and Supertrees Using Clustering: A Review

Comparing Methods for Species Tree Estimation with Gene Duplication and Loss

Rooting Species Trees Using Gene Tree-Species Tree Reconciliation

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

As species and their genomes diverge during evolutionary history, the sets of genes and their sequences also diverge. Gene duplication has been proposed as a crucial source of evolutionary innovation in organisms, like Eukaryotes, with small effective population sizes (Ohno 1970; Francino 2005). With duplication comes initial redundancy, followed by neofunctionalization, subfunctionalization, and, most commonly. pseudogenization (see Lynch et al. 2001; Rastogi and Liberles 2005). This differential retention of duplicate genes between species can result in a different phylogenetic tree for individual gene families than for the species as a whole. Further, differential parsing of shared ancestral gene and nucleotide polymorphism (see Blanchette et al. 2004) as well as uncertainty in tree calculation methodologies (especially for EST and partial sequences) can obfuscate the correlation between the evolutionary history of a gene and the species. Additionally, when using Genbank (Benson et al. 2005) or even draft genome sequences as the starting point for phylogenetic analysis, many species will be represented with some genes, while others that are actually present in the species genomes will not be represented in the datasets and falsely appear to have been lost or alternatively falsely appear in many copies. On top of ambiguity at the gene tree level, many vertices in species trees are unresolved and are represented as nonbinary to reflect this species history ambiguity.

Goodman et al. (1979) first introduced a mapping, which was then formalized by Page (1994), to explain the difference between a gene tree and its species tree. Additional algorithms for computing these mappings have also been presented (Zhang 1997; Eulenstein et al. 1998; Zmasek and Eddy 2001b). Another approach, Notung (Chen et al. 2000; Durand et al. 2005), allows consideration of gene tree uncertainty through a bootstrap value threshold and implements a weighting of gene duplication and loss events.

The problem of reconciling a gene tree to a species tree is used to solve two opposite but connected problems. The first problem is to infer a species tree given a set of gene trees, where the gene trees have different evolutionary histories (Guigo et al. 1996; Page 2000; Ma et al. 2000; Page and Cotton 2000; Cotton and Page 2002; Hallett and Lagergren 2002). The other problem is to infer a gene tree or a set of gene trees given a trusted species tree (Arvestad et al. 2003, 2004). The reconciliation can also be used for locating duplication events with respect to a species tree (Guigo et al. 1996) and for orthology analysis (Zmasek and Eddy 2002; Arvestad et al. 2003, 2004). Recent work has also focused on extending this type of approach to differentiating gene duplication events from lateral transfer events (Hallet et al. 2004). In this paper we wish to infer a rooted binary gene tree given a rooted nonbinary species tree and an unrooted, binary, or nonbinary gene tree considering the process of gene duplication.

These previously described algorithms require (or infer) a binary species tree and the approach has been successfully applied on a large scale, when there are no ambiguities in the species tree (Koonin et al. 2004). However, the NCBI taxonomy database (which, while formally a taxonomy, is commonly used as a species tree) (Benson et al. 2005) and other reference species trees are not binary in many places due to uncertainties, between gene trees both from different genes and from morphological characters (for example, the resolution of eutherian mammals). This problem was solved by Koonin et al. (2004) by performing the calculation over their species tree twice for the species in their dataset (once for a topology consistent with a clade of Ecdysozoa and once for a topology consistent with a clade of Coelomata). Here, building on previous algorithmic work, we present a more general mapping using a parsimonious approach toward uncertain speciation events or soft polytomies (for the original definition of soft polytomies, see Maddison 1989). Because the method embraces soft polytomies, we term it soft parsimony, in contrast to previous work, which we term hard parsimony.

One alternative to gene tree-to-species tree mapping for rooting of unrooted trees is midpoint rooting, where the point that is farthest from any extant sequence is designated as the root. However, as heterotachy (different modes of evolution in different subtrees of gene family trees) does not appear to be uncommon, this can falsely assign a root to a more recent fast-evolving branch (see Galtier 2001; Lopez et al. 2002; Siltberg and Liberles 2002).

For the reasons listed above, in the development of a large-scale database for understanding species evolution through the evolution of gene families (Liberles et al. 2001; Roth et al. 2005), it has been necessary to develop a soft parsimony based approach to map gene trees onto species trees. In future implementations of The Adaptive Evolution Database (TAED), an analysis of gene content could be coupled to an analysis of sequence evolution, as lineage-specific duplication has been proposed to play a major role in lineage-specific organismal evolution (for an interesting discussion see Francino 2005). In addition to the bootstrap (or posterior probability) threshold also implemented in Notung, it has been necessary to implement some additional features driven by considerations in the starting dataset (Genbank). Because many species have sparse sampling of genes from their genomes, it has been necessary to minimize, first, gene duplications and, second, gene losses rather than minimizing them together and attributing (with a weight) the loss of a gene to an absence in the genome. Also, because of the redundancy in GenBank (GenBank is an uncurated depository for gene sequences and many genes including those with mutations, splice variants, sequencing errors, etc., appear as multiple independent entries), it has been necessary to treat in-paralogues (lineage-specific duplicates) as redundant entries in an effort to improve gene family signal. The algorithm, as an option, can seek to exclude in-paralogues, as these are then not counted as duplications and are filtered out. An algorithm is presented that enables a mapping with all of the above features using the flexible soft parsimony approach, together with a downloadable software package, Softparsmap.

Methods

Multiple sequence alignments (MSA) were calculated using POA (Grasso and Lee 2004) and phylogenetic trees were built using MrBayes (Huelsenbeck and Ronquist 2001). The parameters used for the tree calculations were as described by Roth et al. (2005). The NCBI taxonomy (Benson et al. 2005) was used as a species tree. The objective of our method is twofold. First, we aim to root the unrooted phylogenetic tree from MrBayes, using the information from the corresponding species tree. Second, we aim to infer a topology of poorly resolved groups in the gene tree based on the species tree with a minimization of duplication and subsequently loss events as optimality criteria, detecting and filtering out redundant copies in the process. The flowchart for the method is illustrated in Fig. 1.

Our approach of rooting the gene tree follows that of Notung, but the methods differ in how the minimum number of duplications and losses are computed. First, the number of duplications is minimized and then the number of losses is minimized for the trees with the minimum number of duplications. Also, our method does not return all binary gene trees that have the minimum number of duplications and losses.

Algorithms

By mapping the vertices of a gene tree to the vertices of the corresponding species tree, each inner gene tree vertex can be labeled as being a duplication or speciation event. Hence, different mappings will describe different evolutionary scenarios. The m-mapping for mapping the vertices in the gene tree to the vertices in the species tree was introduced by Goodman et al. 1979. For any gene tree vertex g, m(g) is the species to which genome g belongs. For our soft parsimony approach we defined another mapping, denoted M, for mapping gene tree vertices to species tree vertices. The two mappings are illustrated in Fig. 2, and the Appendix presents a formal definition of our M-mapping.

The objective of our method is to both root and resolve weakly supported edges of unrooted gene trees. This is done by finding the rooted gene trees corresponding to the unrooted gene tree that has the most parsimonious mapping, i.e., a mapping that results in the fewest duplications and losses.

Our method starts out with an unrooted, binary, or nonbinary gene tree, where all edges have bootstrap values or posterior probabilities attached to them. In the first step of our method a set of rooted gene trees is constructed by applying midedge rooting to each edge in the unrooted gene tree. Next the edges that have bootstrap values or posterior probabilities less then a predefined cutoff value are collapsed. From the resulting set of rooted gene trees with well-supported edges the following is performed. First, the minimum number of duplications is computed by summing over all gene tree vertices.

$$ Dup^{{(S,G)}} = \sum {_{g \in V(G)\backslash L(G)} } dup^{{(S,G)}} (g) $$

(1)

where dup ^(S,G)(g) is the minimum number of duplications associated with gene tree vertex g (see below). Since the number of duplications associated with any gene tree vertex is independent of the number of duplications associated with any other gene tree vertex, the summation over the gene tree vertices can be done in any order. Only the rooted gene trees that minimize the number of duplications are kept, and this results in a subset to the original set of rooted gene trees. Of course this subset might be equal to the original set. Second, for this subset of rooted gene trees the minimum number of losses is computed by summing over all gene tree vertices:

$$ Loss^{(S,G)} = \sum {_{g \in V(G)\backslash L(G)} } loss^{(S,G)} (g) $$

(2)

As in the previous step, only the rooted gene trees that meet the optimality criterion are kept. Here the optimality criterion is that the minimum number of losses should be minimized. Consequently, the resulting subset consists of rooted gene trees that all have the same number of duplications and losses. The weak edges that do not affect the number of duplications and losses are restored as they are encountered in the summation. Third, if the subset of rooted gene trees that minimize the number of duplications and losses has more than one member, the following procedure is applied to choose the preferred rooted gene tree. As soon as only one tree satisfies a criterion, the procedure stops. The first criterion is that the preferred tree should have the most internal vertices (i.e., the most nodes or branching points), the second criterion is that the preferred tree should have the least number of weak edges, and the third criterion is that the preferred tree should have the shortest root distance. However, it is not certain that a gene tree can be chosen from this procedure and in such cases our method returns any of the trees in the subset together with a warning. Fourth, the preferred rooted gene tree might not be binary, and thus the next step is to resolve the remaining collapsed edges. This is done by adding splits from the corresponding species tree and, if necessary, information from adding outgroups to the original unrooted gene tree (see the Appendix for more detail). Finally, in-paralogues are removed from the rooted gene tree by pairwise comparisons of the gene sequences of the in-paralogues. Given the sequences of two in-paralogues we choose to keep one of them according to the following criteria. First, if one of the sequences is complete while the other is only a fragment, the complete one is kept. Second, the longest sequence is kept. Third, the sequence with the highest GI number (most recent entry to GenBank) is kept.

Minimizing the Number of Duplications

As the leaves have no duplications associated with them, the minimum number of duplications for a given gene tree is computed by summing the minimum number of duplications for each inner gene tree vertex as shown in expression (1). Since the numbers of duplications associated with each gene tree vertex are independent, the computations can be done in any order. When the minimum number of duplication is computed for any inner gene tree vertex g to a gene tree G, the subtree rooted at g is considered, i.e., G _g.Given a binary inner gene tree vertex, the minimum number of duplications can readily be computed from

$$ dup^{(S,G)} (g)= \left\{\matrix{1\quad {\rm if\ the\ two\ children\ of\ } g \ {\rm have} \cr {\rm descendents\ within\ the\ same}\cr {\rm extant\ genome} \cr 0\quad{\rm otherwise}}\right. $$

(3)

For any nonbinary inner gene tree vertex g, the minimum number of duplications associated with g is calculated by partitioning the child vertices of g into sets, such that the members of any set do not have descendants in the same extant genome. The partitioning is done such that the number of sets is minimized. An example of how the minimum number of duplications is computed for a gene tree vertex is shown in Fig. 3a. The problem of computing the minimum number of duplications for nonbinary gene tree vertices in this way is NP-complete; see Theorem A4 in the Appendix.

Minimizing the Number of Losses

The minimum number of losses is computed after the minimum number of duplications has been computed. Moreover, the computations are only performed for the rooted gene trees that minimize the number of duplications, as gene loss is a secondary optimization criterion to gene duplication. When the minimum number of losses is computed for any inner gene tree vertex g, the subtree rooted at g with the children of g as leaves is considered. If the gene vertex is nonbinary, so is the subtree, and thus, a refinement of this nonbinary subtree is constructed before the minimum number of losses is computed. Our algorithm separates three types of loss that can occur in the planted subtree rooted at g and containing one of the two children of g. For each gene tree vertex we have to sum over these three types of loss:

$$ \eqalign{loss^{(S,G)} (g)&=\sum\nolimits_{i=1}^2 (loss_1 (g,c_i(g)) +\ loss_2 (g,c_i(g)) \cr &\quad + loss_3 (g,c_i(g)))} $$

(4)

For any binary gene tree vertex the minimum number of losses is computed directly using expression (4), but for nonbinary vertices in the gene tree another approach must be taken. The proposed algorithm for computing the minimum number of losses for a nonbinary gene tree vertex and the corresponding species tree is approximate, and it is presented in detail in the Appendix. For each nonbinary gene tree vertex g, the corresponding partitioning of the child vertices into sets, given from the minimization of duplications algorithm, is used to resolve the uncertainty, i.e., create a binary tree with the children of the current vertex as leaves. This tree is constructed such that the vertices labeled as duplications are as close to the leaves of this tree (as this will count redundant sequences as in-paralogues rather than out-paralogues, enabling them to be filtered from the duplication calculation). This tree is then used together with the corresponding species subtree in expression (2) to compute the minimum number of losses for the current gene vertex. In Fig. 3b, an example of how the minimum number of losses is computed for a gene tree vertex is presented.

Software

Softparsmap is available for download from http://www.ii.uib.no/∼steffpar/softparsmap/ . It is written in Java and requires JDK 1.4.2 or later.

Results and Discussion

The systematic application of a soft parsimony approach for analyzing gene trees in the context of species trees has been performed as part of The Adaptive Evolution Database (TAED) (Roth et al. 2005). Here, vertices with posterior probabilities of <0.7 were collapsed to nonbinary trees. Then the NCBI taxonomy (Benson et al. 2005) was used as a species tree to minimize the number of gene duplication and loss events in the gene family tree to produce a binary result.

A total of 1217 of 11,704 rootable gene families trees in the Chordate half of TAED were modified using this approach. This approach holds equal value for other gene family databases, where identification of orthologous genes is important. From TAED, a sample tree, where the algorithm corrects a tree in the expected manner, is shown in Figs. 4a and b. The example shown is malate dehydrogenase.

As calculated by Mr. Bayes (Huelsenbeck and Ronquist 2001), there is no root of the gene family tree that generates Artiodactyls (hoofed mammals with an even number of toes) as a monophyletic group without inferring a gene duplication and multiple selective loss events. Application of the soft parsimony algorithm results in the expected rooted tree, shown in Fig. 4b.

To walk through this, in the first step, the root is placed on the lineage separating Branchiostoma (amphioxus) from teleost (bony) fish. This results in one implied gene duplication event in eutherian (placental) mammals. Next the branch leading to Sus scrofa below the posterior probability threshold of 0.70 is collapsed. leading to possible nonbinary trees linking it with the Ovis aries/rodent vertex. Then the number of duplication events associated with different resolutions is assessed, and a resolution of the Sus scrofa/Ovis aries grouping as Artiodactyls with the root still on the branch separating amphioxus from teleosts now implies no gene duplication events.

Of course, gene duplication and loss events do occur. Bayesian approaches, which treat such events probabilistically, result in the explanation that such events are rarer and less likely to explain a tree such as that shown in Fig. 4a than, in this case, statistical uncertainty of branching (Arvestad et al. 2003).

Many of the families in TAED that have been corrected are multigene families, where the branching order is different following a gene duplication event and gene loss events. An example of this, where an ancient gene duplication event preceded the divergence of eutherian mammals and where the optimal tree shows a different eutherian mammal topology, is shown in Figs. 5a and b, from the guanine nucleotide binding protein gene family in TAED. This is an example of a tree where hard parsimony would force a less preferred topology on one of the postduplication clades. In this case, Primates are the outgroup to Rodents, Carnivores, and Artiodactyls on one half of the tree and Rodents are the outgroup to Primates, Carnivores, and Artiodactyls on the other half. Without a posterior probability threshold, the hard parsimony approach would infer a gene duplication event. With a posterior probability threshold (in this case), the tree would be corrected to one with even lower support to prevent inference of a gene duplication event using hard parsimony.

The algorithm presented under Methods and applied to TAED is available as software, called Softparsmap, and is available for download at http://www.ii.uib.no/∼steffpar/softparsmap/ . The optional functionality available in the software package includes tree rooting by minimization of gene duplication and loss events, removal of in-paralogues, removal of uncertainties or correction of weakly supported splits based on the reference species tree, gene tree-to-species tree mapping to allow identification of orthologues and paralogues, and comparison of tree topologies.

A fast, flexible, powerful approach is presented that allows a gene tree-to-species tree mapping using a soft parsimony algorithm. This approach has been applied systematically to gene families based on GenBank in TAED, modifying over 10% of gene families. The software package available from this method should be a valuable addition to the evolutionary bioinformatics toolbox.

References

Arvestad L, Berglund AC, Lagergren J, Sennblad B (2003) Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics 19:I7–I15
Article PubMed Google Scholar
Arvestad L, Berglund AC, Lagergen J, Sennblad B (2004) Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence analysis. RECOMB 2004:326–335
Google Scholar
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2005) Genbank. Nucleic Acids Res 33:D34–D38
Article PubMed CAS Google Scholar
Blanchette M, Green ED, Miller W, Haussler D (2004) Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res 14:2412–2423
Article PubMed CAS Google Scholar
Chen K, Durand D, Farach-Colton M (2000) Notung: a program for dating gene duplications and optimizing gene family trees. J Comput Biol 7:429–447
Article PubMed CAS Google Scholar
Cotton JA, Page RDM (2002) Going nuclear: Gene family evolution and vertebrate phylogeny reconciled. Proc Roy Soc London 269:1555–1561
Article CAS Google Scholar
Durand D, Halldorsson BV, Vernot B (2005) A hybrid micro-macroevolutionary approach to gene tree reconstruction. RECOMB 2005:250–264
Google Scholar
Eulenstein O, Mirkin B, Vingron M (1998) Duplication-based measures of difference between gene and species trees. J Comput Biol 5:135–148
PubMed CAS Google Scholar
Francino MP (2005) An adaptive radiation model for the origin of new gene functions. Nature Genet 37:573–577
Article PubMed CAS Google Scholar
Galtier N (2001) Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol Biol Evol 18:866–873
PubMed CAS Google Scholar
Garey MR, Johnson DS (1979) Computers and intractability, a guide to the theory of NP-completeness. Freeman, New York
Google Scholar
Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G (1979) Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool 28:132–163
Article CAS Google Scholar
Grasso C, Lee C (2004) Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics 20:1546–1556
Article PubMed CAS Google Scholar
Guigo R, Muchnik I, Smith TF (1996) Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol 6:189–213
Article PubMed CAS Google Scholar
Hallett MT, Lagergren J (2000) New algorithms for the duplication-loss model. RECOMB 2000:138–146
Google Scholar
Hallet M, Lagergren J, Tofigh A (2004) Simultaneous identification of duplications and lateral transfers. RECOMB 2004:347–356
Google Scholar
Huelsenbeck JP, Ronquist F (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–755
Article PubMed CAS Google Scholar
Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Rogozin IB, Smirnov S, Sorokin AV, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2004) A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol 5(2):R7
Article PubMed Google Scholar
Liberles DA, Schreiber DR, Govindarajan S, Chamberlin SG, Benner SA (2001) The Adaptive Evolution Database (TAED). Genome Biol 2(8):research0028.1-0028.6
PubMed CAS Google Scholar
Lopez P, Casane D, Philippe H (2002) Heterotachy, an important process of protein evolution. Mol Biol Evol 19:1–7
PubMed CAS Google Scholar
Lynch M, O’Hely M, Walsh B, Force A (2001) The probability of preservation of a newly arisen gene duplicate. Genetics 159:1789–1804
PubMed CAS Google Scholar
Ma B, Li M, Zhang LX (2000) From gene trees to species trees. SIAM J Comput 30:729–752
Article Google Scholar
Maddison WP (1989) Reconstructing character evolution on polytomous cladograms. Cladistics 5:365–377
Article Google Scholar
Ohno S (1970) Evolution by gene duplication. Springer-Verlag, New York
Google Scholar
Page RDM (1994) Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst Biol 43:58–77
Google Scholar
Page RDM (2000) Extracting species trees from complex gene trees: reconciled trees and vertebrate phylogeny. Mol Phylogenet Evol 14:89–106
Article PubMed CAS Google Scholar
Page RDM, Cotton (2000) GeneTree: a tool for exploring gene family evolution. In: Sankoff D, Nadeau J (eds) Map alignment, and the evolution of gene families. Kluwer Academic, Dordrecht, pp 525–536
Google Scholar
Rastogi S, Liberles DA (2005) Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol 5:28
Article PubMed CAS Google Scholar
Roth C, Betts MJ, Steffansson P, Saelensminde G, Liberles DA (2005) The Adaptive Evolution Database (TAED): a phylogeny-based tool for comparative genomics. Nucleic Acids Res 33:D495–D497
Article PubMed CAS Google Scholar
Siltberg J, Liberles DA (2002) A simple covarion-based approach to analyse nucleotide substitution rates. J Evol Biol 15:588–594
Article CAS Google Scholar
Zhang LX (1997) On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J Comput Biol 4:177–187
Article PubMed CAS Google Scholar
Zmasek CM, Eddy SR (2001a) ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics 17:383–384
Article CAS Google Scholar
Zmasek CM, Eddy SR (2001b) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17:821–828
Article CAS Google Scholar
Zmasek CM, Eddy SR (2002) RIO: Analyzing proteomes by automated phylogenomics using resamples inference of orthologs. BMC Bioinform 3:14
Article Google Scholar

Download references

Acknowledgments

We are grateful to Jens Lagergren for helpful discussions and also to three anonymous reviewers for their comments. This work was funded by Vetenskapsrådet, the Swedish Foundation for Strategic Research, and FUGE, the Norwegian national functional genomics platform.

Author information

Matthew J. Betts
Present address: EMBL Heidelberg, Meyerhoffstrasse 1, 69117, Heidelberg, Germany
David A. Liberles
Present address: Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, 82071, USA

Authors and Affiliations

Stockholm Bioinformatics Center, Stockholm University, 10691, Stockholm, Sweden
Ann-Charlotte Berglund-Sonnhammer, Pär Steffansson & David A. Liberles
Linnaeus Centre for Bioinformatics, Uppsala University, BMC Box 598, 75124, Uppsala, Sweden
Ann-Charlotte Berglund-Sonnhammer
Computational Biology Unit, BCCS, University of Bergen, 5020, Bergen, Norway
Matthew J. Betts & David A. Liberles

Authors

Ann-Charlotte Berglund-Sonnhammer
View author publications
You can also search for this author in PubMed Google Scholar
Pär Steffansson
View author publications
You can also search for this author in PubMed Google Scholar
Matthew J. Betts
View author publications
You can also search for this author in PubMed Google Scholar
David A. Liberles
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David A. Liberles.

Additional information

[Reviewing Editor: Dr. Rafael Zardoya]

Ann-Charlotte Berglund-Sonnhammer and Pär Steffansson contributed equally to this work.

Appendix: Mathematical Properties of the Soft Parsimony Algorithm

Definitions and Notation

A tree T is a connected graph with no cycles, and consists of a vertex set V(T) and an edge set E(T). The leaves of a tree are the vertices with degree 1, and the leaf set of T is denoted L(T), which is a subset of V(T). A tree T is rooted if there is exactly one distinguished vertex, the root, which is denoted r(T). For a rooted tree T, a child of a vertex u ∈ V(T) is denoted $ c_i^T (u) $, and the set of children for a vertex u in T is denoted C ^T(u). The rooted subtree of T rooted at u ∈ V(T) is denoted T _u. The planted subtree rooted at u containing only the child vertex c _i(u) is denoted $ T_{uc_i } (u) $. Further, for a nonroot vertex v ∈ V(T), let p ^T(v) be the parent of v. For any u, v ∈ V(T), let v ≤ ^T u if v ∈ V(T _u) and let v <^T u if $ v \in V(T_u )\backslash \{ u\} $. A rooted tree T is binary if all interior vertices have two children and an unrooted tree is binary if all interior vertices are connected to three other vertices. For a tree T, d = (L ₁, L ₂) is a split in T if there exists an edge e ∈ E(T) such that L ₁, L ₂ are the two leaf sets of the trees formed when e is removed. Given a tree T and a set of vertices $ V^{\prime} \subseteq V(T) $, let LCA ^T(V′) be the last common ancestor of the vertices in V′. Tree indexes are omitted whenever it is clear from the context, and trees are rooted if else is not specified.

A species tree S is a phylogenetic tree that describes the relationship between extant species. A gene tree G is a phylogenetic tree with a leaf labeling function σ:L(G)→L(S). The leaves in G represent genes and the leaves in S represent extant species, and thus a gene g ∈ L(G) belongs to the genome of species σ(g). We denote the set of all extant genomes that are represented by the leaves of the gene tree $ \sum {(G)} = \cup _{g \in L(G)} \sigma (g) $. For two rooted gene trees G and G′, the leaf g ₁∈L(G) equals the leaf g ₂ ∈ L(G′) if g ₁ and g ₂ are representing the same gene. The internal vertex g ₁∈V(G)\L(G) equals g ₂∈V(G′)\L(G′) if $ L(G_{g_1 } ) = L(G^{\prime}_{g_2 } ). $ Further, G is said to refine G′ if for every g ₂∈V(G′) there exists a g ₁∈V(G) such that g ₁ = g ₂. Goodman et al. (1979) introduced a mapping m, mapping the vertices in the gene tree to the vertices in the species tree. More precisely, given a species tree S and a gene tree G, m is a surjective mapping such that for all, $ g\in V(G),m(g) = LCA^S (\sum {(G_g )} ) $.

Gene Duplication and Gene Loss

Here we introduce a new soft parsimony mapping for uncertain species trees and binary gene trees. Further, we define how to compute gene duplication and gene loss events using this mapping.

Given a species tree S and a rooted binary gene tree G such that Σ(G) = L(S), let M be the mapping M: V(G)→2^V(S) such that

1.
For any g ∈ V(G) such that m(g) ∈ L(S), M(g) = {m(g)}.
2.
For any g ∈ V(G) such that m(g) ∉ L(S), $ M(g)= {\left\{ x \in C^S (m(g))|\exists g^{\prime} \in V(G), g^{\prime} < ^G g \wedge m(g^{\prime}) \le^s x \right\}} $ and define $ Z(g) = \cup _{s \in M(g)} L(S_s ) $.

Each mapping M for a given gene tree and species tree describes a hypothesis of how the gene tree evolved with respect to the species tree and, thus, which gene tree vertices are duplication events and which are speciation events.

For any interior vertex g ₁∈V(G)\L(G) the number of duplications associate with g is

$$ dup^{(S,G)} (g) =\left\{\matrix{1 \quad {\rm if} \, |Z(c_1 (g)) \cap Z(c_2 (g))| > 0 \cr 0 \quad {\rm otherwise}}\right. $$

(A1)

and the total number of duplications for G is

$$ Dup^{(S,G)} = \sum {_{g \in V(G)\backslash L(G)} } dup^{(S,G)} (g) $$

(A2)

The number of losses associated with a vertex g∈V(G) and one of its two children c _i(g)∈V(G) is defined as

$$ loss_1 (g,c_i (g)) = |\{ s \in V(S)\} |m(c_i (g)) <^s s <^s m(g)\} | $$

(A3)

$$ loss_2 (g,c_i (g)) =\left\{\matrix{1\quad {\rm if}\, m(c_i(g)) < ^s m(g) \wedge |C^s \cr (m(c_i (g)))\backslash M(c_i (g))| > 0 \cr 0 \quad {\rm otherwise}}\right. $$

(A4)

$$ loss_3 (g,c_i (g))=\left\{\matrix{1\quad {\rm if}\ dup^{(S,G)}\ (g) = 1 \cr \wedge M(g) \ne M(c_i (g)) \cr 0\quad {\rm otherwise}}\right. $$

(A5)

Then the number of losses assigned to an interior vertex g∈V(G)\L(G) and the total number of losses in the gene tree is

$$ \eqalign{loss^{(S,G)} (g) = &\sum\nolimits_{i = 1}^2 {(loss_1 (g,c_i (g))} \cr &+ loss_2 (g,c_i (g)) + loss_3 (g,c_i (g)))} $$

(A6)

and

$$ Loss^{(S,G)} = \sum {_{g \in V(G)\backslash L(G)} } loss^{(S,G)} (g) $$

(A7)

Uncertain Gene Trees

In many real gene trees, vertices with more then two children exist. Here we present a heuristic algorithm for computing the minimum number of duplications and losses over the set of all gene trees refining the given gene tree. Duplications are prioritized before losses.

Let G be any gene tree, and let S be a species tree, such that Σ(G) = L(S). The set of binary gene trees that refine G is denoted BG ^G. Moreover, the set of trees in BG ^G that have the minimum number of duplications is denoted $ BG_{min }^{(S,\;G)} $ and can be expressed as $ BG_{min}^{(S,\;G)}=\{ H^{\prime} \in BG^G\mathop{|\forall}\limits^{\ \ min} H \in BG^G,\, {Dup^{(S,\;H^{\prime})}} \le {Dup^{(S,\;H)}}\mathop{\}}\limits^{min}$

The minimization of the number of losses is constrained by the minimum number of duplications. Thus, the numbers of duplications and losses are defined as

$ Dup_{min }^{(S,\;G)} =Dup_{}^{(S,\;H)} \;{\rm where} \;H \in BG^{(S,\;G)} $ and $ Loss_{min}^{(S,\;G)} =Minimum_{H \in BG_{min}^{(S,\;G) Loss^{(S,\;H)}}} $

Computing the number of duplications $ Dup_{min }^{(S,\;G)} $ is NP-complete (see Theorem A4), but for our heuristic approach, described below, duplications can be computed in polynomial time.