Introduction

As species and their genomes diverge during evolutionary history, the sets of genes and their sequences also diverge. Gene duplication has been proposed as a crucial source of evolutionary innovation in organisms, like Eukaryotes, with small effective population sizes (Ohno 1970; Francino 2005). With duplication comes initial redundancy, followed by neofunctionalization, subfunctionalization, and, most commonly. pseudogenization (see Lynch et al. 2001; Rastogi and Liberles 2005). This differential retention of duplicate genes between species can result in a different phylogenetic tree for individual gene families than for the species as a whole. Further, differential parsing of shared ancestral gene and nucleotide polymorphism (see Blanchette et al. 2004) as well as uncertainty in tree calculation methodologies (especially for EST and partial sequences) can obfuscate the correlation between the evolutionary history of a gene and the species. Additionally, when using Genbank (Benson et al. 2005) or even draft genome sequences as the starting point for phylogenetic analysis, many species will be represented with some genes, while others that are actually present in the species genomes will not be represented in the datasets and falsely appear to have been lost or alternatively falsely appear in many copies. On top of ambiguity at the gene tree level, many vertices in species trees are unresolved and are represented as nonbinary to reflect this species history ambiguity.

Goodman et al. (1979) first introduced a mapping, which was then formalized by Page (1994), to explain the difference between a gene tree and its species tree. Additional algorithms for computing these mappings have also been presented (Zhang 1997; Eulenstein et al. 1998; Zmasek and Eddy 2001b). Another approach, Notung (Chen et al. 2000; Durand et al. 2005), allows consideration of gene tree uncertainty through a bootstrap value threshold and implements a weighting of gene duplication and loss events.

The problem of reconciling a gene tree to a species tree is used to solve two opposite but connected problems. The first problem is to infer a species tree given a set of gene trees, where the gene trees have different evolutionary histories (Guigo et al. 1996; Page 2000; Ma et al. 2000; Page and Cotton 2000; Cotton and Page 2002; Hallett and Lagergren 2002). The other problem is to infer a gene tree or a set of gene trees given a trusted species tree (Arvestad et al. 2003, 2004). The reconciliation can also be used for locating duplication events with respect to a species tree (Guigo et al. 1996) and for orthology analysis (Zmasek and Eddy 2002; Arvestad et al. 2003, 2004). Recent work has also focused on extending this type of approach to differentiating gene duplication events from lateral transfer events (Hallet et al. 2004). In this paper we wish to infer a rooted binary gene tree given a rooted nonbinary species tree and an unrooted, binary, or nonbinary gene tree considering the process of gene duplication.

These previously described algorithms require (or infer) a binary species tree and the approach has been successfully applied on a large scale, when there are no ambiguities in the species tree (Koonin et al. 2004). However, the NCBI taxonomy database (which, while formally a taxonomy, is commonly used as a species tree) (Benson et al. 2005) and other reference species trees are not binary in many places due to uncertainties, between gene trees both from different genes and from morphological characters (for example, the resolution of eutherian mammals). This problem was solved by Koonin et al. (2004) by performing the calculation over their species tree twice for the species in their dataset (once for a topology consistent with a clade of Ecdysozoa and once for a topology consistent with a clade of Coelomata). Here, building on previous algorithmic work, we present a more general mapping using a parsimonious approach toward uncertain speciation events or soft polytomies (for the original definition of soft polytomies, see Maddison 1989). Because the method embraces soft polytomies, we term it soft parsimony, in contrast to previous work, which we term hard parsimony.

One alternative to gene tree-to-species tree mapping for rooting of unrooted trees is midpoint rooting, where the point that is farthest from any extant sequence is designated as the root. However, as heterotachy (different modes of evolution in different subtrees of gene family trees) does not appear to be uncommon, this can falsely assign a root to a more recent fast-evolving branch (see Galtier 2001; Lopez et al. 2002; Siltberg and Liberles 2002).

For the reasons listed above, in the development of a large-scale database for understanding species evolution through the evolution of gene families (Liberles et al. 2001; Roth et al. 2005), it has been necessary to develop a soft parsimony based approach to map gene trees onto species trees. In future implementations of The Adaptive Evolution Database (TAED), an analysis of gene content could be coupled to an analysis of sequence evolution, as lineage-specific duplication has been proposed to play a major role in lineage-specific organismal evolution (for an interesting discussion see Francino 2005). In addition to the bootstrap (or posterior probability) threshold also implemented in Notung, it has been necessary to implement some additional features driven by considerations in the starting dataset (Genbank). Because many species have sparse sampling of genes from their genomes, it has been necessary to minimize, first, gene duplications and, second, gene losses rather than minimizing them together and attributing (with a weight) the loss of a gene to an absence in the genome. Also, because of the redundancy in GenBank (GenBank is an uncurated depository for gene sequences and many genes including those with mutations, splice variants, sequencing errors, etc., appear as multiple independent entries), it has been necessary to treat in-paralogues (lineage-specific duplicates) as redundant entries in an effort to improve gene family signal. The algorithm, as an option, can seek to exclude in-paralogues, as these are then not counted as duplications and are filtered out. An algorithm is presented that enables a mapping with all of the above features using the flexible soft parsimony approach, together with a downloadable software package, Softparsmap.

Methods

Multiple sequence alignments (MSA) were calculated using POA (Grasso and Lee 2004) and phylogenetic trees were built using MrBayes (Huelsenbeck and Ronquist 2001). The parameters used for the tree calculations were as described by Roth et al. (2005). The NCBI taxonomy (Benson et al. 2005) was used as a species tree. The objective of our method is twofold. First, we aim to root the unrooted phylogenetic tree from MrBayes, using the information from the corresponding species tree. Second, we aim to infer a topology of poorly resolved groups in the gene tree based on the species tree with a minimization of duplication and subsequently loss events as optimality criteria, detecting and filtering out redundant copies in the process. The flowchart for the method is illustrated in Fig. 1.

Fig 1
figure 1

A flowchart describes the algorithmic process of rooting an unrooted tree.

Our approach of rooting the gene tree follows that of Notung, but the methods differ in how the minimum number of duplications and losses are computed. First, the number of duplications is minimized and then the number of losses is minimized for the trees with the minimum number of duplications. Also, our method does not return all binary gene trees that have the minimum number of duplications and losses.

Algorithms

By mapping the vertices of a gene tree to the vertices of the corresponding species tree, each inner gene tree vertex can be labeled as being a duplication or speciation event. Hence, different mappings will describe different evolutionary scenarios. The m-mapping for mapping the vertices in the gene tree to the vertices in the species tree was introduced by Goodman et al. 1979. For any gene tree vertex g, m(g) is the species to which genome g belongs. For our soft parsimony approach we defined another mapping, denoted M, for mapping gene tree vertices to species tree vertices. The two mappings are illustrated in Fig. 2, and the Appendix presents a formal definition of our M-mapping.

Fig 2
figure 2

At the left is a gene tree explaining the evolutionary relationship among the genes a 1, b 1, c 1, d 1, d 2, e 1, and e 2. The labels of the inner vertices have been omitted. The circles denote the duplication events detected by the soft parsimony approach. At the right is the species tree corresponding to the gene tree. The leaves of the species tree are labeled with the extant species A, B, C, D, and E, and gene a 1 belongs to the genome of species A, gene b 1 to genome B, and so on. The two inner vertices are labeled x and y. For every vertex g in the gene tree, the m- and M-mappings are given in the form of m(g)|M(G). For the edges in the gene tree where the soft parsimony definition of loss detects gene losses, the losses are printed out as (loss 1, loss 2, loss 3). If the species tree is nonbinary, our soft parsimony definition infers a lesser or equal number of duplications compared to the definition introduced by Goodman et al. (1979), as well as a lesser or equal number of losses compared to the definition by Guigo et al. (1996). However, if the species tree is binary, the definitions are equivalent, and thus our approach will give the same number of duplications and losses in comparison with these other two approaches.

The objective of our method is to both root and resolve weakly supported edges of unrooted gene trees. This is done by finding the rooted gene trees corresponding to the unrooted gene tree that has the most parsimonious mapping, i.e., a mapping that results in the fewest duplications and losses.

Our method starts out with an unrooted, binary, or nonbinary gene tree, where all edges have bootstrap values or posterior probabilities attached to them. In the first step of our method a set of rooted gene trees is constructed by applying midedge rooting to each edge in the unrooted gene tree. Next the edges that have bootstrap values or posterior probabilities less then a predefined cutoff value are collapsed. From the resulting set of rooted gene trees with well-supported edges the following is performed. First, the minimum number of duplications is computed by summing over all gene tree vertices.

$$ Dup^{{(S,G)}} = \sum {_{g \in V(G)\backslash L(G)} } dup^{{(S,G)}} (g) $$
(1)

where dup (S,G)(g) is the minimum number of duplications associated with gene tree vertex g (see below). Since the number of duplications associated with any gene tree vertex is independent of the number of duplications associated with any other gene tree vertex, the summation over the gene tree vertices can be done in any order. Only the rooted gene trees that minimize the number of duplications are kept, and this results in a subset to the original set of rooted gene trees. Of course this subset might be equal to the original set. Second, for this subset of rooted gene trees the minimum number of losses is computed by summing over all gene tree vertices:

$$ Loss^{(S,G)} = \sum {_{g \in V(G)\backslash L(G)} } loss^{(S,G)} (g) $$
(2)

As in the previous step, only the rooted gene trees that meet the optimality criterion are kept. Here the optimality criterion is that the minimum number of losses should be minimized. Consequently, the resulting subset consists of rooted gene trees that all have the same number of duplications and losses. The weak edges that do not affect the number of duplications and losses are restored as they are encountered in the summation. Third, if the subset of rooted gene trees that minimize the number of duplications and losses has more than one member, the following procedure is applied to choose the preferred rooted gene tree. As soon as only one tree satisfies a criterion, the procedure stops. The first criterion is that the preferred tree should have the most internal vertices (i.e., the most nodes or branching points), the second criterion is that the preferred tree should have the least number of weak edges, and the third criterion is that the preferred tree should have the shortest root distance. However, it is not certain that a gene tree can be chosen from this procedure and in such cases our method returns any of the trees in the subset together with a warning. Fourth, the preferred rooted gene tree might not be binary, and thus the next step is to resolve the remaining collapsed edges. This is done by adding splits from the corresponding species tree and, if necessary, information from adding outgroups to the original unrooted gene tree (see the Appendix for more detail). Finally, in-paralogues are removed from the rooted gene tree by pairwise comparisons of the gene sequences of the in-paralogues. Given the sequences of two in-paralogues we choose to keep one of them according to the following criteria. First, if one of the sequences is complete while the other is only a fragment, the complete one is kept. Second, the longest sequence is kept. Third, the sequence with the highest GI number (most recent entry to GenBank) is kept.

Minimizing the Number of Duplications

As the leaves have no duplications associated with them, the minimum number of duplications for a given gene tree is computed by summing the minimum number of duplications for each inner gene tree vertex as shown in expression (1). Since the numbers of duplications associated with each gene tree vertex are independent, the computations can be done in any order. When the minimum number of duplication is computed for any inner gene tree vertex g to a gene tree G, the subtree rooted at g is considered, i.e., G g .Given a binary inner gene tree vertex, the minimum number of duplications can readily be computed from

$$ dup^{(S,G)} (g)= \left\{\matrix{1\quad {\rm if\ the\ two\ children\ of\ } g \ {\rm have} \cr {\rm descendents\ within\ the\ same}\cr {\rm extant\ genome} \cr 0\quad{\rm otherwise}}\right. $$
(3)

For any nonbinary inner gene tree vertex g, the minimum number of duplications associated with g is calculated by partitioning the child vertices of g into sets, such that the members of any set do not have descendants in the same extant genome. The partitioning is done such that the number of sets is minimized. An example of how the minimum number of duplications is computed for a gene tree vertex is shown in Fig. 3a. The problem of computing the minimum number of duplications for nonbinary gene tree vertices in this way is NP-complete; see Theorem A4 in the Appendix.

Fig 3
figure 3

a An overview of the gene duplication computation process is presented. In the upper left corner is the species tree corresponding to the rooted, binary gene tree in the upper right corner, for which we wish to compute the minimum number of duplications. The weak edges in the gene tree are labeled W. Duplication events are denoted by circles. Here we only show how to compute the minimum number of duplications for the gene tree vertex g. After the week edges have been collapsed, the child vertices of g are equal to the set {ab,bc,de,cd} and the minimum partitioning of these vertices such that members of the same set do not have descendants in the same extant genome is {{ab,cd},{bc,de}}. The number of duplications associated with g is equal to the size of the minimum partition minus one. In this case the minimum number of duplications associated with g is one. b An overview of the gene loss computation process is presented. In the upper right corner is the rooted, binary gene tree for which we wish to compute the minimum number of losses, and in the upper left corner is the corresponding species tree. The gene tree leaves labeled a 1, a 2, and a 3 are genes present in the genome of the extant species A, and the gene tree leaves labeled b 1 and b 2 are genes present in the genome of the extant species B. The weak edges are labeled W, and duplications are denoted as circles. Here we only show how to compute the minimum number of losses for the gene tree vertex g. After collapsing the weak edges, the children of g are equal to the set {a 1,b 1,a 2,b 2,a 3}. From the duplication algorithm a minimum partitioning of these vertices into sets such that any members of the same set do not have descendants in the same extant genome is {{a 1,b 1},{a 2,b 2},{a 3}}, i.e., the minimum number of duplications associated with this gene tree vertex is equal to two. In the first step in the loss algorithm, a rooted tree is constructed for each set in the partition as illustrated in (I). Note that the tree constructed from the set {a 3} is a single vertex. The first two trees in (I) are then combined as illustrated in (II). The duplication event associated with the root of this tree can be moved one step farther from the leaves by swapping b 1 and a 3, illustrated in (III), and the tree is kept for the subsequent steps of the algorithm. However, if the duplication event cannot be moved closer to the leaves, the first and third trees in (I) are combined and the resulting tree is tested to see if the duplication there can be moved closer to the leaves. If so, this tree is kept instead, but if the duplication event cannot be moved closer to the leaves in any of the trees constructed by combining the first tree in (I) with any other tree in (I), the tree constructed first would be chosen. Moreover, we continue to build trees from the remaining pairs of trees in (I), if any. The resulting tree(s) is(are) shown in (III). Next the tree(s) in (III) is(are) combined (if there is more than one) in the same way as in the previous step, and this procedure continues until we only have one tree as shown in (IV). Now the minimum number of losses for the gene vertex g can be computed, using expression (2), and in this example the minimum number of losses is equal to zero.

Minimizing the Number of Losses

The minimum number of losses is computed after the minimum number of duplications has been computed. Moreover, the computations are only performed for the rooted gene trees that minimize the number of duplications, as gene loss is a secondary optimization criterion to gene duplication. When the minimum number of losses is computed for any inner gene tree vertex g, the subtree rooted at g with the children of g as leaves is considered. If the gene vertex is nonbinary, so is the subtree, and thus, a refinement of this nonbinary subtree is constructed before the minimum number of losses is computed. Our algorithm separates three types of loss that can occur in the planted subtree rooted at g and containing one of the two children of g. For each gene tree vertex we have to sum over these three types of loss:

$$ \eqalign{loss^{(S,G)} (g)&=\sum\nolimits_{i=1}^2 (loss_1 (g,c_i(g)) +\ loss_2 (g,c_i(g)) \cr &\quad + loss_3 (g,c_i(g)))} $$
(4)

For any binary gene tree vertex the minimum number of losses is computed directly using expression (4), but for nonbinary vertices in the gene tree another approach must be taken. The proposed algorithm for computing the minimum number of losses for a nonbinary gene tree vertex and the corresponding species tree is approximate, and it is presented in detail in the Appendix. For each nonbinary gene tree vertex g, the corresponding partitioning of the child vertices into sets, given from the minimization of duplications algorithm, is used to resolve the uncertainty, i.e., create a binary tree with the children of the current vertex as leaves. This tree is constructed such that the vertices labeled as duplications are as close to the leaves of this tree (as this will count redundant sequences as in-paralogues rather than out-paralogues, enabling them to be filtered from the duplication calculation). This tree is then used together with the corresponding species subtree in expression (2) to compute the minimum number of losses for the current gene vertex. In Fig. 3b, an example of how the minimum number of losses is computed for a gene tree vertex is presented.

Software

Softparsmap is available for download from http://www.ii.uib.no/∼steffpar/softparsmap/ . It is written in Java and requires JDK 1.4.2 or later.

Results and Discussion

The systematic application of a soft parsimony approach for analyzing gene trees in the context of species trees has been performed as part of The Adaptive Evolution Database (TAED) (Roth et al. 2005). Here, vertices with posterior probabilities of <0.7 were collapsed to nonbinary trees. Then the NCBI taxonomy (Benson et al. 2005) was used as a species tree to minimize the number of gene duplication and loss events in the gene family tree to produce a binary result.

A total of 1217 of 11,704 rootable gene families trees in the Chordate half of TAED were modified using this approach. This approach holds equal value for other gene family databases, where identification of orthologous genes is important. From TAED, a sample tree, where the algorithm corrects a tree in the expected manner, is shown in Figs. 4a and b. The example shown is malate dehydrogenase.

Fig 4
figure 4

a The unrooted tree for malate dehydrogenase from Mr. Bayes as calculated for TAED (Roth et al. 2005) before application of the soft parsimony algorithm. The leaf IDs are GenBank protein GIs. b The same tree has now been rooted and corrected using the soft parsimony algorithm. The Artiodactyls now form a single clade without implying a gene duplication event, and the in-paralogues along the human and mouse lineages have been filtered out because the original data set from GenBank contained redundancies. Trees here are visualized using ATV (Zmasek and Eddy 2001a).

As calculated by Mr. Bayes (Huelsenbeck and Ronquist 2001), there is no root of the gene family tree that generates Artiodactyls (hoofed mammals with an even number of toes) as a monophyletic group without inferring a gene duplication and multiple selective loss events. Application of the soft parsimony algorithm results in the expected rooted tree, shown in Fig. 4b.

To walk through this, in the first step, the root is placed on the lineage separating Branchiostoma (amphioxus) from teleost (bony) fish. This results in one implied gene duplication event in eutherian (placental) mammals. Next the branch leading to Sus scrofa below the posterior probability threshold of 0.70 is collapsed. leading to possible nonbinary trees linking it with the Ovis aries/rodent vertex. Then the number of duplication events associated with different resolutions is assessed, and a resolution of the Sus scrofa/Ovis aries grouping as Artiodactyls with the root still on the branch separating amphioxus from teleosts now implies no gene duplication events.

Of course, gene duplication and loss events do occur. Bayesian approaches, which treat such events probabilistically, result in the explanation that such events are rarer and less likely to explain a tree such as that shown in Fig. 4a than, in this case, statistical uncertainty of branching (Arvestad et al. 2003).

Many of the families in TAED that have been corrected are multigene families, where the branching order is different following a gene duplication event and gene loss events. An example of this, where an ancient gene duplication event preceded the divergence of eutherian mammals and where the optimal tree shows a different eutherian mammal topology, is shown in Figs. 5a and b, from the guanine nucleotide binding protein gene family in TAED. This is an example of a tree where hard parsimony would force a less preferred topology on one of the postduplication clades. In this case, Primates are the outgroup to Rodents, Carnivores, and Artiodactyls on one half of the tree and Rodents are the outgroup to Primates, Carnivores, and Artiodactyls on the other half. Without a posterior probability threshold, the hard parsimony approach would infer a gene duplication event. With a posterior probability threshold (in this case), the tree would be corrected to one with even lower support to prevent inference of a gene duplication event using hard parsimony.

Fig 5
figure 5

a The unrooted tree for guanine nucleotide binding protein from Mr. Bayes as calculated for TAED (Roth et al. 2005) before application of the soft parsimony algorithm. b Following application of the soft parsimony algorithm, a tree with differential optimal branching of eutherian mammals in two independent clades after a more ancient gene duplication event is shown. While a zebrafish in-paralogue is removed, the ambiguous branching order of Primates, Rodents, Carnivores, and Artiodactyls is tolerated without imposing a solution from one clade to another or inferring any extra gene duplication events. Trees here are visualized using ATV (Zmasek and Eddy 2001a).

The algorithm presented under Methods and applied to TAED is available as software, called Softparsmap, and is available for download at http://www.ii.uib.no/∼steffpar/softparsmap/ . The optional functionality available in the software package includes tree rooting by minimization of gene duplication and loss events, removal of in-paralogues, removal of uncertainties or correction of weakly supported splits based on the reference species tree, gene tree-to-species tree mapping to allow identification of orthologues and paralogues, and comparison of tree topologies.

A fast, flexible, powerful approach is presented that allows a gene tree-to-species tree mapping using a soft parsimony algorithm. This approach has been applied systematically to gene families based on GenBank in TAED, modifying over 10% of gene families. The software package available from this method should be a valuable addition to the evolutionary bioinformatics toolbox.