Keywords

1 Introduction

Reconstructing species phylogenies from a collection of input trees each inferred from a different part of the genome is becoming the standard practice in phylogenomics (e.g., [1,2,3,4,5]). This two-step approach stands in contrast to concatenation [6], where all of the sequences are combined into a supermatrix and analyzed in one maximum likelihood analysis. The two-step approach promises to effectively account for discordances between gene trees and the species tree [7] (but see recent literature for ongoing debates [8,9,10,11]) and is more efficient than statistical co-estimation of gene trees and the species tree [12] or site-based estimation of the species tree [13]. Among several causes of gene tree discordance [14], incomplete lineage sorting (ILS) is believed to be ubiquitous [15] and is extensively studied. ILS is typically modeled by the multi-species coalescent model (MSCM) [16, 17], where branches of the species tree represent populations, and lineages are allowed to coalesce inside each branch; lineages that fail to coalesce at the root of each branch are moved to the parent branch.

Several methods are proposed to infer a species tree from a collection of input trees (even though these trees need not be inferred from functional genes, following the conventions of the field, we will call them “gene trees”). Examples of summary methods include MP-EST [18], NJst [19], DISTIQUE [20], and STAR [21], which only use the topology of the input gene trees, and GLASS [22] and STEAC [21], which also uses the input branch lengths. While most methods need rooted gene trees as input, NJst and DISTIQUE can take unrooted input. These methods are all proved statistically consistent under the MSCM when the input gene trees are error-free, but no summary method is proved consistent when input trees are inferred from sequence data [23].

One of the statistically consistent methods under the MSCM is ASTRAL [24], which takes as input a collection of unrooted gene tree topologies and produces an unrooted species tree. ASTRAL uses dynamic programming to find the tree that shares the maximum number of induced quartet topologies with the collection of input gene trees. Since this problem is NP-Hard [25], ASTRAL solves a constrained version of the problem exactly, where the search space is limited to a predefined set of bipartitions X. In ASTRAL-I, the set X is the collection of all bipartitions in input gene trees. Showing that this space is not always large enough, ASTRAL-II [26] uses several heuristics to further augment the search space. Using the fact that for unrooted quartet trees the species tree always matches the most likely gene tree [27], ASTRAL is proved statistically consistent, even when solving the constrained problem, and its accuracy has been established in simulations [20, 24, 26, 28, 29]. ASTRAL-II has running time \(O(nk|X|^2)\) for n species and k binary genes. Finally, ASTRAL has the ability to compute branch lengths in coalescent units [14] and a measure of branch support called local posterior probability [30]. Perhaps most importantly, ASTRAL and ASTRAL-II have been adopted by the community as one of the main methods of performing phylogenomics, and many biological analyses have adopted them.

ASTRAL-II has several shortcomings, some of which we address here by introducing ASTRAL-III. While ASTRAL-II can analyze datasets of 1,000 species and 1,000 genes on average in a day, ASTRAL-II has trouble scaling to many tens of thousands of input trees. Datasets with more than ten thousands genomic loci are already available (e.g., [3]) and with the increase in genome sequencing, more will be available in future. Moreover, being able to handle large numbers of input trees enables using multiple trees per locus (e.g., a Bayesian sample) as input to ASTRAL. The limited scalability of ASTRAL with k is because of a \(\varTheta (nk)\) factor in the running time that corresponds to scoring a potential node in the species tree against all nodes of the input gene trees. This computation does not exploit similarities between gene trees, a shortcoming that we fix in ASTRAL-III. Moreover, while ASTRAL-II can handle polytomies in input gene trees, in the presence of polytomies of maximum degree \(d_{m}\), its running time inflates to \(O(d_{m}^3k|X|^2)=O(n^3k|X|^2)\), which quickly becomes prohibitive for input trees with polytomies of large degrees. ASTRAL-III uses a mathematical trick to enable scoring of gene tree polytomies in time similar to binary nodes.

The ability to handle large polytomies in input gene trees is important for two reasons. On the one hand, some of the conditions that are conducive to ILS, namely shallow trees with many short branches, are also likely to produce gene sequence data that are identical between two species. A sensible gene tree (e.g., those produced by FastTree [31]) would leave the relationship between identical sequences unresolved (tools such as RAxML that output a random resolution take care to warn the user about such input data). On the other hand, all summary methods, including ASTRAL, are sensitive to gene tree estimation error  [26, 32,33,34,35,36]. One way of dealing with gene tree error, previously studied in the context of minimizing deep coalescence [37], is to contract low support branches in gene trees and use these unresolved trees as input to the summary method. While earlier studies found no evidence that this approach helps ASTRAL when the support is judged by SH-like FastTree support [26], no study has tested this approach with bootstrap support values. We will for the first time evaluate the effectiveness of contracting low support branches and show that conservative filtering of very low support branches does, in fact, help the accuracy. We note that the main competitors to ASTRAL, namely NJst [19] and its fast implementation, ASTRID [38], are not able to handle polytomies in input gene trees. ASTRAL-III makes it efficient to use unresolved gene trees as input to the species tree. Empirically, we observe that ASTRAL-III improves the running time compared to ASTRAL-II by a factor of 3X-4X for binary trees with large numbers of genes. Moreover, ASTRAL-III finishes on a dataset of 5,000 species and 500 genes in 18–30 h (24 on average). The ASTRAL-III software is publicly available at https://github.com/smirarab/ASTRAL.

2 Background and Notation

2.1 Notations and Definitions

We denote the set of species by L and let \(n=|L|\). Let G be the set of k input gene trees. The set of quartet trees induced by any tree t is denoted by Q(t). We refer to any subset of L as a cluster and refer to clusters with cardinality one as singletons. We define a partition as a set of clusters that are pairwise mutually exclusive (note that we abuse the term here, as the union of all clusters in a partition need not give the complete set). A bipartition (tripartition) is a partition with cardinality two (three); a partition with cardinality at least four corresponds to a polytomy and is referred to as a polytomy in this paper. Let X (the constraint bipartition set) be a set of clusters such that for each \(A\in X\), we also have \(L-A\in X\). We use Y to represent the set of all tripartitions examined in the ASTRAL dynamic programming:

$$\begin{aligned} Y=\{(A',A-A',L-A)| A' \subset A, A \in X, A' \in X, A-A' \in X\}. \end{aligned}$$

We use N(g) to represent the set of partitions correspondent to internal nodes in the gene tree g. We use E to denote the set of unique partitions and the number of times they appear in G:

$$\begin{aligned} E=\{(M, \sum _{g\in G}|N(g) \cap \{M\}|)| M\in N(g), g\in G\} \end{aligned}$$
(1)

and we define D as the sum of the cardinalities of unique partitions in gene trees:

$$\begin{aligned} D=\sum _{(M,c)\in E} |M|. \end{aligned}$$
(2)

Finally, we use [d] to represent the set \(\{1,2 \ldots ,d\}\).

2.2 Background on ASTRAL-I and ASTRAL-II

The problem addressed by ASTRAL is:  

Given: :

a set G of input gene trees

Find: :

find the species tree t that maximizes \(\sum _{g \in G} |Q(g)\cap Q(t)|\).

  Lanford and Scornavacca recently proved this problem is NP-hard [25]. ASTRAL solves a constrained version of this problem where a set of clusters X restricts bipartitions that the output species tree may include (note \(\forall A \in X: L-A \in X\)).

To solve the constrained version, ASTRAL uses a dynamic programming method with the following recursive relation to obtain the optimal tree.

$$\begin{aligned} V(A) = \max _{(A'|A-A'|L-A)\in Y} V(A') + V(A-A') + w(A'|A-A'|L-A) \end{aligned}$$

where the function w(T) scores each tripartition \(T=(A|B|C)\) against each node in each input gene tree. Let partition \(M=(M_1|M_2|\ldots |M_d)\) represent an internal node of degree d in a gene tree. The overall contribution of T to the score of any species tree that includes T is:

$$\begin{aligned} w(T)=\sum _{g\in G}\sum _{M\in N(g)}\frac{1}{2}QI(T,M) \end{aligned}$$
(3)

where, defining \(a_i=|A\cap M_i|\), \(b_i=|B\cap M_i|\), and \(c_i=|C\cap M_i|\), we have:

$$\begin{aligned} QI((A|B|C), M)=\sum _{i\in [d]}\sum _{j\in [d]-\{i\}}\sum _{k\in [d]-\{i,j\}} \frac{a_i+b_j+c_k -3}{2} a_i b_j c_k. \end{aligned}$$
(4)

As previously proved [24], QI(TM) computes twice the number of quartet trees that are going to be shared between any two trees if one includes only T and the other includes only M. ASTRAL-II requires \(\varTheta (d^3)\) time for computing QI(.), making the overall running time \(O(n^3k|Y|)\) with polytomies of unbounded degrees or O(nk|Y|) in the absence of polytomies.

3 ASTRAL-III Algorithmic Improvements

Noting trivially that \(|Y|<|X|^2\), the previously published running time analysis of ASTRAL-II was \(O(nk|X|^{2})\) for binary gene trees and \(O(n^3k|X|^{2})\) for input trees with polytomies. A recent result by Kane and Tao [39] (motivated by the question raised in analyzing the ASTRAL algorithm) indicates that \(|Y|\le |X|^{3/log_3(27/4)}\). This result immediately gives a better upper bound:

Corollary 1

ASTRAL-II runs in \(O(nk|X|^{{1.726}})\) and \(O(n^3k|X|^{{1.726}})\), respectively, with and without polytomies in gene trees.

ASTRAL-III further improves this running time using three new features:

  1. 1.

    A new way of handling polytomies is introduced to reduce the running time for scoring a gene tree to O(n), instead of \(O(n^3)\), in the presence of polytomies, which reduces the total running time to \(O(nk|X|^{{1.726}})\) irrespective of the gene tree resolution.

  2. 2.

    A polytree is used to represent gene trees, and this enables an algorithm that reduce the overall running time from \(O(nk|X|^{{1.726}})\) to \(O(D|X|^{{1.726}})\).

  3. 3.

    An A*-like algorithm is used to trim parts of the dynamic programming DAG.

In addition to these running time improvements, ASTRAL-III changes parameters of heuristics described in ASTRAL-II [26] to expand the size of set X for gene trees that include polytomies. We next describe each improvement in detail.

3.1 Efficient Handling of Polytomies

Recall that ASTRAL-II uses Eq. 4 to score a tripartition against a polytomy of size d in \(\varTheta (d^3)\) time. We now show this can be improved.

Lemma 1

Let QI(TM) be twice the number of quartet tree topologies shared between an unrooted tree that only includes a node corresponding to the tripartition \(T=(A|B|C)\) and another tree that includes only a node corresponding to a partition \(M=(M_1|M_2|\ldots |M_d)\) of degree d; then, QI(TM) can be computed in time \(\varTheta (d)\).

Proof

In \(\varTheta (d)\) time, we can compute:

$$\begin{aligned} S_a =\sum _{i\in [d]}a_i~~~and ~~~ S_{a,b} =\sum _{i\in [d]}a_i b_i \end{aligned}$$
(5)

where \(a_i=|A\cap M_i|\) and \(b_i=|B\cap M_i|\); ditto for \(S_b\), \(S_c\), \(S_{a,c}\) and \(S_{b,c}\). Equation 4, as proved before [26], computes twice the number of quartet tree topologies shared between an unrooted tree with internal node T and another tree with one internal node M. Equation 4 can be rewritten using these intermediate sums as:

$$\begin{aligned} \begin{aligned} QI((A|B|C), M)=&\sum _{i\in [d]}\left( {\begin{array}{c}a_i\\ 2\end{array}}\right) ((S_b - b_i) (S_c - c_i) - S_{b,c} + b_i c_i)\\ +&\sum _{i\in [d]}\left( {\begin{array}{c}b_i\\ 2\end{array}}\right) ((S_a - a_i) (S_c - c_i) - S_{a,c} + a_i c_i)\\ +&\sum _{i\in [d]}\left( {\begin{array}{c}c_i\\ 2\end{array}}\right) ((S_a - a_i) (S_b - b_i) - S_{a,b} + a_i b_i)\\ \end{aligned} \end{aligned}$$
(6)

(the derivation is given in the Appendix A). Computing Eq. 6 instead of Eq. 4 clearly reduces the running time to \(\varTheta (d)\) instead of \(\varTheta (d^3)\).    \(\square \)

ASTRAL needs to score each of the |Y| tripartitions considered in the dynamic programming against each internal node of each input gene tree. The sum of degrees of k trees on n leaves is O(nk) (since that sum can never exceed the number of bipartitions in gene trees) and thus:

Corollary 2

Scoring a tripartition (i.e., computing w) can be done in O(nk).

3.2 Gene Trees as a Polytree

ASTRAL-II scores each dynamic programming tripartition against each individual node of each gene tree. However, nodes that are repeated in several genes need not be recomputed. Recalling the definitions of E and D (Eqs. 1 and 2),

Lemma 2

The score of a tripartition \(T=(A|B|C)\) against all gene trees (i.e., the w(T) score) can be computed in \(\varTheta (D)\).

Proof

In ASTRAL-III, we keep track of nodes that appear in multiple trees. This enables us to reduce the total calculation by using multiplicities:

$$\begin{aligned} w(T)=\sum _{(M,c)\in E}c\times QI(T,M). \end{aligned}$$
(7)

We achieve this in two steps. In the first step, for each distinct gene tree cluster W, we compute the cardinality of the intersection of W and sets A, B, and C once using a depth first search with memoization. Let children(W) denote the set of children of W in an arbitrarily chosen tree \(g\in G\) containing W. Then, we have the following recursive relation:

$$\begin{aligned} |W \cap A| = \sum _{Z\in children(W)}|Z \cap A| \end{aligned}$$
(8)

(ditto for \(|W \cap B|\) and \(|W \cap C|\)). All such intersection values can be computed in a post-order traversal of a polytree. In this polytree, all unique clusters in the gene trees are represented as vertices and parent-child relations are represented as edges; note that when a cluster has different children in two different input trees, we arbitrary choose one set of children and ignore the others. The polytree will include no more than \(D\) edges; thus, the time complexity of traversing this polytree and computing Eq. 8 for all nodes is \(\varTheta (D)\). Once all intersections are computed, in the second step, we simply compute the sum in Eq. 7. Each QI(.) computation requires \(\varTheta (d)\) by Lemma 1. Recalling that \(D=\sum _{(M,c)\in E} |M|\), it is clear that computing Eq. 7 requires \(\varTheta (D)\). Therefore, both steps can be performed in \(\varTheta (D)\).   \(\square \)

Theorem 1

The ASTRAL-III running time is \(O(D|X|^{{1.726}})\) for both binary and unresolved gene trees.

Proof

By results of Kane and Tao [39], the size of the set Y is \(O(|X|^{{1.726}})\), and for each element in Y, by Lemma 2, we require \(O(D)\) to compute the weights, regardless of the presence or absence of polytomies. The running time of ASTRAL is dominated by computing the weights [26]. Thus, the overall running time is \(O(D|Y|)=O(D|X|^{{1.726}})\).

3.3 Trimming of the Dynamic Programming

Our last feature does not improve theoretical running time but can result in some improvements in the experimental running time. Our main insight is that \(U(A)=\frac{w(A|A|L)}{2}-\frac{w(A|A|A)}{3}\) is an upperbound of V(A) (see the Appendix A for proof). Since \(V(A) \le U(A)\), for any \((A'|A-A'|L-A')\in Y\) and \((A''|A-A''|L-A'')\in Y\), we no longer need to recursively compute \(V(A'')\) and \(V(A-A'')\) when:

$$\begin{aligned} \begin{aligned} U(A'') + U(A-A'') + w(A''|A-A''|L-A)&\le \\ V(A') + V(A-A') + w(A'|A-A'|L-A)&\end{aligned} \end{aligned}$$
(9)

Thus, in ASTRAL-III we trim the DAG of the memoized recursive dynamic programming when this calculation indicates that a path has no chance of improving the final score. To heuristically improve the efficiency of this approach, we order all \((A'|A-A'|L-A)\in Y\) according to \(U(A') + U(A-A') + w(A'|A-A'|L-A)\).

4 Experimental Setup

Using simulation studies and on real data, we study two research questions:

  • RQ1: Can contracting low support branches improve the accuracy of ASTRAL?

  • RQ2: How does ASTRAL-III running time compare to ASTRAL-II for large polytomies and many gene trees?

Note that addressing RQ1 in a scalable fashion is made possible only through the running time improvements of ASTRAL-III.

4.1 Datasets

Avian Biological Dataset: Neoaves have gone through a rapid radiation, and therefore, have extremely high levels of ILS [3]. A dataset of 48 whole-genomes was used to resolve this rapid radiation [3]. MP-EST run on 14,446 gene trees (exons, introns, and UCEs) produced a tree that conflicted with strong evidence from the literature and other analyses on the dataset (e.g., the Passerimorphae/Falcons/Seriemas grade was not recovered). This motivated the development of the statistical binning method to reduce the impacts of gene tree error [32, 33]. MP-EST run on binned gene trees produced results that were largely congruent with the concatenation and other analyses. Here, we test if simply contracting low support branches of gene trees produces a tree that is congruent with other analyses and the literature. This analysis is made possible because ASTRAL-III can handle datasets with a large number of polytomies and large k efficiently.

Simulated Avian-Like Dataset: We use a simulated dataset that was previously used in the statistical binning paper [32] to emulate the biological avian dataset. Since estimating the true branch lengths in coalescent units are hard, three versions of this dataset are available: 1X is the default version, whereas 0.5X divides each branch length in half (increasing ILS) and 2X multiplies them by 2 (reducing ILS). The amount of true discordance (ILS), measured by the average RF distances between true species tree and true gene trees, is moderate at 0.35 for 2X, high at 0.47 for 1X, and very high at 0.59 for 0.5X. Moreover, to study the impact of gene tree estimation error, sequence lengths were varied to create four conditions: very high error with 250 bp alignments (0.67 RF distance between true gene trees and estimated gene trees), high error with 500 bp (0.54 RF), medium error with 1000 bp (0.39 RF) and moderate error with 1500 bp (0.30 RF). We use 1000 gene trees, and 20 replicates per condition. Gene trees are estimated using RAxML [40] with 200 replicates of bootstrapping.

Fig. 1.
figure 1

Properties of the S100 dataset. (a) The density plot of the amount of true gene discordance measured by the FN rate between the true species tree and the true gene trees. (b) The density plot of gene tree estimation error measured by FN rate between true gene trees and estimated gene trees for different set of sequence lengths.

SimPhy-Homogeneous (S100): We simulated 50 replicates of a 101-taxon dataset using SimPhy [41] under the MSCM, where each replicate has a different species tree. In order to generate the species trees, we used the birth-only process with birth rate \(10^{-7}\), fixed haploid effective population size of 400K, and the number of generations sampled from a log-normal distribution with mean 2.5 M. For each replicate, 1000 true gene trees are simulated under the MSCM (the exact simulation commands are given in Appendix B and parameters are shown in Table 2). The amount of ILS, measured by the false-negative (FN) rate between true species trees and true gene trees, mostly ranged between 0.3 and 0.6 with an average of 0.46 (Fig. 1). We use Indelible [42] to simulate the nucleotide sequences along the gene trees using the GTR evolutionary model [43] with 4 different fixed sequence lengths: 1600, 800, 400, and 200 bp (Table 2). We then use FastTree2 [31] to estimate both ML and 100 bootstrapped gene trees under the GTR model for each gene of each replicate (\(>2000200\) runs in total). Gene tree estimation error, measured by the FN rate between the true gene trees and the estimated gene trees, depended on the sequence length as shown in Fig. 1 (0.55, 0.42, 0.31, and 0.23 on average for 200 bp, 400 bp, 800 bp, and 1600 bp, respectively). We sample 1000, 500, 200, or 50 genes to generate datasets with varying numbers of gene trees.

4.2 Methods and Evaluation

We compare ASTRAL-III (version 5.2.5) to ASTRAL-II (version 4.11.1) in terms of running time. To address RQ1, we draw the bootstrap support values on the ML gene trees using the newick utility package [44]. We then contract branches with bootstrap support up to a threshold (0, 3, 5, 7, 10, 20, 33, 50, and 75%,) using the newick utility and use these contracted gene trees as input to ASTRAL. Together with the original set, this creates 10 different ways to run ASTRAL.

To measure the accuracy of estimated species trees, we use False Negative (FN) rate. Note that in all our species tree comparisons, FN rate is equivalent to normalized Robinson-Foulds (RF) [45] metric, since the ASTRAL species trees are fully resolved. All running times are measured on a cluster with servers with Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50 GHz; each run was assigned to a single process, sharing cache and memory with other jobs.

5 Results

5.1 Impact of Contracting Low Support Branches on Accuracy

We investigate the impact of contracting branches with low support (RQ1) on our two simulated datasets (avian and S100) and on the real avian dataset.

Table 1. Species tree error (FN ratio) for all model conditions of the S100 dataset, with true gene trees (true), no filtering (non), and all filtering thresholds (columns).
Fig. 2.
figure 2

The error in species trees estimated by ASTRAL-III on the S100 dataset given \(k=\) 50, 200, 500, or 1000 genes (boxes) and with full FastTree gene trees (non) or trees with branches with \(\le \{0,3,5,7,10,20,33,50\}\)% support contracted (x-axis). Average FN error and standard error bars are shown for all 50 replicates with the four alignment lengths combined (black solid line); average FN error broken down by alignment length (a) or gene tree error (b) is also shown (dashed colored lines). In (b) we divide the replicates based on their average gene tree error (normalized RF) into four categories: \([0,\frac{1}{4}]\),\((\frac{1}{4},\frac{1}{3}]\),\((\frac{1}{3},\frac{1}{2}]\),\((\frac{1}{2},1]\). (Color figure online)

S100: On this dataset, contracting very low support branches in most cases improves the accuracy (Table 1 and Fig. 2); however, the excessive removal of branches with high, moderate or occasionally even low support degrades the accuracy. The threshold where contracting starts to become detrimental depends on the condition, especially the number of gene trees and the alignment length.

As the number of genes increases, the optimal threshold for contracting also tends to increase. Combining all model conditions, the error continues to drop until a 10% contracting threshold with 1000 genes, whereas no substantial improvement is observed after contracting branches with 0% support for 50 genes. The alignment length and gene tree error also impact the effect of contraction. For short alignments (200 bp) and 1000 genes, contracting branches with up to 10% support reduces the species tree error by 21% (from 8.9% with no contraction to 7.0%). As alignment length grows (and gene tree error decreases), benefits of gene tree contraction diminish, so that with 1600 bp genes, the reduction in error is merely from 4.1% to 3.7%. While aggressive filtering at 33% or higher sometimes increases the error compared to no filtering, filtering at 10% is either neutral or beneficial on average for all conditions in this dataset.

Fig. 3.
figure 3

Species trees error for ASTRAL-III on the avian dataset given \(k=1000\) genes with (left) fixed sequence lengths \(=500\) and varying levels of ILS, or (right) fixed ILS (1X) and varying sequence length, in each case both with full FastTree gene trees (non) or trees with branches with \(\le \{0,3,5,7,10,20,33,50\}\)% support contracted (x-axis). Average FN error and standard error bars are shown for all 20 replicates (black solid line) and also for each model condition separately (dashed color lines). (Color figure online)

Avian-Like Simulations: On the avian dataset, overall, contracting low support branches helps accuracy marginally, but the extent of improvements depends on the model condition (Fig. 3). We first fix the sequence length to 500 bp and allow the amount of ILS to change (e.g., from moderate with 2X to very high with 0.5X). With moderate ILS (2X), we see no improvements as a result of contracting low support branches, perhaps because the average error is below 5% even with no contraction. Going to high and very high ILS, we start to see improvements with contracted gene trees. For example, removing branches of up to 5% support reduces the error from 13% to 11% with 0.5X, and from 8% to 7% for the 1X condition. Just like the S100 dataset, aggressive filtering reduces the accuracy, but here, thresholds of 20% and higher seem to be detrimental. When ILS is fixed to 1X and sequence length is varied (Fig. 3), contracting is helpful mostly with short sequences (e.g., 250 bp). With longer sequences, where gene tree estimation error is low, little or no improvement in accuracy is obtained. The best accuracy is typically observed by contracting at 0–5%. Overall, the gains in accuracy comparing no contraction to contraction at 0% are statistically significant with p-value 0.042 (are close to significant with p-values 0.087 and 0.058 for 3% and 5% thresholds) according to a two-way ANOVA test with the sequence length and ILS levels as extra variables. Irrespective of significance, the improvements are not large on this dataset.

Fig. 4.
figure 4

Avian biological results on 14,446 unbinned gene trees. Species trees are shown for the TENT RAxML concatenation tree [3] (left), ASTRAL-III tree with no contraction (middle), and with 33% contraction (right). ASTRAL-III branches conflicting with the TENT tree are in color; red indicates disagreement with strong evidence [3]. (Color figure online)

Avian Biological Dataset: On the avian dataset with 14 K genes, ASTRAL-III managed to complete with 0%, 33%, 50%, and 75% thresholds in 48 h. Results on the runs that did finish are very interesting. The ASTRAL-III tree with no contraction had 11 and 9 branch differences respectively with the TENT and the binned MP-EST analyses from the original paper [3]. Some of these differences were on strong results (Fig. 4) from the avian dataset (e.g., the Columbea group). After contracting branches below 33% BS, the ASTRAL-III tree had only 4 and 3 branch differences, respectively with TENT and the binned MP-EST trees; these differences were among the branches that were deemed unresolved by Jarvis et al. and changed among their analyses as well [3]. ASTRAL-III obtained on collapsed gene trees agreed with all major new findings of Jarvis et al. [3]. For example, at 33% filtering, the novel Columbea group was corroborated, whereas the unresolved tree completely missed this important clade (Fig. 4).

Fig. 5.
figure 5

Average running time of ASTRAL-II versus ASTRAL-III on the avian dataset with 500  bp or 1500 bp alignments with varying numbers of gens (k), shown both in log-log scale (top) and normal scale (bottom). A line (top) or a LOESS curve (bottom) is fit to the data points. ASTRAL-II could not finish on \(2^{14}\) genes in the allotted 48 h time. Line slopes are shown for the log-log plot. Averages are over 4 runs.

5.2 Running Time Improvements

The Impact of the Number of Genes ( k ): We evaluate the improvement of ASTRAL-III compared to ASTRAL-II on the avian simulated dataset, changing the number of genes from \(2^8\) to \(2^{16}\). We allow each replicate run to take up to two days. ASTRAL-II is not able to finish on the dataset with \(k=2^{16}\), while ASTRAL-III finishes on all conditions. ASTRAL-III improves the running time over ASTRAL-II and the extent of the improvement depends on k (Fig. 5). With 1000 genes or more, there is at least a 2.7X improvement. With \(2^{13}\) genes, the largest value where both versions could run, ASTRAL-III finishes on average four times faster than ASTRAL-II (190 versus 756 min). Moreover, fitting a line to the average running time in the log-log scale graph reveals that on this dataset, the running time of ASTRAL-III on average grows as \(O(k^{2.04})\), which is better than that of ASTRAL-II at \(O(k^{2.27})\), and both are better than the theoretical worst case, which is \(O(k^{2.726})\).

Fig. 6.
figure 6

Average and standard error of running times of ASTRAL-II and ASTRAL-III on the S100 dataset for scoring a single weight (Eq. 3). Running time is in log scale for varying numbers of gene trees (boxes) and sequence length (200, 400, 800, and 1600).

The Impact of Polytomies: ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the running time when gene trees include polytomies (Fig. 6). Since ASTRAL-II and ASTRAL-III can potentially weight different numbers of tripartitions, we show the running time per weight calculation (i.e., Eq. 3). As we contract low support branches and hence increase the prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II, whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.

6 Discussion

Comparison to True Gene Trees: Although we observed improvements in the tree accuracy with contracting low support branches, the gap between performance on true gene trees and estimated gene trees remains wide (Table 1). On the S100 dataset, respectively for 50, 200, 500, and 1000 genes, the average error with 1600 bp gene trees were 10.1%, 6.1%, 4.4%, and 3.9% compared to 7.0%, 3.7%, 2.4%, and 1.5% with true gene trees. Thus, while contracting low support branches helps in addressing gene tree error, it is not a panacea. Improved methods of gene tree estimation remain crucial as ever before. Our results also indicate that in the presence of noisy gene trees, increased numbers of genes are needed to achieve high accuracy. For example, on the S100 dataset, with 1000 gene trees of only 200 bp and contracting with a 10% threshold, the species tree error was 7.1%, which matched the accuracy with only 50 true gene trees. The increased data requirement for noisy genes encourages the use of many thousands of gene trees, making scalability gains of ASTRAL-III more relevant.

Arbitrary Resolutions and 0% Filtering: Interestingly, in most datasets, the most substantial improvements were observed when only 0% BS branches were removed, and one can assume that such branches are essentially resolved randomly. As the use of Ultra Conserved Elements [9] continues to gain in popularity, instances where two or more taxa have identical sequences in some genes will become more prevalent. Many tree estimation methods generate binary trees even under such conditions. Removing branches that are arbitrarily resolved make sense and, as our results indicate, will improve accuracy.

Statistical Consistency: While removing branches with low support can improve the accuracy empirically, its theoretical justification is less clear. In principle, branches that have low support are not necessarily expected to have a random distribution among gene trees. Thus, while the empirical results could support the use of (conservative) filtering, it must be understood that the resulting procedure may introduce small biases. Whether ASTRAL remains statistically consistent given contracted gene trees should be studied in future work.

Other Strategies: Beyond removing branches with low bootstrap support, several other strategies could be employed. Our previous results [24] indicate that simply using a concatenation of all bootstrap gene trees as input to ASTRAL increases error, perhaps because of the increased noise in bootstrap replicates [30, 34]. However, it is possible that a fixed sized sample from a Bayesian estimate of each gene tree distribution would improve accuracy.

Fig. 7.
figure 7

The density plots of \(\log _{X} |Y|\) across all ASTRAL-III runs reported in this paper. Size of the dynamic programming space Y is never above \(|X|^{1.312}\) here.

\(|{{\varvec{Y}}}|\) : The ASTRAL-III running time analysis of \(O(|X|^{1.726})\) is based on the fact that \(|Y|\le |X|^{1.726}\) [39]. However, this upper bound is for specialized formations of the set X. Empirically, as the size of set X increases, the size of |Y| in ASTRAL-III does not increase as fast as the worst-case scenario implies. Across all of our ASTRAL-III runs in this paper, |Y| ranged mostly between \(|X|^{1.05}\) and \(|X|^{1.25}\), with an average of \(|X|^{1.11}\) (Fig. 7). Thus, the average running time of ASTRAL seems closer to \(O(D|X|^{1.1})\), though, the exact value depends on the dataset. The size of X is not currently controlled to be a polynomial function of n and k, but such constraints can be imposed in future versions of ASTRAL.

Large \({\varvec{n}}\) : To assess scalability limits of ASTRAL-III, we tested it on 4 replicates of a dataset with 5,000 species and 500 true gene trees (simulation procedure described in Appendix B). ASTRAL-III took between 18 and 30 h to run on this dataset (24 h on average). We also ran ASTRAL-III on similar datasets with 1,000 and 2,000 species. Average over our four replicates, ASTRAL running time increases as \(O(n^{1.9})\). Our attempts to analyze 10K species within 72 h failed. Future work should reproduce these results with more replicates.