1 Introduction

The estimation of a species phylogeny from multiple loci is confounded by biological processes such as horizontal gene transfer and incomplete lineage sorting that cause individual gene tree topologies to differ from that of the overall species tree [19]. Incomplete lineage sorting (ILS), which is modeled by the well-studied multi-species coalescent (MSC) model, is considered to be perhaps the major cause for this discordance [5]. Many methods have been developed to estimate the species tree in the presence of ILS in a statistically consistent manner, which means that as the amount of data increases, the species tree topology estimated by the method converges in probability to the true species tree topology. Examples of methods for species tree estimation that are statistically consistent under the MSC include ASTRAL [20, 21], ASTRID [31], *BEAST [7], BEST [15], the population tree in BUCKy [12], GLASS [22], MP-EST [17], METAL [3], NJst [16], SMRT [4], SNAPP [1], STEAC [18], STAR [18], STEM [11], and SVDquartets [2]. Some of these methods (e.g., ASTRAL, ASTRID, BUCKy-pop, and NJst) estimate just the species tree topology but not the branch lengths in coalescent units, while others (e.g., BEST, *BEAST, and MP-EST) also estimate the branch lengths. In this paper, we will refer to all methods that have been proven to be statistically consistent under the MSC as “coalescent-based species tree estimation methods”.

One of the key assumptions in the proofs of statistical consistency for standard methods is that every gene is present in every species. This assumption is unrealistic for many empirical datasets (e.g., the plant transcriptome dataset studied in [32]), which can have substantial missing data. The impact of missing data on species tree estimation has mostly been investigated from an empirical rather than theoretical standpoint. Early studies focused on the impact of missing data on the estimation of individual gene trees [6, 24, 38], while later studies examined the impact on multi-locus species tree estimation but without any gene tree discord [14, 33,34,35]. Four recent studies have examined the impact of missing data on species tree estimation using multiple loci, when gene trees can differ from the species tree due to ILS [8, 29, 31, 36]. These studies have largely focused on whether it is better to include taxa and/or genes that have substantial amounts of missing data (e.g., taxa that are absent for 50% or more of the genes), and the relative performance of different coalescent-based species tree estimation methods in the presence of missing data. In general these studies have shown that although deleting whole genes from the overall data reduces accuracy compared to having no missing data, including gene data (even if they are highly incomplete) may be on the whole beneficial to species tree estimation efforts, at least for many coalescent-based species tree estimation methods.

Yet, the question of whether coalescent-based methods are statistically consistent under the MSC in the presence of missing data has not been addressed. This paper examines whether standard coalescent-based species tree estimation methods remain statistically consistent in the presence of missing data. We explore this question under a simple i.i.d. model of missing data (where every species is missing from every gene with the same probability \(p>0\), and denoted \(M_{iid}\)), and also under a more general model of missing data where, for some constant k, each subset of k species has non-zero probability of being present in a randomly selected gene. We refer to this as the “full subset coverage” model (denoted \(M_{fsc}\)). The \(M_{fsc}\) model includes the simpler i.i.d. model as a special case, but also includes the models of taxon deletion considered in [8, 36].

In this study, we address the question of whether coalescent-based species tree estimation methods are statistically consistent under the \(M_{iid}\) or \(M_{fsc}\) models of taxon deletion. We focus on coalescent-based species tree methods that operate by computing summary statistics for subsets of the taxon set and using those summary statistics to estimate the species tree. We show that whenever these calculated summary statistics are not impacted by deleting species outside the subset of interest, then the coalescent-based species tree method will be statistically consistent under the \(M_{fsc}\) model of taxon deletion. We also discuss taxon-deletion models under which species tree estimation methods cannot be statistically consistent, and we finish by discussing the impact of missing data on species tree estimation in practice.

2 Background

2.1 Problem Statement and Notation

The multi-species coalescent is a population genetics model that describes the evolution of individual genes within a population-level species phylogeny [10]. Specifically, a species phylogeny \(\mathcal {T}=(T,\varTheta )\) with topology T and branch lengths \(\varTheta \) is given (but unknown) on a set of n-taxa, \(\mathcal {X}=\{x_i\}_{i=1}^n\), where the branch lengths are denominated in “coalescent units.” This species tree then parameterizes a probability density function for a random variable \(G(\mathcal {T})\) defined over all possible phylogenies of \(\mathcal {X}\). For a gene tree \(g\sim G(\mathcal {T})\), an additional assumption can be made regarding a sequence evolution model which may generate a set of sequences \(s_g=(s_{g1},\dots ,s_{gn})\) for each taxon in \(\mathcal {X}\). Let the leaf set of gene tree g be denoted as \(\mathcal {L}(g)\). Given a collection of genes \(g_1,\dots ,g_m \sim G(\mathcal {T})\), the coalescent-based species tree estimation problem is the challenge of estimating the topology T from the input data \(I_m\), which may include the gene trees, the accompanying sequences or both.

Thus coalescent-based species tree estimation methods can work with a variety of different types of inputs. Usually such methods assume that the estimation of gene trees given sequence data can be done with statistical consistency, which is true in the case of the most common models [27]. In this paper, we will consider the input data I to include, broadly, the gene trees themselves (one per gene), with or without branch lengths, or the multiple sequence alignments (one per gene), or both depending on the method. In either case, it is natural to consider the input data I as being potentially restricted to a subset \(\mathcal {X}'\) of the taxa by considering, respectively, the sub-trees of each gene tree restricted to the leaves with taxa in \(\mathcal {X}'\), or the multiple sequence alignment of only the sequences corresponding to taxa in \(\mathcal {X}'\). We will refer occasionally to this restricted data as \(I|_{\mathcal {X}'}\). In contexts where the number of genes may vary and is indexed by m, the input data I on m genes is correspondingly indexed as \(I_m\).

Tuple-Based Methods. We will establish properties about statistical consistency in the presence of missing data for a class of coalescent-based species tree estimation methods that we collectively refer to as “tuple-based methods”. As we will show, nearly all coalescent-based species tree estimation methods that have been proven to be statistically consistent under the MSC are tuple-based, so this restriction covers most of the methods in use.

A coalescent-based method is a “tuple-based” method if there is some \(\ell \in \mathbb {Z}_{\ge 2}\) such that the method operates by computing a set of summary statistics from the input I for every subset of \(\ell \) species, and then uses these summary statistics to compute the species tree. Furthermore a tuple-based method is called an \(\ell \)-tuple-based method (or more simply an \(\ell \)-tuple method) to reflect the specific value of \(\ell \) on which it bases its summary statistics. We write each tuple-based method as a pair \((F,\alpha )\), with F the function that computes the set of summary statistics from I, and \(\alpha \) the function that computes a species tree given F(I). Also, the set of summary statistics computed by an \(\ell \)-tuple method includes one statistic for every tree topology (possibly rooted) on every subset of \(\ell \) species.

Since a “tree” on two species is just a path, the 2-tuple methods compute pairwise distances for every pair of species. Examples of 2-tuple methods include NJst and ASTRID, which operate by computing the “average internode distance” between every pair of species. Other 2-tuple based methods include GLASS [22] and its variants (e.g., [9]), METAL [3], STAR [18], and STEAC [18], which also compute a pairwise distance between every pair of species, but use a different technique to do the calculation. 2-tuple methods then compute a tree on the matrix of pairwise distances, using methods such as Neighbor Joining (NJ) [26] or FastME [13]; thus, NJ and FastME serve as the function \(\alpha \) in the 2-tuple method.

MP-EST and SMRT are 3-tuple methods. MP-EST requires rooted gene trees (and so depends on the strict molecular clock), and uses the frequency of each rooted 3-leaf tree t induced in the input set of gene trees as the summary statistic for t. It then seeks the model species tree (topology and branch lengths) that is most likely to produce the observed distribution of rooted 3-leaf gene tree frequencies. SMRT is a site-based method that estimates rooted three-leaf subtrees from the concatenated gene sequence alignments, and so depends on the strict molecular clock. SMRT then combines the rooted three-leaf subtrees into a tree on the full set of taxa using the Modified Min Cut algorithm [23].

In contrast to 3-tuple methods (e.g., MP-EST and SMRT), 4-tuple methods operate on unrooted gene trees, and so do not depend on the strict molecular clock. 4-tuple methods begin by computing either the most likely tree on every four leaves, or by computing some real-valued statistic for each unrooted tree on every four leaves. An example of a 4-tuple method is ASTRAL [20, 21], which uses the frequency of quartet tree t induced in the input gene trees as the real-valued support for t. Other 4-tuple methods include the population tree in BUCKy [12] (called BUCKy-pop in [37]) and the implementation of SVDquartets [2] within PAUP* [30]; in these two cases, the real-valued support for a quartet tree is either 1 or 0 (i.e., the best quartet tree on every four leaves is determined). All these 4-tuple methods then compute a species tree by applying some quartet amalgamation method to the set of quartet trees, weighted by their support values. For these 4-tuple methods, \(\alpha \) is the quartet amalgamation technique used to construct the tree T from the set of weighted quartet trees.

The number of summary statistics that each type of method computes depends on the value for \(\ell \) and the number n of species: 2-tuple methods compute \(n \atopwithdelims ()2\) summary statistics (one for each pair of species), 3-tuple methods compute \(3 {n \atopwithdelims ()3}\) summary statistics (one for each rooted three-leaf tree), and 4-tuple methods compute \(3{n \atopwithdelims ()4}\) summary statistics (one for each unrooted four-leaf tree).

The proofs of statistical consistency for \(\ell \)-tuple-based methods have the following basic steps: first, they show that as the number m of genes increases, the vector of summary statistics computed by F on input data \(I_m\) converges in probability to a constant vector (which we will refer to as \(F_0\)). Second, they show that \(\alpha (F_0) = T\), where T is the topology of the true species tree. Third, they show that there is some \(\delta > 0\) so that whenever \(L_{\infty }(F_1,F_0) < \delta \) then \(\alpha (F_1)=\alpha (F_0)=T\) (here \(L_{\infty }\) is the infinity-norm, i.e. the maximum absolute difference of individual vector components). It follows that the algorithm \(A=(F,\alpha )\) is statistically consistent under the MSC. Therefore, when we refer to a statistically consistent \(\ell \)-tuple method, we will assume that these properties hold for the method, and then study the impact of missing data on the method.

Proofs of statistical consistency for many coalescent-based methods typically require several extra conditions. For example, the proofs of statistical consistency of SVDquartets, MP-EST, STEM, STAR, and SMRT require that sequences evolve under the strict molecular clock. Similarly, the proofs of statistical consistency for nearly all methods that operate by combining gene trees require completely correct gene trees (for an exception to this rule see [25]), and it is unknown whether any standard coalescent-based methods that estimate species trees by combining gene trees are statistically consistent in the presence of gene tree estimation error. Another complication in the proofs of statistical consistency is the typical requirement that \(\alpha \) provide an exact solution to an optimization problem (e.g., finding the species tree that maximizes some optimization criterion with respect to the input gene data). This is generally not an issue for 2-tuple methods, which use methods like neighbor joining [26] to compute trees from distance matrices, but can be a problem for 3-tuple and 4-tuple methods. For example, 4-tuple methods tend to have two steps, where the first step computes a set of quartet trees (using F) and the second step computes a tree from the set of quartet trees using \(\alpha \). Since quartet tree compatibility is NP-hard [28], quartet amalgamation methods are typically heuristics that have no guarantees (the dynamic programming algorithm in ASTRAL is one of the few exceptions to this), and may not even be guaranteed to return a tree T when given its set of quartet trees. Thus, statistical consistency of coalescent-based methods is complicated, even when there are no missing data.

Extension of Tuple-Based Methods to Missing Data. These tuple-based methods are defined and described on the assumption of gene trees or sequence alignments without missing data, and the statistics or the algorithm may not be fully defined if not all taxa are present. For example, for a given gene tree, the topology for quartet ijkl does not exist if one or more of the species is missing from the gene. Intuitively, if the method would have called for the calculation of a statistic on a particular set of taxa for a particular gene, it is not possible to calculate this if any taxon in the set is not present, so that gene should be excluded for purposes of that statistic. Thus, the natural extension of a tuple-based method \((F,\alpha )\) to inputs with missing data (species missing from genes) is as follows:

Definition 1

Let \(A=(F,\alpha )\) be an \(\ell \)-tuple species tree estimation method. The natural extension of A computes the summary statistics for a given set B of \(\ell \) species based only on those genes that contain all the species in B.

Type 1 and Type 2 \(\ell \) -Tuple Methods. Since the set of summary statistics includes a real number for every tree t on \(\ell \) species, we will let \(F_t(I)\) denote the summary statistic computed by the function F for tree t given input I. For a set B of \(\ell \) species drawn from the full set \(\mathcal {X}\) of species, let \(I|_{B}\) denote the input set I restricted to B; thus, all species in \(\mathcal {X}\setminus B\) are deleted entirely from the input. Then tuple-based methods can be characterized further depending on how they behave on such inputs. Specifically, we will partition \(\ell \)-tuple methods \((F,\alpha )\) into two categories:

  • Type 1: For all inputs I, all sets B of \(\ell \) species from \(\mathcal {X}\), and all trees t on B, \(F_t(I)=F_t(I|_{B})\).

  • Type 2: There is at least one input I, one set B of \(\ell \) species, and one tree t on B such that \(F_t(I) \ne F_t(I|_{B})\).

Thus, a Type 1 \(\ell \)-tuple method has the property that deleting taxa from outside a set B does not impact the summary statistics it computes for any tree on B. Note that taxon deletion impacts both Type 1 and Type 2 methods, in that if enough taxa are deleted from enough genes then accuracy must decrease. As we will see, Type 1 methods are easier to analyze than Type 2 methods, and in particular it is easy to prove that a Type 1 method remains statistically consistent in the presence of missing data for some models of random taxon deletion. Most statistically consistent coalescent-based methods are Type 1 tuple-based methods; for example, ASTRAL, GLASS, METAL, MP-EST, STEAC, and SVDquartets are all Type 1 tuple-based methods. ASTRID, NJst, and STAR are Type 2 methods.

2.2 Taxon Deletion Models

Let \(\mathcal {T}\) be a species tree on a set \(\mathcal {X}\) of n species, with \(\mathcal {X}=\{x_i\}_{i=1}^n\), and let m gene trees evolve within \(\mathcal {T}\) under the multi-species coalescent model. We denote the set of gene trees by \(\mathbf {T}=\{T_i\}_{i=1}^m\), and the set of genes by \(\mathcal {G}=\{g_i\}_{i=1}^m\). To model taxon deletion, we let \(g_i\) denote an arbitrary gene, and \(Y_i=[Y_{i1},\cdots , Y_{in}]^T\) be a random n-dimensional vector where

$$\begin{aligned} Y_{ij}=\mathbb {I}_{\left\{ x_j\text { is present in }g_i\right\} } \end{aligned}$$
(1)

Here each individual \(Y_{ij}\) is a binary random variable that represents whether a species \(x_j\) is present for a random gene \(g_i\).

Exchangeability. For the following lemma, we will assume that if \(T_i\) is generated before \(Y_i\), then the post-deletion tree \(T_i^*=T_i|Y_i=y_i\) is obtained by taking the sub-tree of \(T_i\) restricted to the set of leaves \(\{x_j | y_{ij}=1\}\), that is, the set of leaves whose taxa have not been deleted. If \(Y_i\) is generated before \(T_i\), then \(T_i^*\) is obtained by taking the same sub-tree of the species tree \(\mathcal {T}\), \(\mathcal {T}|Y_i\) and simulating a gene tree within this species sub-tree under the multi-species coalescent.

Lemma 2

If \(T_i\) and \(Y_i\) are independent, the two variables are exchangable and the distribution of \(T_i^*\) does not depend on the sequence.

Proof

If \(Y_i\) is generated first, then the conditional distribution of \(T_i^*\) is equal to the distribution of gene trees under the multi-species coalescent on \(\mathcal {T}|Y_i\), by definition.

If \(T_i\) is generated first, then the pruning operations described above mean that \(T_i^*\) will lie entirely within the subtree \(\mathcal {T}|Y_i\). It remains to show that the probability of any given pattern of coalescence on the remaining branches is identical to the MSC under \(\mathcal {T}|Y_i\). This follows from the memoryless property of coalescence under the MSC: the probability of any two lineages originating within \(\mathcal {T}|Y_i\) coalescing at any given point is not dependent on either lineage’s coalescent history.

It should be noted by this model description, taxa are absent or present independently of the generation of the gene data, including tree topology and sequence evolution, and the two processes are exchangeable. Also, as is the case with the general multi-species coalescent model, gene trees evolve under a process that is i.i.d. with respect to one another.

We will now define the two models for taxon deletion described briefly earlier.

The i.i.d. Model ( \(M_{iid}\) ). \(M_{iid}\) is a family of models parameterized by p, with \(0<p<1\), where p is the probability that a random gene is present in a random species. For the \(M_{iid}\) model for parameter p, we assume that \(Y_{ij}\sim Bernoulli(p)\) for all genes i and all taxa j, and that \(Y_{ij}\) and \(Y_{kj}\) are independent for \(k\ne i\). (By extension of the statement earlier that genes evolve independently of one another, this also implies that \(Y_{ij}\) and \(Y_{ik}\) are independent for genes j, k where \(j\ne k\).)

The Full Subset Coverage Model ( \(M_{fsc}\) ). \(M_{fsc}\) is a family of models parameterized by \(k \ge 2\). We assume that the taxon deletion process is i.i.d. across the genes but we do not assume that it is i.i.d. across species. An \(M_{fsc}\) model for parameter k satisfies the property that for any subset B of at most k species there is a strictly positive probability \(p_B\) (that can depend on B) so that given a random gene, every member of B is present in the data for that gene with probability \(p_B\). Since the number of taxon sets of size at most k is finite, \(p^* = \min \{p_B: B \subseteq \mathcal {X}, |B| \le k\} > 0\); hence, every taxon subset of size at most k appears in a random gene with probability at least \(p^*\). Note that every \(M_{iid}\) model satisfies the property of being an \(M_{fsc}\) model for every k.

Comparison to Previous Models. Most prior studies of the impact of missing data on phylogenomic analysis have been performed under the \(M_{iid}\) model; this model is referred to as R in [36] and as the “random allocation” model in [8]. [36] also considered the G model, where missing data are concentrated in a subset of randomly chosen genes, and then ingroup taxa are deleted under an i.i.d. process from these genes. [36] also studied the S model, where missing data are allowed only in a subset of randomly chosen ingroup species, and that the genes are deleted from the selected species under an i.i.d. process. Note that the S and G models studied in [36] are \(M_{fsc}\) models.

3 Results

3.1 Results under an adversary model of taxon deletion

Theorem 3

Let taxon deletion be dependent on gene tree topology. There exists a dependency structure under which no method is statistically consistent.

Proof

Let \(\mathcal {T}\) and \(\mathcal {T}'\) be possible species trees; note that \(\mathcal {T}'\) has a topology that appears with strictly positive probability under the MSC for species tree \(\mathcal {T}\). For each gene \(g_i\) (whose true gene tree topology we’ll denote as \(t_i\) for clarity), consider the dependency structure where all taxa are present in the data for \(g_i\) with probability 1 if the topology of \(t_i\) is identical to \(\mathcal {T}'\), and all taxa are absent with probability 1 otherwise. Effectively, gene \(g_i\) is observed if and only if it has topology identical to \(\mathcal {T}'\). Then the distribution of observed gene data is not unique to the species tree and identifiability of the species tree is lost, so no method can be statistically consistent.

The theorem above demonstrates that a dependence between gene tree topology and taxon presence can quickly unravel statistical consistency guarantees in the absence of additional assumptions. But such a dependence may exist for some realistic models of gene presence/absence, including a birth/death type model where a gene may be present only for a clade of the tree. Such models are interesting and unsolved, but are beyond the scope of this paper.

3.2 Results for Type 1 Methods Under \(M_{fsc}\)

We now discuss the statistical consistency guarantees of Type 1 tuple-based methods. As we will see, most of the tuple-based methods remain statistically consistent even in the presence of missing data, as long as the process that generates the missing taxa is well-behaved (e.g., not generated by an adversary that biases the method towards the wrong tree).

Let \(A=(F,\alpha )\) be a Type 1 \(\ell \)-tuple method that satisfies the following properties:

  • (i) For all model species trees \(\mathcal {T}= (T,\varTheta )\), as the number m of genes increases, \(F(I_m) \overset{p}{\longrightarrow }F_0\), where \(F_0\) is a constant vector parameterized by \(\mathcal {T}\).

  • (ii) There exists \(\delta > 0\) such that for all vectors of summary statistics \(F_1 \) satisfying \(L_{\infty }(F_1,F_0) < \delta \), \(\alpha (F_1) = \alpha (F_0) = T\).

Theorem 4

Let \(A=(F,\alpha )\) be a Type 1 \(\ell \)-tuple species tree estimation method satisfying the two properties (i) and (ii) above, and assume that the number of species is at least \(\ell \). The natural extension of A is statistically consistent under \(M_{fsc}\) with parameter \(k \ge \ell \), and thus also under \(M_{iid}\) for any parameter p.

Proof

Let \(\mathcal {T}=(T,\varTheta )\) be the model species tree, \(I_m\) be the input dataset containing m genes, C be the number of summary statistics computed by algorithm \(A=(F,\alpha )\) on input \(I_m\). Since \(A=(F,\alpha )\) satisfies condition (i) when there are no missing data, then as the number of genes m increases, \(F(I_m) \overset{p}{\longrightarrow }F_0\), where \(F_0\) is a vector of constants. We will denote the \(i^{th}\) summary statistic computed on input \(I_m\) by \(F_i(I_m)\) and the i-th component of \(F_0\) as \(F_{0_i}\). We write \(F(I_m)=\left( F_1(I_m|_{\mathbf {x}_1}),\dots ,F_C(I_m|_{\mathbf {x}_C})\right) \) where \(\mathbf {x}_i\) denotes a particular set of \(\ell \) taxa. In other words, since A satisfies condition (i) when there are no missing data, for all \(i=1,\dots ,C\) there exist a constant \(F_{0_i}\) such that \(F_i(I_m|_{\mathbf {x}_i})\overset{p}{\longrightarrow }F_{0_i}\) as \(m \rightarrow \infty \). Since the data for each gene are independent of all others, to prove statistical consistency under the \(M_{fsc}\) model we merely require that \(I_m|_{\mathbf {x}_i}\) include an infinite number of genes as \(m\rightarrow \infty \). Under the \(M_{fsc}\) model, \(Pr[\mathbf {x}_i \subseteq L(g)] >0\) for every gene g (where \(\mathcal {L}(g)\) denotes the set of species for gene g). Hence, by the Borel-Cantelli lemma, the number of genes that include all \(\ell \) taxa in \(\mathbf {x}_i\) will also approach infinity. Thus \(I_m|_{\mathbf {x}_i}\) will include an infinite number of genes, and \(F(I_m|_{\mathbf {x}_i})\overset{p}{\longrightarrow }F_{0_i}\). By the definition of the natural extension of A, \(\alpha \) does not change under deleted taxa. Since A satisfies condition (ii), \(\exists \delta >0\) such that \(\forall F_1\) with \(L_{\infty }(F_1,F_0) < \delta \), \(\alpha (F_1)=T\), and so the natural extension of A is statistically consistent under \(M_{fsc}\). Since \(M_{iid}\) is a subset of \(M_{fsc}\), it is also statistically consistent under \(M_{iid}\).

Corollary 5

ASTRAL and METAL are statistically consistent under the MSC even when taxa are deleted under an \(M_{fsc}\) model, provided that each is run in exact mode and so finds globally optimal species trees. MP-EST and STEM are statistically consistent under the MSC even when taxa are deleted under an \(M_{fsc}\) model, if sequence evolution is under a strict molecular clock and they find globally optimal species trees. SVDquartets is statistically consistent under the MSC even when taxa are deleted under an \(M_{fsc}\) model, if sequence evolution is under a strict molecular clock and the quartet amalgamation heuristic used is modified to ensure that it returns a compatibility tree when the input set of quartets is compatible. SMRT is statistically consistent under the MSC even when taxa are deleted under an \(M_{fsc}\) model, if sequence evolution is under the symmetric two-state model with a strict molecular clock.

3.3 Statistical Consistency of Versions of ASTRAL Under \(M_{fsc}\)

ASTRAL-1 [20] (and its improved version, ASTRAL-2 [21]) are coalescent-based methods for estimating species trees that take unrooted gene trees as input, and return a tree that minimizes the quartet tree distance to the input gene trees. Each can be run in exact mode, which guarantees that the tree that is returned has the minimum distance to the input gene trees. However, the exact versions are computationally intensive (running in time that grows exponentially with the number of species), and so heuristic versions are also available. These heuristic versions operate by constraining the search space using the input set of gene trees, and then guarantee that an optimal tree is returned within the search space. The important difference between the two methods is how the search space is constrained, and ASTRAL-2 explicitly enlarges the space compared to ASTRAL-1 when the input gene trees can be incomplete (i.e., when some species are missing from some gene trees). Because the search space is constrained using the input gene trees, the two ASTRAL algorithms depends on the input in a way that makes the analysis of their statistical guarantees non-trivial.

This section shows that both ASTRAL-1 and ASTRAL-2 are statistically consistent under the \(M_{iid}\) model, but not under any \(M_{fsc}\) model. We then present a modification to ASTRAL-1, denoted by ASTRAL*, and which differs from ASTRAL-1 only in how the search space is constrained. We then show that ASTRAL* is statistically consistent under many \(M_{fsc}\) models.

ASTRAL-1. We begin with a formal description of the ASTRAL-1 algorithm.

Notation. We let \(\mathcal {X}\) denote the full set of species, and \(\mathcal {X}'\) denote an arbitrary subset of \(\mathcal {X}\). Every tree t we consider is assumed to be a binary unrooted tree with leaves taken from a subset of \(\mathcal {X}\), and as earlier we denote the leafset of t by \(\mathcal {L}(t)\). Each edge of t defines a bipartition of the set \(\mathcal {L}(t)\) (denoted by \(B|B'\), for some set \(B \subseteq \mathcal {X}\) and \(B' = \mathcal {L}(t) \setminus B\)) obtained by deleting the edge but not its endpoints from t. We will refer to the set of all these bipartitions as Bip(t), and the set of halves of the bipartitions of t as the clades of t. (Note that the term “clades” is normally used only in the context of rooted trees, but we extend the term here to allow us to refer to halves of bipartitions using the same term.) We let \(T_{\mathcal {X}}(X)\) denote the set of unrooted binary trees on leafset \(\mathcal {X}\) that satisfy \(Bip(t) \subseteq X\). If X is not provided, then we assume the set of unrooted binary trees is not constrained, and let \(T_{\mathcal {X}}\) denote the set of all unrooted binary trees on leafset \(\mathcal {X}\).

We let Q(t) denote all 4-leaf homeomorphic subtrees of t induced by a set of four leaves in t, and we note that when t is binary (i.e., fully resolved), then Q(t) contains only binary quartet trees. Let \(\mathcal {Q}\) be the set of all \(\left( {\begin{array}{c}n\\ 4\end{array}}\right) \) 4-taxon subsets of the taxon set \(\mathcal {X}\). Let \(q\in \mathcal {Q}\), let t be an arbitrary tree topology on \(\mathcal {X}\) and let Top(qt) denote the induced quartet subtree topology for quartet q in t.

Definition 6

ASTRAL Optimization Problem

Input: Taxon set \(\mathcal {X}=\{x_i\}_{i=1}^n\), gene trees \(t_1,\cdots ,t_m\), and set X of allowed bipartitions of \(\mathcal {X}\).

Output: Binary tree T where

$$\begin{aligned} T=\arg \max _{t\in T_{\mathcal {X}}(X)} \sum _{q\in \mathcal {Q}}\sum _{i=1}^{m}\mathbb {I}_{\left\{ Top(q,t)=Top(q,g_i)\right\} } \end{aligned}$$

ASTRAL-1 will specifically return the tree that maximizes the optimization criteria in the innermost summation subject to the constraint that every bipartition in the output tree be included in the set \(\mathcal {X}\). Therefore, to show consistency under a model of missing data it is necessary to show not only that the optimization criteria still works, but also that the true topology will be still be included in this constrained search space.

ASTRAL-1 and ASTRAL-2 differ in how they define the default set X of allowed bipartitions, and also use slightly different dynamic programming techniques to assemble the optimal tree from the bottom up. Note that to run ASTRAL-1 or ASTRAL-2 in exact mode, the set X is defined to be all bipartitions on \(\mathcal {X}\). In the default version of ASTRAL-1 (referred to as the “heuristic version”), X is the set of all bipartitions that appear in any gene tree. Hence, when there are no missing data, then as the number of genes increases, the set X will include all possible bipartitions on the taxon set with probability converging to 1 (and hence in particular the bipartitions in the true species tree). However, when there are missing data, then proving that the set X contains all the bipartitions in the species tree takes some care. In particular, if every gene tree is incomplete, then no bipartition in any gene tree is a bipartition of the full set of taxa, and so this default setting will not enable a statistically consistent estimation method.

ASTRAL*: We will modify ASTRAL-1 by changing how it defines the set X of allowed bipartitions, and refer to this modification as ASTRAL*. Specifically, we will add bipartitions to the default setting computed by ASTRAL-1. Hence, ASTRAL*’s extra bipartitions could also be added to ASTRAL-2.

Note that ASTRAL-1, ASTRAL-2, and hence also ASTRAL, when run in heuristic mode, are different from the species tree estimation methods described previously, in that \(\alpha \) depends not only on the summary statistics F(I) but also on the input data I. Therefore, we denote the output of the function by \(\alpha (F,I)\).

For every clade \(C\subset \mathcal {X}\) occurring in a gene tree, we require that ASTRAL* adds the bipartition \(C|C'\) where \(C' = \mathcal {X}\setminus C\), to its set X. (Note that since the trees in this problem are unrooted, a clade and one half of a bipartition are equivalent concepts.) This is a trivial extension of the algorithm for a model of incomplete genes and one that strengthens the conditions under which the method is consistent, as we will see below.

Theorem 7

(1) ASTRAL*, as well as ASTRAL-1 and ASTRAL-2 run in default heuristic mode, are statistically consistent under the MSC for any \(M_{iid}\) model of taxon deletion. (2) ASTRAL-1 is not statistically consistent under an \(M_{fsc}\) model with parameter k if the number of species is greater than k and ASTRAL-1 is run in default heuristic mode. (3) ASTRAL* is statistically consistent under any \(M_{fsc}\) model of taxon deletion with parameter k if the number n of species is at most 2k.

Proof

  1. (1)

    Let \(\mathcal {T}\) be a model species tree, and consider taxon deletion under some \(M_{iid}\) model. We will show that there is non-zero probability that every bipartition in the species tree appears in the search space computed by ASTRAL-1 in its default setting. Since the search space computed by ASTRAL-1 is a subset of the search space computed by ASTRAL-2 and ASTRAL*, the result will follow. Recall that ASTRAL-1 includes all bipartitions \(C|C'\) that appear in any input gene tree. Under the MSC model, every bipartition appears in some gene tree with probability increasing to 1 as the number of genes increases. Under any \(M_{iid}\) model, for every subset of taxa, the probability that none of the taxa in the subset are deleted is strictly greater than 0. Hence, under \(M_{iid}\), the set X of bipartitions allowed in the ASTRAL-1 search space will converge to the set of all possible bipartitions. Therefore, ASTRAL-1 is statistically consistent under \(M_{iid}\). Since the sets of bipartitions computed by ASTRAL* and ASTRAL-2 contain the set of bipartitions computed by ASTRAL-1, ASTRAL-2 and ASTRAL* are also statistically consistent under \(M_{iid}\).

  2. (2)

    Now consider the \(M_{fsc}\) model with parameter k: let \(n > k\), and further let the taxon deletion process be such that every gene has exactly k taxa, (e.g. k taxa sampled uniformly, a valid model under \(M_{fsc}\)). Then if ASTRAL-1 is run in heuristic mode, it will compute a set X that contains no bipartitions on the set \(\mathcal {X}\) of species, and so cannot be statistically consistent under all \(M_{fsc}\) models.

  3. (3)

    We now show that ASTRAL* run in heuristic mode is statistically consistent under any \(M_{fsc}\) model with parameter k when \(n \le 2k\). Let \(C|C'\) be an arbitrary bipartition on \(\mathcal {X}\), and assume without loss of generality that \(|C| \le k\). Hence, under \(M_{iid}\), the probability that C appears in a random gene tree is strictly positive. Under the MSC, any bipartition on \(\mathcal {X}\) appears in a given true gene tree with strictly positive probability. Since this process is independent from the removal of taxa, and since there is non-zero probability that all members of a clade appear in the gene tree, the probability is non-zero that the set C appears as a clade in a random gene tree.

Hence, as the number m of gene trees increases, the probability approaches 1 that C appears as a clade in at least one gene tree. Thus the probability approaches 1 that the set X computed by ASTRAL* will contain \(C|C'\), where \(C' = \mathcal {X}\setminus C\). Therefore, ASTRAL*, run in heuristic mode, will be statistically consistent under the \(M_{fsc}\) model with parameter k, provided that the number of species \(n \le 2k\).

Theorem 8

ASTRAL-1 and ASTRAL*, when run in heuristic mode, are not statistically consistent under the \(MSC_{fsc}\) class of models with parameter k, if the number of species \(n > 2k\).

Proof

Consider a model of taxon deletion where every gene tree has exactly k taxa, selected at random from the full set of taxa. This model satisfies the conditions of the \(M_{fsc}\) models with parameter k. Now assume k \(<\lfloor \) n/2 \(\rfloor \).

Let \(\mathcal {T}\) be a caterpillar tree on a set \(\mathcal {X}\) of n taxa. Then \(\mathcal {T}\) contains a clade B of size \(\lfloor n/2 \rfloor \) whose complement is at least as large; hence both B and \(\mathcal {X}\setminus B\) have more than k species. Hence, under this model of taxon deletion, neither B nor \(\mathcal {X}\setminus B\) will be in any gene tree. Hence, the bipartition \(B|B'\) (where \(B'=\mathcal {X}\setminus B\)) will not be in X (the constraint on the search space) as computed by ASTRAL and ASTRAL*. Hence, neither ASTRAL nor ASTRAL* can recover the true species tree under this random taxon deletion model.

3.4 Statistical Consistency of ASTRID and NJst Under \(M_{iid}\)

As noted earlier, ASTRID, NJst, and STAR are Type 2 methods, and proofs we provided of statistical consistency for Type 1 tuple-based methods do not apply to these methods (or other Type 2 methods). In this section we will show ASTRID and NJst remain statistically consistent under the \(M_{iid}\) models of taxon deletion. However, the statistical consistency of these methods under the more general \(M_{fsc}\) models is unknown.

NJst is a distance-based method that uses the average topological “internode” distance between taxa in the gene trees. The internode distance between two taxa \(x_i\) and \(x_j\) in a tree is the count of individual nodes along the path from \(x_i\) to \(x_j\), denoted \(\rho (x_i,x_j)\). ASTRID is an extension of NJst with the averaging redefined to better accommodate missing taxa and is precisely the natural extension of NJst as defined earlier, where the statistic for each pair is calculated as usual but restricted to the genes in which both members of the pair appear. However, NJst and ASTRID are not Type 1 methods, under the definition provided above, because the internode distance for two taxa \(x_i\) and \(x_j\) can be affected by the presence or absence of a third taxon.

Distance methods are formally 2-tuple methods, and use an algorithm such as neighbor joining [26] to return a tree topology and branch lengths given \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) \) pairwise distances (collectively, the distance “matrix”). We now state some well-known properties of such methods for reference in the proof below. For a tree with topology \(T=(V,E)\) on n taxa and edge weights \(l_e\), \(e\in E\), if the distance for any two taxa j and k is equal to the sum of the edge weights over edges in the shortest path between leaves j and k, then neighbor-joining will return a tree with topology T, and the distance matrix is said to be additive on the topology T. An equivalent definition of an additive matrix is as follows:

Definition 9

The Four Point Condition. Let \(T=(V,E)\) be a tree on n leaves labeled as \(\mathcal {S}=\{s_i\}_{i=1}^n\) with positive edge weights \(l_e\), \(e\in E\). Let \(D=d_{ij}\) be the matrix of pairwise distances between all pairs of taxa (ij), \(i,j\in \mathcal {S}\). The matrix D is additive on the topology T if and only if for all sets of four leaves \(\{i,j,k,l\}\subset \mathcal {S}\) with quartet-subtree topology ij|kl in T, without loss of generality, the following holds:

$$\begin{aligned} d_{ij}+d_{kl} < d_{ik}+d_{jl} = d_{il}+d_{jk}. \end{aligned}$$

Furthermore, if instead of D as above we are given \(\hat{D}=d_{ij}+\varepsilon _{ij}\), where \(\varepsilon _{ij}\) is an unknown noise term such that for all \(i,j\in \mathcal {S}\), \(|\varepsilon _{ij}|<\frac{1}{2}\min _{a,b\in \mathcal {S}} d_{ab}\), then neighbor-joining applied to the matrix \(\hat{D}\) will also return the topology T with probability 1. Therefore, since NJst is a distance-method, to prove statistical consistency it suffices to show that the metric given by the average internode distance collectively converges to an additive matrix on the true topology.

Theorem 10

Assume that taxa are absent from the data for each gene according to the \(M_{iid}\) model. Then NJst and ASTRID are statistically consistent under the MSC.

First we give a helpful lemma. Note that since \(\rho (x_i,x_j)\) is undefined when either of \(x_i\) or \(x_j\) are removed, the expectation of \(\rho (x_i,x_j)\) is formally undefined as long as the probability of either is nonzero. We nonetheless use the notation \(\mathbf {E}\left[ \rho (x_i,x_j)\right] \) in the lemma and proofs below, which will refer implicitly to the conditional expectation on the event that neither \(x_i\) nor \(x_j\) are removed.

Lemma 11

Under the MSC and \(M_{iid}\), let a, b, c and d be four taxa, and consider the event in the coalescent probability space in which the lineages of these taxa have entered a common population and no pair have coalesced with one another, denoted as \(\mathcal {E}_{abcd}\). Denote the points on each respective lineage in which they enter the common population as A, B, C, and D. Let Y be the random variable representing the taxon deletion process. Then for any two taxa \(\{i,j\} \subset \{a,b,c,d\}\) and respectively \(\left\{ I,J\right\} \subset \{A,B,C,D\}\):

$$\begin{aligned} \mathbf {E}\left[ \rho (i,j) | \mathcal {E} \right] =\mathbf {E}_Y\left[ \rho (i,I)|\mathcal {E}\right] + \mathbf {E}_Y\left[ \rho (j,J)|\mathcal {E}\right] + K \end{aligned}$$

where K is a constant that does not depend on the identities of i and j.

Proof (Lemma 11)

[Lemma 11] Consider any gene tree g meeting the condition of event \(\mathcal {E}\), and let G be the subtree of g sitting between the points A, B, C and D and the root. The topology of G in this model, given that \(g\in \mathcal {E}\), is determined by a set of i.i.d. Exponentially distributed random variables corresponding to the pairwise times-to-coalescence of all remaining lineages concurrent with and including \(\{A,B,C,D\}\). By De Finetti’s theorem, the probability density function of G is unique up to a permutation of the indices of the random variables. Thus for any \(\{I,J\}\subset \{A,B,C,D\}\), since \(\rho (I,J)\) is dependent only on the topology of G, the probability density of \(\rho (I,J)\) does not depend on the identity of I and J, and thus \(\mathbf {E}\left[ \rho (I,J)|\mathcal {E} \right] =K\).

Proof (Theorem 10)

[Theorem 10] In order to show that the method is statistically consistent, we must show the following:

  1. 1.

    For a gene tree generated under the coalescent process in the MSC followed by the removal of taxa subject to \(M_{iid}\), the expected value of the summary statistics \(\rho (x_i,x_j)\) for each pair of taxa \(x_i,x_j\) form an additive matrix that defines the topology of the species tree.

  2. 2.

    The statistics themselves converge to their expectations as the number m of genes approaches infinity.

The second property follows from the i.i.d.-generation of gene trees and the weak law of large numbers; hence, we need only establish the first property.

Let G be a random gene tree on n taxa generated by the MSC on species tree \(\mathcal {T}\), and let \(Y_n\sim M_{iid}\) with probability of deletion p. We will show that for an arbitrary set of four taxa \(\{x_1, x_2, x_a, x_b\}\), the expectations over G and \(Y_n\) of the internode distances are additive on the species tree, which we check by confirming that they obey the four point condition for the species tree \(\mathcal {T}\). In what follows, we will refer to this by saying that the expectations “obey the four point condition”, with the understanding that this is with reference to the species tree \(\mathcal {T}\). Importantly, the event \(\mathcal {E}\) defined in Lemma 11 includes all cases in which the quartet-subtree topology in the gene tree does not match that of the species tree, and implies that the four point condition holds in these cases. As a result, it suffices to show that the four point condition holds when the quartet-subtree in G is identical to the topology in the species tree, which we assume without loss of generality has topology \(x_1 x_2|x_a x_b\). We will show this by induction on the number n of taxa, and begin with the smallest non-trivial case, \(n=4\).

Base Step: \(\mathbf {n=4}\). For \(n=4\) we can write the expected values of the internode distances in closed form and check the four point condition directly:

figure a

Thus to test that the four point condition holds, we have:

figure b

Thus the four point condition holds for \(n=4\).

Induction Step: Assume that for a set \(S_n\) of n taxa, the expected value of the matrix \(D=[\rho (x_i,x_j)]\) for a random gene tree \(G_n\) and taxon-removal variable \(Y_n\) is additive on the true species tree. That is, for any set of n-taxa under the MSC and \(M_{iid}\) and any quartet (abcd), the four point condition holds for \(\mathbf {E}_n[\rho (i,j)]\), \(i,j\in \{a,b,c,d\}\), a fact that we will use below. Here \(\mathbf {E}_n\) denotes the expectation operator under n taxa for notational clarity. We will now show that the same holds for a set of size \(n+1\).

Let \(x_{n+1}\) be a new taxon and let G be generated under the MSC on \(S_n\cup \{x_{n+1}\}\) with taxon-removal variable \(Y_{n+1}=(Y_n,y_{n+1})\) where \(y_{n+1}\sim Bernoulli(p)\) in accordance with \(M_{iid}\). Our approach will be to define three cases, each of which have non-zero probability, and show that regardless of the placement of \(x_{n+1}\) in the species tree, the four point condition holds on \(\{x_1,x_2,x_a,x_b\}\).

Case 1: \(x_{n+1}\) is deleted from G (i.e. \(y_{n+1}=1\)). In this case the conditional values of \(\mathbf {E}_{n+1}[\rho (x,x')]\) for \(x\ne x'\), \(x,x'\in \{x_1,x_2,x_a,x_b\}\) are identical to the n-taxa case, and thus by our assumption they obey the four point condition.

Case 2: \(x_{n+1}\) is not deleted and it coalesces on a branch that is not on the induced quartet-subtree of G for the quartet \(\{x_1,x_2,x_a,x_b\}\). In this case, the coalescence event for \(x_{n+1}\) does not add to the internode distances along any of the branches connecting any two members of the quartet. As a result, again the conditional values of \(\mathbf {E}[\rho (x,x')]\) for \(x\ne x'\), \(x,x'\in \{x_i,x_j,x_k,x_l\}\) are identical to the n-taxa case, where again they obey the four point condition, by the induction assumption.

Case 3: \(x_{n+1}\) is not deleted and coalesces directly with a branch on the induced quartet subtree of G for the quartet \(\{x_1,x_2,x_a,x_b\}\). This case is non-trivial and requires some analysis. For shorthand, we will refer to the quartet-subtree of G restricted to \(\{x_1, x_2,x_a,x_b\}\) as simply q. For a pair of taxa \(i,j\in S_n\) the value of the expected internode distance in the presence of \(x_{n+1}\) is:

$$\begin{aligned} \mathbf {E}_{n+1}\left[ \rho (i,j)\right] =\mathbf {E}_{n}\left[ \rho (i,j)\right] +P(i,j) \end{aligned}$$
(2)

where P(ij) denotes the probability that \(x_{n+1}\) coalesces on a branch in the path from i to j, which would cause \(\rho (i,j)\) to increase by 1. This quantity is non-trivial and depends jointly on both the topology of G and the value of \(Y_{n}\). However, since the first term obeys the four point condition by the induction hypothesis, the proof depends on showing that P(ij) does as well.

To do that, we will fi1rst partition the joint probability space of the MSC and \(M_{iid}\), denoted as \(\mathcal {G}\) and assign each part of this partition to one of the five branches in the subtree q, then show that for \(\{i,j\} \subset \{x_1,x_2,x_a,x_b\}\), P(ij) can be expressed as the sum of probabilities assigned to the branches between i and j. We will label the branches as \(b_1\), \(b_2\), \(b_a\), and \(b_b\) for the four outer branches and \(b_m\) for the middle branch.

Figure 1 describes the partition based on the branch with which \(x_{n+1}\) coalesces (in each column) and the outcome of the taxon-removal process (in each row). For the illustrations in the header of each row, a branch represented with a dotted line represents the case that all lineages coalescing along that branch other than \(x_{n+1}\) are fully removed by the taxon-removal process. The positions of the relative taxa are given in the illustration in the first row header, and all taxon-removal outcomes that allow at least one \(\rho (i,j)\), \(i,j\in \{x_1,x_2,x_a,x_b\}\) to be measured are represented in the rows. The assignment is not unique, as implied by the entries with offer two possibilities, but for the proof either will work so we consider the default to be the first branch listed.

Fig. 1.
figure 1

Table of attachment branches given the latent 5-taxon topology and set of deleted taxa from q, assuming no other taxa coalesce closer to the inner nodes. Branches (in rows) given by dotted lines are pruned as a result of taxon deletion.

For an event \((G,Y_{n+1} ) \in \mathcal {G}\), denote the assignment of \((G,Y_{n+1})\) to branch \(b_e\) as \((G,Y_{n+1}) \rightarrow b_e\), and for that branch let

$$p_e = P\left( \{(G,Y_{n+1})\in \mathcal {G} | (G,Y_{n+1}) \rightarrow b_e\} \right) .$$

In this way, for any pair of taxa \(i,j\in \{x_1,x_2,x_a,x_b\}\), the probability that \(\rho (i,j)\) is incremented by 1 in the presence of \(x_{n+1}\) is given precisely by the sum of \(p_e\) for set of edges e between i and j in the quartet subtree q. This should be apparent by visual inspection of the table. Thus, P(ij) in (2) is the sum of positive edge weights on the topology \(x_1 x_2 | x_a x_b\), which is also the species tree topology on these four taxa, and so is additive on the species tree. Hence, it meets the four point condition, completing the proof.

This proof has been limited to ASTRID, and only provided under the \(M_{iid}\) model of taxon deletion because the independence of taxon deletion (between taxa as well as from gene tree generation) was used when we noted that

$$\begin{aligned} \mathbf {E}_{n+1}[\rho (x,x')]=\mathbf {E}_n[\rho (x,x')]+p_{xx'} \end{aligned}$$
(3)

in case 3. Independence implies that the marginal probability distribution of \(\rho (x,x')\) for n taxa is identical to the conditional distribution given the information that \(x_{n+1}\) has not been deleted.

While that is not strictly necessary for (3) to hold, counterexamples can be difficult to construct, and thus it is non-trivial to characterize conditions that may be weaker than pure independence and still imply that ASTRID is statistically consistent. It is an open question whether ASTRID is statistically consistent under any more general model (e.g., under the \(M_{fsc}\) model). It is also an open question whether other Type II methods (e.g., STAR) are statistically consistent under the \(M_{iid}\) model of taxon deletion.

Results shown here have focused on whether coalescent-based species tree estimation methods remain statistically consistent in the presence of missing data. However, inherent in the question is the assumption that the method being considered is statistically consistent when there are no missing data. Here we consider the conditions under which the different methods we discussed in this paper are statistically consistent when there are no missing data.

4 Discussion

This study examined the theoretical properties of coalescent-based species tree estimation methods, and showed that the Type 1 methods are statistically consistent under a fairly general model of missing data (the “full subset coverage” model, \(M_{fsc}\)), while at least one Type 2 method is statistically consistent under an i.i.d. model of missing data that is a subclass of the \(M_{fsc}\) model. However, these statistical consistency results do not suggest that missing data do not have a negative impact on the accuracy of coalescent-based species tree estimation, nor that the impact may differ between methods. The practical implications of missing data are best understood through experimental studies.

Four such studies [8, 29, 31, 36] examined the impact of missing data on coalescent-based species tree estimation methods. Hovmöller et al. [8] evaluated the impact of missing data on STEM and *BEAST, using two different models of taxon deletion and found that both STEM and *BEAST were highly robust to missing data.

Xi et al. [36] evaluated ASTRAL (v. 4.7.1), MP-EST, and STAR on simulated and biological datasets with missing data. They explored two taxon deletion models, S (missing data restricted to a subset of selected species) and G (missing data restricted to a subset of selected genes), although they never deleted any genes from the outgroup taxa. They found that the impact of missing data depended on the model and the amount of missing data, and that in general ASTRAL and MP-EST were highly robust to missing data, while STAR was much more negatively impacted.

Streicher et al. [29] examined the impact of missing data on the branch support of species trees estimated using NJst and ASTRAL (v. 4.7.6) on a biological dataset of Iguanian lizards. They explored different amounts of missing data, and concluded that in general it is best to include all the data, except (perhaps) when the amount of missing data in the gene or species exceeds 50%.

Finally, Vachaspati and Warnow [31] studied the impact of missing data on ASTRID and ASTRAL-2 (v. 4.7.8) on simulated datasets with 50 species under a model in which all genes had missing data, but the genes that were deleted from each species were selected under an i.i.d. model. They observed that both ASTRID and ASTRAL were substantially impacted by missing data, so that error increased with taxon deletion. However, even with very high missing data rates (i.e., all genes missing 80% of the species), both ASTRID and ASTRAL-2 achieved good accuracy with a large enough number of genes.

Overall these studies suggest that missing data is not especially detrimental to accuracy using MP-EST, ASTRID, ASTRAL, STEM, and *BEAST.

However, one particularly challenge of missing data is the possibility that outgroup taxa may be the missing taxa. When the phylogenomic estimation method requires rooted gene trees, this creates a challenge of using a different rooting method (e.g., midpoint rooting, or maximum likelihood under a strict molecular clock assumption). These rooting techniques are not statistically consistent when the data does not follow a strict clock, however. Only two of the studies above [8, 36] considered methods that require rooted gene trees, but both evolved sequences under a strict molecular clock. Xi et al. [36] used outgroup rooting but did not simulate under conditions where data was missing for the outgroup. Thus our understanding of the empirical impact of missing data on methods that require rooted gene trees would benefit from additional study.

5 Summary

We have shown that if taxon absence/presence is independent of gene tree topology, then it is exchangeable with the MSC. We have further shown that full subset coverage is sufficient for a tuple-based method to be statistically consistent, and that i.i.d. taxon deletion is a necessary condition for NJst and ASTRID to be statistically consistent.