Impossibility of Consistent Distance Estimation from Sequence Lengths Under the TKF91 Model

Fan, Wai-Tong Louis; Legried, Brandon; Roch, Sebastien

doi:10.1007/s11538-020-00801-3

Impossibility of Consistent Distance Estimation from Sequence Lengths Under the TKF91 Model

Original Article
Published: 13 September 2020

Volume 82, article number 123, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Bulletin of Mathematical Biology Aims and scope Submit manuscript

Impossibility of Consistent Distance Estimation from Sequence Lengths Under the TKF91 Model

Download PDF

199 Accesses
2 Citations
Explore all metrics

Abstract

We consider the problem of distance estimation under the TKF91 model of sequence evolution by insertions, deletions and substitutions on a phylogeny. In an asymptotic regime where the expected sequence lengths tend to infinity, we show that no consistent distance estimation is possible from sequence lengths alone. More formally, we establish that the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to one.

Inferring phylogenies of evolving sequences without multiple sequence alignment

Article Open access 30 September 2014

Large-Scale Multiple Sequence Alignment and Phylogeny Estimation

Fast and accurate branch lengths estimation for phylogenomic trees

Article Open access 07 January 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Phylogeny estimation consists in the inference of an evolutionary tree from extant species data, commonly molecular sequences (e.g. DNA, amino acid). A large body of theoretical work exists on the statistical properties of standard reconstruction methods (Steel 2016; Warnow 2017). Typically in such analyses, one assumes that sequences have evolved on a fixed rooted tree, from a common ancestor sequence to the leaf sequences, according to some Markovian stochastic process. Often these processes model site substitutions exclusively, with the underlying assumption being that the data have been properly aligned in a pre-processing step. In contrast, relatively little theoretical work has focused on models of insertions and deletions (indels) together with substitutions, in spite of the fact that such models have been around for some time (Thorne et al. 1991, 1992). See e.g. (Thatte 2006; Daskalakis and Roch 2013; Allman et al. 2015; Fan and Roch 2020).

One extra piece of information available under indel models is the length of the sequence, which itself evolves according to a Markov process on the tree. The notable work of Thatte (2006) shows that leaf sequence lengths alone are in fact enough to reconstruct phylogenies, through a distance-based approach. More specifically, it is shown in (Thatte 2006, (27)) that under the TKF91 model (Thorne et al. 1991) the expectation of the sequence length $N_v$ at a leaf v conditioned on the sequence length $N_u$ at another leaf u separated from v by an amount of time $t_{uv}$ is

$$\begin{aligned} {\mathcal {N}}_v(t) = {\bar{L}} + \left( N_u - {\bar{L}}\right) e^{-\mu t_{uv} (1-\lambda /\mu )} \end{aligned}$$

(1)

where ${\bar{L}} = \frac{\lambda /\mu }{1 - \lambda /\mu }$ is the expected length at stationarity, where $\lambda < \mu $ are the rates of insertion and deletion, respectively (full details on the TKF91 model are provided in Sect. 2). Hence, we see from (1) that the full distribution of sequence lengths suffices to recover $\lambda /\mu $ and all $\mu t_{uv}$’s.

The tree topology can then be recovered using standard results about the metric properties of phylogenies (Steel 2016). That is, the tree is identifiable from the sequence lengths under the TKF91 model in the sense that two distinct tree topologies $T_1 \ne T_2$ necessarily produce distinct joint distributions of sequence lengths at the leaves.

It is also suggested in Thatte (2006)—without a full rigorous proof—that the scheme above could be used to reconstruct phylogenies from a single sample of sequence lengths at the leaves in the limit where $\lambda \nearrow \mu $. The latter asymptotics ensure that the expected sequence length at stationarity ${\bar{L}}$ diverges and serves as a proxy for the amount of data growing to infinity. However, we show that no consistent distance estimator exists in this limit. Formally we establish that the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to 1 as $\lambda \nearrow \mu $. Hence, while the tree is identifiable from the distribution of the sequence lengths at the leaves, one sample of the sequence lengths alone cannot be used in a distance-based approach of the type described above to reconstruct the tree consistently as $\lambda \nearrow \mu $. On the technical side our proof follows by noting that, under the TKF91 model, the sequence length is (morally) a sum of independent random variables with finite variances, to which we apply a central limit theorem. One complication is to obtain a limit theorem that is uniform in the parameter $\lambda /\mu $. We expect that our techniques will be useful to analyze other bioinformatics methods under indel processes, for instance methods based on k-mer statistics [see e.g. (Yang and Zhang 2008; Haubold 2013)]. Further intuition on our results is provided in Sect. 3.

Organization The rest of the paper is organized as follows. The TKF91 model is reviewed in Sect. 2. Our main result, together with a proof sketch, is stated in Sect. 3. Details of the proof are provided in Sect. 4.

2 Basic Definitions

In this section, we recall the TKF91 sequence evolution model (Thorne et al. 1991). To simplify the presentation, we restrict ourselves to a two-state version of the model, as we will only require the underlying sequence-length process.

Definition 1

(TKF91 model: two-state version) Consider the following Markov process ${\mathcal {I}} = \{{\mathcal {I}}_{t}\}_{t \ge 0}$ on the space ${\mathcal {S}}$ of binary digit sequences together with an immortal link “$\bullet $”, that is,

$$\begin{aligned} {\mathcal {S}} := ``\bullet '' \otimes \bigcup _{M\ge 1} \{0,1\}^M, \end{aligned}$$

where the notation above indicates that all sequences begin with the immortal link. Positions of a sequence, except for that of the immortal link, are called sites or mortal links. Let $(\nu ,\lambda ,\mu ) \in (0,\infty )^{3}$ and $(\pi _0,\pi _1) \in [0,1]^2$ with $\pi _0 + \pi _1 = 1$ be given parameters. The continuous-time dynamics are as follows: If the current state is the sequence $\vec {x} \in {\mathcal {S}}$, then the following events occur independently:

Substitution Each site is substituted independently at rate $\nu > 0$. When a substitution occurs, the corresponding digit is replaced by 0 and 1 with probabilities $\pi _0$ and $\pi _1$, respectively.
Deletion Each site is removed independently at rate $\mu $.
Insertion Each site, as well as the immortal link, gives birth to a new digit independently at rate $\lambda $. When a birth occurs, the new site is added immediately to the right of its parent site. The newborn site has digit 0 and 1 with probabilities $\pi _0$ and $\pi _1$, respectively.

This indel process is time-reversible with respect to the measure $\Pi $ given by

$$\begin{aligned} \Pi (\vec {x})= \left( 1-\frac{\lambda }{\mu }\right) \left( \frac{\lambda }{\mu }\right) ^M\prod _{i=1}^M\pi _{x_i} \end{aligned}$$

for each $\vec {x}=(x_1,x_2,\cdots ,x_M)\in \{0,1\}^M$ where $M\ge 1$, and $\Pi (``\bullet '') = \left( 1-\frac{\lambda }{\mu }\right) $. We assume further that $\lambda < \mu $. In that case, $\Pi $ is the stationary distribution of ${\mathcal {I}}$.

We will be concerned with the underlying sequence-length process.

Definition 2

(Sequence length) The length of a sequence $\vec {x}\in {\mathcal {S}}$ is defined as the number of sites and is denoted by $|\vec {x}|$. Therefore, if $\vec {x} = (\bullet ,x_1, \ldots ,x_M)$, then $|\vec {x}|=M$.

Under $\Pi $, the sequence-length process $|{\mathcal {I}}|$ is stationary and is geometrically distributed. Specifically the stationary distribution of the length process $|{\mathcal {I}}|$ is

$$\begin{aligned} \gamma ^{(\lambda )}_M:= \left( 1-\frac{\lambda }{\mu }\right) \left( \frac{\lambda }{\mu }\right) ^M,\quad M\in {\mathbb {Z}}_+. \end{aligned}$$

(2)

We are interested in this process on a rooted tree T. Denote the index set by $\Gamma _{T}$. The root vertex $\rho $ is assigned a state ${\mathcal {I}}_{\rho } \in {\mathcal {S}}$, drawn from stationary distribution on ${\mathcal {S}}$. This state is then evolved down the tree according to the following recursive process. Moving away from the root, along each edge $e = (u,v) \in E$, conditionally on the state ${\mathcal {I}}_{u}$, we run the indel process for a time $\ell _{(u,v)}$. Denote by ${\mathcal {I}}_{t}$ the resulting state at $t \in e$. Then the full process is denoted by $\{{\mathcal {I}}_{t}\}_{t \in \Gamma _{T}}$. In particular, the set of leaf states is ${\mathcal {I}}_{\partial T} = \{{\mathcal {I}}_{v}:v \in \partial T\}$.

Setting Throughout this paper, we let ${\mathbb {P}}_{\vec {x}}$ be the probability measure when the root state is $\vec {x}$. If the root state is chosen according to a distribution $\nu $, then we denote the probability measure by ${\mathbb {P}}_{\nu }$. We also denote by ${\mathbb {P}}_{M}$ the conditional probability measure for the event that the root state has length M.

For our purposes, it will suffice to focus on the space ${\mathcal {T}}_{2}$ of star trees with two leaves that have the same finite distance $h\in (0,\infty )$ from the root and are labeled as $\{1,2\}$. This distance h is the height of the tree. The indel process on a tree $T\in {\mathcal {T}}_{2}$ reduces to a pair of indel processes $({\mathcal {I}}_{t}^1,{\mathcal {I}}_{t}^2)_{t\ge 0}$ that are independent upon conditioning on the root state ${\mathcal {I}}_{\rho }={\mathcal {I}}_{0}^1={\mathcal {I}}_{0}^2$. We always assume the root state is chosen according to the equilibrium distribution $\Pi $. So the distribution of $({\mathcal {I}}_{0}^1,{\mathcal {I}}_{0}^2)\in {\mathcal {S}} \times {\mathcal {S}}$ is

$$\begin{aligned} {\widehat{\nu }}_0(\vec {x},\vec {y}) = {\left\{ \begin{array}{ll} \Pi (\vec {x}) &{} \text { if } \vec {x} = \vec {y}, \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

3 Main Result

Our main theorem is an impossibility result: the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to 1 as $\lambda \nearrow \mu $. Following (Thatte 2006), we consider the asymptotic regime where $\lambda \nearrow \mu $, which implies that the expected sequence length at stationarity tends to $+\infty $. Recall that the total variation distance between two probability measures $\tau _1$ and $\tau _2$ on a countable measure space E is

$$\begin{aligned} \Vert \tau _1 - \tau _2\Vert _{TV} = \sup _{A \subseteq E} \left| \tau _1(A) - \tau _2(A)\right| . \end{aligned}$$

(3)

Theorem 1

(Impossibility of distance estimation from sequence lengths) Let $T^{1}$ and $T^{2}$ be two trees in ${\mathcal {T}}_2$ with heights $h_1> h_2>0$, respectively. For $i\in \{1,2\}$, we consider a TKF91 process on tree $T^i$ and let $\vec {N}^{(i)} = (N_{1}^{(i)},N_{2}^{(i)})\in {\mathbb {Z}}_{+}^2$ be the pair of sequence lengths at the leaves $\partial T^i$. Let

$$\begin{aligned} {\mathcal {L}}_{i} = {\mathbb {P}}_{\Pi }(\vec {N}^{(i)} \in \cdot ) \end{aligned}$$

be the distribution of $\vec {N}^{(i)}$ under ${\mathbb {P}}_{\Pi }$. Then for any fixed deletion rate $\mu \in (0,\infty )$,

$$\begin{aligned} \limsup _{\lambda \nearrow \mu }\Vert {\mathcal {L}}_{1} - {\mathcal {L}}_{2}\Vert _{TV} < 1. \end{aligned}$$

(4)

From (3), we see that (4) implies that there is no test that can distinguish between ${\mathcal {L}}_{1}$ and ${\mathcal {L}}_{2}$ with probability going to 1 as $\lambda \nearrow \mu $.

Proof idea We first give a heuristic argument that underlies our formal proof. Without loss of generality, assume that the deletion rate is $\mu = 1$. The stationary length M at the root is geometric with mean and standard deviation both of order $1/(1-\lambda )$. So we can think of the root length as roughly $M \approx C/(1-\lambda )$ with significant probability. Ignoring the small effect of the immortal link and conditioning on M, the lengths at the leaves are sums of independent random variables, specifically the progenies of the M mortal links of the root. Here the progeny of a site is its descendants including itself.

The mean and variance of these variables can be computed explicitly from continuous-time Markov chain theory (see (11) below; see also (Thatte 2006, (27), (31)). As $\lambda \nearrow 1$, the difference in expectation between heights $h_1$ and $h_2$ turns out to be

$$\begin{aligned} M e^{-(1-\lambda )h_1} - M e^{-(1-\lambda )h_2} \approx \frac{C}{1-\lambda }[-(1-\lambda )h_1 + (1-\lambda ) h_2] \approx C(h_2 - h_1), \end{aligned}$$

(5)

while the variance is of order

$$\begin{aligned} M \frac{e^{-(1-\lambda )h_i} (1- e^{-(1-\lambda )h_i})}{1-\lambda } \approx \frac{C}{1-\lambda } \frac{ (1-\lambda )h_i}{1-\lambda } \approx \frac{C h_i}{1-\lambda }. \end{aligned}$$

(6)

The key observation is that the variance (6) is $\gg $ than the square of the expectation difference (5). Hence, by the central limit theorem, one can expect significant overlap between the length distributions under $h_1$ and $h_2$, making them hard to distinguish even as $\lambda \nearrow 1$. We formalize this argument next.

We observe that (4) is equivalent to

$$\begin{aligned} \liminf _{\lambda \nearrow 1}\sum _{\vec {y} \in {\mathbb {Z}}_{+}^2}{\mathbb {P}}_{\Pi }(\vec {N}^{(1)} = \vec {y}) \wedge {\mathbb {P}}_{\Pi }(\vec {N}^{(2)} = \vec {y}) > 0. \end{aligned}$$

(7)

Indeed the total variation distance between two probability measures $\tau _1$ and $\tau _2$ on a countable space E can also be written as

$$\begin{aligned} \Vert \tau _1 - \tau _2\Vert _{TV} = 1 - \sum _{\sigma \in E}\tau _1(\sigma )\wedge \tau _2(\sigma ). \end{aligned}$$

The rest of the proof is to establish (7). It involves a series of steps:

1.
We first reduce the problem to a sum of independent random variables by conditioning on the root sequence length and ignoring the immortal link. In particular, we use the fact that there is a fairly uniform probability that M is in an interval of size $1/(1-\lambda )$ around 1. And we remove the effect of the immortal link by conditioning on its having no descendant, an event of positive probability.
2.
The central limit theorem (CLT) implies that there is a significant overlap between the two sums. More precisely, we need a local CLT [see e.g. (Durrett 2010)] to derive the sort of pointwise lower bound needed in (7). However, the bound we require must be uniform in $\lambda $ and we did not find in the literature a result of quite this form. Instead, we use an argument based on the Berry-Esséen theorem [again see e.g. (Durrett 2010)]. We first establish overlap over $\Omega (\sqrt{M})$ constant size intervals for the sum of the first $M-1$ mortal links, and then we use the final mortal link to match the probabilities on common point values under heights $h_1$ and $h_2$.
3.
Finally we bound the sum in (7).

4 Proof

In this section, we give the details of the proof of Theorem 1. We follow the steps described in the previous section.

Step 1. Reducing the problem to a sum of independent random variables We first show that ${\mathbb {P}}_{\Pi }$ in (7) can be replaced by ${\mathbb {P}}_M$ where M is of the order of the expected sequence length $1/(1-\lambda )$ under $\Pi $. That is, we condition on the length of the ancestral sequence. After that, we further ignore the progenies of the immortal link so that each leave sequence consists of i.i.d. progenies of the M sites in the ancestral sequence. These two simplifications are achieved in (8) and (9), respectively.

Precisely, for any $\lambda _* \in (0,1)$ and $0< c_1< 1< c_2 < +\infty $, using (2)

$$\begin{aligned}&\liminf _{\lambda \nearrow 1}\sum _{\vec {y} \in {\mathbb {Z}}_{+}^2}{\mathbb {P}}_{\Pi }(\vec {N}^{(1)} = \vec {y}) \wedge {\mathbb {P}}_{\Pi }(\vec {N}^{(2)} = \vec {y}) \nonumber \nonumber \\&\quad \ge \inf _{\lambda \in (\lambda _*,1)}\sum _{\vec {y} \in {\mathbb {Z}}_{+}^2}{\mathbb {P}}_{\Pi }(\vec {N}^{(1)} = \vec {y}) \wedge {\mathbb {P}}_{\Pi }(\vec {N}^{(2)} = \vec {y}) \nonumber \nonumber \\&\quad = \inf _{\lambda \in (\lambda _*,1)}\sum _{\vec {y} \in {\mathbb {Z}}_{+}^2} \left[ \sum _{M \in {\mathbb {Z}}_+} \gamma ^{(\lambda )}_M \,{\mathbb {P}}_{M}(\vec {N}^{(1)} = \vec {y}) \right] \wedge \left[ \sum _{M \in {\mathbb {Z}}_+} \gamma ^{(\lambda )}_M \,{\mathbb {P}}_{M}(\vec {N}^{(2)} = \vec {y})\right] \nonumber \nonumber \\&\quad = \inf _{\lambda \in (\lambda _*,1)} \sum _{\vec {y} \in {\mathbb {Z}}_{+}^2} \sum _{M \in {\mathbb {Z}}_+} (1-\lambda ) \lambda ^M \left[ {\mathbb {P}}_{M}(\vec {N}^{(1)} = \vec {y}) \wedge {\mathbb {P}}_{M}(\vec {N}^{(2)}= \vec {y})\right] \nonumber \nonumber \\&\quad \ge \inf _{\lambda \in (\lambda _*,1)} \sum _{M \in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] } (1-\lambda ) \lambda ^M \sum _{\vec {y} \in {\mathbb {Z}}_{+}^2} {\mathbb {P}}_{M}(\vec {N}^{(1)} = \vec {y}) \wedge {\mathbb {P}}_{M}(\vec {N}^{(2)}= \vec {y}) \nonumber \nonumber \\&\quad \ge c_3 \inf _{\lambda \in (\lambda _*,1)} \inf _{M \in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] } \sum _{\vec {y} \in {\mathbb {Z}}_{+}^2} {\mathbb {P}}_{M}(\vec {N}^{(1)} = \vec {y}) \wedge {\mathbb {P}}_{M}(\vec {N}^{(2)}= \vec {y}), \end{aligned}$$

(8)

where $c_3:=\inf _{\lambda \in (\lambda _*,1)}(1-\lambda )\sum _{M \in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] }\lambda ^{M}$. Note that $c_3\in (0,\infty )$ because the expression $ (1-\lambda )\sum _{M \in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] }\lambda ^{M}$ is continuous in $\lambda $ and tends to $e^{-c_1}-e^{-c_2}$ as $\lambda \rightarrow 1$.

Let ${\mathcal {Z}}_0$ be the event that the immortal link of the root sequence produces no mortal link in either leaf sequences. Let ${\mathbb {P}}_{M,\bullet }$ be the probability conditioned on that event, and $c_4$ be a lower bound on the probability of ${\mathcal {Z}}_0$ uniform in $\lambda \in (\lambda _*,1)$. Under ${\mathbb {P}}_{M,\bullet }$, the two components of $\vec {N}^{(1)}$ are conditionally independent and each is a sum of M i.i.d. random variables corresponding to the progenies of mortal links. Hence, (8) is at least

$$\begin{aligned}&c_4 c_3 \inf _{\lambda \in (\lambda _*,1)} \inf _{M \in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] } \sum _{\vec {y} \in {\mathbb {Z}}_{+}^2} {\mathbb {P}}_{M,\bullet }(\vec {N}^{(1)} = \vec {y}) \wedge {\mathbb {P}}_{M,\bullet }(\vec {N}^{(2)}= \vec {y}) \nonumber \\&\quad {=} c_4 c_3 \inf _{\lambda \in (\lambda _*{,}1)} \inf _{M \in \left[ \frac{c_1}{1-\lambda }{,}\frac{c_2}{1{-}\lambda }\right] } \sum _{\vec {y} \in {\mathbb {Z}}_{+}^2}\! \left[ p_{M{,}y_1}^{(\lambda )}(h_1) \,p_{M,y_2}^{(\lambda )}(h_1) \right] \wedge \! \left[ p_{M{,}y_1}^{(\lambda )}(h_2) \,p_{M{,}y_2}^{(\lambda )}(h_2) \right] {,} \end{aligned}$$

(9)

where we let $p_{i,j}^{(\lambda )}(t) = {\mathbb {P}}_{i,\bullet }(|{\mathcal {I}}_{t}| = j)$ for $i,j\in {\mathbb {Z}}_+$ be the transition probabilities of the length process without the immortal link.

The sum in (9) leads us to study the overlap between the probability distributions $p_{M, \,\cdot }^{(\lambda )}(t):=\{p_{M, j}^{(\lambda )}(t)\}_{j\in {\mathbb {Z}}_+}$ for $t=h_1, h_2$ and $M\in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] $. The central limit theorem is what we need. However, because of our need for a bound that is uniform in $\lambda $, we shall apply the Berry-Esséen theorem. Specifically, we apply the latter bound to the progenies of the first $M-1$ mortal links of the root sequence. The idea is to show that $\Omega (\sqrt{M})$ summands in (9) have value $\Omega (1/\sqrt{M})$, for each of $h_1$ and $h_2$ separately, and then use the last mortal link to “match” all these values between $h_1$ and $h_2$.

Step 2a. Establishing a uniform bound for $p_{M-1, \cdot }^{(\lambda )}(t)$ Note that $p_{M, \cdot }^{(\lambda )}(t)$ is the distribution of $S_{M}(t):=\sum _{i=1}^ML_t^i$, where $\{L^i_t\}_{i\ge 1}$ are i.i.d. random variables having the distribution of the progeny length of a single mortal link at time $t>0$.

Let the mean and the variance of $L^i_t$ be

$$\begin{aligned} \beta :=\beta (\lambda ,t):={\mathbb {E}}[L^i_t]\quad \text {and}\quad \sigma ^2:=\sigma ^2(\lambda ,t):={\mathbb {E}}|L^i_{t}-\beta |^2. \end{aligned}$$

(10)

As is expected, the distribution $p_{M, \,\cdot }^{(\lambda )}(t)$ is approximately Gaussian with mean $\beta M$ and variance $\sigma ^2 M$. We quantify this statement in the bound (12) below, after proving some moment bounds.

Lemma 1

Let $\beta (\lambda ,t)$ and $\sigma (\lambda ,t)$ be the mean and the standard deviation of $L^i_t$ defined in (10) and consider the absolute third moment $\rho (\lambda ,t) := E|L^i_{t}-\beta |^3$. For any $t\in (0,\infty )$,

$$\begin{aligned} \beta (\lambda ,t)=e^{-(1-\lambda )t}\quad \text {and}\quad \sigma ^2(\lambda ,t)=\frac{1+\lambda }{1-\lambda } e^{-(1-\lambda )t} (1- e^{-(1-\lambda )t}). \end{aligned}$$

(11)

Furthermore,

$$\begin{aligned} 0<\inf _{\lambda \in [\lambda _*,1)}\sigma (\lambda ,t)< \sup _{\lambda \in [\lambda _*,1)}\sigma (\lambda ,t)<\infty \quad \text {and}\quad \sup _{\lambda \in [\lambda _*,1)}\rho (\lambda ,t) <\infty . \end{aligned}$$

Proof

For (11), see e.g. (Daskalakis and Roch 2013, (3), (4)).

Moreover, from (Thorne et al. 1991, (8)–(10)) or (Thatte 2006, (7)–(8)), the probability that a normal link has n descendants including itself is

$$\begin{aligned} {\mathbb {P}}(L^i_t=n)= {\left\{ \begin{array}{ll} (1-\eta (\lambda , t))(1-\lambda \eta (\lambda ,t))[\lambda \eta (\lambda , t)]^{n-1} \qquad &{}\text {for }n\ge 1 \\ \eta (\lambda , t) \qquad &{}\text {for }n=0 \end{array}\right. }, \end{aligned}$$

where $\eta (\lambda , t) = \frac{1- e^{-(1-\lambda )t}}{1-\lambda e^{-(1-\lambda )t}}$. It can be seen from L’Hospital’s rule that $\eta (\lambda , t)$ is continuous as a function of $\lambda $ around 1 and that $\eta (\lambda , t) = \frac{t}{1+t} + O(|1-\lambda |)$ as $\lambda \rightarrow 1$. From this explicit formula for the probability mass function of $L^i_t$, which we note is a geometric sequence, it follows that all moments of $L^i_t$ are bounded from above uniformly in $\lambda \in [\lambda _*,1)$.

To show that the variance is bounded from below uniformly in $\lambda \in [\lambda _*,1)$, we note (again using L’Hospital’s rule) that $ \sigma ^2(\lambda ,t)$ is continuous in $\lambda $ around 1, strictly positive and tends to 2t as $\lambda \rightarrow 1$. Hence, the variance is bounded from below, uniformly in $\lambda \in [\lambda _*,1)$ $\square $

Let $F_{M}^{(\lambda )}(t)$ be the cumulative distribution function (CDF) of the probability distribution $p_{M, \cdot }^{(\lambda )}(t)$. That is,

$$\begin{aligned} F_{M}^{(\lambda )}(t)(x)=\sum _{j:j\le x}p_{M, j}^{(\lambda )}(t)={\mathbb {P}}(S_{M}(t)\le x). \end{aligned}$$

Lemma 2

For each $t>0$, there exists a constant $C>0$ such that

$$\begin{aligned} \sup _{\lambda \in [\lambda _*,1)}\sup _{x \in {\mathbb {R}}}\Big |F_{M}^{(\lambda )}(t)\Big (M\beta (\lambda ,t)\,+\,x\,\sigma (\lambda ,t)\,\sqrt{M}\Big ) - {\mathcal {N}}(x)\Big | \le \frac{C}{\sqrt{M}}, \end{aligned}$$

(12)

for all $M\in Z_+$, where ${\mathcal {N}}$ is the CDF of the standard normal distribution.

Proof

Since $\beta (\lambda ,t), \sigma ^2(\lambda ,t), \rho (\lambda ,t) \in (0,\infty )$, the Berry-Esséen theorem [as stated e.g. in Durrett (2010)] applies and asserts that

$$\begin{aligned} \sup _{x \in {\mathbb {R}}}\left| {\mathbb {P}}\left( \frac{S_{M-1} - (M-1)\beta (\lambda ,t)}{\sigma (\lambda ,t)\sqrt{M-1}}\le x\right) \,-\, {\mathcal {N}}(x) \right| \le \frac{3\rho (\lambda ,t)}{\sigma ^3(\lambda ,t) \sqrt{M-1}} \end{aligned}$$

(13)

for all $\lambda \in [0,1)$. By Lemma 1, for each $t>0$, the right-hand side is bounded from above uniformly for $\lambda \in [\lambda _*,1)$. $\square $

Step 2b. Controlling the overlap of $p_{M-1, \cdot }^{(\lambda )}(h_1)$ and $p_{M-1, \cdot }^{(\lambda )}(h_2)$ in (9) To quantify the overlap between $p_{M-1, \cdot }^{(\lambda )}(h_1)$ and $p_{M-1, \cdot }^{(\lambda )}(h_2)$, we first compare their expectations. For simplicity, we write $\beta _i:=\beta (\lambda ,h_i)$, $\sigma _i:=\sigma (\lambda ,h_i)$ and $\rho _i:=\rho (\lambda ,h_i)$ for $i=1,2$, where these functions are defined in Lemma 1. From the formula of $\beta $ in (11), we have

$$\begin{aligned} |\beta _1 - \beta _2| \le (1-\lambda )(h_1 - h_2) \end{aligned}$$

and so, for $M \in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] $, the means of $S_{M-1}$ for $h_1$ and $h_2$ are close in the sense that

$$\begin{aligned} |\beta _1(M-1) - \beta _2(M-1)| \le c_6 \end{aligned}$$

(14)

for some $c_{6} > 0$ not depending on $\lambda $.

Now consider first the interval with length roughly twice the standard deviation of $p_{M-1, \cdot }^{(\lambda )}(h_1)$ and centered at around its means $\beta (\lambda ,h_1)(M-1)$, then consider an equi-partition of this interval into roughly $\frac{2\sqrt{M-1}}{K}$ many pieces of constant length $\sigma _1 K$, where $K > 0$ is an arbitrary constant. Precisely, we consider the sub-intervals

$$\begin{aligned} {\mathcal {J}}^M_r(K):= \big (\beta _1(M-1) + r\sigma _1 K ,\; \beta _1(M-1) + (r+1) \sigma _1 K \big ) \end{aligned}$$

(15)

for $r \in \Lambda ^M_K$, where

$$\begin{aligned} \Lambda ^M_K:=\left\{ -\left[ \frac{\sqrt{M-1}}{K}\right] ,\, \,\ldots , -1, 0,\,1, \,\ldots , \left[ \frac{ \sqrt{M-1}}{K}\right] -1\right\} \end{aligned}$$

(16)

and [x] denotes the largest integer smaller than or equal to x.

Lemma 3 says that there exist positive constants $K = c_7$ large enough and $c_8$ small enough, depending on $c_6$ but not on $\lambda \in (\lambda _*,1)$, such that each of these intervals contains mass at least $\frac{c_8}{\sqrt{M-1}}$ under both probability distributions $p_{M-1, \,\cdot }^{(\lambda )}(h_1)$ and $p_{M-1, \,\cdot }^{(\lambda )}(h_2)$. See Fig. 1. Write $p_{M-1, \,A}^{(\lambda )}(t)=\sum _{j\in A}p_{M-1, \,j}^{(\lambda )}(t)$ for simplicity.

Lemma 3

There exist positive constants $c_7,c_8$ such that $c_7\inf _{\lambda \in [\lambda _*,1)}\sigma _1 >1$ and, with ${\mathcal {J}}_{r}={\mathcal {J}}^M_{r}(c_7)$ and $\Lambda ^M=\Lambda ^M_{c_7}$,

$$\begin{aligned} p_{M-1, \,{\mathcal {J}}_{r}}^{(\lambda )}(h_1) \wedge p_{M-1, \,{\mathcal {J}}_{r}}^{(\lambda )}(h_2) \ge \frac{c_8}{\sqrt{M-1}} \end{aligned}$$

for all $r \in \Lambda ^M$, $M \in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] $ and $\lambda \in [\lambda _*,1)$.

Proof

The Berry-Esséen theorem (13) implies that

$$\begin{aligned} \sup _{r\in \Lambda ^M_K}\left| p_{M-1, \,{\mathcal {J}}_{r}}^{(\lambda )}(h_1) - \int _{\tilde{{\mathcal {J}}_{r}}}\frac{1}{\sqrt{2\pi }}e^{-\frac{x^2}{2}}\,dx \right| \le \frac{6\rho _1}{\sigma _1^3 \sqrt{M-1}} \end{aligned}$$

(17)

for all $\lambda \in [\lambda _*,1)$, $M\ge 2$ and $K>0$, where ${\mathcal {J}}_{r}={\mathcal {J}}^M_r(K)$ is defined in (15), $\Lambda ^M_K$ is defined in (16), and

$$\begin{aligned} \tilde{{\mathcal {J}}_{r}}:=\frac{{\mathcal {J}}_{r}-(M-1)\beta _1}{\sigma _1\sqrt{M-1}}\,=\,\left( \frac{rK}{\sqrt{M-1}},\,\frac{(r+1)K}{\sqrt{M-1}} \right) . \end{aligned}$$

Then $\{\tilde{{\mathcal {J}_{r}}}\}_{r\in \Lambda ^M}$ is roughly an equi-partition of the interval $(-1,1)$ into $\frac{2\sqrt{M-1}}{K}$ sub-intervals of length $\frac{K}{\sqrt{M-1}}$. Furthermore, since $\tilde{{\mathcal {J}}_{r}}\subset [-1,1]$ for all $r \in \Lambda ^M_K$,

$$\begin{aligned} \int _{\tilde{{\mathcal {J}}_{r}}}\frac{1}{\sqrt{2\pi }}e^{-\frac{x^2}{2}}\,dx \ge \frac{K}{\sqrt{M-1}}\,\frac{e^{-1/2}}{\sqrt{2\pi }}. \end{aligned}$$

From (17), there exists and absolute constant C large enough such that when $K=C\sup _{\lambda \in (\lambda _*,1)}\frac{\rho _1}{\sigma _1^3}$, we have

$$\begin{aligned} \inf _{r \in \Lambda ^M_K} p_{M-1, \,{\mathcal {J}}_{r}}^{(\lambda )}(h_1) \ge \frac{c}{\sqrt{M-1}} \end{aligned}$$

for some constant $c > 0$ that depends neither on $M \in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] $ nor $\lambda \in [\lambda _*,1)$. Therefore, we let $c_7:=C\sup _{\lambda \in (\lambda _*,1)}\frac{\rho _1}{\sigma _1^3}$ and take $K=c_7$.

We now repeat the above argument for $h_2$, using (14). Similarly to (17), inequality (13) implies that

$$\begin{aligned} \sup _{r\in \Lambda ^M}\left| p_{M-1, \,{\mathcal {J}}_{r}}^{(\lambda )}(h_2) - \int _{{\mathcal {E}}_{r}}\frac{1}{\sqrt{2\pi }}e^{-\frac{x^2}{2}}\,dx \right| \le \frac{6\rho _2}{\sigma _2^3 \sqrt{M-1}} \end{aligned}$$

for all $\lambda \in [\lambda _*,1)$ and $M\ge 2$, where ${\mathcal {J}}_{r}={\mathcal {J}}^M_r(c_7)$ is defined in (15) using $h_1$, and

$$\begin{aligned} {\mathcal {E}}_{r}:=\frac{{\mathcal {J}}_{r}-(M-1)\beta _2}{\sigma _2\sqrt{M-1}}\,=\, \frac{(\beta _1-\beta _2)\sqrt{M-1}}{\sigma _2} + \frac{\sigma _1}{\sigma _2}\left( \frac{rc_7}{\sqrt{M-1}},\,\frac{(r+1)c_7}{\sqrt{M-1}} \right) . \end{aligned}$$

and we denote $a+bI=\{a+bx:\,x\in I\}$ for any interval I and $a,b\in {\mathbb {R}}$.

By (14), ${\mathcal {E}}_{r}\subset \left[ -A,A\right] $ for all $r \in \Lambda ^M$ and $M\ge 2$, where $A:=c_6+\sup _{\lambda \in (\lambda _*,1)}\frac{\sigma _1}{\sigma _2}\in (0,\infty )$. Hence, as before, we have

$$\begin{aligned} \inf _{r \in \Lambda ^M} p_{M-1, \,{\mathcal {J}}_{r}}^{(\lambda )}(h_2) \ge \frac{c'}{\sqrt{M-1}} \end{aligned}$$

for some constant $c'>0$ that depends neither on $M \in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] $ nor $\lambda \in [\lambda _*,1)$, even though ${\mathcal {J}}_{r}$ is constructed using $h_1$.

The proof is complete by taking $c_8=\min \{c,c'\}$. $\square $

Step 2c. Matching $p_{M, \cdot }^{(\lambda )}(h_1)$ and $p_{M, \cdot }^{(\lambda )}(h_2)$ by the last mortal link Lemma 3 establishes overlap of $p_{M-1, \cdot }^{(\lambda )}(h_1)$ and $p_{M-1, \cdot }^{(\lambda )}(h_2)$ over constant size intervals. The next lemma uses the final mortal link to establish overlap of $p_{M, \cdot }^{(\lambda )}(h_1)$ and $p_{M, \cdot }^{(\lambda )}(h_2)$ over specific values.

Lemma 4

There exists a positive constant $c_9$ such that

$$\begin{aligned} \inf _{j_{r+1}^{*} \in {\mathcal {J}}_{r+1}\cap {\mathbb {Z}}_+ }\,p_{M,j_{r+1}^{*}}^{(\lambda )}(h_1) \wedge p_{M,j_{r+1}^{*}}^{(\lambda )}(h_2) > \frac{c_8 c_9}{\sigma ^*_1c_7\sqrt{M-1}}. \end{aligned}$$

for all $r \in \Lambda ^M$, $M \in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] $ and $\lambda \in [\lambda _*,1)$, where $\sigma ^*_1:=\sup _{\lambda \in (\lambda _*,1)}\sigma _1$.

Proof

By Lemma 3, ${\mathcal {J}}_r$ contains at least one integer, say $j_r^{(1)}$, with mass at least $\frac{c_8}{\sigma _1^*c_7 \sqrt{M-1}}$ under the probability measure $p_{M-1, \,\cdot }^{(\lambda )}(h_1)$. This is because (i) ${\mathcal {J}}_r$ is non-empty since $c_7\inf _{\lambda \in [\lambda _*,1)}\sigma _1 >1$ by Lemma 3, and (ii) there are at most $\left[ c_7\sigma ^*_1\right] $ many integers in ${\mathcal {J}}_r$.

Similarly, there exists $j_{r}^{(2)}$ with mass at least $\frac{c_8}{\sigma _1^*c_7 \sqrt{M-1}}$ under under $p_{M-1,\,\cdot }^{(\lambda )}(h_2)$. Hence,

$$\begin{aligned} p_{M-1, \,j_{r}^{(1)}}^{(\lambda )}(h_1) \wedge p_{M-1, \,j_{r}^{(2)}}^{(\lambda )}(h_2)\ge \frac{c_8}{\sigma _1^*c_7 \sqrt{M-1}}. \end{aligned}$$

Let $j_{r+1}^*$ be an arbitrary integer in ${\mathcal {J}}_{r+1}$. The progeny of the M-th mortal link has a probability at each integer in $[0,\,2 \sigma _1^* c_7]$ bounded from below by a positive constant ${\tilde{c}}$ uniformly for all such integers and all $\lambda \in [0,1)$. It follows that

$$\begin{aligned} p_{M,j_{r+1}^*}^{(\lambda )}(h_1)= & {} \sum _{k=0}^{j_{r+1}^*}p_{M-1,k}^{(\lambda )}(h_1)\,p_{1,j_{r+1}^*-k}^{(\lambda )}(h_1) > p_{M-1,j_r^{(1)}}^{(\lambda )}(h_1)\,p_{1,j_{r+1}^*-j_r^{(1)}}^{(\lambda )}(h_1) \\\ge & {} \frac{c_8 {\tilde{c}}}{\sigma _1^*c_7 \sqrt{M-1}} \end{aligned}$$

and similar for $h_2$. The proof is complete. $\square $

Step 3. Putting everything together Lemma 4 implies the sum in (9) is at least a positive constant, uniformly in $M \in \left[ \frac{c_1}{1-\lambda },\frac{c_2}{1-\lambda }\right] $ and $\lambda \in (\lambda _*,1)$, because that sum is

$$\begin{aligned}&\,\sum _{\vec {y} \in {\mathbb {Z}}_{+}^2}\left[ p_{M,y_1}^{(\lambda )}(h_1) \,p_{M,y_2}^{(\lambda )}(h_1) \right] \wedge \left[ p_{M,y_1}^{(\lambda )}(h_2) \,p_{M,y_2}^{(\lambda )}(h_2) \right] \\&\quad \ge \, \sum _{y_1\in \cup _{r\in \Lambda ^M}{\mathcal {J}}_{r+1}\cap {\mathbb {Z}}_+,\;y_2\in \cup _{r\in \Lambda ^M}{\mathcal {J}}_{r+1}\cap {\mathbb {Z}}_+ }\left[ p_{M,y_1}^{(\lambda )}(h_1) \wedge p_{M,y_1}^{(\lambda )}(h_2) \right] \\&\qquad \cdot \,\left[ p_{M,y_2}^{(\lambda )}(h_1) \wedge p_{M,y_2}^{(\lambda )}(h_2) \right] \\&\quad \ge \, \left( \frac{c_8 c_9}{\sigma _1^*c_7 \sqrt{M-1}}\right) ^2\,\Big |\{y_1\in \cup _{r\in \Lambda ^M}{\mathcal {J}}_{r+1}\cap {\mathbb {Z}}_+,\;y_2\in \cup _{r\in \Lambda ^M}{\mathcal {J}}_{r+1}\cap {\mathbb {Z}}_+\}\Big |\\&\quad \ge \, \left( \frac{c_8 c_9}{\sigma _1^*c_7}\right) ^2. \end{aligned}$$

The proof of (7) and hence that of Theorem 1 are complete.

References

Allman ES, Rhodes JA, Sullivant S (2015) Statistically consistent k-mer methods for phylogenetic tree reconstruction. J Comput Biol J Comput Mol Cell Biol 24(2):153–171
Article MathSciNet Google Scholar
Daskalakis C, Roch S (2013) Alignment-free phylogenetic reconstruction: sample complexity via a branching process analysis. Ann Appl Probab 23(2):693–721
Article MathSciNet Google Scholar
Durrett R (2010) Probability theory and examples, 4th edn. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge
Book Google Scholar
Fan W-TL, Roch S (2020) Statistically consistent and computationally efficient inference of ancestral dna sequences in the TKF91 model under dense taxon sampling. Bull Math Biol 82(2):21
Article MathSciNet Google Scholar
Haubold B (2013) Alignment-free phylogenetics and population genetics. Briefings Bioinf 15(3):407–418
Article Google Scholar
Mike S (2016) Phylogeny—discrete and random processes in evolution. In: CBMS-NSF regional conference series in applied mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA
Thatte BD (2006) Invertibility of the TKF model of sequence evolution. Math Biosci 200(1):58–75
Article MathSciNet Google Scholar
Thorne JL, Kishino H, Felsentein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 33(2):114–124
Article Google Scholar
Thorne JL, Kishino H, Felsenstein J (1992) Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol 34(1):3–16
Article Google Scholar
Warnow T (2017) Computational phylogenetics: an introduction to designing methods for phylogeny estimation, 1st edn. Cambridge University Press, Cambridge, USA
Book Google Scholar
Yang K, Zhang L (2008) Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucleic Acids Res 36(5):e33–e33
Article Google Scholar

Download references

Acknowledgements

SR was supported by NSF grants DMS-1614242, CCF-1740707 (TRIPODS), DMS-1902892, and DMS-1916378, as well as a Simons Fellowship and a Vilas Associates Award. BL was supported by DMS-1614242, CCF-1740707 (TRIPODS), DMS-1902892 (to SR). WTF was supported by NSF grants DMS-1614242 (to SR) and DMS-1855417, and ONR-TCRI N00014-20-1-2411.

Author information

Authors and Affiliations

Department of Mathematics, Indiana University, Bloomington, US
Wai-Tong Louis Fan
Center of Mathematical Sciences and Applications, Harvard University, Cambridge, US
Wai-Tong Louis Fan
Department of Mathematics, University of Wisconsin–Madison, Madison, US
Brandon Legried & Sebastien Roch

Authors

Wai-Tong Louis Fan
View author publications
You can also search for this author in PubMed Google Scholar
Brandon Legried
View author publications
You can also search for this author in PubMed Google Scholar
Sebastien Roch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sebastien Roch.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, WT.L., Legried, B. & Roch, S. Impossibility of Consistent Distance Estimation from Sequence Lengths Under the TKF91 Model. Bull Math Biol 82, 123 (2020). https://doi.org/10.1007/s11538-020-00801-3

Download citation

Received: 26 May 2020
Accepted: 31 August 2020
Published: 13 September 2020
DOI: https://doi.org/10.1007/s11538-020-00801-3

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Impossibility of Consistent Distance Estimation from Sequence Lengths Under the TKF91 Model

Abstract

Similar content being viewed by others

Inferring phylogenies of evolving sequences without multiple sequence alignment

Large-Scale Multiple Sequence Alignment and Phylogeny Estimation

Fast and accurate branch lengths estimation for phylogenomic trees

1 Introduction

2 Basic Definitions

Definition 1

Definition 2

3 Main Result

Theorem 1

4 Proof

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Navigation

Impossibility of Consistent Distance Estimation from Sequence Lengths Under the TKF91 Model

Abstract

Similar content being viewed by others

Inferring phylogenies of evolving sequences without multiple sequence alignment

Large-Scale Multiple Sequence Alignment and Phylogeny Estimation

Fast and accurate branch lengths estimation for phylogenomic trees

1 Introduction

2 Basic Definitions

Definition 1

Definition 2

3 Main Result

Theorem 1

4 Proof

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation