Wald Space for Phylogenetic Trees

Lueg, Jonas; Garba, Maryam K.; Nye, Tom M. W.; Huckemann, Stephan F.

doi:10.1007/978-3-030-80209-7_76

Jonas Lueg¹⁰,
Maryam K. Garba¹¹,
Tom M. W. Nye¹² &
…
Stephan F. Huckemann¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12829))

Included in the following conference series:

International Conference on Geometric Science of Information

2541 Accesses
4 Citations

Abstract

The recently introduced wald space models phylogenetic trees from an evolutionary perspective. We show that it is a stratified space and propose algorithms to compute geodesics. In application we compute a Fréchet mean of three trees of different topologies that is fully resolved, unlike in BHV-space. Both, preliminary results on geodesics and on means suggest that wald space features less stickiness than BHV-space, making it an alternative model for statistical investigations.

Acknowledging DFG HU 1575/7, DFG GK 2088, DFG SFB 1465 and the Niedersachsen Vorab of the Volkswagen Foundation.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Information geometry for phylogenetic trees

Article Open access 15 February 2021

New Gromov-Inspired Metrics on Phylogenetic Tree Space

Article 02 January 2018

Cluster Matching Distance for Rooted Phylogenetic Trees

1 Introduction

Phylogenetic trees reflect biological species’ evolution. They are built from genetic variation over a set of taxa. Curiously, building them for the same set of taxa, from different genes, however, often result in fundamentally different trees, e.g. Rokas et al. (2003). This generates a call for statistics, for instance averaging over different trees while controlling their uncertainty. Also, this is a call for geometry, designing suitable spaces of trees that are both, biologically meaningful and numerically tractable.

A seminal model has been proposed twenty years ago by Billera et al. (2001), abbreviated as the BHV-model. It has the favorable property of being a Riemann stratified space of globally nonpositive curvature, thus admitting unique geodesics and unique Fréchet means. Additionally, since it is locally flat, an abundance of successful algorithms have been developed for their computation that suffer only from inherent combinatorial complexity, e.g. Owen (2011); Bačák (2014); Miller et al. (2015); Brown and Owen (2018).

While this model is mathematically intriguing, more recently new models have been developed with geometries more closely reflecting stochastic biological fundamentals of gene mutations, e.g. Moulton and Steel (2004); Shiers et al. (2016); Garba et al. (2020). In Garba et al. (2018), metrics for phylogenetic trees based on the information geometry of the two-state and four-state model were proposed (four states because gene entries are taken from one of the four nucleotide bases). This study was continued in Garba et al. (2020, 2021) and - as a further simplification - a continuous model has been proposed with moments matching those of the two-state model.

In this contribution, we briefly review the definition of our new wald space (cf. Garba et al. (2020)) and propose algorithms to compute geodesics and Fréchet means. On the one hand, the wald space is geometrically more challenging. It is a stratified space that is isometrically embedded in the space of positive symmetric $N\times N$ matrices $\mathcal {P}$ (where $N\in \mathbb {N}$ is the number of taxa) equipped with the well known affine invariant geometry of globally nonpositive curvature – hence the need for algorithms as sophisticated as those of the BHV-space. On the other hand, we believe it is biologically more meaningful than the BHV-space. For example, in BHV-space the distance of two different trees with edge lengths becoming arbitrary large diverges to infinity. In wald space, such trees converge to the completely disconnected forest, a member of the wald space, along with other forests. Hence these two trees become more and more similar. Simulations and data analyses reveal advantage of wald space: degenerate trees seem to be less sticky (sticky means have degenerate limiting distributions) in wald space than in BHV-space, cf. Hotz et al. (2013); Huckemann et al. (2015); Barden et al. (2013, 2018), thus more easily allowing for statistical inference.

Wald space was first proposed at the Oberwolfach workshop 1804 (2018) in the black forest which is the Schwarzwald in German.

2 Wald Space

Let $N\in \mathbb {N}$ denote the number of taxa. A phylogenetic forest $(F,\ell )$ is

(i)
a forest $F = (V,E)$ with a finite number of vertices V, undirected edges E such that any two vertices $u,v \in V$ are connected by at most one edge denoted by $\{u,v\}$ and labeled vertices $L = \{1,\dots ,N\} \subseteq V$, where $v\in V\setminus L$ implies that $\deg (v)\ge 3$,
(ii)
with a mapping $\ell :E\rightarrow (0,\infty )$.

Two phylogenetic forests are equivalent, $(F_1,\ell _1)\sim (F_2,\ell _2)$, if their label sets agree $L_1 = L = L_2$ and if there is a graph isomorphism $f:V_1\rightarrow V_2$ such that

(i)
$f(u) = u$ for all $u\in L$, and
(ii)
$\ell _1(\{u, v\}) = \ell _2(\{f(u), f(v)\})$ for all $\{u, v\}\in E_1$.

Definition 1

Every equivalence class $W=[F,\ell ]$ is called a wald and all equivalence classes form the wald space ${\mathcal {W}_{}}$, its geometric structure is defined further below. Disregarding the edge lengths map $\ell $, every equivalence class of forests F with regards to (i) above, is a wald topology. For a given wald $W=[F,\ell ]$, the grove of W is ${\mathcal {W}_{W}}$ which comprises all $W'=[F',\ell '] \in {\mathcal {W}_{}}$ where $F'$ and F have the same wald topology.

In the following, for any connected $u,v\in V$, E(u, v) is the set of edges along the unique path connecting u and v. For $u=v$, we set any sum over E(u, u) equal zero.

With this notation, the map $\phi $ sending $W=[F,\ell ]$ to the $N\times N$ matrix with coordinate entry at $u,v \in L$,

$$\begin{aligned} \big (\phi (W)\big )_{uv} = \big (\phi ([F,\ell ])\big )_{uv} := {\left\{ \begin{array}{ll} \exp \Big (-\sum _{e\in E(u,v)} \ell (e)\Big ),&{}\text {if } u \text { and } v \text { are connected},\\ 0,&{}\text {else}, \end{array}\right. } \end{aligned}$$

(1)

is well defined and maps ${\mathcal {W}_{}}$ injectively into the set of symmetric positive $N \times N$ matrices $\mathcal {P}$, cf. Garba et al. (2020).

Recall from Garba et al. (2020, 2021) that the affine invariant Riemannian metric on $\mathcal {P}$ corresponds to the Fisher information geometry for zero-mean nondegenerate N-dimensional Gaussians induced by tree-indexed Gaussian processes, a continuous generalisation of the two-state model. This metric has the advantage of turning $\mathcal {P}$ into a Riemannian manifold of global nonpositive curvature (e.g. Lang (1999)), guaranteeing unique geodesics and unique Fréchet means (e.g. Sturm (2003)). The squared distance induced on $\mathcal {P}$ is given by

$$\begin{aligned} d_\mathcal {P}^2(P,Q) = \mathrm {Tr} \bigg [\log \Big (\sqrt{P}^{-1}Q\sqrt{P}^{-1}\Big )^2\bigg ] = \sum _{i = 1}^N \log (\mu _i)^2, \end{aligned}$$

where $\sqrt{P}$ is the unique positive definite square root of $P$ and $\mu _i$ are the eigenvalues of $P^{-1}Q$.

Definition 2

The metric $d_{{\mathcal {W}_{}}}$ of the wald space is the pullback of $d_\mathcal {P}$ under $\phi $, which is given for $W_1,W_2\in {\mathcal {W}_{}}$ by

$$\begin{aligned} d_{{\mathcal {W}_{}}}(W_1, W_2) = \inf _{\begin{array}{c} \gamma :[0,1]\rightarrow {\mathcal {W}_{}}\\ \phi \circ \gamma ~\mathrm{cont.}\;\mathrm{path},\\ \gamma (0)=W_1, \gamma (1)=W_2 \end{array}} L_{d_\mathcal {P}}(\phi \circ \gamma ), \end{aligned}$$

where $L_{d_\mathcal {P}}(\gamma )$ is the length of the path $\gamma $ measured in $d_\mathcal {P}$. If no such path exists, we set $d_{{\mathcal {W}_{}}}(W_1,W_2) = \infty $.

As previously noted, trees with edge lengths $\ell $ tending to infinity move infinitively far apart in the BHV geometry. In the wald geometry the distance between these trees goes to zero. This is reflected in the following reparametrization $W = [F,\lambda ]$ with $\lambda := 1 - \exp (-\ell )$, recasting (1) as

$$\begin{aligned} \big (\phi (W)\big )_{uv} = \big (\phi ([F,\lambda ])\big )_{uv} := {\left\{ \begin{array}{ll} \prod _{e\in E(u,v)} \big (1 - \lambda (e)\big ),&{}\text {if } u \text { and } v \text { are connected},\\ 0,&{}\text {else}. \end{array}\right. } \end{aligned}$$

In particular, if $W=[F,\lambda ]$, $F=(V,E)$, has |E| edges, vectorizing $\lambda \in (0,1)^{|E|}$, we have the following identification for the grove of W:

$$ {\mathcal {W}_{W}}\cong (0,1)^{|E|}\,.$$

Theorem 1

1.
For every wald $W = [F,\lambda ]$, $F=(V,E)$ with grove ${\mathcal {W}_{W}}$, the mapping $(0,1)^{|E|}\cong {\mathcal {W}_{W}} {\mathop {\rightarrow }\limits ^{\phi }}\mathcal {P}$ is an embedding.
2.
If $W = [F,\lambda ]$ with a fully resolved (i.e. binary) tree F then ${\mathcal {W}_{W}}$ is an open subset of ${\mathcal {W}_{}}$.

Proof

cf. Lueg et al. (2021).

In consequence, ${\mathcal {W}_{}}$ is a stratified space with strata given by groves. As BHV-space can be viewed as a subset of wald space, cf. Garba et al. (2020), BHV-orthants are subsets of groves. In contrast to BHV-space, groves are not only connected to the star stratum (trees without interior edges), they are also connected to forest strata including the completely disconnected forest (consisting of N isolated vertices, no edges), which lies on the boundary of the star stratum.

3 Geodesics in Wald Space

We propose different algorithms to compute geodesics between two fully resolved trees $W_1$ and $W_2$, where Algorithm 4 is only applicable if $W_1$ and $W_2$ lie in a common grove ${\mathcal {W}_{W}}$. Dropping the embedding map $\phi $, we consider wald space ${\mathcal {W}_{}}$ as a subset of the ambient space $\mathcal {P}$. To this end, for $P,Q\in \mathcal {P}$, denote the unique geodesic between P and Q by $\gamma _{P,Q}:[0,1]\rightarrow \mathcal {P}$, the Riemann exponential and logarithm by $\mathrm {Exp}_P^{(\mathcal {P})}:T_P\mathcal {P}\rightarrow \mathcal {P}$ and $\mathrm {Log}_P^{(\mathcal {P})}:\mathcal {P}\rightarrow T_P\mathcal {P}$, respectively, the orthogonal tangent space projection by $\pi _W:T_P\mathcal {P}\rightarrow T_W{\mathcal {W}_{}}$ and define the projection $\pi :\mathcal {P}\rightarrow {\mathcal {W}_{}}, P\mapsto \pi (P):= \mathop {\mathrm {argmin}}\nolimits _{W\in {\mathcal {W}_{}}} d_{\mathcal {P}}(P,W)$, where $\pi $ is only well-defined for $P\in \mathcal {P}$ close enough to ${\mathcal {W}_{}}$. The following is a very simple but naive algorithm.

The next algorithm makes small (approximately geodesic) steps and successively takes the geodesic from the newest point to the destination (note the $X_{i-1}$ and $Y_{i-1}$ in the subscript in the update step).

The following two algorithms are inspired by Schmidt et al. (2006). They update a given path iteratively and perform a straightening of the path, eventually leading to a geodesic (cf. Figs. 1–4).

Exploiting the manifold structure of groves, for two walds $W_1,W_2\in {\mathcal {W}_{[F]}}$ with the same fully resolved tree F, we change Algorithm 3 slightly and thus avoid using the projection.

We measure the quality of a proposal $(X_1,\dots ,X_n)$, $3\le n\in \mathbb {N}$ by its length,

$$\begin{aligned} L(X_1,\dots ,X_n) = \sum _{i=1}^{n-1} d_{\mathcal {P}}(X_i, X_{i+1}) \end{aligned}$$

and its energy,

$$\begin{aligned} E(X_1,\dots ,X_n) = \frac{1}{2}\sum _{i = 1}^{n-1} d_{\mathcal {P}}(X_i, X_{i+1})^2. \end{aligned}$$

4 Comparing Fréchet Means

For illustration, we take $n=3$ trees $W_1,\dots ,W_n\in {\mathcal {W}_{}}$ from Nye et al. (2016) depicted in Fig. 5, each of which having $N=5$ leaves (3 taxa and the root were removed from the original trees for computational tractability). We compute their Fréchet means

$$\begin{aligned} W^*\in \mathop {\mathrm {argmin}}\limits _{W\in {\mathcal {W}_{}}}\sum _{k = 1}^n d_{{\mathcal {W}_{}}}\big (W_k, W\big )^2 \end{aligned}$$

in BHV-space and in wald space, cf. Fig. 6. For computation we use the algorithm of Sturm (2003). In general, the computation of other types of means is also possible (e.g. the Riemannian 1-center, cf. Arnaudon et al. (2013)).

While in BHV-space, the Fréchet mean is unique, in wald space its uniqueness is dubious. For both spaces we have performed 15 iterations after which the final subsequent iterates were less than 0.05 apart, respectively. Remarkably, the mean tree in BHV-space is a star tree. In wald space, however, it is a fully resolved tree.

References

Arnaudon, M., Nielsen, F.: On approximating the Riemannian 1-center. Computational Geometry 46(1), 93–104 (2013)
Article MathSciNet Google Scholar
Barden, D., Le, H., Owen, M.: Central limit theorems for Fréchet means in the space of phylogenetic trees. Electron. J. Probab 18(25), 1–25 (2013)
MATH Google Scholar
Barden, D., Le, H., Owen, M.: Limiting behaviour of Fréchet means in the space of phylogenetic trees. Annals of the Institute of Statistical Mathematics 70(1), 99–129 (2016). https://doi.org/10.1007/s10463-016-0582-9
Bačák, M.: Computing Medians and Means in Hadamard Spaces. SIAM Journal on Optimization 24(3), 1542–1566 (2014)
Article MathSciNet Google Scholar
Billera, L., Holmes, S., Vogtmann, K.: Geometry of the space of phylogenetic trees. Advances in Applied Mathematics 27(4), 733–767 (2001)
Article MathSciNet Google Scholar
Brown, D. G. and M. Owen (2018, May). Mean and Variance of Phylogenetic Trees. arXiv:1708.00294 [math, q-bio, stat]. arXiv: 1708.00294
Garba, M.K., Nye, T.M., Boys, R.J.: Probabilistic Distances Between Trees. Systematic Biology 67(2), 320–327 (2018)
Article Google Scholar
Garba, M. K., Nye, T. M. W., Lueg, J., Huckemann, S. F.: Information geometry for phylogenetic trees. Journal of Mathematical Biology 82(3), 1–39 (2021). https://doi.org/10.1007/s00285-021-01553-x
Garba, M. K., T. M. W. Nye, J. Lueg, and S. F. Huckemann (2021). Information metrics for phylogenetic trees via distributions of discrete and continuous characters. In: Nielsen, F., Barbaresco, F. (Eds.) GSI 2021, LNCS 12829, pp. 701–709 (2021). https://doi.org/10.1007/978-3-030-80209-7_75
Hotz, T., Huckemann, S., Le, H., Marron, J.S., Mattingly, J., Miller, E., Nolen, J., Owen, M., Patrangenaru, V., Skwerer, S.: Sticky central limit theorems on open books. Annals of Applied Probability 23(6), 2238–2258 (2013)
Article MathSciNet Google Scholar
Huckemann, S., Mattingly, J.C., Miller, E., Nolen, J.: Sticky central limit theorems at isolated hyperbolic planar singularities. Electronic Journal of Probability 20(78), 1–34 (2015)
MathSciNet MATH Google Scholar
Lang, S.: Fundamentals of Differential Geometry. Graduate Texts in Mathematics. Springer-Verlag, New York (1999)
Book Google Scholar
Lueg, J., T. Nye, M. Garba, and S. F. Huckemann (2021). Phylogenetic wald spaces. manuscript
Google Scholar
Miller, E., Owen, M., Provan, J.S.: July). Polyhedral computational geometry for averaging metric phylogenetic trees. Advances in Applied Mathematics 68, 51–91 (2015)
Article MathSciNet Google Scholar
Moulton, V., Steel, M.: Peeling phylogenetic ‘oranges’. Advances in Applied Mathematics 33(4), 710–727 (2004)
Google Scholar
Nye, T. M., X. Tang, G. Weyenberg, and Y. Yoshida (2016). Principal component analysis and the locus of the Fréchet mean in the space of phylogenetic trees. arXiv preprint arXiv:1609.03045
Owen, M.: Computing geodesic distances in tree space. SIAM Journal on Discrete Mathematics 25(4), 1506–1529 (2011)
Article MathSciNet Google Scholar
Rokas, A., B. L. Williams, N. King, and S. B. Carroll (2003, October). Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425(6960), 798–804
Google Scholar
Schmidt, Frank R., Clausen, Michael, Cremers, Daniel: Shape Matching by Variational Computation of Geodesics on a Manifold. In: Franke, Katrin, Müller, Klaus-Robert., Nickolay, Bertram, Schäfer, Ralf (eds.) DAGM 2006. LNCS, vol. 4174, pp. 142–151. Springer, Heidelberg (2006). https://doi.org/10.1007/11861898_15
Chapter Google Scholar
Shiers, N., Zwiernik, P., Aston, J.A., Smith, J.Q.: The correlation space of gaussian latent tree models and model selection without fitting. Biometrika 103(3), 531–545 (2016)
Article MathSciNet Google Scholar
Sturm, K.: Probability measures on metric spaces of nonpositive curvature. Contemporary mathematics 338, 357–390 (2003)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Felix-Bernstein-Institute for Mathematical Statistics in the Biosciences, Georg-August-Universität, Göttingen, Germany
Jonas Lueg & Stephan F. Huckemann
Department of Mathematical Sciences, Bayero University, Kano, Nigeria
Maryam K. Garba
School of Mathematics, Statistics and Physics, Newcastle University, Newcastle upon Tyne, UK
Tom M. W. Nye

Authors

Jonas Lueg
View author publications
You can also search for this author in PubMed Google Scholar
Maryam K. Garba
View author publications
You can also search for this author in PubMed Google Scholar
Tom M. W. Nye
View author publications
You can also search for this author in PubMed Google Scholar
Stephan F. Huckemann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jonas Lueg .

Editor information

Editors and Affiliations

Sony Computer Science Laboratories Inc., Tokyo, Japan
Frank Nielsen
THALES Land & Air Systems, Meudon, France
Frédéric Barbaresco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lueg, J., Garba, M.K., Nye, T.M.W., Huckemann, S.F. (2021). Wald Space for Phylogenetic Trees. In: Nielsen, F., Barbaresco, F. (eds) Geometric Science of Information. GSI 2021. Lecture Notes in Computer Science(), vol 12829. Springer, Cham. https://doi.org/10.1007/978-3-030-80209-7_76

Download citation

DOI: https://doi.org/10.1007/978-3-030-80209-7_76
Published: 14 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80208-0
Online ISBN: 978-3-030-80209-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us