Abstract
The recently introduced wald space models phylogenetic trees from an evolutionary perspective. We show that it is a stratified space and propose algorithms to compute geodesics. In application we compute a Fréchet mean of three trees of different topologies that is fully resolved, unlike in BHV-space. Both, preliminary results on geodesics and on means suggest that wald space features less stickiness than BHV-space, making it an alternative model for statistical investigations.
Acknowledging DFG HU 1575/7, DFG GK 2088, DFG SFB 1465 and the Niedersachsen Vorab of the Volkswagen Foundation.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Phylogenetic trees reflect biological species’ evolution. They are built from genetic variation over a set of taxa. Curiously, building them for the same set of taxa, from different genes, however, often result in fundamentally different trees, e.g. Rokas et al. (2003). This generates a call for statistics, for instance averaging over different trees while controlling their uncertainty. Also, this is a call for geometry, designing suitable spaces of trees that are both, biologically meaningful and numerically tractable.
A seminal model has been proposed twenty years ago by Billera et al. (2001), abbreviated as the BHV-model. It has the favorable property of being a Riemann stratified space of globally nonpositive curvature, thus admitting unique geodesics and unique Fréchet means. Additionally, since it is locally flat, an abundance of successful algorithms have been developed for their computation that suffer only from inherent combinatorial complexity, e.g. Owen (2011); Bačák (2014); Miller et al. (2015); Brown and Owen (2018).
While this model is mathematically intriguing, more recently new models have been developed with geometries more closely reflecting stochastic biological fundamentals of gene mutations, e.g. Moulton and Steel (2004); Shiers et al. (2016); Garba et al. (2020). In Garba et al. (2018), metrics for phylogenetic trees based on the information geometry of the two-state and four-state model were proposed (four states because gene entries are taken from one of the four nucleotide bases). This study was continued in Garba et al. (2020, 2021) and - as a further simplification - a continuous model has been proposed with moments matching those of the two-state model.
In this contribution, we briefly review the definition of our new wald space (cf. Garba et al. (2020)) and propose algorithms to compute geodesics and Fréchet means. On the one hand, the wald space is geometrically more challenging. It is a stratified space that is isometrically embedded in the space of positive symmetric \(N\times N\) matrices \(\mathcal {P}\) (where \(N\in \mathbb {N}\) is the number of taxa) equipped with the well known affine invariant geometry of globally nonpositive curvature – hence the need for algorithms as sophisticated as those of the BHV-space. On the other hand, we believe it is biologically more meaningful than the BHV-space. For example, in BHV-space the distance of two different trees with edge lengths becoming arbitrary large diverges to infinity. In wald space, such trees converge to the completely disconnected forest, a member of the wald space, along with other forests. Hence these two trees become more and more similar. Simulations and data analyses reveal advantage of wald space: degenerate trees seem to be less sticky (sticky means have degenerate limiting distributions) in wald space than in BHV-space, cf. Hotz et al. (2013); Huckemann et al. (2015); Barden et al. (2013, 2018), thus more easily allowing for statistical inference.
Wald space was first proposed at the Oberwolfach workshop 1804 (2018) in the black forest which is the Schwarzwald in German.
2 Wald Space
Let \(N\in \mathbb {N}\) denote the number of taxa. A phylogenetic forest \((F,\ell )\) is
-
(i)
a forest \(F = (V,E)\) with a finite number of vertices V, undirected edges E such that any two vertices \(u,v \in V\) are connected by at most one edge denoted by \(\{u,v\}\) and labeled vertices \(L = \{1,\dots ,N\} \subseteq V\), where \(v\in V\setminus L\) implies that \(\deg (v)\ge 3\),
-
(ii)
with a mapping \(\ell :E\rightarrow (0,\infty )\).
Two phylogenetic forests are equivalent, \((F_1,\ell _1)\sim (F_2,\ell _2)\), if their label sets agree \(L_1 = L = L_2\) and if there is a graph isomorphism \(f:V_1\rightarrow V_2\) such that
-
(i)
\(f(u) = u\) for all \(u\in L\), and
-
(ii)
\(\ell _1(\{u, v\}) = \ell _2(\{f(u), f(v)\})\) for all \(\{u, v\}\in E_1\).
Definition 1
Every equivalence class \(W=[F,\ell ]\) is called a wald and all equivalence classes form the wald space \({\mathcal {W}_{}}\), its geometric structure is defined further below. Disregarding the edge lengths map \(\ell \), every equivalence class of forests F with regards to (i) above, is a wald topology. For a given wald \(W=[F,\ell ]\), the grove of W is \({\mathcal {W}_{W}}\) which comprises all \(W'=[F',\ell '] \in {\mathcal {W}_{}}\) where \(F'\) and F have the same wald topology.
In the following, for any connected \(u,v\in V\), E(u, v) is the set of edges along the unique path connecting u and v. For \(u=v\), we set any sum over E(u, u) equal zero.
With this notation, the map \(\phi \) sending \(W=[F,\ell ]\) to the \(N\times N\) matrix with coordinate entry at \(u,v \in L\),
is well defined and maps \({\mathcal {W}_{}}\) injectively into the set of symmetric positive \(N \times N\) matrices \(\mathcal {P}\), cf. Garba et al. (2020).
Recall from Garba et al. (2020, 2021) that the affine invariant Riemannian metric on \(\mathcal {P}\) corresponds to the Fisher information geometry for zero-mean nondegenerate N-dimensional Gaussians induced by tree-indexed Gaussian processes, a continuous generalisation of the two-state model. This metric has the advantage of turning \(\mathcal {P}\) into a Riemannian manifold of global nonpositive curvature (e.g. Lang (1999)), guaranteeing unique geodesics and unique Fréchet means (e.g. Sturm (2003)). The squared distance induced on \(\mathcal {P}\) is given by
where \(\sqrt{P}\) is the unique positive definite square root of \(P\) and \(\mu _i\) are the eigenvalues of \(P^{-1}Q\).
Definition 2
The metric \(d_{{\mathcal {W}_{}}}\) of the wald space is the pullback of \(d_\mathcal {P}\) under \(\phi \), which is given for \(W_1,W_2\in {\mathcal {W}_{}}\) by
where \(L_{d_\mathcal {P}}(\gamma )\) is the length of the path \(\gamma \) measured in \(d_\mathcal {P}\). If no such path exists, we set \(d_{{\mathcal {W}_{}}}(W_1,W_2) = \infty \).
As previously noted, trees with edge lengths \(\ell \) tending to infinity move infinitively far apart in the BHV geometry. In the wald geometry the distance between these trees goes to zero. This is reflected in the following reparametrization \(W = [F,\lambda ]\) with \(\lambda := 1 - \exp (-\ell )\), recasting (1) as
In particular, if \(W=[F,\lambda ]\), \(F=(V,E)\), has |E| edges, vectorizing \(\lambda \in (0,1)^{|E|}\), we have the following identification for the grove of W:
Theorem 1
-
1.
For every wald \(W = [F,\lambda ]\), \(F=(V,E)\) with grove \({\mathcal {W}_{W}}\), the mapping \((0,1)^{|E|}\cong {\mathcal {W}_{W}} {\mathop {\rightarrow }\limits ^{\phi }}\mathcal {P}\) is an embedding.
-
2.
If \(W = [F,\lambda ]\) with a fully resolved (i.e. binary) tree F then \({\mathcal {W}_{W}}\) is an open subset of \({\mathcal {W}_{}}\).
Proof
cf. Lueg et al. (2021).
In consequence, \({\mathcal {W}_{}}\) is a stratified space with strata given by groves. As BHV-space can be viewed as a subset of wald space, cf. Garba et al. (2020), BHV-orthants are subsets of groves. In contrast to BHV-space, groves are not only connected to the star stratum (trees without interior edges), they are also connected to forest strata including the completely disconnected forest (consisting of N isolated vertices, no edges), which lies on the boundary of the star stratum.
3 Geodesics in Wald Space
We propose different algorithms to compute geodesics between two fully resolved trees \(W_1\) and \(W_2\), where Algorithm 4 is only applicable if \(W_1\) and \(W_2\) lie in a common grove \({\mathcal {W}_{W}}\). Dropping the embedding map \(\phi \), we consider wald space \({\mathcal {W}_{}}\) as a subset of the ambient space \(\mathcal {P}\). To this end, for \(P,Q\in \mathcal {P}\), denote the unique geodesic between P and Q by \(\gamma _{P,Q}:[0,1]\rightarrow \mathcal {P}\), the Riemann exponential and logarithm by \(\mathrm {Exp}_P^{(\mathcal {P})}:T_P\mathcal {P}\rightarrow \mathcal {P}\) and \(\mathrm {Log}_P^{(\mathcal {P})}:\mathcal {P}\rightarrow T_P\mathcal {P}\), respectively, the orthogonal tangent space projection by \(\pi _W:T_P\mathcal {P}\rightarrow T_W{\mathcal {W}_{}}\) and define the projection \(\pi :\mathcal {P}\rightarrow {\mathcal {W}_{}}, P\mapsto \pi (P):= \mathop {\mathrm {argmin}}\nolimits _{W\in {\mathcal {W}_{}}} d_{\mathcal {P}}(P,W)\), where \(\pi \) is only well-defined for \(P\in \mathcal {P}\) close enough to \({\mathcal {W}_{}}\). The following is a very simple but naive algorithm.
The next algorithm makes small (approximately geodesic) steps and successively takes the geodesic from the newest point to the destination (note the \(X_{i-1}\) and \(Y_{i-1}\) in the subscript in the update step).
The following two algorithms are inspired by Schmidt et al. (2006). They update a given path iteratively and perform a straightening of the path, eventually leading to a geodesic (cf. Figs. 1–4).
Exploiting the manifold structure of groves, for two walds \(W_1,W_2\in {\mathcal {W}_{[F]}}\) with the same fully resolved tree F, we change Algorithm 3 slightly and thus avoid using the projection.
We measure the quality of a proposal \((X_1,\dots ,X_n)\), \(3\le n\in \mathbb {N}\) by its length,
and its energy,
4 Comparing Fréchet Means
For illustration, we take \(n=3\) trees \(W_1,\dots ,W_n\in {\mathcal {W}_{}}\) from Nye et al. (2016) depicted in Fig. 5, each of which having \(N=5\) leaves (3 taxa and the root were removed from the original trees for computational tractability). We compute their Fréchet means
in BHV-space and in wald space, cf. Fig. 6. For computation we use the algorithm of Sturm (2003). In general, the computation of other types of means is also possible (e.g. the Riemannian 1-center, cf. Arnaudon et al. (2013)).
While in BHV-space, the Fréchet mean is unique, in wald space its uniqueness is dubious. For both spaces we have performed 15 iterations after which the final subsequent iterates were less than 0.05 apart, respectively. Remarkably, the mean tree in BHV-space is a star tree. In wald space, however, it is a fully resolved tree.
References
Arnaudon, M., Nielsen, F.: On approximating the Riemannian 1-center. Computational Geometry 46(1), 93–104 (2013)
Barden, D., Le, H., Owen, M.: Central limit theorems for Fréchet means in the space of phylogenetic trees. Electron. J. Probab 18(25), 1–25 (2013)
Barden, D., Le, H., Owen, M.: Limiting behaviour of Fréchet means in the space of phylogenetic trees. Annals of the Institute of Statistical Mathematics 70(1), 99–129 (2016). https://doi.org/10.1007/s10463-016-0582-9
Bačák, M.: Computing Medians and Means in Hadamard Spaces. SIAM Journal on Optimization 24(3), 1542–1566 (2014)
Billera, L., Holmes, S., Vogtmann, K.: Geometry of the space of phylogenetic trees. Advances in Applied Mathematics 27(4), 733–767 (2001)
Brown, D. G. and M. Owen (2018, May). Mean and Variance of Phylogenetic Trees. arXiv:1708.00294 [math, q-bio, stat]. arXiv: 1708.00294
Garba, M.K., Nye, T.M., Boys, R.J.: Probabilistic Distances Between Trees. Systematic Biology 67(2), 320–327 (2018)
Garba, M. K., Nye, T. M. W., Lueg, J., Huckemann, S. F.: Information geometry for phylogenetic trees. Journal of Mathematical Biology 82(3), 1–39 (2021). https://doi.org/10.1007/s00285-021-01553-x
Garba, M. K., T. M. W. Nye, J. Lueg, and S. F. Huckemann (2021). Information metrics for phylogenetic trees via distributions of discrete and continuous characters. In: Nielsen, F., Barbaresco, F. (Eds.) GSI 2021, LNCS 12829, pp. 701–709 (2021). https://doi.org/10.1007/978-3-030-80209-7_75
Hotz, T., Huckemann, S., Le, H., Marron, J.S., Mattingly, J., Miller, E., Nolen, J., Owen, M., Patrangenaru, V., Skwerer, S.: Sticky central limit theorems on open books. Annals of Applied Probability 23(6), 2238–2258 (2013)
Huckemann, S., Mattingly, J.C., Miller, E., Nolen, J.: Sticky central limit theorems at isolated hyperbolic planar singularities. Electronic Journal of Probability 20(78), 1–34 (2015)
Lang, S.: Fundamentals of Differential Geometry. Graduate Texts in Mathematics. Springer-Verlag, New York (1999)
Lueg, J., T. Nye, M. Garba, and S. F. Huckemann (2021). Phylogenetic wald spaces. manuscript
Miller, E., Owen, M., Provan, J.S.: July). Polyhedral computational geometry for averaging metric phylogenetic trees. Advances in Applied Mathematics 68, 51–91 (2015)
Moulton, V., Steel, M.: Peeling phylogenetic ‘oranges’. Advances in Applied Mathematics 33(4), 710–727 (2004)
Nye, T. M., X. Tang, G. Weyenberg, and Y. Yoshida (2016). Principal component analysis and the locus of the Fréchet mean in the space of phylogenetic trees. arXiv preprint arXiv:1609.03045
Owen, M.: Computing geodesic distances in tree space. SIAM Journal on Discrete Mathematics 25(4), 1506–1529 (2011)
Rokas, A., B. L. Williams, N. King, and S. B. Carroll (2003, October). Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425(6960), 798–804
Schmidt, Frank R., Clausen, Michael, Cremers, Daniel: Shape Matching by Variational Computation of Geodesics on a Manifold. In: Franke, Katrin, Müller, Klaus-Robert., Nickolay, Bertram, Schäfer, Ralf (eds.) DAGM 2006. LNCS, vol. 4174, pp. 142–151. Springer, Heidelberg (2006). https://doi.org/10.1007/11861898_15
Shiers, N., Zwiernik, P., Aston, J.A., Smith, J.Q.: The correlation space of gaussian latent tree models and model selection without fitting. Biometrika 103(3), 531–545 (2016)
Sturm, K.: Probability measures on metric spaces of nonpositive curvature. Contemporary mathematics 338, 357–390 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Lueg, J., Garba, M.K., Nye, T.M.W., Huckemann, S.F. (2021). Wald Space for Phylogenetic Trees. In: Nielsen, F., Barbaresco, F. (eds) Geometric Science of Information. GSI 2021. Lecture Notes in Computer Science(), vol 12829. Springer, Cham. https://doi.org/10.1007/978-3-030-80209-7_76
Download citation
DOI: https://doi.org/10.1007/978-3-030-80209-7_76
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80208-0
Online ISBN: 978-3-030-80209-7
eBook Packages: Computer ScienceComputer Science (R0)