Abstract
Topological data analysis has arisen has a promising tool to extract information on the structure of a wide variety of datasets. We analyze here its potential in two types of cancer studies. First, we compare times series of images from simulations of metastatic invasion in epithelial tissues. Calculating bottleneck distances of persistent diagrams we can characterize and classify the advancing interfaces of cellular aggregates. Second, we compare mRNA expression values for genes involved in cell cycles extracted from pancreas cancer tissue. We discuss how persistence information from different distances can provide insight on patient/gene clusters.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Clinical and experimental studies of illness generate large amounts of data of a different nature. Consider cancer, for instance. Laboratory analyses of gene expression lead to large files containing measurements for different genes [15], see Fig. 1a. Instead, experimental observations of normal and malignant cells [9] yield time series of images, see Fig. 1b. Being able to extract meaningful information from large biomedical datasets, regardless of their nature, is a challenge that requires the development of adequate mathematical and computational tools.
Topological data analysis (TDA) furnishes a framework that provides dimensionality reduction and robustness to noise [2] when studying data clouds, with a certain independence with respect to the metrics selected. Recent studies have pointed out the potential of TDA in biological applications [8, 13, 16]. Biomedical data can be often be seen as point clouds in a space of dimension D. Whereas for images D is the spatial dimension, for gene expression datasets D is the number of patients or genes in the study. We will see how to use TDA to extract information in both settings. The paper is organized as follows. Section 2 applies TDA to classify automatically interfaces between healthy and malignant cells in two dimensional images. Section 3 proposes a topology based hierarchical clustering procedure for gene expression data. Finally, Sect. 4 summarizes our conclusions.
2 Classification of Interfaces
Competition between two different media (fluids, for instance) or populations is an ubiquitous phenomenon in many fields. Usually, an interface separating the two components forms. Being able to automatically characterize such interface is important to identify patterns or stages in biological applications. Given several images representing the evolution of fragmented interfaces, our strategy proceeds in the following steps [1]:
-
1.
Extract from each image a point cloud X defining the interface.
-
2.
Build a Vietoris-Rips filtration V (X, r) for each point cloud based on the Euclidean distance, that is, a family of simplicial complexes formed joining by edges and triangles the points at a distance smaller than a variable parameter r, see [17].
-
3.
Calculate the Betti numbers associated to each filtration: betti 0(r) (number of components) and betti 1(r) (number of holes) as the filtration parameter r varies.
-
4.
For each identified component in each filtration, calculate the persistence intervals [r b, r d], that is, the filtration parameter values at which it appears r b (birth) and disappears r d (death). They define the H 0 homology.
-
5.
For each identified hole in each filtration, calculate the persistence intervals [r b, r d]. They define the H 1 homology.
-
6.
Plot the persistence diagrams formed by the points (r b, r d) defining the persistence intervals for components and holes in each filtration, see Fig. 2.
-
7.
Calculate the Bottleneck distance [11] between the H 1 persistence diagrams.
-
8.
Use k-means or a hierarchical clustering [10] approach to group the interfaces in clusters according to the level of detail required.
For the simulation considered in Fig. 1b, a set of 12 images is classified by K-means in 3 groups: the first three frames correspond to initial stages in which the interface is close to an unfragmented smooth curve, the last two frames correspond to late stages of the invasion period with many fragments and interpenetration, while the remaining frames correspond to an intermediate stage in which fragments may detach and reattach, see Fig. 2.
The study of images involves point clouds in two or three dimensional spaces. Medical records containing the values of several variables monitorized over a collection of patients belong to higher dimensional spaces. Their study presents new difficulties.
3 Grouping Data
Gene studies in cancer patients have provided large amounts of information which may help to identify genetic features of sickness [15]. We consider here measurements of mRNA gene expression data for pancreas cancer available in [6], taken from the TCGA (the Cancer Genome Atlas) study. In this case, data take the form of numeric matrices M = (m j,i) containing values for a collection of genes i = 1, …, N, from tissue samples corresponding to different patients j = 1, …, J.
The first step consists in normalizing the data. To do so [7], we calculate the means μ i and standard deviations σ i for each gene over the patients and compute the normalized values \(\tilde {m}_{j,i} = \frac {m_{j,i} -\mu _i}{3\sigma _i}\). Then, we select a distance and a clustering strategy to group either patients using information from genes, or genes using information from patients.
3.1 Distance Selection
To compare genes or patients, we can use a number of distances [5]:
-
The Euclidean distance between two columns or rows m 1 and m 2 is their distance as vectors in a D dimensional space d(m 1, m 2) = ∥m 1 − m 2∥2.
-
The Earth Mover’s distance (EMD) provides the minimum cost of turning one column (resp. row) into the other [13]
$$\displaystyle \begin{aligned} emd(m^1,m^2)= \frac{\sum_{k=1}^D \sum_{\ell=1}^D c_{k,\ell} d_{k,\ell}}{ \sum_{k=1}^D \sum_{\ell=1}^D d_{k,\ell}}, \end{aligned}$$where \(d_{k,\ell } = |m^1_k - m^2_\ell |\) is the ground distance, and c k,ℓ minimizes the cost \(\sum _{k=1}^D \sum _{\ell =1}^D c_{k,\ell } d_{k,\ell }\) subject to the constraints c k,ℓ ≥ 0, 1 ≤ k, ℓ ≤ D, \( \sum _{k=1}^D \sum _{\ell =1}^D c_{k,\ell } = D, \) \( \sum _{k=1}^D c_{k,\ell } \leq 1, \; 1 \leq \ell \leq D, \) \( \sum _{\ell =1}^D c_{k,\ell } \leq 1, \; 1 \leq k \leq D. \) The EMD identifies patterns regardless of their location. The distance between two patient profiles that are equal except for a peak about different genes would be small, which is inadequate as different genes may define different illnesses.
-
Considering a set S of columns (resp. rows) m 1, m 2, …, m L, the Fermat α-distance between any two of them relative to that set is [14]
$$\displaystyle \begin{aligned} d_{S,\alpha}(m^1,m^2)=\min\Bigl\{\sum_{\ell=1}^{k-1}\| y^{\ell+1}-y^\ell \|{}_2^\alpha \bigg\vert (y_1,\ldots,y_k) \text{ path from}\ m^1\ \text{to}\ m^2\ \text{in S}\Bigr\}, \end{aligned}$$for any α > 1. When α = 1, we recover the Euclidean distance. The Fermat distance compares items in a set weighting information from all the other items in the same set, which is interesting when we want to compare gene profiles weighting information from cohorts of patients [3].
3.2 Distance and Topology Based Clustering
Figure 3 represents gene-gene and patient-patient distances for different gene (resp. patient) orderings. Regardless of the ordering, we can use such distance matrices in hierarchical clustering algorithms [10] and select a natural number of clusters based on inconsistency criteria [12]. Grouping genes (resp. patients) by their clusters we obtain the panels in Fig. 3, which uncover hidden relations in the data.
Moreover, using any of these distances on the point cloud of patients m j,⋅ = (m j,1, …, m j,N), j = 1, …, J, or the point cloud of patients m ⋅,i = (m 1,i, …, m N,i), i = 1, …, N, we can implement a similar procedure to that described in Sect. 2, only the distance changes. We construct a filtration, calculate the Betti numbers, as well as the persistence diagrams and intervals. With this information, we can compare datasets from different cancer types or patient studies to identify distinctive features and profiles. Moreover, the H 0 homology provides an additional clustering strategy, different from usual hierarchical clustering. For a fixed filtration parameter value, each component of the simplex constructed for that filtration value defines a cluster. As the filtration parameter varies, we have a topology based hierarchical clustering strategy. Figure 4 displays the same data as Fig. 1a when genes and patients are rearranged following the components of filtrations for a fixed filtration value.
4 Conclusions
We have discussed the potential of persistence studies based on different distances combined with clustering strategies to extract information from point clouds of data of medical interest. Applied to time series of images of cellular arrangements, it provides a tool to automatically classify specific image features. Applied to gene expression data, it opens new perspectives to gain a better understanding of hidden relations. Similar techniques could be exploited to study clinical data from other illnesses, immune disorders for instance [4].
References
L.L. Bonilla, A. Carpio, C. Trenado, Tracking collective cell motion by topological data analysis, PLoS Comput Biol 16 (2020) e1008407.
G. Carlsson, Topology and data, Bull. Amer. Math. Soc. 46 (2009) 255–308.
A. Carpio, L.L. Bonilla, J.C. Mathews, A.R. Tannenbaum, Fingerprints of cancer by persistent homology, bioRxiv 777169, 2019
A. Carpio, A. Simón, L.F. Villa, Clustering methods and Bayesian inference for the analysis of the time evolution of immune disorders, arXiv:2009.11531 2020
A. Carpio, A. Simón, A. Torres, L.F. Villa, Pattern recognition in data as a diagnosis tool, Journal of Mathematics in Industry 12 (2022) 3.
E. Cerami, J. Gao, U. Dogrusoz et al, The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data, Cancer Discov 2 (2012) 401–404.
Y. Chen, F. D. Cruz, R. Sandhu, A. L. Kung, P. Mundi, et al, Poediatric sarcoma data forms a unique cluster measured via the Earth Mover’s Distance, Sci. Rep. 7 (2017) 7035.
M.R. McGuirl, A. Volkening, B. Sandstede, Topological data analysis of zebrafish patterns. Proc. Nat. Acad. Sci. 117 (2020) 5113–5124.
S. Moitrier, C. Blanch, S. Garcia, K. Sliogeryte et al., Collective stresses drive competition between monolayers of normal and Ras-transformed cells, Soft Matter 15 (2019) 537–545.
L. Kaufman, P.J. Rousseeuw, Finding groups in data: An introduction to cluster analysis, Hoboken: Wiley-Interscience, 1990.
M. Kerber, D. Morozov, A. Nigmetov, Geometry helps to compare persistence diagrams, ACM J. Exp. Algorithmics, 22 (2017) 1.4.
T. Kovacheva, A hierarchical clustering approach to find groups of objects, Proceedings of the IV Congress of Mathematicians, Macedonia; 2008. pp 359–373.
A.H. Rizvi, P.G. Camara, E.K. Kandror, T.J. Roberts et al., Single-cell topological RNA-seq analysis reveals insights into cellular differentiation and development, Nat. Biotech. 35 (2017) 551–560.
F. Sapienza, P. Groisman, M. Jonckheere, Weighted Geodesic Distance Following Fermat’s Principle. Proc 6th International Conference on Learning Representations (ICLR), 2018.
The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, Pan-cancer analysis of whole genomes, Nature 578 (2020) 82–93.
C. Topaz, L. Ziegelmeier, T. Halverson, Topological data analysis of biological aggregation models, PLoS ONE 10 (2015) e0126383.
A. Zomorodian, G. Carlsson, Computing persistent homology. Discrete and Computational Geometry, 33 (2002) 249–274.
Acknowledgements
Research supported by Spanish MICINN grants MTM 2017-84446-C2-1-R and PID2020-112796RB-C21.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Carpio, A. (2022). Cancer Fingerprints by Topological Data Analysis. In: Ehrhardt, M., Günther, M. (eds) Progress in Industrial Mathematics at ECMI 2021. ECMI 2021. Mathematics in Industry(), vol 39. Springer, Cham. https://doi.org/10.1007/978-3-031-11818-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-11818-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11817-3
Online ISBN: 978-3-031-11818-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)