Cancer Fingerprints by Topological Data Analysis

Carpio, Ana

doi:10.1007/978-3-031-11818-0_4

Ana Carpio¹⁴

Part of the book series: Mathematics in Industry ((TECMI,volume 39))

Included in the following conference series:

European Consortium for Mathematics in Industry

613 Accesses

Abstract

Topological data analysis has arisen has a promising tool to extract information on the structure of a wide variety of datasets. We analyze here its potential in two types of cancer studies. First, we compare times series of images from simulations of metastatic invasion in epithelial tissues. Calculating bottleneck distances of persistent diagrams we can characterize and classify the advancing interfaces of cellular aggregates. Second, we compare mRNA expression values for genes involved in cell cycles extracted from pancreas cancer tissue. We discuss how persistence information from different distances can provide insight on patient/gene clusters.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Phenotype-driven identification of epithelial signalling clusters

Article Open access 05 March 2018

Identification of relevant genetic alterations in cancer using topological data analysis

Article Open access 30 July 2020

Using the QAPgrid Visualization Approach for Biomarker Identification of Cell-Specific Transcriptomic Signatures

1 Introduction

Clinical and experimental studies of illness generate large amounts of data of a different nature. Consider cancer, for instance. Laboratory analyses of gene expression lead to large files containing measurements for different genes [15], see Fig. 1a. Instead, experimental observations of normal and malignant cells [9] yield time series of images, see Fig. 1b. Being able to extract meaningful information from large biomedical datasets, regardless of their nature, is a challenge that requires the development of adequate mathematical and computational tools.

Topological data analysis (TDA) furnishes a framework that provides dimensionality reduction and robustness to noise [2] when studying data clouds, with a certain independence with respect to the metrics selected. Recent studies have pointed out the potential of TDA in biological applications [8, 13, 16]. Biomedical data can be often be seen as point clouds in a space of dimension D. Whereas for images D is the spatial dimension, for gene expression datasets D is the number of patients or genes in the study. We will see how to use TDA to extract information in both settings. The paper is organized as follows. Section 2 applies TDA to classify automatically interfaces between healthy and malignant cells in two dimensional images. Section 3 proposes a topology based hierarchical clustering procedure for gene expression data. Finally, Sect. 4 summarizes our conclusions.

2 Classification of Interfaces

Competition between two different media (fluids, for instance) or populations is an ubiquitous phenomenon in many fields. Usually, an interface separating the two components forms. Being able to automatically characterize such interface is important to identify patterns or stages in biological applications. Given several images representing the evolution of fragmented interfaces, our strategy proceeds in the following steps [1]:

1.
Extract from each image a point cloud X defining the interface.
2.
Build a Vietoris-Rips filtration V (X, r) for each point cloud based on the Euclidean distance, that is, a family of simplicial complexes formed joining by edges and triangles the points at a distance smaller than a variable parameter r, see [17].
3.
Calculate the Betti numbers associated to each filtration: betti ₀(r) (number of components) and betti ₁(r) (number of holes) as the filtration parameter r varies.
4.
For each identified component in each filtration, calculate the persistence intervals [r _b, r _d], that is, the filtration parameter values at which it appears r _b (birth) and disappears r _d (death). They define the H ₀ homology.
5.
For each identified hole in each filtration, calculate the persistence intervals [r _b, r _d]. They define the H ₁ homology.
6.
Plot the persistence diagrams formed by the points (r _b, r _d) defining the persistence intervals for components and holes in each filtration, see Fig. 2.
Fig. 2
Persistence diagrams representative of the initial, intermediate and late stages in the invasion process
Full size image
7.
Calculate the Bottleneck distance [11] between the H ₁ persistence diagrams.
8.
Use k-means or a hierarchical clustering [10] approach to group the interfaces in clusters according to the level of detail required.

For the simulation considered in Fig. 1b, a set of 12 images is classified by K-means in 3 groups: the first three frames correspond to initial stages in which the interface is close to an unfragmented smooth curve, the last two frames correspond to late stages of the invasion period with many fragments and interpenetration, while the remaining frames correspond to an intermediate stage in which fragments may detach and reattach, see Fig. 2.

The study of images involves point clouds in two or three dimensional spaces. Medical records containing the values of several variables monitorized over a collection of patients belong to higher dimensional spaces. Their study presents new difficulties.

3 Grouping Data

Gene studies in cancer patients have provided large amounts of information which may help to identify genetic features of sickness [15]. We consider here measurements of mRNA gene expression data for pancreas cancer available in [6], taken from the TCGA (the Cancer Genome Atlas) study. In this case, data take the form of numeric matrices M = (m _j,i) containing values for a collection of genes i = 1, …, N, from tissue samples corresponding to different patients j = 1, …, J.

The first step consists in normalizing the data. To do so [7], we calculate the means μ _i and standard deviations σ _i for each gene over the patients and compute the normalized values $\tilde {m}_{j,i} = \frac {m_{j,i} -\mu _i}{3\sigma _i}$. Then, we select a distance and a clustering strategy to group either patients using information from genes, or genes using information from patients.

3.1 Distance Selection

To compare genes or patients, we can use a number of distances [5]:

The Euclidean distance between two columns or rows m ¹ and m ² is their distance as vectors in a D dimensional space d(m ¹, m ²) = ∥m ¹ − m ²∥₂.
The Earth Mover’s distance (EMD) provides the minimum cost of turning one column (resp. row) into the other [13]
$$\displaystyle \begin{aligned} emd(m^1,m^2)= \frac{\sum_{k=1}^D \sum_{\ell=1}^D c_{k,\ell} d_{k,\ell}}{ \sum_{k=1}^D \sum_{\ell=1}^D d_{k,\ell}}, \end{aligned}$$
where $d_{k,\ell } = |m^1_k - m^2_\ell |$ is the ground distance, and c _k,ℓ minimizes the cost $\sum _{k=1}^D \sum _{\ell =1}^D c_{k,\ell } d_{k,\ell }$ subject to the constraints c _k,ℓ ≥ 0, 1 ≤ k, ℓ ≤ D, $ \sum _{k=1}^D \sum _{\ell =1}^D c_{k,\ell } = D, $ $ \sum _{k=1}^D c_{k,\ell } \leq 1, \; 1 \leq \ell \leq D, $ $ \sum _{\ell =1}^D c_{k,\ell } \leq 1, \; 1 \leq k \leq D. $ The EMD identifies patterns regardless of their location. The distance between two patient profiles that are equal except for a peak about different genes would be small, which is inadequate as different genes may define different illnesses.
Considering a set S of columns (resp. rows) m ¹, m ², …, m ^L, the Fermat α-distance between any two of them relative to that set is [14]
$$\displaystyle \begin{aligned} d_{S,\alpha}(m^1,m^2)=\min\Bigl\{\sum_{\ell=1}^{k-1}\| y^{\ell+1}-y^\ell \|{}_2^\alpha \bigg\vert (y_1,\ldots,y_k) \text{ path from}\ m^1\ \text{to}\ m^2\ \text{in S}\Bigr\}, \end{aligned}$$
for any α > 1. When α = 1, we recover the Euclidean distance. The Fermat distance compares items in a set weighting information from all the other items in the same set, which is interesting when we want to compare gene profiles weighting information from cohorts of patients [3].

3.2 Distance and Topology Based Clustering

Figure 3 represents gene-gene and patient-patient distances for different gene (resp. patient) orderings. Regardless of the ordering, we can use such distance matrices in hierarchical clustering algorithms [10] and select a natural number of clusters based on inconsistency criteria [12]. Grouping genes (resp. patients) by their clusters we obtain the panels in Fig. 3, which uncover hidden relations in the data.

Moreover, using any of these distances on the point cloud of patients m _j,⋅ = (m _j,1, …, m _j,N), j = 1, …, J, or the point cloud of patients m _⋅,i = (m _1,i, …, m _N,i), i = 1, …, N, we can implement a similar procedure to that described in Sect. 2, only the distance changes. We construct a filtration, calculate the Betti numbers, as well as the persistence diagrams and intervals. With this information, we can compare datasets from different cancer types or patient studies to identify distinctive features and profiles. Moreover, the H ₀ homology provides an additional clustering strategy, different from usual hierarchical clustering. For a fixed filtration parameter value, each component of the simplex constructed for that filtration value defines a cluster. As the filtration parameter varies, we have a topology based hierarchical clustering strategy. Figure 4 displays the same data as Fig. 1a when genes and patients are rearranged following the components of filtrations for a fixed filtration value.

4 Conclusions

We have discussed the potential of persistence studies based on different distances combined with clustering strategies to extract information from point clouds of data of medical interest. Applied to time series of images of cellular arrangements, it provides a tool to automatically classify specific image features. Applied to gene expression data, it opens new perspectives to gain a better understanding of hidden relations. Similar techniques could be exploited to study clinical data from other illnesses, immune disorders for instance [4].

References

L.L. Bonilla, A. Carpio, C. Trenado, Tracking collective cell motion by topological data analysis, PLoS Comput Biol 16 (2020) e1008407.
Article Google Scholar
G. Carlsson, Topology and data, Bull. Amer. Math. Soc. 46 (2009) 255–308.
Article MathSciNet MATH Google Scholar
A. Carpio, L.L. Bonilla, J.C. Mathews, A.R. Tannenbaum, Fingerprints of cancer by persistent homology, bioRxiv 777169, 2019
Google Scholar
A. Carpio, A. Simón, L.F. Villa, Clustering methods and Bayesian inference for the analysis of the time evolution of immune disorders, arXiv:2009.11531 2020
Google Scholar
A. Carpio, A. Simón, A. Torres, L.F. Villa, Pattern recognition in data as a diagnosis tool, Journal of Mathematics in Industry 12 (2022) 3.
Article MathSciNet MATH Google Scholar
E. Cerami, J. Gao, U. Dogrusoz et al, The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data, Cancer Discov 2 (2012) 401–404.
Article Google Scholar
Y. Chen, F. D. Cruz, R. Sandhu, A. L. Kung, P. Mundi, et al, Poediatric sarcoma data forms a unique cluster measured via the Earth Mover’s Distance, Sci. Rep. 7 (2017) 7035.
Article Google Scholar
M.R. McGuirl, A. Volkening, B. Sandstede, Topological data analysis of zebrafish patterns. Proc. Nat. Acad. Sci. 117 (2020) 5113–5124.
Article MathSciNet MATH Google Scholar
S. Moitrier, C. Blanch, S. Garcia, K. Sliogeryte et al., Collective stresses drive competition between monolayers of normal and Ras-transformed cells, Soft Matter 15 (2019) 537–545.
Article Google Scholar
L. Kaufman, P.J. Rousseeuw, Finding groups in data: An introduction to cluster analysis, Hoboken: Wiley-Interscience, 1990.
Book MATH Google Scholar
M. Kerber, D. Morozov, A. Nigmetov, Geometry helps to compare persistence diagrams, ACM J. Exp. Algorithmics, 22 (2017) 1.4.
Article MathSciNet MATH Google Scholar
T. Kovacheva, A hierarchical clustering approach to find groups of objects, Proceedings of the IV Congress of Mathematicians, Macedonia; 2008. pp 359–373.
Google Scholar
A.H. Rizvi, P.G. Camara, E.K. Kandror, T.J. Roberts et al., Single-cell topological RNA-seq analysis reveals insights into cellular differentiation and development, Nat. Biotech. 35 (2017) 551–560.
Article Google Scholar
F. Sapienza, P. Groisman, M. Jonckheere, Weighted Geodesic Distance Following Fermat’s Principle. Proc 6th International Conference on Learning Representations (ICLR), 2018.
Google Scholar
The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, Pan-cancer analysis of whole genomes, Nature 578 (2020) 82–93.
Article Google Scholar
C. Topaz, L. Ziegelmeier, T. Halverson, Topological data analysis of biological aggregation models, PLoS ONE 10 (2015) e0126383.
Article Google Scholar
A. Zomorodian, G. Carlsson, Computing persistent homology. Discrete and Computational Geometry, 33 (2002) 249–274.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

Research supported by Spanish MICINN grants MTM 2017-84446-C2-1-R and PID2020-112796RB-C21.

Author information

Authors and Affiliations

Universidad Complutense de Madrid, Madrid, Spain
Ana Carpio

Authors

Ana Carpio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ana Carpio .

Editor information

Editors and Affiliations

Angewandte Mathematik und Numerische Analysis, Bergische Universität Wuppertal, Wuppertal, Germany
Matthias Ehrhardt
Angewandte Mathematik und Numerische Analysis, Bergische Universität Wuppertal, Wuppertal, Germany
Michael Günther

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Carpio, A. (2022). Cancer Fingerprints by Topological Data Analysis. In: Ehrhardt, M., Günther, M. (eds) Progress in Industrial Mathematics at ECMI 2021. ECMI 2021. Mathematics in Industry(), vol 39. Springer, Cham. https://doi.org/10.1007/978-3-031-11818-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-11818-0_4
Published: 11 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11817-3
Online ISBN: 978-3-031-11818-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics