Topological Methods for Unsupervised Learning

McInnes, Leland

doi:10.1007/978-3-030-26980-7_35

Leland McInnes¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11712))

Included in the following conference series:

International Conference on Geometric Science of Information

2110 Accesses

Abstract

Unsupervised learning is a broad topic in machine learning with many diverse sub-disciplines. Within the field of unsupervised learning we will consider three major topics: dimension reduction; clustering; and anomaly detection. We seek to use the languages of topology and category theory to provide a unified mathematical approach to these three major problems in unsupervised learning.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Topological Data Analysis

Revisiting Attribute Independence Assumption in Probabilistic Unsupervised Anomaly Detection

Machine Learning-Based Clustering Analysis: Foundational Concepts, Methods, and Applications

Keywords

1 Mathematical Foundations

Our goal is to develop a flexible theory that can cope with data sampled non-uniformly from an underlying manifold embedded in a high dimension ambient space. The underlying approach is based on combining together multiple local approximations of an underlying manifold on which the input data $X = \{x_1, x_2, \ldots , x_N\}$ is assumed to lie. Since most theory, such as the Nerve theorem, implicitly assumes uniform distribution with respect to the ambient metric, other approaches are called for. For example we can adapt to local density by locally approximating the Riemannian metric of the manifold for each point, yielding N incompatible local approximations of inter-point distance. This incompatibility is in turn tackled by transferring the problem to fuzzy topological representations where natural means of combination are available. Given a single fuzzy topological representation of the data the various subtopics of unsupervised learning fall out as natural further operations.

The foundation of this approach is based on the work of David Spivak on metric realization for fuzzy simplicial sets [10], and categorical versions of fuzzy sets by Michael Barr [1] – it is this work that allows us to transfer the problem to one of fuzzy topological representations. A sketch of the approach is as follows: Let I be the unit interval $(0,1] \subseteq \mathbb {R}$ with topology given by intervals of the form (0, a) for $a\in (0, 1]$. The category of open sets (with morphisms given by inclusions) can be imbued with a Grothendieck topology in the natural way for any poset category.

Definition 1

A presheaf $\mathscr {P}$ on I is a functor from $I^\text {op}$ to Sets. A fuzzy set is a presheaf on I such that all maps $\mathscr {P}(a\le b)$ are injections.

Presheaves on I form a category with morphisms given by natural transformations. We can thus form a category of fuzzy sets by simply restricting to those presheaves that are fuzzy sets. We note that such presheaves are trivially sheaves under the Grothendieck topology on I. A section $\mathscr {P}([0, a))$ can be thought of as the set of all elements with membership strength at least a. We can now define the category of fuzzy sets.

Definition 2

The category Fuzz of fuzzy sets is the full subcategory of sheaves on I spanned by fuzzy sets.

Defining fuzzy simplicial sets is simply a matter of considering presheaves of $\varvec{\varDelta }$ valued in the category of fuzzy sets rather than the category of sets.

Definition 3

The category of fuzzy simplicial sets sFuzz is the category with objects given by functors from $\varvec{\varDelta }^{\text {op}}$ to Fuzz, and morphisms given by natural transformations.

Alternatively, a fuzzy simplicial set can be viewed as a sheaf over $\varvec{\varDelta }\times I$, where $\varvec{\varDelta }$ is given the trivial topology and $\varvec{\varDelta }\times I$ has the product topology. We will use $\varDelta ^n_{<a}$ to denote the sheaf given by the representable functor of the object ([n], (0, a)). The importance of this fuzzy (sheafified) version of simplicial sets is their relationship to metric spaces. We begin by considering the larger category of extended-pseudo-metric spaces.

Definition 4

An extended-pseudo-metric space (X, d) is a set X and a map $d: X\times X\rightarrow \mathbb {R}_{\ge 0}\cup \{\infty \}$ such that

1.
$d(x,y)\geqslant 0$, and $x=y$ implies $d(x,y)=0$;
2.
$d(x,y)=d(y,x)$; and
3.
Either $d(x,z)\leqslant d(x,y)+d(y,z)$ or $d(x,z) = \infty $.

The category of extended-pseudo-metric spaces EPMet has as objects extended-pseudo-metric spaces and non-expansive maps as morphisms. We denote the subcategory of finite extended-pseudo-metric spaces FinEPMet.

The choice of non-expansive maps in Definition 4 is due to Spivak, but we note that it closely mirrors the work of Carlsson and Memoli in [3] on topological methods for clustering as applied to finite metric spaces. This choice is significant since pure isometries are too strict and do not provide large enough Hom-sets.

In [10] Spivak constructs a pair of adjoint functors, $\mathsf {Real}$ and $\mathsf {Sing}$ between the categories sFuzz and EPMet. These functors are the natural extension of the classical realization and singular set functors from algebraic topology (see [5]). We are only interested in finite metric spaces, and thus use the analogous adjoint pair $\mathsf {FinReal}$ and $\mathsf {FinSing}$. Formally we define the finite realization functor as follows:

Definition 5

Define the functor $\mathsf {FinReal}:\mathbf {sFuzz}\rightarrow \mathbf {FinEPMet}$ by setting

$$ \mathsf {FinReal}(\varDelta ^n_{<a}) \triangleq (\{x_1, x_2, \ldots , x_n\}, d_a), $$

where

$$ d_a(x_i, x_j) = {\left\{ \begin{array}{ll} -\log (a) &{} \quad \text {if } i \ne j,\\ 0 &{} \quad \text {otherwise} \end{array}\right. }. $$

and then defining

A morphism $(\sigma , \le ):([n], ([0, a)) \rightarrow ([m], ([0, b))$ only exists for $a\le b$, and in that case we can define

$$ \mathsf {FinReal}((\sigma , \le )): \mathsf {FinReal}(\varDelta ^n_{<a}) \rightarrow \mathsf {FinReal}(\varDelta ^m_{<b}) $$

to be the map

$$ (\{x_1, x_2, \ldots , x_n\}, d_a)\mapsto (\{x_{\sigma (1)}, x_{\sigma (2)}, \ldots , x_{\sigma (n)}\}, d_b), $$

which is non-expansive since $a \le b$ implies $d_a \ge d_b$.

Since $\mathsf {FinReal}$ preserves colimits it admits a right adjoint, the fuzzy singular set functor $\mathsf {FinSing}$. To define the fuzzy singular set functor we require some further notation. Given a fuzzy simplicial set X let $X^n_{<a}$ be the set X([n], (0, a)). We can then define the fuzzy singular set functor in terms of the action of its image on $\varvec{\varDelta }\times I$.

Definition 6

Define the functor $\mathsf {FinSing}:\mathbf {FinEPMet}\rightarrow \mathbf {sFuzz}$ by

$$ \mathsf {FinSing}(Y)^n_{<a} \triangleq \hom _{\mathbf{FinEPMet}}(\mathsf {FinReal}(\varDelta ^n_{<a}), Y). $$

With the necessary theoretical background in place, the means to handle the family of incompatible metric spaces described earlier becomes clear. Each metric space in the family can be translated into a fuzzy simplicial set via the fuzzy singular set functor, distilling the topological information while still retaining metric information in the fuzzy structure. Resolving the incompatibilities of the resulting family of fuzzy simplicial sets can be done by taking a (fuzzy) union or intersection across the entire family. The result is a single fuzzy simplicial set which captures the relevant topological and underlying metric structure of the manifold $\mathcal {M}$. Thus if $d_i$ is the induced distance metric approximated for the point $x_i$ then the fuzzy topological representation of the manifold is given by either

$$ \bigcup _{i=1}^N \mathsf {FinSing}((X, d_i)) \quad \text { or }\quad \bigcap _{i=1}^N \mathsf {FinSing}((X, d_i)). $$

depending on the kind of combination required. This can be phrased directly in terms of pushouts (or colimits) and pullbacks (or limits) of simplicial sheaves for those less comfortable with fuzzy set semantics.

2 Dimension Reduction

We can make use of the general theory outlined in Sect. 1 for the purposes of dimension reduction. The resulting algorithm, called Uniform Manifold Approximation and Projection (UMAP), is sketched here. For a fuller presentation see [9].

For the purposes of dimension reduction we assume the manifold hypothesis – that the data is (approximately) sampled from some manifold $\mathcal {M}$ embedded in the ambient feature space. If we assume the data to be uniformly distributed on $\mathcal {M}$ (with respect to a Riemannian metric g) then any ball of fixed volume should contain approximately the same number of points of X regardless of where on the manifold it is centered. Conversely, a ball centered at $X_i$ that contains exactly the k-nearest-neighbors of $X_i$ should have fixed volume regardless of the choice of $X_i \in X$. We can approximate geodesic distance from $X_i$ to its neighbors by normalising distances with respect to the distance to the $k^{\text {th}}$ nearest neighbor of $X_i$.

For real data it is safe to assume that the manifold $\mathcal {M}$ is locally connected. In practice this can be realized by measuring distance in the extended-pseudo-metric space local to $X_i$ as geodesic distance beyond the nearest neighbor of $X_i$. Since this sets the distance to the nearest neighbor to be equal to 0, this is only possible in the more relaxed setting of extended-pseudo-metric spaces. It ensures, however, that each 0-simplex is the face of some 1-simplex with fuzzy membership strength 1, meaning that the resulting topological structure derived from the manifold is locally connected. We note that this has a similar practical effect to the truncated similarity approach of Lee and Verleysen [7], but derives naturally from the assumption of local connectivity of the manifold. We may then use the results of the previous section to combine together the local fuzzy simplicial sets and arrive at a single global fuzzy topological representation.

Having obtained a representation of the manifold it only remains to find a low dimensional representation of the data that suitably approximates the manifold.

Let $Y = \{Y_1,\ldots ,Y_N\} \subseteq \mathbb {R}^d$ be a low dimensional ($d \ll n$) representation of X such that $Y_i$ represents the source data point $X_i$. In contrast to the source data where we want to estimate a manifold on which the data is uniformly distributed, we know the manifold for Y is $\mathbb {R}^d$ itself. Therefore we know the manifold and manifold metric apriori, and can compute the fuzzy topological representation directly. Of note, we still want to incorporate the distance to the nearest neighbor as per the local connectedness requirement. This can be achieved by supplying a parameter that defines the expected distance between nearest neighbors in the embedded space.

Given fuzzy simplicial set representations of X and Y, a means of comparison is required. If we consider only the 1-skeleton of the fuzzy simplicial sets we can describe each as a fuzzy graph, or, more specifically, a fuzzy set of edges. To compare two fuzzy sets we will make use of fuzzy set cross entropy. For these purposes we will revert to classical fuzzy set notation. That is, a fuzzy set is given by a reference set A and a membership strength function $\mu :A\rightarrow [0,1]$. Comparable fuzzy sets have the same reference set. Given a sheaf representation $\mathscr {P}$ we can translate to classical fuzzy sets by setting $A = \bigcup _{a\in (0,1]} \mathscr {P}([0,a))$ and $\mu (x) = \sup \{a\in (0, 1] \mid x \in \mathscr {P}([0, a))\}$.

Definition 7

The cross entropy C of two fuzzy sets $(A, \mu )$ and $(A, \nu )$ is defined as

$$ C((A, \mu ), (A, \nu )) \triangleq \sum _{a\in A} \mu (a)\log \left( \frac{\mu (a)}{\nu (a)}\right) + (1 - \mu (a))\log \left( \frac{1 - \mu (a)}{1 - \nu (a)}\right) . $$

Similar to t-SNE we can optimize the embedding Y with respect to fuzzy set cross entropy C by using stochastic gradient descent. However, this requires a differentiable fuzzy singular set functor. If the expected minimum distance between points is zero the fuzzy singular set functor is differentiable for these purposes, however for any non-zero value we need to make a differentiable approximation (chosen from a suitable family of differentiable functions).

This completes the algorithm: by using manifold approximation and patching together local fuzzy simplicial set representations we construct a topological representation of the high dimensional data. We then optimize the layout of data in a low dimensional space to minimize the error between the two topological representations.

We believe this algorithm [9] to be state-of-the-art for dimension reduction techniques.

3 Clustering

Clustering data requires a clear definition of what constitutes a cluster. Here we follow Hartigan [6], Stuetzle [11], Chaudhuri et al. [4], and others in viewing the problem as that of approximating the ‘cluster tree’ formed by level sets of the probability density function (pdf) from which the data was drawn.

Specifically, a statistically oriented view of density clustering begins with the assumption that there exists some unknown density function from which the observed data is drawn. From the density function f, defined on a metric space $(\mathcal {X}, d)$, one can construct a hierarchical cluster structure, where a cluster is a connected subset of an f-level set $\{x \in (\mathcal {X}, d) \mid f(x) \ge \lambda \}$. As $\lambda \ge 0$ varies these f-level sets nest in such a way as to construct an infinite tree, which is referred to as the cluster tree. Each cluster is a branch of this tree, extending over the range of $\lambda $ values for which it is distinct. The goal of a clustering algorithm is to suitably approximate the cluster tree, converging to it in the limit of infinite observed data points.

For our purposes we assume that the data is drawn from a pdf defined on some manifold. In contrast to the dimension reduction case where it was beneficial to adjust the metric so as have the data uniformly distributed on the manifold, we specifically want to preserve the distribution. Indeed we may wish to go a step further – since we are ultimately interested in the high density regions of the pdf on the manifold it may be beneficial to emphasise these regions. This is due to the finite sampling producing “noise” points that can confound density approximations built on limited data. Thus while the dimension reduction case normalised distances based on the uniform distribution assumption we will instead do the opposite.

Specifically we have two issues when considering distances between points: first we need some level of global normalisation to bring distances into a common range regardless of the scaling of the source data. This can be achieved, similar to the dimension reduction case, by normalising by a k-nearest-neighbour distance – in this case, however we select the global average of knn distances since we do not want to locally normalise the data. Next we may wish to emphasise the pdf by extending distances in sparse regions of the space, while contracting them in dense regions. This would be similar to Wishart’s mode analysis [12]. In practice we can simply do the opposite of the dimension reduction case; instead of normalizing by the knn-distance of each point we can expand distances by a factor of some power $\alpha $ of the knn-distance. In some practical cases $\alpha $ may be zero, but in high noise datasets values in the range (0, 1] are desirable. Thus we can define a clustering distance local to $x_i$ as

$$ d_{C_i}(x_i, x_j) = \frac{d_{\mathbb {R}^n}(x_i, x_j)}{\left( \frac{1}{N}\sum _{k=1}^N r_k\right) } \cdot \left( \frac{r_i}{\left( \frac{1}{N}\sum _{k=1}^N r_k\right) }\right) ^\alpha $$

where $r_i$ is the distance to the $k^\text {th}$ nearest neighbor of $x_i$.

We now have the familiar state of having multiple incompatible local metric spaces, which we can convert and combine into a single fuzzy topological representation. Again we wish to do the opposite of the dimension reduction approach by using logical conjunction to join the fuzzy simplicial sets – i.e. we wish to use intersection in this case. In categorical terms this is simply a matter of taking the limit of a family of simplicial sheaves rather than the colimit.

In practical term this represents the requirement that a given simplex should have membership strength at most the lowest membership strength of any of the local representations. The result is a fuzzy topological representation of the manifold that preserves density information.

Our next concern is converting this topological representation S into actual density estimates which we can find level sets for. From an algebraic topology point of view the way is clear: we want to consider $\pi _0(S)$, the connected components of the fuzzy simplicial set S, assuming that such an object can be defined. We first note that S itself is a functor $S:\varvec{\varDelta }^\text {op}\rightarrow \mathbf{sFuzz}$, where, in turn, sFuzz is a sub-category of presheaves $\mathbf{Sets}^{I^\text {op}}$. It follows, via some trivial symbol pushing, that S also defines a functor $S:I^\text {op}\rightarrow \mathbf{sSet}$. Since the connected component functor $\pi _0$ is defined on sSet, we can post compose with it to define a functor $\pi _0\circ S$

which defines a fuzzy set.

Expressing this fuzzy set in classical fuzzy set notation $(A, \mu )$ we have

$$ A = \bigcup _{a\in I} \pi _0 \circ S ([0, a)) $$

and

$$ \mu (a) = \sup \{i\in (0, 1] \mid a \in \pi _0 \circ S(i)\}. $$

This is a very large set. Similar to HDBSCAN* [2], or the pruning of Stuetzle and Nugent [11], we can dismiss connected components with size less then some parameterized bound m. Thus we arrive at the fuzzy subset $(A, \nu )$ where

$$ \nu (a) = {\left\{ \begin{array}{ll} \sup \{i\in (0, 1] \mid a \in \pi _0 \circ S(i)\} &{} \text { if } |a| \ge m\\ 0 &{} \text {otherwise} \end{array}\right. }. $$

Finally we can further simplify the set by placing an equivalence relation on the carrier set A. First define a binary relation $\smallsmile $ on elements of A by

$$ a \smallsmile b \iff |a \triangle b| \le m, $$

where $\triangle $ is set symmetric difference. Clearly this relation is symmetric and reflexive, and thus the transitive closure $\sim $ of $\smallsmile $ is an equivalence relation on A. The resulting fuzzy quotient set provides effectively a simplified tree of clusters, and similar methods to those of HDBSCAN* [2] can be used to extract a flat clustering if desired.

We note that all of this theory goes through much more cleanly than, for example, the topological interpretations of HDBSCAN* (section 2.3 of [8]). Furthermore the resulting clustering algorithm, while producing very similar results to HDBSCAN*, is more robust to hyper-parameter selection. We therefore contend, based on the limited experiments so far conducted, that this represents a new state-of-the-art clustering algorithm.

4 Anomaly Detection

Anomaly detection is the task of finding samples that are “inconsistent with the rest of the data”. Ultimately such an approach requires a model of the expected distribution of the data through which to determine which samples are “surprising”. While parametric techniques such as Gaussian mixture models can be used, they still impose significant parametric assumptions. Ideally an anomaly detection technique would make use of a non-parametric approximation of the distribution of the data.

Following the work of the previous sections such a non-parametric density estimate can be easily constructed. Specifically we can define a fuzzy set $(X, \varphi )$ with carrier set the source data X and fuzzy membership given by

$$ \varphi (x) = \sup \{ \nu (a) \mid a\in A \text { and } x\in a\}, $$

where $(A, \nu )$ is defined as per the procedure in the previous section. This provides a [0, 1]-valued membership strength for each data point. By simply taking the fuzzy complement of this set we arrive at an outlier score with 1 representing a maximal outlier and 0 representing a maximal inlier.

In effect we are taking the fuzzy set $(X, \varphi )$ to be a (scaled) approximation of the pdf from which the data was drawn. By making use of the noise-offsetting density exaggeration in the early steps of the clustering construction the effects of outlying points have limited impact on this approximation, and hence are readily identified as outliers.

We believe this represents a powerful new approach to anomaly detection that demands further research.

5 Summary

We have presented a mathematical formulation in terms of topology and category theory that allows for simple extensions to yield algorithms for a wide variety of unsupervised learning tasks. These algorithms demonstrate state-of-the-art performance in their tasks, despite being drawn from a single unified framework. Thus the topological approaches to unsupervised learning outlined here not only provide several new algorithms, but a rich field for further research into unsupervised learning in general.

References

Barr, M.: Fuzzy set theory and topos theory. Canad. Math. Bull 29(4), 501–508 (1986)
Article MathSciNet Google Scholar
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Chapter Google Scholar
Carlsson, G., Mémoli, F.: Classifying clustering schemes. Found. Comput. Math. 13(2), 221–252 (2013)
Article MathSciNet Google Scholar
Chaudhuri, K., Dasgupta, S.: Rates of convergence for the cluster tree. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems. NIPS 2010, pp. 343–351. Curran Associates Inc., USA (2010)
Google Scholar
Goerss, P.G., Jardine, J.F.: Simplicial Homotopy Theory. Springer, Basel (2009). https://doi.org/10.1007/978-3-0346-0189-4
Book MATH Google Scholar
Hartigan, J.A.: Consistency of single linkage for high-density clusters. J. Am. Stat. Assoc. 76(374), 388–394 (1981)
Article MathSciNet Google Scholar
Lee, J.A., Verleysen, M.: Shift-invariant similarities circumvent distance concentration in stochastic neighbor embedding and variants. Procedia Comput. Sci. 4, 538–547 (2011)
Article Google Scholar
McInnes, L., Healy, J.: Accelerated hierarchical density clustering. arXiv preprint arXiv:1705.07321 (2017)
McInnes, L., Healy, J.: Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
Spivak, D.I.: Metric realization of fuzzy simplicial sets. Self published notes http://math.mit.edu/~dspivak/files/metric_realization.pdf
Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20(1), 025–047 (2003)
Article MathSciNet Google Scholar
Wishart, D.: Mode analysis: a generalization of nearest neighbor which reduces chaining effects. Numer. Taxonomy 76(282–311), 17 (1969)
Google Scholar

Download references

Author information

Authors and Affiliations

Tutte Institute for Mathematics and Computing, Ottawa, Canada
Leland McInnes

Authors

Leland McInnes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leland McInnes .

Editor information

Editors and Affiliations

Sony Computer Science Laboratories, Inc., Tokyo, Japan
Frank Nielsen
Thales, Limours, France
Frédéric Barbaresco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

McInnes, L. (2019). Topological Methods for Unsupervised Learning. In: Nielsen, F., Barbaresco, F. (eds) Geometric Science of Information. GSI 2019. Lecture Notes in Computer Science(), vol 11712. Springer, Cham. https://doi.org/10.1007/978-3-030-26980-7_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-26980-7_35
Published: 02 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26979-1
Online ISBN: 978-3-030-26980-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Topological Methods for Unsupervised Learning

Abstract