Keywords

1 Introduction

1.1 Hierarchy and Other Symmetries in Data Analysis

On one level this chapter is about symmetries in data, such that the data represent complex phenomena, and the symmetries provide a model for understanding these complex phenomena. Hierarchy gives rise to a rich expanse of symmetries and we will be concerned mostly with hierarchy in this article.

Partitioning a set of observations [47, 70, 71] leads to some very simple symmetries. This is one approach to clustering and data mining. But such approaches, often based on optimization, are not of direct interest to us here. Instead we will pursue the theme pointed to by Simon [69], namely that the notion of hierarchy is fundamental for interpreting data and the complex reality which the data expresses. Our work is very different from the marvelous view of the development of mathematical group theory—but viewed in its own right as a complex, evolving system—presented by Foote [24].

Weyl [79] makes the case for the fundamental importance of symmetry in science, engineering, architecture, art and other areas. As a “guiding principle”, “Whenever you have to do with a structure-endowed entity …try to determine its group of automorphisms, the group of those element-wise transformations which leave all structural relations undisturbed. You can expect to gain a deep insight in the constitution of [the structure-endowed entity] in this way. After that you may start to investigate symmetric configurations of elements, i.e. configurations which are invariant under a certain subgroup of the group of all automorphisms; …” [79, p. 144].

1.2 About This Chapter

Theoretical and applied results that are based on ultrametric topology have been studied in fields such as the following:

  • In data analysis, both because of the fitting of tree structures and/or visualizations to data sets, to provide a possible way to present a range of partitions to the user, and also to provide for a genealogical model to be fit to data.

  • In physics in order to take account of phenomena at very small spatial and time scales, where it is found that discreteness of structures is represented well by p-adic number systems; and also for any systems that involve movement between discrete states that are characterized by their energy levels. p-Adic number systems can represent ultrametric topology and vice versa.

    It can be added that, as a consequence of applications in physics, the future holds much promise for ultrametric topology-based theory and analysis methods in quantum computing and quantum information theory.

  • A further field of use of ultrametric topology arises from being able to show that a considerable number of search and discovery algorithms developed in recent years have an interpretation or vantage point in terms of ultrametric topology.

    Computer programming theory also avails of ultrametrics, for example in order to have a framework for non-monotonic reasoning and for multivalued logic.

This chapter will review the state of the art in these fields and will stress the common aspects of methods and applications.

In Sect. 3, we describe ultrametric topology as an expression of hierarchy.

In Sect. 4, we look at the generalized ultrametric context. This is closely linked with analysis based on lattices.

In Sect. 5, p-adic encoding provides a number theory vantage point on ultrametric topology.

Section 6 deals with application to search and discovery, including work in massive and possibly high dimensional spaces.

2 Backgrounders on Hierarchical Clustering, p-Adic Numbers, Ultrametric Topology

2.1 A Brief Introduction to Hierarchical Clustering

For the reader new to analysis of data a brief introduction is now provided on hierarchical clustering. Along with other families of algorithm, the objective is automatic classification, for the purpose of data mining, or knowledge discovery. Classification, after all, is fundamental in human thinking and machine-based decision making. But we draw attention to the fact that our objective is unsupervised, as opposed to supervised classification, also known as discriminant analysis or (in a general way) machine learning. So here we are not concerned with generalizing the decision-making capability of training data, nor are we concerned with fitting statistical models to data so that these models can play a role in generalizing and predicting. Instead we are concerned with having “data speak for themselves”. That this unsupervised objective of classifying data (observations, objects, events, phenomena, etc.) is a huge task in our society is unquestionably true. One may think of situations when precedents are very limited, for instance.

Among families of clustering, or unsupervised classification, algorithms, we can distinguish the following: (a) array permuting and other visualization approaches; (b) partitioning to form (discrete or overlapping) clusters through optimization, including graph-based approaches; and—of interest to us in this article—(c) embedded clusters interrelated in a tree-based way.

For the last-mentioned family of algorithm, agglomerative building of the hierarchy from consideration of object pairwise distances has been the most common approach adopted. For comprehensive background texts, see [33, 34, 45, 80].

2.2 A Brief Introduction to p-Adic Numbers

The real number system and a p-adic number system for given prime, p, are potentially equally useful alternatives. p-Adic numbers were introduced by Kurt Hensel in 1898.

Whether we deal with Euclidean or with non-Euclidean geometry, we are (nearly) always dealing with reals. But the reals start with the natural numbers, and from associating observational facts and details with such numbers we begin the process of measurement. From the natural numbers, we proceed to the rationals, allowing fractions to be taken into consideration.

The following view of how we do science or carry out other quantitative study was proposed by Volovich in 1987 [76, 77]. See also the surveys in [20, 25]. We can always use rationals to make measurements. But they will be approximate, in general. It is better therefore to allow for observables being “continuous, i.e. endow them with a topology”. Therefore we need a completion of the field \(\mathbb{Q}\) of rationals. To complete the field \(\mathbb{Q}\) of rationals, we need Cauchy sequences and this requires a norm on \(\mathbb{Q}\) (because the Cauchy sequence must converge, and a norm is the tool used to show this). There is the Archimedean norm such that: for any \(x,y \in \mathbb{Q}\), with | x |  <  | y | , then there exists an integer N such that | Nx |  >  | y | . For convenience here, we write: | x |  for this norm. So if this completion is Archimedean, then we have \(\mathbb{R} = \mathbb{Q}_{\infty }\), the reals. That is fine if space is taken as commutative and Euclidean.

What of alternatives? Remarkably all norms are known. Besides the \(\mathbb{Q}_{\infty }\) norm, we have an infinity of norms, | x |  p , labelled by primes, p. By Ostrowski’s theorem [61] these are all the possible norms on \(\mathbb{Q}\). So we have an unambiguous labelling, via p, of the infinite set of non-Archimedean completions of \(\mathbb{Q}\) to a field endowed with a topology.

In all cases, we obtain locally compact completions, \(\mathbb{Q}_{p}\), of \(\mathbb{Q}\). They are the fields of \(p\)-adic numbers. All these \(\mathbb{Q}_{p}\) are continua. Being locally compact, they have additive and multiplicative Haar measures. As such we can integrate over them, such as for the reals.

2.3 Brief Discussion of p-Adic and m-Adic Numbers

We will use p to denote a prime, and m to denote a non-zero positive integer. A p-adic number is such that any set of p integers which are in distinct residue classes modulo p may be used as p-adic digits. (Cf. remark below, at the end of Sect. 5.2, quoting from [29]. It makes the point that this opens up a range of alternative notation options in practice.) Recall that a ring does not allow division, while a field does. m-Adic numbers form a ring; but p-adic numbers form a field. So a priori, 10-adic numbers form a ring. This provides us with a reason for preferring p-adic over m-adic numbers.

We can consider various p-adic expansions:

  1. 1.

    \(\sum _{i=0}^{n}a_{i}p^{i}\), which defines positive integers. For a p-adic number, we require a i  ∈ 0, 1, … p − 1. (In practice: just write the integer in binary form.)

  2. 2.

    \(\sum _{i=-\infty }^{n}a_{i}p^{i}\) defines rationals.

  3. 3.

    \(\sum _{i=k}^{\infty }a_{i}p^{i}\) where k is an integer, not necessarily positive, defines the field \(\mathbb{Q}_{p}\) of p-adic numbers.

\(\mathbb{Q}_{p}\), the field of p-adic numbers, is (as seen in these definitions) the field of p-adic expansions.

The choice of p is a practical issue. Indeed, adelic numbers use all possible values of p (see [5] for extensive use and discussion of the adelic number framework). A biotechnology example is considered as follows, by Dragovich and Dragovich [19] and Khrennikov [38]. Desoxyribonucleic acid (DNA) is encoded using four nucleotides: A, adenine; G, guanine; C, cytosine; and T, thymine. In RNA (ribonucleic acid) T is replaced by U, uracil. In [19] a 5-adic encoding is used, since 5 is a prime and thereby offers uniqueness. In [38] a 4-adic encoding is used, and a 2-adic encoding, with the latter based on 2-digit boolean expressions for the four nucleotides (00, 01, 10, 11). A default norm is used, based on a longest common prefix—with p-adic digits from the start or left of the sequence. (See Sects. 5.3 and 6.3 where a longest common prefix norm or distance is used.)

3 Ultrametric Topology

3.1 Ultrametric Space for Representing Hierarchy

Consider Fig. 1 illustrating the ultrametric distance and its role in defining a hierarchy. An early, influential paper is Johnson [37] and an important survey is that of Rammal et al. [63]. Discussion of how a hierarchy expresses the semantics of change and distinction can be found in [58].

Fig. 1
figure 1

The strong triangular inequality defines an ultrametric: every triplet of points satisfies the relationship: d(x, z) ≤ max{d(x, y), d(y, z)} for distance d. Cf. by reading off the hierarchy, how this is verified for all x, y, z: \(d(x,z) = 3.5;d(x,y) = 3.5;d(y,z) = 1.0\). In addition the symmetry and positive definiteness conditions hold for any pair of points

The ultrametric topology was introduced by Krasner [39], the ultrametric inequality having been formulated by Hausdorff in 1934. Essential motivation for the study of this area is provided by Schikhof [66] as follows. Real and complex fields gave rise to the idea of studying any field K with a complete valuation | . | comparable to the absolute value function. Such fields satisfy the “strong triangle inequality” | x + y | ≤ max( | x | , | y | ). Given a valued field, defining a totally ordered Abelian (i.e. commutative) group, an ultrametric space is induced through \(\vert x - y\vert = d(x,y)\). Various terms are used interchangeably for analysis in and over such fields such as p-adic, ultrametric, non-Archimedean, and isosceles. The natural geometric ordering of metric valuations is on the real line, whereas in the ultrametric case the natural ordering is a hierarchical or rooted tree.

3.2 Some Geometrical Properties of Ultrametric Spaces

An ultrametric space is quite different from a metric one. In an ultrametric space everything “lives” on a tree. For various properties that ensue, see [40, Chap. 0, part IV].

In an ultrametric space, all triangles are either isosceles with small base or equilateral. We have here very clear symmetries of shape in an ultrametric topology. These symmetry “patterns” can be used to fingerprint data sets and time series: see [51, 54] for many examples of this.

Some further properties that are studied in [40] are: (a) every point of a circle in an ultrametric space is a centre of the circle. (b) In an ultrametric topology, every ball is both open and closed (termed clopen). (c) An ultrametric space is zero-dimensional (see [7, 74]). It is clear that an ultrametric topology is very different from our intuitive, or Euclidean, notions. The most important point to keep in mind is that in an ultrametric space everything “lives” in a hierarchy expressed by a tree.

For an n × n matrix of positive reals, symmetric with respect to the principal diagonal, to be a matrix of distances associated with an ultrametric distance on X, a sufficient and necessary condition is that a permutation of rows and columns satisfies the following form of the matrix:

  1. 1.

    Above the diagonal term, equal to 0, the elements of the same row are non-decreasing.

  2. 2.

    For every index k, if

    $$\displaystyle{d(k,k + 1) = d(k,k + 2) =\ldots = d(k,k +\ell +1)}$$

    then

    $$\displaystyle{d(k + 1,j) \leq d(k,j)\ \text{for}\ \,k + 1 < j \leq k +\ell +1}$$

    and

    $$\displaystyle{d(k + 1,j) = d(k,j)\ \text{for}\ j > k +\ell +1}$$

    Under these circumstances,  ≥ 0 is the length of the section beginning, beyond the principal diagonal, the interval of columns of equal terms in row k.

To illustrate the ultrametric matrix format, consider the small data set shown in Table 1. A dendrogram produced from this is shown in Fig. 2. From the abscissa height of the lowest node or cluster containing the two terminals, the ultrametric distances, also termed cophenetic distances, matrix can be read off this dendrogram. This is shown in Table 2. Finally a visualization of this matrix, illustrating the ultrametric matrix properties discussed above, is shown in Fig. 3.

Table 1 Input data: eight iris flowers characterized by sepal and petal widths and lengths
Fig. 2
figure 2

Hierarchical clustering of seven iris flowers using data from Table 1. No data normalization was used. The agglomerative clustering criterion was the minimum variance or Ward one

Table 2 Ultrametric matrix derived from the dendrogram in Fig. 2
Fig. 3
figure 3

A visualization of the ultrametric matrix of Table 2, where bright or white  = highest value, and black  = lowest value

3.3 Clustering Through Matrix Row and Column Permutation

Direct clustering of the data matrix with no changing of the data matrix values—“non-destructive”, therefore—also comes under the heading of block model clustering.

Figure 3 shows how an ultrametric distance allows a certain structure to be visible (quite possibly, in practice, subject to an appropriate row and column permuting), in a matrix defined from the set of all distances. A generalization opens up for this sort of clustering-by-visualization scheme. An optimized way to do this was pursued in [43, 44]. Comprehensive surveys of clustering algorithms in this area, including objective functions, visualization schemes, optimization approaches, presence of constraints, and applications, can be found in [42, 72]. See also [17, 50].

For all these approaches, underpinning them are row and column permutations that can be expressed in terms of the permutation group, S n , on n elements.

4 The Generalized Ultrametric and Formal Concept Analysis

In this section, we consider an ultrametric defined on the power set or join semilattice. Comprehensive background on ordered sets and lattices can be found in [15]. A review of generalized distances and ultrametrics is in [67].

4.1 Link with Formal Concept Analysis

Typically hierarchical clustering is based on a distance (which can be relaxed often to a dissimilarity, not respecting the triangular inequality, and mutatis mutandis to a similarity), defined on all pairs of the object set: \(d: X \times X \rightarrow \mathbb{R}^{+}\); i.e., a distance is a positive real value. Usually we require that a distance cannot be 0-valued unless the objects are identical. That is the traditional approach.

A different form of ultrametrization is achieved from a dissimilarity defined on the power set of attributes characterizing the observations (objects, individuals, etc.) X. Here we have: d: X × X → 2J, where J indexes the attribute (variables, characteristics, properties, etc.) set.

Fig. 4
figure 4

Top: Example data set consisting of five objects, characterized by three boolean attributes. Bottom: Lattice corresponding to this data and its interpretation

This gives rise to a different notion of distance that maps pairs of objects onto elements of a join semilattice. The latter can represent all subsets of the attribute set, J. That is to say, it can represent the power set, commonly denoted 2J, of J.

As an example, consider, say, n = 5 objects characterized by three boolean (presence/absence) attributes, shown in Fig. 4 (top). Define dissimilarity between a pair of objects in this table as a set of 3 components, corresponding to the 3 attributes, such that if both components are 0, we have 1; if either component is 1 and the other 0, we have 1; and if both components are 1, we get 0. This is the simple matching coefficient [36]. We could use, e.g., Euclidean distance for each of the values sought; but we prefer to treat 0 values in both components as signaling a 1 contribution. We get then d(a, b) = 1, 1, 0 which we will call d1,d2. Then, d(a, c) = 0, 1, 0 which we will call d2, etc.

We create lattice nodes shown in Fig. 4, as follows.

The set d1,d2,d3 corresponds to: d(b, e) and d(e, f)

The subset d1,d2 corresponds to: d(a, b), d(a, f), d(b, c), d(b, f), and d(c, f)

The subset d2,d3 corresponds to: d(a, e) and d(c, e)

The subset d2 corresponds to: d(a, c)

Clusters defined by all pairwise linkage at level ≤ 2:

a, b, c, f

a, c, e

Explanation is as follows: a, b, c, f all share a 1 for attribute v 1 and a 0 for attribute v 2, or vice versa. Then a, c, e share a 1 for attribute v 2 and a 0 for attribute v 3, or vice versa. See this specification of these clusters in the middle part, “Lattice vertices found”, of Fig. 4.

Finally:

Clusters defined by all pairwise linkage at level ≤ 3:

a, b, c, e, f

In Formal Concept Analysis [15, 28], it is the lattice itself which is of primary interest. In [36] there is discussion of, and a range of examples on, the close relationship between the traditional hierarchical cluster analysis based on \(d: I \times I \rightarrow \mathbb{R}^{+}\), and hierarchical cluster analysis “based on abstract posets” (a poset is a partially ordered set), based on d: I × I → 2J. The latter, leading to clustering based on dissimilarities, was developed initially in [35].

4.2 Applications of Generalized Ultrametrics

As noted in the previous subsection, the usual ultrametric is an ultrametric distance, i.e. for a set I, \(d: I \times I\longrightarrow \mathbb{R}^{+}\). The generalized ultrametric is also consistent with this definition, where the range is a subset of the power set: d: I × I → Γ, where Γ is a partially ordered set. In other words, the generalized ultrametric distance is a set. Some areas of application of generalized ultrametrics will now be discussed.

In the theory of reasoning, a monotonic operator is rigorous application of a succession of conditionals (sometimes called consequence relations). However, negation or multiple valued logic (i.e. encompassing intermediate truth and falsehood) requires support for non-monotonic reasoning.

Thus [31]: “Once one introduces negation …then certain of the important operators are not monotonic (and therefore not continuous), and in consequence the Knaster-Tarski theorem [i.e. for fixed points; see [15]] is no longer applicable to them. Various ways have been proposed to overcome this problem. One such [approach is to use] syntactic conditions on programs …Another is to consider different operators …The third main solution is to introduce techniques from topology and analysis to augment arguments based on order …[the latter include:] methods based on metrics …on quasi-metrics …and finally …on ultrametric spaces”.

The convergence to fixed points that are based on a generalized ultrametric system is precisely the study of spherically complete systems and expansive automorphisms discussed in Sect. 5.4 below. As expansive automorphisms we see here again an example of symmetry at work. (Cf. too the quotation from Weyl at the end of Sect. 1.1.)

5 Hierarchy, Ultrametric Topology and the p-Adic Number System

A dendrogram is widely used in hierarchical, agglomerative clustering, and is induced from observed data. By expressing a dendrogram in p-adic terms, we open up a wide range of possibilities for seeing symmetries and attendant invariants.

5.1 p-Adic Numbers and Their Importance

Rizzi [65] considered ultrametrics and ultramines (i.e. an analogous topology for similarities as opposed to dissimilarities, both of which satisfy the strong triangular inequality). He also discussed the representation of ultrametrics and ultramines using p-adic numbers.

The importance of p-adic representation for physics on very small scales has been made by Volovich from the 1980s. See [20, 78]. Such scales are of the order of the Planck length, a fundamental constant (1. 6 × 10−35 m). p-Adic description of very large scales has similarly been proposed.

A hierarchy, as a branching process, is a very good means of expressing discrete energy states associated with energy basins requiring at least a requisite quantity of energy to enable a particle to move to another energy basin and possibly, energy level.

Volovich [78] poses the general principle that the fundamental physical laws should be invariant under the change of the number field. Furthermore the p-adic number fields, it is argued, have great benefit at very small scales and at very large scales. This leads to the following ambitious statement: “If these ideas are true then number theory and the corresponding branches of algebraic geometry are …the ultimate and unified physical theory”.

5.2 p-Adic Encoding of a Dendrogram

We will introduce now the one-to-one mapping of clusters (including singletons) in a dendrogram H into a set of p-adically expressed integers (a fortiori, rationals, or \(\mathbb{Q}_{p}\)). The field of p-adic numbers is the most important example of ultrametric spaces. Addition and multiplication of p-adic integers, \(\mathbb{Z}_{p}\) (cf. expression in Sect. 2.3), are well defined. Inverses exist and no zero-divisors exist.

A terminal-to-root traversal in a dendrogram or binary rooted tree is defined as follows. We use the path \(x \subset q \subset q^{{\prime}}\subset q^{{\prime\prime}}\subset \ldots q_{n-1}\), where x is a given object specifying a given terminal, and \(q,q^{{\prime}},q^{{\prime\prime}},\ldots\) are the embedded classes along this path, specifying nodes in the dendrogram. The root node is specified by the class q n−1 comprising all objects.

A terminal-to-root traversal is the shortest path between the given terminal node and the root node, assuming we preclude repeated traversal (backtrack) of the same path between any two nodes.

Fig. 5
figure 5

Labelled, ranked dendrogram on eight terminal nodes, \(x_{1},x_{2},\ldots,x_{8}\). Branches are labelled + 1 and − 1. Clusters are: \(q_{1} = (x_{1},x_{2}),q_{2} = (x_{1},x_{2},x_{3}),q_{3} = (x_{4},x_{5}),q_{4} = (x_{4},x_{5},x_{6}),q_{5} = (x_{1},x_{2},x_{3},x_{4},x_{5},x_{6}),q_{6} = (x_{7},x_{8}),q_{7} = (x_{1},x_{2},\ldots,x_{7},x_{8})\)

By means of terminal-to-root traversals, we define the following p-adic encoding of terminal nodes, and hence objects, in Fig. 5.

$$\displaystyle\begin{array}{rcl} & & x_{1}: +1 \cdot p^{1} + 1 \cdot p^{2} + 1 \cdot p^{5} + 1 \cdot p^{7} \\ & & x_{2}: -1 \cdot p^{1} + 1 \cdot p^{2} + 1 \cdot p^{5} + 1 \cdot p^{7} \\ & & x_{3}: -1 \cdot p^{2} + 1 \cdot p^{5} + 1 \cdot p^{7} \\ & & x_{4}: +1 \cdot p^{3} + 1 \cdot p^{4} - 1 \cdot p^{5} + 1 \cdot p^{7} \\ & & x_{5}: -1 \cdot p^{3} + 1 \cdot p^{4} - 1 \cdot p^{5} + 1 \cdot p^{7} \\ & & x_{6}: -1 \cdot p^{4} - 1 \cdot p^{5} + 1 \cdot p^{7} \\ & & x_{7}: +1 \cdot p^{6} - 1 \cdot p^{7} \\ & & x_{8}: -1 \cdot p^{6} - 1 \cdot p^{7} {}\end{array}$$
(1)

If we choose p = 2, the resulting decimal equivalents could be the same: cf. contributions based on + 1 ⋅ p 1 and \(-1 \cdot p^{1} + 1 \cdot p^{2}\). Given that the coefficients of the p j terms (1 ≤ j ≤ 7) are in the set \(\{-1,0,+1\}\) (implying for x 1 the additional terms: \(+0 \cdot p^{3} + 0 \cdot p^{4} + 0 \cdot p^{6}\)), the coding based on p = 3 is required to avoid ambiguity among decimal equivalents.

A few general remarks on this encoding follow. For the labelled ranked binary trees that we are considering (for discussion of combinatorial properties based on labelled, ranked and binary trees, see [49]), we require the labels + 1 and − 1 for the two branches at any node. Of course we could interchange these labels and have these + 1 and − 1 labels reversed at any node. By doing so we will have different p-adic codes for the objects, x i .

The following properties hold: (a) Unique encoding: the decimal codes for each x i (lexicographically ordered) are unique for p ≥ 3; and (b) Reversibility: the dendrogram can be uniquely reconstructed from any such set of unique codes.

The p-adic encoding defined for any object set can be expressed as follows for any object x associated with a terminal node:

$$\displaystyle{ x =\sum _{ j=1}^{n-1}c_{ j}p^{j}\ \text{where}\ c_{ j} \in \{-1,0,+1\} }$$
(2)

In greater detail we have:

$$\displaystyle{ x_{i} =\sum _{ j=1}^{n-1}c_{ ij}p^{j}\ \text{where}\ c_{ ij} \in \{-1,0,+1\} }$$
(3)

Here j is the level or rank (root: n − 1; terminal: 1), and i is an object index.

In our example we have used: \(c_{j} = +1\) for a left branch (in the sense of Fig. 5), \(c_{j} = -1\) for a right branch, and c j  = 0 when the node is not on the path from that particular terminal to the root.

A matrix form of this encoding is as follows, where { ⋅ }t denotes the transpose of the vector.

Let x be the column vector \((x_{1},x_{2},\ldots,x_{n})^{t}\).

Let p be the column vector \((p^{1},p^{2},\ldots,p^{n-1})^{t}\).

Define a characteristic matrix C of the branching codes, + 1 and − 1, and an absent or non-existent branching given by 0, as a set of values c ij where i ∈ I, the indices of the object set; and \(j \in \{ 1,2,\ldots,n - 1\}\), the indices of the dendrogram levels or nodes ordered increasingly. For Fig. 5 we therefore have:

$$\displaystyle{ C =\{ c_{ij}\} = \left (\begin{array}{rrrrrrr} 1& 1& 0& 0& 1& 0& 1\\ - 1 & 1 & 0 & 0 & 1 & 0 & 1 \\ 0& - 1& 0& 0& 1& 0& 1\\ 0 & 0 & 1 & 1 & - 1 & 0 & 1 \\ 0& 0& - 1& 1& - 1& 0& 1\\ 0 & 0 & 0 & - 1 & - 1 & 0 & 1 \\ 0& 0& 0& 0& 0& 1& - 1\\ 0 & 0 & 0 & 0 & 0 & - 1 & - 1 \end{array} \right ) }$$
(4)

For given level j, ∀i, the absolute values | c ij  | give the membership function either by node, j, which is therefore read off columnwise or by object index, i, which is therefore read off rowwise.

The matrix form of the p-adic encoding used in Eqs. (2) or (3) is:

$$\displaystyle{ \mathbf{x} = C\mathbf{p} }$$
(5)

Here, x is the decimal encoding, C is the matrix with dendrogram branching codes (cf. example shown in expression (4)), and p is the vector of powers of a fixed prime p.

The tree encoding exemplified in Fig. 5, and defined with coefficients in Eqs. (2) or (3), (4) or (5), with labels + 1 and − 1 was required (as opposed to the choice of 0 and 1, which might have been our first thought) to fully cater for the ranked nodes (i.e. the total order, as opposed to a partial order, on the nodes).

We can consider the objects that we are dealing with to have equivalent integer values. To show that, all we must do is work out decimal equivalents of the p-adic expressions used above for \(x_{1},x_{2},\ldots\). As noted in [29], we have equivalence between: a p-adic number; a p-adic expansion and an element of \(\mathbb{Z}_{p}\) (the p-adic integers). The coefficients used to specify a p-adic number, [29] notes (p. 69), “must be taken in a set of representatives of the class modulo p. The numbers between 0 and p − 1 are only the most obvious choice for these representatives. There are situations, however, where ther choices are expedient”.

We note that the matrix C is used in [14]. A somewhat trivial view of how “hierarchical trees can be perfectly scaled in one dimension” (the title and theme of [14]) is that p-adic numbering is feasible, and hence a one-dimensional representation of terminal nodes is easily arranged through expressing each p-adic number with a real number equivalent.

In [46], what is termed a nest (i.e. cluster nesting) indicator function is defined, based on the set \(\{a_{w},-b_{w},0\},a_{w},b_{w} \in \mathbb{R}^{+}\) in the same way that the set {1, −1, 0} is used above for the matrix C. Orthonormality properties of the nest indicator functions are studied.

5.3 p-Adic Distance on a Dendrogram

We will now induce a metric topology on the p-adically encoded dendrogram, H. It leads to various symmetries relative to identical norms, for instance, or identical tree distances.

We use the following longest common subsequence, starting at the root: we look for the term p r in the p-adic codes of the two objects, where r is the lowest level such that both sequences have non-zero, i.e. + 1 or − 1 coefficients, for the p r term.

Let us look at the set of p-adic codes for \(x_{1},x_{2},\ldots\) above (Fig. 5 and relations (1)), to give some examples of this.

  • For x 1 and x 2, we find the term we are looking for to be p 1, and so r = 1.

  • For x 1 and x 5, we find the term we are looking for to be p 5, and so r = 5.

  • For x 5 and x 8, we find the term we are looking for to be p 7, and so r = 7.

Having found the value r, the distance is defined as p r [3, 29].

This longest common prefix metric is also known as the Baire distance. In topology the Baire metric is defined on infinite strings [41]. It is more than just a distance: it is an ultrametric bounded from above by 1, and its infimum is 0 which is relevant for very long sequences, or in the limit for infinite-length sequences. The use of this Baire metric is pursued in [60] based on random projections [75], and providing computational benefits over the classical O(n 2) hierarchical clustering based on all pairwise distances. This is further discussed in Sect. 6.3.

The longest common prefix metric leads directly to a p-adic hierarchical classification (cf. [4]). This is a special case of the “fast” hierarchical clustering to be discussed in Sect. 6.3.

Compared to the longest common prefix metric, there are other related forms of metric, and simultaneously ultrametric. In [27], the metric is defined via the integer part of a real number. In [3], for integers x, y we have: \(d(x,y) = 2^{-\text{order}_{p}(x-y)}\) where p is prime, and order p (i) is the exponent (non-negative integer) of p in the prime decomposition of an integer. Furthermore, let S(x) be a series: \(S(x) =\sum _{i\in \mathbb{N}}a_{i}x^{i}\). (\(\mathbb{N}\) is the set of natural numbers.) The order of S(x) is the rank of its first non-zero term: order\((S) =\inf \{ i: i \in \mathbb{N};a_{i}\neq 0\}\). (The series that is all zero is of order infinity.) Then the ultrametric similarity between series is: \(d(S,S^{{\prime}}) = 2^{-\text{order}(S-S^{{\prime}}) }\).

5.4 Scale-Related Symmetry

Scale-related symmetry is very important in practice. In this subsection we introduce an operator that provides this symmetry. We also term it a dilation operator, because of its role in the wavelet transform on trees (see [55] for discussion and examples). This operator is p-adic multiplication by 1∕p.

Consider the set of objects {x i  | i ∈ I} with its p-adic coding considered above. Take p = 2. (Non-uniqueness of corresponding decimal codes is not of concern to us now, and taking this value for p is without any loss of generality.) Multiplication of \(x_{1} = +1 \cdot 2^{1} + 1 \cdot 2^{2} + 1 \cdot 2^{5} + 1 \cdot 2^{7}\) by \(1/p = 1/2\) gives: \(+1 \cdot 2^{1} + 1 \cdot 2^{4} + 1 \cdot 2^{6}\). Each level has decreased by one, and the lowest level has been lost. Subject to the lowest level of the tree being lost, the form of the tree remains the same. By carrying out the multiplication-by-1∕p operation on all objects, it is seen that the effect is to rise in the hierarchy by one level.

Let us call product with 1∕p the operator A. The effect of losing the bottom level of the dendrogram means that either (1) each cluster (possibly singleton) remains the same; or (2) two clusters are merged. Therefore the application of A to all q implies a subset relationship between the set of clusters {q} and the result of applying A, {Aq}.

Repeated application of the operator A gives Aq, A 2 q, A 3 q, \(\ldots\). Starting with any singleton, i ∈ I, this gives a path from the terminal to the root node in the tree. Each such path ends with the null element, which we define to be the p-adic encoding corresponding to the root node of the tree. Therefore the intersection of the paths equals the null element.

Benedetto and Benedetto [1, 2] discuss A as an expansive automorphism of I, i.e. form-preserving, and locally expansive. Some implications [2] of the expansive automorphism follow. For any q, let us take \(q,Aq,A^{2}q,\ldots\) as a sequence of open subgroups of I, with \(q \subset Aq \subset A^{2}q \subset \ldots\), and \(I =\bigcup \{ q,Aq,A^{2}q,\ldots \}\). This is termed an inductive sequence of I, and I itself is the inductive limit [64, p. 131].

Each path defined by application of the expansive automorphism defines a spherically complete system [27, 66, 74], which is a formalization of well-defined subset embeddedness. Such a methodological framework finds application in multi-valued and non-monotonic reasoning, as noted in Sect. 4.2.

6 Exploiting Ultrametric Embeddings for Search and Discovery

6.1 Remarkable Symmetries in Very High Dimensional Spaces

In the work of [62, 63] it was shown how as ambient dimensionality increased distances became more and more ultrametric. That is to say, a hierarchical embedding becomes more and more immediate and direct as dimensionality increases. A better way of quantifying this phenomenon was developed in [51]. What this means is that there is inherent hierarchical structure in high dimensional data spaces.

It was shown experimentally in [51, 62, 63] how points in high dimensional spaces become increasingly equidistant with increase in dimensionality. Both [30] and [18] study Gaussian clouds in very high dimensions. The latter finds that “not only are the points [of a Gaussian cloud in very high dimensional space] on the convex hull, but all reasonable-sized subsets span faces of the convex hull. This is wildly different than the behavior that would be expected by traditional low-dimensional thinking”.

That very simple structures come about in very high dimensions is not as trivial as it might appear at first sight. Firstly, even very simple structures (hence with many symmetries) can be used to support fast and perhaps even constant time worst case proximity search [51]. Secondly, as shown in the machine learning framework by Hall et al. [30], there are important implications ensuing from the simple high dimensional structures. Thirdly, [56] shows that very high dimensional clustered data contain symmetries that in fact can be exploited to “read off” the clusters in a computationally efficient way. Fourthly, following [16], what we might want to look for in contexts of considerable symmetry are the “impurities” or small irregularities that detract from the overall dominant picture.

See Table 3 exemplifying the change of topological properties as ambient dimensionality increases. It behoves us to exploit the symmetries that arise when we have to process very high dimensional data.

Table 3 Typical results, based on 300 sampled triangles from triplets of points

6.2 Partial Ultrametric Embedding

In [57], we discuss permutation representations of a data stream. Since hierarchies can also be represented as permutations, there is a ready way to associate data streams with hierarchies. In fact, early computational work on hierarchical clustering used permutation representation to great effect (cf. [68]).

To analyse data streams in this way, in [54] we develop an approach to ultrametric embedding of time-varying signals, including biomedical, meteorological, financial and other. As opposed to the classical way of inducing a hierarchy, through use of an agglomerative hierarchical clustering algorithm, we look for the ultrametric relationship—the strong triangular inequality—and, when found, count such particular cases of adherence to inherent hierarchical properties in the data. The most non-ultrametric time series are found to be chaotic ones. Eyegaze trace data was found to be remarkably high in ultrametricity, which are likely to be due to extreme saccade movements. Some initial questions were raised in that work [54] in regard to the EEG data used, for sleeping, petit mal and irregular epilepsy cases.

This work has been pursued by Khrennikov and his colleagues in modelling multi-agent systems. See [22]. Furthermore this work uses Bose–Einstein and Fermi–Dirac statistical distributions (derived from quantum statistics of energy states of bosons and fermions, i.e. elementary particles with integer, and half odd integer, spin). In [21, 22] multi-agent behaviours are modelled using such energy distributions. The framework is an urn model, where balls can move, with loss of energy over time, and with possibilities to receive input energy, but potentially shared with other balls. See the cited works for a full description of the Monte Carlo system set up. Sequences of actions (and moves), viz. their histories, are coded such that triangle properties can be investigated (cf. also [54]). That leads to a characterization of how ultrametrically embeddable the data is, ab initio (and not through imposing any hierarchical or other structure on the data with retrospective goodness of fit assessment). In [22], the case is presented for such analysis of behavioural histories being important for study of social and economic complexity.

Quantum statistical distributions have been noted in the foregoing work [21, 22]. van Rijsbergen [73] has set out various ways in which a quantum physics formalism makes clearer what is being done in information retrieval and in data analysis generally.

The quantifying of the inherent ultrametric content of text, and finding that some are much more inherently hierarchical than others, was pursued in [59]. As data, the following were used: tales from the Brothers Grimm, Jane Austen novels, dream reports, air accident reports, and James Joyce’s Ulysses.

6.3 Ultrametric Baire Space and Distance

A Baire space consists of countably infinite sequences with a metric defined in terms of the longest common prefix: the longer the common prefix, the closer a pair of sequences. This longest common prefix metric allows us to define the Baire distance [11, 48, 60]. In this description of the Baire distance, we consider to begin with scalar or univariate values. Below we will generalize to multivariate data such as in the case, for example, of documents with presence or absence, or frequencies of occurrence, on a term set or some other features.

Take the longest common prefixes at issue here as coming from precision of any value. Without loss of generality, take these values as decimal, i.e. base 10, or m-adic with m = 10. We take x and y to be bounded by 0 and 1. Each of them is of some precision, and we take the integer \(\left \vert K\right \vert \) to be the maximum precision.

Thus we consider ordered sets x k and y k for k ∈ K, or, we will write, for k = 1, \(2,\ldots,\left \vert K\right \vert \). The cardinality of the set K is the precision with which a number, x, is measured. So, x k with k = 1 is the first decimal place of precision; with k = 2, we have the second decimal place;...; and with \(k = \left \vert K\right \vert \) we have the \(\left \vert K\right \vert \) th decimal place.

Consider as examples x = 0. 478; and y = 0. 472. Start from the first decimal position. For k = 1, we find \(x_{1} = y_{1} = 4\). For k = 2, \(x_{2} = y_{2} = 7\). But for k = 3, x 3y 3.

We now introduce the following distance (case of x and y, with 1 attribute, hence unidimensional, and where subscript k is the digit of precision):

$$\displaystyle{ d_{\mathrm{B}}(x,y) \equiv d_{\mathrm{B}}(x_{K},y_{K}) = \left \{\begin{array}{ll} 1 &\;\;\text{if}\;\;x_{1}\neq y_{1} \\ \text{inf}\;\;10^{-k}&\;\;\;\;\;\;x_{k} = y_{k},\;\;\;1 \leq k \leq \left \vert K\right \vert \end{array} \right. }$$
(6)

We call this d B value Baire distance, which can be shown to be an ultrametric [5154, 60] distance. In the properties of a metric we generally have \(d(x,y) = 0\ \text{iff}x = y\) whereas for the Baire distance this reflexivity property is relaxed by having the 0 value replaced by the definably minimal value.

When dealing with binary data, 2 is a convenient base. In definition (6) we used a base of 10 for ease of coding when working with real numbers.

It is seen that this distance splits a unidimensional string of decimal values into a 10-way hierarchy, in which each leaf is associated with a grid cell. From Eq. (6) we can read off the distance between points assigned to the same grid cell. All pairwise distances of points assigned to the same cell are the same.

Relative to agglomerative hierarchical clustering, the Baire-based hierarchy is such that each node of this tree is associated with a grid (more strictly, in what we have described, interval) cell. Cell assignments at a particular level can be used to count the number of values x, y that are associated with that cell, and these counts define local density. As we have described the inducing of a hierarchy, this has been in a top-down manner (cf. how agglomerative hierarchical clustering, in that it is agglomerative, is consequently bottom-up). It follows from this algorithm that we can read the hierarchy off the data in a single scan, by having the target data structure—here, a regular 10-way tree—and assigning each value to all its appropriate nodes in the tree. For ease of characterizing this tree, or hierarchical clustering, we refer to it as a Baire tree or Baire hierarchy. The minimum Baire distance corresponds to a partial match of the values at each level [13].

For data with higher dimensionality, random projections can be used. Random projection is simple. Forming the random matrix R and projecting the d × N data matrix X into the k dimensions is of order O(dkN). If X is sparse with c non-zero entries per column, the complexity is of order O(ckN).

In fact random projection can be seen as a class of hashing function. Hashing is much faster than alternative methods because it avoids the pairwise comparisons required for partitioning and classification. If two points (p, q) are close, they will have a very small \(\|p - q\|\) (Euclidean metric) value; and they will hash to the same value with high probability; if they are distant, they should collide with small probability.

Clustering using the Baire distance has been successfully applied to areas such as chemoinformatics [60], astronomy [12] and text retrieval [10].

In [60], this principle of binning data is used on a large, high dimensional chemoinformatics data set. The application of merging databases of chemical compounds is important. In [60] 1.2 million compounds were used, characterized using a particular coding scheme by 1052-valued presence/absence vector. Use of the Baire distance and the associated hierarchical clustering were compared in detail with k-means partitioning (through partitions derived from the hierarchical clustering).

We studied stability of results, and effectiveness relative to other clustering methods, in particular k-means partitioning, in [12]. The main domain of application in that work was astronomy, and in particular clustering of redshifts in order to facilitate regression of (more expensively observed but better quality) spectroscopic redshifts on (more easily observed but with less signal resolution) photometric redshifts.

6.4 Approximating an Ultrametric for Similarity Metric Space Searching

In [51] we show that, in much work over the years, nearest neighbour searching has been made more efficient through the use of more easily determined feasibility bounds. An early example is Fukunaga and Narendra [26], a chapter review is in [50], and a survey can be found in [9]. Rendering given distances as ultrametric is a powerful way to facilitate nearest neighbour searching. Furthermore “stretching the triangular inequality” (a phrase used by [8]) so that it becomes the strong triangular inequality, or ultrametric inequality, gives a unifying view of some algorithms of this type.

Fast nearest neighbour finding often makes use of pivots to establish bounds on points to be searched, and points to be bypassed as infeasible [6, 9].

A full discussion can be found in [51]. Fast nearest neighbour searching in metric spaces often appeals to heuristics. The link with ultrametric spaces gives rise instead to a unifying view.

Hjaltason and Samet [32] discuss heuristic nearest neighbour searching in terms of embedding the given metric space points in lower dimensional spaces. From our discussion in this section, we see that there is evidently another alternative direction for facilitating fast nearest neighbour searching: viz., taking the metric space as an ultrametric one, and if it does not quite fit this perspective, then “stretch” it (transform it locally) so that it does so.

7 Conclusions

There are many exciting perspectives opened up by our work on the theme of symmetry-finding through hierarchy in very large data collections, with insights and perspectives from many application domains that are data-based and motivated, and indeed driven, by complex problem-solving.

“My thesis has been that one path to the construction of a nontrivial theory of complex systems is by way of a theory of hierarchy”. Thus Simon [69, p. 216]. We have noted symmetry in many guises in the representations used, in the transformations applied, and in the transformed outputs. These symmetries are non-trivial too, in a way that would not be the case were we simply to look at classes of a partition and claim that cluster members were mutually similar in some way. We have seen how the p-adic or ultrametric framework provides significant focus and commonality of viewpoint.

Furthermore we have highlighted the computational scaling properties of our algorithms. They are fully capable of addressing the data and information deluge that we face and providing us with the best interpretative and decision-making tools. The full elaboration of this last point is to be sought in each and every application domain, and face to face with old and new problems.