Keywords

1 Introduction

The self-organizing map (SOM) algorithm, [1], was proven, over the years, to be a powerful and convenient tool for clustering and visualizing data. While the original algorithm was designed for numerical vectors, the available data in the applications became more and more complex, being frequently too rich to be described by a fixed set of numerical attributes only. This is the case, for example, when data are described by relations between objects (individuals involved in a social network) or by measures of ressemblance/dissemblance (professional trajectories).

During the past twenty years, the SOM algorithm was extended for handling relational data, either described by kernels (see [2] for the online version and [3] for the batch version), or by dissimilarities (see [4] for the online version and [5] for the batch version). All these extensions are based on the same underlying principle: the dissimilarity or the kernel implicitly define an Euclidean (or pseudo-Euclidean) space in which the prototypes can be expressed as convex combinations of the embedded input data. However, when the goal is to explore large data sets, the relational approaches may become rapidly unfeasible. Indeed, complex relational data often have a large dimensionality. Moreover, kernel and relational SOM rely on the knowledge of the dissimilarity matrix for the entire data set, which generates at least quadratic complexity for the algorithms. As stressed in [5], algorithms will be slow for data sets with 10,000 observations and impossible to run on a normal computer for 100,000 input data. In addition to the complexity issue, expressing prototypes as convex combinations of the entire data set has a second drawback, as emphasized in [6]: the interpretability of the prototypes and of the model is lost.

In order to tackle these two issues, several approaches were introduced for relational data, all of them seeking for a sparse representation of the prototypes and a linear (in the number of observations) computational cost. [7] use the natural sparsity of the prototypes in batch relational k-means in order to reduce the complexity. The natural sparsity is enhanced by selecting the K (K fixed) closest inputs to each prototype. In [5], the complexity is reduced using iterative “patch clustering”. First, the data are split into P patches of size \(n_{P}\) (P fixed). A prototype-based clustering algorithm in batch version (neural gas or SOM) is then run on a patch \(P_{t}\) and the resulting prototypes, which may be viewed as compressed representations of the data already seen, are added as new data points to the next patch, \(P_{t+1}\). Moreover, the full vector of coefficients is replaced by the K closest input data (K fixed). With this method, linear time and constant space representation are obtained. Another technique consists in using the Nyström approximation [8] for the dissimilarity matrix. This technique also leads to a linear computational cost in the number of input data, but is strongly dependent on the intrinsic dimensionality of the given dissimilarity matrix, which has to be of low rank and entirely known in advance. All these cited approaches are batch algorithms.

In the online framework, [9] propose a bagging approach for kernel SOM. Data is split into B subsamples of size \(n_{B}\) (B fixed), the online kernel SOM is trained on each subsample and, after training, the most representative K observations are chosen for each prototype (K fixed). Eventually, a final map is trained on the resulting most representative observations. The algorithm has the advantage of being parallelizable, although it does not consider all the advantages of an online implementation.

In the present paper, we propose a sparse version of the online relational SOM algorithm, which takes further advantage of the online setting. Instead of expressing prototypes as convex combinations of the entire data set from the beginning, the size and the composition of the prototypes are sequentially increased with each new input fed to the algorithm. When the size of the prototypes becomes too large, prototypes are made sparse by deleting all the insignificant coefficients. Different approaches for selecting the most interesting observations are reported in [6]. In this manuscript, we use a slightly different technique, by interpreting the coefficients as a probability distribution and by selecting the most probable observation: a global probability mass \(\nu \) is fixed and the largest coefficients summing to \(\nu \) are kept. In this way, more flexibility is allowed to the prototypes which are no longer represented by a fixed number K of observations, but by the necessary number of observations allowing an “almost complete” knowledge of the composition of the prototypes (if \(\nu \) is chosen close to 1).

The rest of the paper is organized as follows: Sect. 2 recalls the online relational SOM, while Sect. 3 introduces the sparse version of the online relational SOM. The equivalent algorithm for kernels is briefly described in Sect. 4, while Sect. 5 contains some examples on real data-sets.

2 Online Relational SOM

In this section we shall briefly recall the principles of the online relational SOM (RSOM) algorithm, as introduced in [4]. Throughout the rest of the paper, let us suppose that the input data, \(x_{1},\ldots ,x_{N}\), belong to some arbitrary space \(\mathcal {G}\) and can be described through a dissimilarity measure \(\delta \), such that \(\delta _{ij}=\delta \left( x_{i},x_{j}\right) \). The dissimilarity measure is supposed to verify some basic assumptions: symmetry \(\left( \delta _{ij}=\delta _{ji}\right) \) and non-negativity \(\left( \delta _{ij}\ge 0\right) \), for all \(i,j=1,\ldots ,N\), and also \(\delta _{ii}=0\), for all \(i=1,\ldots ,N\).

The online RSOM algorithm aims at mapping the input data onto a low dimensional grid (usually a two-dimensional rectangle), composed of U units, each of them described by a prototype \(p_{u}\), \(u=1,\ldots ,U\). The units are linked together by a neighborhood relationship H, expressed as a function of the distance between the units on the grid, \(d\!\left( u,u^{\prime }\right) \). The distance on the grid, d, may be chosen, for example, as the length of the shortest path between the units. The U prototypes are initialized either as random convex combinations of the input data or randomly among the input data.

The extension of the original SOM algorithm is based on two key ideas:

  • First, prototypes are written as (symbolic) convex combinations of the input data, \(p_{u}=\sum _{i=1}^{N}\beta _{u,i}x_{i}\), with \(\beta _{u,i}\ge 0\) and \(\sum _{i=1}^{N}\beta _{u,i}=1\), for all \(u=1,\ldots ,U\). This definition is justified by the fact that, when a dissimilarity is given, it can be viewed as the dot product of the images by a mapping function \(\phi \) into a pseudo-Euclidean space [10]: the prototypes are thus truly the convex combinations of \((\phi (x_i))_i\) in this space (see [4, 5] for further explanations).

  • Second, the distance between an input data \(x_{i}\) and a prototype \(p_{u}\) can be written only in terms of the dissimilarity matrix of the input data and the coefficients \(\beta _{u,i}\) as follows:

    $$\begin{aligned} \Vert x_{i}-p_{u}\Vert ^{2}=\mathbf {\Delta }_{i}\beta _{u}^{T}-\frac{1}{2}\beta _{u}\mathbf {\Delta }\beta _{u}^{T} \ , \end{aligned}$$
    (1)

    where \(\mathbf {\Delta }=\left( \delta _{ij}\right) _{i,j=1,\ldots ,N}\), \(\mathbf {\Delta }_{i}\) represents the i-th row of the matrix \(\mathbf {\Delta }\) and \(\beta _{u}=\left( \beta _{u,1},\ldots ,\beta _{u,N}\right) \) is the vector of coefficients for the prototype \(p_{u}\).

Expressing the prototypes as convex combinations of the input data and computing the distances between observations and prototypes as in Eq. (1) consists, in fact, in a generalization of the original SOM algorithm. Indeed, one can easily see that the two are equivalent if the dissimilarity \(\delta \) is the squared Euclidean distance and if the prototypes of the original SOM are initialized within the convex hull of the input data.

This general framework allowing an elegant writing of the algorithm for complex data described by dissimilarities was introduced initially for the online version of kernel SOM (KSOM) in [2]. Afterwards, extensions and rediscoveries were described for batch relational SOM in [5], batch kernel SOM in [3] and online relational SOM in [4]. A detailed and complete comparison of these methods and their equivalences may be found in [11].

The distance computation in Eq. (1) may be theoretically justified in the very general setting of dissimilarities by extending the Hilbert embedding for kernels to a pseudo-Euclidean embedding, as shown, for example, in [5].

The online relational SOM algorithm is summarized in Algorithm 1. The neighborhood function H is supposed to verify the following assumptions: \(H:\mathbb {R}\rightarrow \mathbb {R}\), \(H(0)=1\) and \(\lim _{x\rightarrow +\infty }H(x)=0\). In the setting of Algorithm 1, \(H^{t}\) decreases piecewise linearly, while \(\mu (t)\) vanishes at the rate \(\frac{1}{t}\).

figure a

3 Sparse Online Relational SOM

Similarly to relational SOM, prototypes are written as convex combinations of the observations, but, in this case, they are restricted to the input data already fed to the algorithm and, more particularly, to the most significant of them. In order to guarantee the sparsity of the writing as well as similar properties with the original online relational SOM algorithm, several issues have to be verified.

  1. 1.

    Prototypes have to be initialized at random among the input data. Hence, the observations have to be randomly presented to the algorithm. The first U observations will be then used as initial values for the U prototypes.

  2. 2.

    The dissimilarity between a new input data and a prototype, written as a convex combination of the most significant past observations, has to be computed. This can be achieved using the following formula \(\Vert x_{k}-p_{u}\Vert ^{2} = \sum _{j\in I(t)}\beta _{u,j}\delta \!\left( x_{k},x_{j}\right) - \frac{1}{2} \sum _{i\in I(t)}\sum _{j\in I(t)} \beta _{u,i} \beta _{u,j}\delta \!\left( x_{i},x_{j}\right) \), where \(p_{u}=\sum _{j\in I(t)}\beta _{u,j}x_{j}\) and I(t) contains the indices of the most significant inputs already fed to the algorithm before \(x_{k}\) is chosen.

  3. 3.

    Prototypes are sparse combinations of the input data. Hence, prototypes are periodically updated and the most coefficients only are selected. The updates may be performed throughout the iteration using either a deterministic design (the number of updates is fixed and updates are uniformly distributed during the learning of the map), or a random design (the updates are distributed according to some geometric distribution. The parameter of the geometric distribution may depend on the total number of iterations and on the size of the neighborhood). Sparsity could be achieved by selecting the first Q most important coefficients, where Q is a fixed integer. However, in order to allow for more flexibility in the expression and interpretability of the prototypes, the most significant coefficients are selected according to their value, by fixing a threshold: let \(0<\nu \le 1\) be the selected threshold (if \(\nu =1\), the algorithm is no longer sparse, but the original one).

    For \(u=1,\ldots ,U\), the coefficients are ordered in descending order for each prototype \(\beta _{u,(1)},\ldots ,\beta _{u,(\sharp I(t))}\), where \(\beta _{u,(1)}=\max _{i\in I(t)}\beta _{u,i}\) and \(\beta _{u,(\sharp I(t))} =\min _{i\in I(t)}\beta _{u,i}\). Consider \(N_{u}\) such that \(N_{u}=\arg \min _{n=1,\ldots ,\sharp I(t)}\left\{ \sum _{i=1}^{n}\beta _{u,(i)}\ge \nu \right\} \). The most significant coefficients are updated as follows

    $$ \beta _{u,(i)}=\left\{ \begin{array}{rcl} \frac{\beta _{u,(i)}}{\sum _{j=1}^{N_{u}}\beta _{u,(j)}} &{} , &{} \text{ if } (i)\le N_{u} \\ &{} &{} \\ 0 &{} , &{} \text{ if } (i)>N_{u}\\ \end{array} \right. $$

The sparse online relational SOM algorithm is summarized in Algorithm 2.

figure b

4 The Kernel Version

In some cases, data may be described by a kernel, K, instead of a dissimilarity. We shall recall that a kernel is a symmetric similarity such that \(K\!\!\left( x_{i},x_{i}\right) =0\) and which satisfies the following positive constraint: \(\forall M>0, \ \forall \left( x_{i}\right) _{i=1,\ldots ,M}\in \mathcal {G}, \ \forall \left( \alpha _{i}\right) _{i=1,\ldots ,M}\in \mathbb {R}, \ \sum _{i,j=1}^{M}\alpha _{i}\alpha _{j}K\!\!\left( x_{i},x_{j}\right) \ge 0\). According to [12], there exists a Hilbert space \(\mathcal {H}\), also called feature space, as well as a feature map \(\psi :\mathcal {G}\rightarrow \mathcal {H}\), such that \(K\!(x,x^{\prime })=\langle \psi (x),\psi (x^{\prime })\rangle _{\mathcal {H}}\). Similarly to the dissimilarity case, the prototypes are defined as convex combinations of (the images by \(\psi \) of) \((x_i)_i\). The distance between an input data \(x_{k}\) and some prototype \(p_{u}\) is then computed as the squared distance induced by the kernel \( \Vert x_{k}-p_{u}\Vert ^{2} = K\!\!\left( x_{k},x_{k}\right) -2\sum _{i\in I(t)}\beta _{u,i} K\!\!\left( x_{k},x_{i}\right) +\sum _{i,j\in I(t)}\beta _{u,i}\beta _{u,j} K\!\!\left( x_{i},x_{j}\right) \). The sparse online relational SOM can thus be immediately adapted for kernels. Algorithm 2 has to be modified only in the assignment step which becomes

1: Assignment step: find the unit of the closest prototype

$$ f^{t}\left( x_{k}\right) \leftarrow \arg \min _{u=1,\ldots ,U} \left[ \beta _u^{t-1}\mathbf {K}_{I(t-1)}\left( \beta _u^{t-1}\right) ^{T} -2 \sum _{j\in I(t-1)} \beta _{u,j}^{t-1}K\!\!\left( x_{k},x_{j}\right) \right] \ , $$

where \(\mathbf {K}_{I(t-1)}=\left( K\!\!\left( x_{i},x_{j}\right) \right) _{i,j\in I(t-1)}\).

5 Examples

The sparse version introduced in the present manuscript was compared to the online relational SOM on two real data sets. For the sparse version, several values were considered for the threshold \(\nu \). The sparse updates were performed either in a uniform deterministic design (fixed number of updates), or at random, according to a geometric distribution. The performances of the sparse RSOM and the online RSOM were then compared in terms of average computational time (in seconds), quantization and topographic errors and sparsity (number of non-zero coefficients). Scripts were all implemented under the free statistical software environement R.

Table 1 Average results for Astraptes fulgerator (100 random initializations)
Table 2 Average results for Astraptes fulgerator (100 random initializations, updates were made with a random design)

Astraptes fulgerator. The first data set was introduced in [13]. In contains information on 465 Amazonian butterflies, each of them described by a sample of their DNA. Each input data is a DNA sequence of length 350. The Kimura distance for genetical sequences, as introduced in [14], was computed and the resulting distance matrix was used as input for relational and sparse relational SOM. For both algorithms, 100 different initializations with 2 500 iterations each were performed on a square grid of size \(5\times 5\). The results are summarized in Tables 1 and 2 for the deterministic and random designs respectively.

Professional trajectories. The second example comes from [15]. It contains information about 2 000 people having graduated high-school in 1998 and monitored during 94 months afterwards. For each individual, a categorical sequence of length 94, giving his monthly professional status is available. In all, there are nine possible situations, from permanent contracts to unemployment. The dissimilarity used for these data is the optimal matching (OM) distance, as introduced in [16]. Here, 100 different initializations with 10 000 iterations each were performed on a square grid of size \(10\times 10\). The sparse version was compared to the standard online relational SOM (itself run from 100 different initializations and 10 000 iterations). The results for the deterministic design are summarized in Table 3 (due to the lack of space, we do not report here the results with a random design, which are quite similar).

It is interesting to note that the sparsity has a strong influence on the computational time: increasing the number of updates tends to decrease the computational time since the prototypes are regularly cleared from unnecessary coefficients. The computational time compared to the standard version is at least 10 times smaller in the sparse version for this large dataset. On the contrary, the performances, measured in terms of quantization and topographic errors, can be affected by a too large sparsity but the best ones remain close to those of the standard version.

Table 3 Average results for “professional trajectories” (100 random initializations, updates were made with a deterministic design)

6 Conclusion and Future Work

A sparse version of the online relational SOM algorithm was proposed, by sequentially increasing the composition of the prototypes and sparsely updating them. The algorithm was compared with the online ROM on two real data sets and the sparse version appeared to achieve very similar performances as compared to the original algorithm, while improving computational time and prototype representation.