1 Introduction

High dimensional data visualization is used to analyze the hidden patterns and data relationships, which are hard to illustrate. One of the most popular and classical methods for dimension reduction and data visualization is principal component analysis, PCA [14]. Linear PCA is susceptible to lose useful information in highly nonlinear data sets. Several nonlinear PCA methods have been proposed such as the work presented in [24]. Linear and nonlinear PCA are computationally inexpensive in large data sets however, the memory required by these algorithms is not affordable in very large data sets. The Sammon’s mapping [23] is a nonlinear mapping algorithm for visualization of multivariate data, which is shown to be superior to PCA. Neural networks specially the self organizing maps [15] are able to learn complex nonlinear relationships of variables in a sample of data points.

The self organizing map is one of the well known data mining tools where the aim is to visualize a high dimensional data space into usually a 2-Dim grid [4, 10, 12, 20, 25]. The SOM contains a set of neurons that gradually adapts to input data space by competitive learning and creates ordered prototypes. The ordered prototypes preserve the topology of the mapped data and make the SOM to be very suitable for cluster analysis [31]. This adaption is based on a similarity measure, which is usually Euclidean distance, and repositioning of neurons in a 2-Dim space using a learning algorithm. The performance of the SOM strongly depends on a learning algorithm [6, 7, 9, 11].

Different versions of the SOM for visualization of high dimensional data have been introduced in [5, 13, 17, 29, 30, 32, 33]. An alternative to the SOM algorithm, ViSOM, is introduced by [33] to improve the topological and quantization errors by restricting contractions of neurons. The authors proposed different updating rules for the winner neurons and their neighborhood neurons, which is computationally expensive in very large data sets. The analysis and experimental results show that the ViSOM may offer attractive advantages over the commonly used SOM, PCA, and Sammon’s mapping. The PRSOM [30] is an extension to ViSOM, where the sequential updating rules are extended to optimize a cost function. Unlike the hard assignment in SOM and ViSOM, the assignment of PRSOM is soft such that an input data point belongs to a neuron with certain probability. However, the computational complexity of both ViSOM and PRSOM is shown to be more than the classical SOM and depends quadratically on the number of neurons [30].

In [3], a new self organizing model, the so called growing grid (GG), is proposed to the problem of data visualization. The network automatically chooses a height/width ratio suitable for the data distribution. Moreover, locally accumulated statistical values are used to determine where to insert new units (neurons). The growing neural gas (GNG), introduced in [8], is an improvement to the neural gas (NG) algorithm [18]. More specifically, it is an incremental version of the NG algorithm which does not require the pre-setting of the network size. The GNG algorithm is able to make explicit topological relations of input data.

The growing hierarchical SOM (GHSOM) generates multiple independent layers of the maps on the top of the first layer to cluster input data points in a hierarchical manner and allows for adaptation of the network architecture simultaneously [21]. The GHSOM is presented to be suitable for visualization of high dimensional document data sets rather than the SOM, although manipulating multiple layers of the GHSOM is liable to high computational effort in very large data sets. Following the training phase of the GHSOM, a novel visualization method, namely, the ranked centroid projection (RCP), is introduced in [32] for projecting the document vectors to a hierarchy of output maps. The results of the DB index show that RCP produces a better result, which denotes that for the purpose of classification and categorization, the RCP is to be preferred over the SOM, PCA and Sammon’s mapping [32].

The expanding SOM, ESOM, for data visualization is proposed in [13]. The ESOM utilizes the SOM by employing an additional factor, the expanding coefficient, which is used to push neurons away from the center of all data points during the learning process. The authors show that the ESOM has advantages over the SOM and the growing models of the SOM, where the growth criteria in the latter increases the computational time in the learning process. To cope with the problems of selecting learning rate in conventional SOM, [5] proposed RPSOM in which, for each input data point, the RPSOM adaptively chooses several rivals of the BMU and penalizes their associated models a little far away from the input data point [5]. The RPSOM converges much faster than the SOM and the two-phase SOM [27] moreover, outperforms them in the term of quantization error.

In this paper, a similar approach to the RPSOM is proposed to improve the performance of the RPSOM and the SOM in the scene of quantization error. Instead of penalizing the rivals of the BMU, which increases the training steps, an adaptive constraint parameter is used in the learning process. The constraint parameter is chosen as a decreasing function with respect to iterations and we consider three such functions: linear, hyperbolic and sigmoid. This parameter restricts the process of updating neighborhoods of the BMU to only those neighbors which are close in the n-dimensional space. Such an approach leads to faster convergence and better local minimum of the quantization error than that of by the RPSOM and the SOM. Furthermore, the distortion error and topology preservation are improved. The proposed algorithm is tested using eight real-world data sets.

The rest of the paper is organized as follows. The basic self organizing maps is presented in Sect. 2. The modified learning algorithm of the SOM and the adaptive constraints are introduced in Sect. 3. The CSOM algorithm and its implementation are discussed in Sect. 4. Numerical results are presented in Sects. 5 and 6 concludes the paper.

2 Self Organizing Maps

The SOM is an unsupervised neural network [15] that usually contains a 2-Dim array of neurons. Assume that we are given the set of m input data vectors \(A=\{x_1,\ldots ,x_m\}\) where \(x_i \in \mathbb {R}^n,~i=1,\ldots ,m\). In the SOM, a set of q neurons, \({\varPsi }=\{w_1,\ldots ,w_q\}\), \(w_j \in \mathbb {R}^n\), is given.

One data point \(x_i, ~i \in \{1,\ldots ,m\}\) at a time is presented to the network. The data point \(x_i\) is compared with all weight vectors. The nearest \(w_j,~j=1,\ldots ,q\) is selected as the best matching unit (BMU) for the i-th data point. The data point \(x_i\) is mapped to the best matching neuron.

The set of neighborhood weights \(N_c = \{w_l : p(c,l) \le r, ~l \ne c\}\) around the BMU are updated where p(cl) is the distance between the BMU and the neighborhood neuron l in 2-Dim coordinates of the network topology and r is the predefined radius. Here \(p(c,l) \in \mathbb {N}\) and \(0<p(c,l)\le r\). Usually the quantization error is used to show the quality of the map, therefore, the aim is to solve the following problem:

$$\begin{aligned} \text {minimize}\,E = \frac{1}{m}\sum _{i=1}^m \Vert x_i - w_c\Vert , \end{aligned}$$
(1)

where \(w_c\) is the weight of the BMU of \(x_i,~i =1,\ldots ,m\). A general description of the SOM algorithm is as follows.

Algorithm 1

SOM algorithm

Step 1. :

Initialize the dimension of the network, the maximum number of iterations (T), a radius (r) of the network and weight vectors \(w_j, ~j=1,\ldots ,q\). Set stopping criterion \(\epsilon _{0}\) and iteration counter \(\tau :=0\).

Step 2. :

Select data \(x_i, i=1,\ldots ,m\) and find its closest neuron c, that is

$$\begin{aligned} c := \underset{j=1,\ldots ,q}{\text {argmin}} \Vert x_i - w_j\Vert . \end{aligned}$$
(2)
Step 3. :

Update the set of neighborhood neurons \(w_j \in N_c\) using the following equation:

$$\begin{aligned} w_j := w_j + \alpha (\tau )h(\tau )(x_i - w_j). \end{aligned}$$
(3)

(Here h is a neighborhood function and \(\alpha (\tau )\) is a learning rate at the iteration \(\tau \).)

Step 4. :

If all input data are presented to the network go to Step 5, otherwise go to Step 2.

Step 5. :

Calculate E using (1). If \(E < \epsilon _{0}\) or \(\tau > T\) terminate, otherwise set \(\tau := \tau + 1\) and go to Step 2.

Note that the neighborhood function in equation (3) of Algorithm 1 is as follows.

$$\begin{aligned} h(\tau ) = \exp \left( -\frac{r^2}{2 \sigma (\tau )^2}\right) , \end{aligned}$$
(4)

where

$$\begin{aligned} \sigma (\tau ) = \eta \frac{T-\tau }{T}, \quad ~ \eta \in \mathbb {R}, \end{aligned}$$
(5)

and usually \(\eta \ge 1\).

The neighborhood function h in Step 3 of Algorithm 1 plays an important role in the SOM. Usually h is a decreasing exponential function of \(\tau \). The value of neighborhood functions depends on iteration \(\tau \) and the distance in the output space (r) of each neuron in the set \(N_{c}\). The learning rate \(\alpha \) is a decreasing linear function of \(\tau \) that reduces the effect of the neighborhood function h as \(\tau \rightarrow T\).

In this paper, for given \(w_{j},~j \in \{1,\ldots ,q\}\) we define the following set:

$$\begin{aligned} S_j =\{x_k : d(x_k, w_j) < d(x_k, w_l), \quad l\ne j, \quad l=1,\ldots ,q \} \end{aligned}$$
(6)

where

$$\begin{aligned} d(x,y) =\Vert x-y\Vert = \left( \sum _{t=1}^n (x^t - y^t)^2 \right) ^{1/2}, \quad x, y \in \mathbb {R}^n \end{aligned}$$

is the Euclidean distance.

The learning procedure of SOM is explored in details in the next section.

3 CSOM Learning Algorithm

In this section the learning process of the Self Organizing Map is analyzed, furthermore, the quantization performance of the SOM is criticized. As we mentioned in Sect. 2, the set \(N_c\) in Step 3 of Algorithm 1 contains all neighborhood neurons which are connected to the BMU. These neurons are selected using the parameter r which is the radius around the BMU in the topological 2-dimensional map. The SOM updates all neighborhood neurons of the BMU in the topological map using (3), although relative weights of these neurons may be far from the BMU in the n-dimensional space. In this case, the value of E in (1) deviates from its optimal value as the neighborhood neurons in the network topology, which are not in the neighborhood of the BMU, are relocated according to (3) (see Fig. 1). The relocation of the set of neighborhood neurons, \(M_{c},~M_{c} \subset N_{c}\), which are far from the winning neuron in the n-dimensional space is illustrated in the Fig. 1. Note that the set \(M_{c}\) depends on iteration \(\tau \).

Fig. 1
figure 1

Contraction of neighbor neurons of the BMU

Assume that the SOM algorithm is learning the input data points. In some stage, the relocation of the neighborhood neurons in the set \(M_{c}\) increases the inter-neuron distances (see Fig. 1). Let have a given data point \(x_{m}\), which activates the nearest neuron in the map at iteration \(\tau \). The neighborhood neurons of the BMU move towards the data point \(x_{m}\) on the line intersecting both \(x_{m}\) and \(w_{j},~w_{j} \in N_{c}\). Based on the updating formula in (3) the distance that neuron \(w_{j}\) moves is \(\Vert {\varDelta } w_{j}\Vert \) where,

$$\begin{aligned} {\varDelta } w_{j} = \beta (x_{m}-w_{j}), \quad 0 \le \beta <1. \end{aligned}$$

Therefore, by applying the updating rule to the neuron \(w_{j}\), there exist data points \(x_{i} \in S_{j}\), that the value

$$\begin{aligned} {\varDelta } d(x_{i},w_{j}) > 0, \end{aligned}$$

after updating the neuron \(w_{j}\), if

$$\begin{aligned} d(x_{m},w_{j})\le \cos (\theta _{m,i}) d(x_{m},x_{i}), \end{aligned}$$
(7)

where \(\theta _{m,i}\) is the angle between vectors \((x_{i} - x_{m})\) and \((w_{j} - x_{m})\) (see Fig. 1). Let have the subset of data points, \(Q_{j} \subset S_{j}\), where every data point \(x_{i} \in Q_{j}\) holds the condition (7). Thus, the increment in the value of inter-neuron distance, \({\varDelta } D_{j}\), is:

$$\begin{aligned} {\varDelta } D_{j} = \sum _{i=1}^{|Q_{j}|} {\varDelta } d(x_{i},w_{j}), \end{aligned}$$
(8)

where, |.| is the cardinality of a set of data. Furthermore, consider a subset of neurons \(F_{c} \subset N_{c}\) including neurons \(w_{j}\), which are far from the BMU in the n-dimensional space. Therefore, the value E in (1) increases by:

$$\begin{aligned} {\varDelta } E = \sum _{j=1}^{|F_{c}|} {\varDelta } D_{j}, \end{aligned}$$
(9)

after the data point \(x_{m}\) is presented to the network and (3) is applied to the set \(N_{c}\).

3.1 Modified Learning Algorithm

In this section we design an algorithm to minimize \({\varDelta } E\) given by (9) for \(w_j,~j=1,\ldots ,|N_{c}|\) at iterations \(\tau >T/\rho \). The idea is to modify the update formula (3) in Step 3 of Algorithm 1 by adding a constraint parameter. We introduce a constraint parameter \(\gamma \in \mathbb {R}\), which defines a subset \(C_c\) of the set of neighborhood neurons \(N_c\) which are close to the BMU:

$$\begin{aligned} C_c =\{w_j : d(w_c, w_j) < \gamma d_{min},~w_j \in N_c\}, \end{aligned}$$
(10)

where \(\gamma > 1\) and

$$\begin{aligned} d_{min} = \min \left\{ d(w_c, w_j): w_j \in N_c \right\} . \end{aligned}$$
(11)

Also we have

$$\begin{aligned} \psi = \frac{d_{max}}{d_{min}}, \end{aligned}$$
(12)

and

$$\begin{aligned} d_{max} = \max \{d(w_c, w_j): w_j \in N_c\}. \end{aligned}$$
(13)

One can see from (10) that \(|C_c| < |N_c|\) for values of \(\gamma \) close to 1 and both sets coincide as \(\gamma \rightarrow \infty \). The set \(C_c\) contains neighborhood neurons in a 2-Dim topological space while preserving the condition of adjacency of \(w_j \in C_c\) in the n-dimensional space. In the step 3 of Algorithm 1, if we update neighborhood neurons of the set \(C_c\) then the weight vectors, which are far from the best matching unit, will not be considered in Eq. (3). Therefore, following (10):

$$\begin{aligned} d_{min} \le d(w_c, w_j) < \gamma d_{min}~~ \forall ~w_j \in C_c , \quad \gamma > 1. \end{aligned}$$
(14)

Since the distance \(d(w_c, w_j),~w_j \in C_c\) is sufficiently small, it is easy to see that \({\varDelta } d(x_{i},w_{j}) \) in (8), where

$$\begin{aligned} {\varDelta } d(x_{i},w_{j}) < d(w_{c},w_{j}), \end{aligned}$$

is depended on the upper bound, \(d(w_{c},w_{j})\) (see Fig. 2).

Fig. 2
figure 2

The updating procedures in the SOM algorithm

Therefore, by limiting the value of \(d(w_{c},w_{j})\), then we have:

$$\begin{aligned} {\varDelta } d(x_{i},w_{j}) < \gamma d_{min}. \end{aligned}$$

One can see that the values \({\varDelta } D_{j}\) in (9) are minimized.

Thus, the constrained learning algorithm can be summarized as follows.

Algorithm 2

Constrained learning algorithm

Step 1. :

\(C_c = \emptyset \), select \(\gamma \) and find \(d_{min}\) from \(N_c\) using Eq. (11).

Step 2. :

Select neuron \(w_i \in N_c\), if \(\Vert w_i - w_c\Vert <\gamma d_{min}\) then

$$\begin{aligned} C_c = C_c \cup w_i \end{aligned}$$
Step 3. :

If all neurons in \(N_c\) have been visited then goto 4, otherwise goto 2.

Step 4. :

Update the set of neighborhood neurons \(w_j \in C_c\) using Eq. (3).

Thus, we minimize the Eq. (9) by applying Algorithm 2 in the step 3 of Algorithm 1 at iteration \(\tau \).

3.2 Adaptive Selection of Parameter \(\gamma \)

In this section, we propose linear, hyperbolic and sigmoid formulation of the constraint parameter \(\gamma \) in Eq. (10) with respect to iteration \(\tau \). Considering Algorithm 2, one can see that fixed values of \(\gamma \) reduce the interaction of neurons in early iterations thus, the neurons do not converge to the domain of input data accurately. From Eq. (10) it is easy to see that \(C_c \equiv N_c\) for the large values of \(\gamma \). Therefore the behavior of the algorithm for large values of \(\gamma \) is similar to that of classical SOM.

In Step 2 of Algorithm 2 we consider \(\gamma \) not as a fixed constant but parameter depending on iteration \(\tau \). Moreover, \(\gamma (\tau )\) decreases as \(\tau \) increases. Hence, to ensure the validity of the proposed formulations, the constraint functions allow the algorithm to perform similar to SOM in early stages and gradually applies the constrained learning algorithm. In order to have such property we require \(\gamma \) to satisfy the following conditions: there exist \(\bar{\tau } >1\) such that:

$$\begin{aligned} {\left\{ \begin{array}{ll} \gamma (\tau ) > \psi &{}\quad 1<\tau <\bar{\tau } \\ 1 < \gamma (\tau ) \le \psi &{}\quad \bar{\tau } \le \tau < T\\ \end{array}\right. } \end{aligned}$$
(15)

It is clear there are different ways to choose \(\gamma \) satisfying the conditions (15). In this paper, we will define \(\gamma \) as a linear, hyperbolic and sigmoid constraint functions of \(\tau \).

3.2.1 Linear formulation

In the case of linear constraint function one can approximate \(\gamma \) as follows:

$$\begin{aligned} \gamma (\tau ) = \frac{\rho }{(1-\rho )}(\psi - 1)\biggl (\frac{\tau }{T} - \frac{1}{\rho }\biggr ) + \psi , \end{aligned}$$
(16)

where \(\rho \in \mathbb {N}\) defines the \(\bar{\tau } = T/\rho \) which is the iteration from which the constrained learning algorithm is applied to Algorithm 2. It is easy to see that the function (16) satisfies condition (15).

3.2.2 Hyperbolic Formulation

The hyperbolic expression for \(\gamma \) is given as

$$\begin{aligned} \gamma (\tau ) = {\left\{ \begin{array}{ll} \infty &{}\quad 1<\tau <\bar{\tau } \\ \textstyle \frac{\psi }{\tau - \varphi } +1 &{}\quad \bar{\tau } < \tau < T\\ \end{array}\right. } \end{aligned}$$
(17)

where

$$\begin{aligned} \varphi = \bar{\tau } - \frac{d_{max}}{d_{max} - d_{min}}, \end{aligned}$$

and \(\bar{\tau } = T/\rho \).

3.2.3 Sigmoid Formulation

Finally, we propose the following sigmoid function to define \(\gamma \):

$$\begin{aligned} \gamma (\tau ) = \varepsilon (\psi - 1)\tanh \bigg (\frac{- \tau + \bar{\tau }}{\delta }\bigg ) + \frac{1}{2} (\psi + 1), \end{aligned}$$
(18)

where \( 0< \varepsilon < 1\) and \(1<\delta < T\) defines the slope of changes depending on iterations. It should be noted that if \(\delta \) is chose close to 1 then we have a rapid changes in constraint function (18).

To explain the idea, the performance of the proposed algorithm is presented in Fig. 3. The Fig. 3a shows that from iteration 2 onwards (\(\tau > \bar{\tau }\)) the algorithm applies the constraint to the learning process and \(\gamma d_{min}\) is decreasing and always less than \(d_{max}\) as \(\tau \rightarrow T\), consequently, the cardinality of \(C_c\) in (10) decreases (see Fig. 3b). The algorithm prevents further modification to the neighborhood neurons which are far from the BMU in the n-dimensional space.

Fig. 3
figure 3

Performance of constraint \(\gamma \) using equation (16) on \(20 \times 20\) SOM where \(r=3,~\rho =5,~T=5\) and \(|A|=20\). a Performance of \(\gamma \), b comparing \(|C_c|\) with and without constraint

4 The CSOM Algorithm and Its Implementation

In this section, we apply adaptive formulations of constraint \(\gamma \) in Step 2 of Algorithm 2. Then Algorithm 1 for solving vector quantization problem (1) can be rewritten as follows.

Algorithm 3

CSOM algorithm

Step 1. :

Initialize dimension, maximum iteration (T), radius (r) of the network and finally weight vectors \(w_j, j=1,\ldots ,q\). Set \(\tau :=1\).

Step 2. :

Select data \(x_i\) and find closest neuron c, where

$$\begin{aligned} c = \underset{j}{\text {argmin}} \Vert x_i - w_j\Vert . \end{aligned}$$
(19)
Step 3. :

Compute the set \(N_c\) and \(d_{min}\) and \(d_{max}\) using Eqs. (11) and (13) respectively.

Step 4. :

Compute \(\gamma (\tau )\) using Eqs. (16), (17) or (18) and set \(C_c = \emptyset \).

Step 4.1 :

Select neuron \(w_i \in N_c\), if \(\Vert w_i - w_c\Vert < \gamma (\tau )d_{min}\) then

$$\begin{aligned} C_c = C_c \cup w_i. \end{aligned}$$
Step 4.2 :

If all neurons in \(N_c\) have been visited then goto Step 5, otherwise goto Step 4.2.

Step 5. :

Update the set of neighborhood neurons \(C_c\) using the following equation:

$$\begin{aligned} w_j(t+1) = w_j(t) + \alpha (\tau )h(\tau )(x_i - w_j(t)),~w_j \in C_c. \end{aligned}$$
(20)
Step 6. :

If all input data \(x_i\) are presented to the network go to Step 7 otherwise, go to Step 2.

Step 7. :

Calculate \(E_{\tau }\) using Eq. (1) and if \(\tau > T\) terminate otherwise set \(\tau = \tau + 1\) and go to step 2.

In Algorithm 3, weight vectors \(w_j, j=1,\ldots ,q\) are initialized with a uniform pattern in order to be comparable with the basic SOM. The maximum number of iterations T is set according to size and dimension of data sets. We set T large for small data sets because for such data sets the time complexity is less to obtain stable network over input data. The topology of network is rectangular [15] with the same number of neurons in each column and row (i.e. \(n \times n\)). Each interior neuron is connected with 8 neighborhood neurons, however this number is less than 5 for border neurons. Furthermore, the radius of the map r is set to 2 and 3 for small and large number of neurons (see Table 1). The dimension of the map is selected based on the dimensionality of data sets thus, for very large data sets \((0.5\times 10^5<|A|<0.8\times 10^5)\) with high dimension (high time complexity) the size of map is set to \(15 \times 15\), whereas for 2-dimensional very large data sets \((|A|>0.8\times 10^5)\) it is set to \(25 \times 25\).

Table 1 Initialization of SOM parameters in Algorithm 3

In step 4 of Algorithm 3, we set values of \(\rho \in \{\rho : 2\le \rho \le 0.7\times T,~\rho \in \mathbb {N}\}\) for Eqs. (16) and (17) which depends on data set size and the maximum iteration number T. If the number of data points is large then we choose \(\rho \) so that to apply the constrained learning in the most first T / 2 iterations. If the size of data set is small, then \(\rho \) is chosen in order to apply the constraint in early iterations (\(\rho \ge 2\)). In all experiments, the parameters \(\varepsilon \) and \(\delta \) in Eq. (18) are set to 0.83 and T / 4, respectively. The parameter \(\eta \) in Eq. (5) is set to 1 for all data sets.

5 Numerical Results

To demonstrate the efficiency of the proposed algorithm, the numerical experiments were carried out using a number of real-world data sets. Algorithm 3 was coded in NetBeans IDE under Java platform and the algorithm is tested on a MAC OSX with 2.7GHz core i7 CPU and 10GB of RAM. Eight data sets: 2 small (Iris and Wine), 2 medium size (TSPLIB1060 and Image Segmentation), 2 large (D15112, Gamma Telescope) and 2 very large (NEFootnote 1 and Pla85900) data sets are used in experiments. Iris, Wine, Image Segmentation and Gamma Telescope data sets are data sets with 4, 13, 19 and 10 attributes, respectively, which can be found in UCI Machine Learning Repository [19]. TSPLIB1060, D15112 and Pla85900 are 2 dimensional data sets from TSPLIB library [22] and NE data set from [26]. A brief description of data sets is presented in Table 2.

Table 2 The brief description of data sets

5.1 Validation of Parameter \(\rho \) in Adaptive Constraints

The sensitivity of choosing different values of parameter \(\rho \) in (16), (17) and (18) on Iris and Wine data sets is presented in Fig. 4. One can see that this parameter in sigmoid constraint is more sensitive than other two constraint functions in both data sets. Furthermore, in both data sets, the hyperbolic constraint is less sensitive to the parameter \(\rho \) and produces high quality maps with less quantization error.

Fig. 4
figure 4

The sensitivity of choosing different values of parameter \(\rho \in \{\rho : 2\le \rho \le 0.7\times T,~\rho \in \mathbb {N}\}\) on Iris and Wine data sets. E is the value of quantization error. a Iris, b Wine

In order to validate the performance of the constraint parameter \(\rho \), the quantization errors using different settings of this parameter in four data sets, Iris, Wine, TSPLIB1060 and Image Segmentation are presented in Table 3. Furthermore, the average and standard deviation of quantization errors are calculated to check the robustness of this parameter in each constraints. From these results one can see that the CSOM with hyperbolic constraint obtains less average quantization error and sufficiently low deviation in high dimensional data sets, Iris, Wine and Image segmentation than other two constraints. The sigmoid constraint is the one with low average quantization error and deviation in TSPLIB1060 data set. The linear constraint is the second-best constraint, which obtains less deviations in all four data sets.

Table 3 The quantization error, E, using different settings of parameter \(\rho \) in CSOM (Linear, Hyperbolic and Sigmoid functions)

5.2 Comparison with the SOM

The error values e of the quantization values E using Eq. (1) for different iterations on all data sets are presented in Tables 4 and 5. The error e is calculated using the following formula:

$$\begin{aligned} e = \frac{E - E_{best}}{E_{best}} \,\times \,100\,\%. \end{aligned}$$
(21)

In these tables t stands for the CPU time used by an algorithm and the numbers in these tables should be multiplied by the number indicated after the name of each data set to get true values of \(E_{best}\). We include results for different iterations of each to demonstrate their performance. As it is presented in these tables, in early iterations the CSOM is performing same as the classical SOM until the constraint is applied to the learning process, then the CSOM performs much better than the SOM. From these results one can see that the CSOM outperforms SOM in all data sets and the hyperbolic constraint finds best solution (providing the lowest value of E) on 5 data sets (Iris, Wine, Image Segmentation, Gamma Telescope and NE), sigmoid on 3 data sets (TSPLIB1060, D15112 and Pla85900) and linear is the second-best constraint in 3 data sets (TSPLIB1060, Gamma Telescope and Pla85900).

Table 4 Results for small and medium size data sets
Table 5 Results for small and medium size data sets

CSOM finds significantly better solution on 5 data sets (Iris, Wine, TSPLIB1060, Image Segmentation, and Pla85900) than SOM from 6.2 % on Pla85900 up to 37.48 % on Wine data set. In this case, the minimum improvement gained by the CSOM is on D15112 data set where the obtained solution is only 0.57 % better than SOM . In other two data sets (Gamma Telescope and NE) the CSOM reduces the value of the quantization error E about 5 % on the rest of the data sets. Note that in the most data sets CPU time required by the Constrained SOM is slightly larger than that of by the SOM. This is due to the fact that the CSOM tries to find nearest neighborhood neurons in n-dimensional space. Since both the NE and Pla85900 are large data sets the SOM algorithm requires more iterations than in other smaller data sets to spread neurons over the whole data set. Therefore, in these data sets constraints are applied at iterations close to T / 2 to ensure that values calculated by constraint functions (16)–(18) are true values.

Fig. 5
figure 5

Convergence of CSOM versus SOM. a Iris, b wine, c image segmentation, d NE

To see the performance of proposed algorithm, In Fig. 5 we present values of E obtained by the SOM and the CSOM depending on iteration \(\tau \) in Iris, Wine, Image Segmentation and NE data sets. From these figures, it is obvious that both the SOM and the CSOM converge as \(\tau \rightarrow T\). Comparing values of E in all data sets, one can see that the CSOM converges significantly faster than the SOM in iterations less than 20. The CSOM with the hyperbolic constraint produces best results in all 4 data sets.

5.3 Complexity Comparison with SOM

The time complexity of Self Organizing Map is linear with respect to number of data points but it is dependent quadratically on the number of the neighborhood neurons to be updated by the winner neuron. The most time consuming step in the SOM algorithm is Step 2 and Step 3 where the Eq. (3) is calculated. In Table 6, the total number of calculations of (3), \(N=T\rho n\), for all data sets is presented. From these results, one can see that the number N obtained by CSOM is considerably less than those with the classical SOM in all data sets. The complexity is improved significantly using the proposed algorithm in 4 data sets: Iris, Image Segmentation, D15112 and NE.

Table 6 Total number of calculations of (3), N, for all data sets

5.4 Distortion Error

Note that the error E shows the quantization quality of the network. However, there is a distortion measurement which can be used to calculate the overall quality of the map. Unlike the quantization error, the distortion measure \(\xi \) considers both vector quantization and topology preservation of the SOM. The distortion measure is defined as follows [1]:

$$\begin{aligned} \xi = \sum _{x_i \in A} \sum _{w_j \in {\varPsi }} h_{cj} \Vert x_i - w_j\Vert ^2,~~ j\ne c, \end{aligned}$$
(22)

where c is the BMU of \(x_i\) and \(h_{cj}\) is the neighborhood function of neurons c and j at max iteration T defined by Eq. (4).

Table 7 Results of distortion measure on all data sets

Table 7 presents the numerical results of distortion measure (22) on all data sets. From these results, one can see that the proposed algorithm outperforms the SOM in all data sets. The improvement of distortion error, \(\xi \), is significance in eight data sets (wine, TSPLIB1060, image segmentation, TSPLIB3038, D15112, gamma telescope, NE and Pla85900). The value \(n_d\) in the Table 7 presents the number of neurons which never activated (dead neurons) by the input data points. The number of dead neurons, \(n_d\), for CSOM in all data sets is equal or less than that for the SOM. This means that the proposed algorithm activates more neurons and distributes the neurons more efficiently than the classical SOM.

5.5 Topographic Product

In this section the neighborhood relations preservation of CSOM is compared with the SOM using topographic product. This measure demonstrate how nearby neurons in the input space are mapped onto neighboring locations in the output space. Many topology preservation measures have been proposed and a complete review can be found in [2, 28].

The topographic product is a measure of the preservation of neighborhood relations in maps between spaces. The basic idea is that for each neuron the ordering of remaining neurons in output and input spaces are defined based on their distances in each space. Then the error is minimized if for each neuron these orderings become identical, i.e the k-th nearest neuron in output space is the k-th nearest neuron in the input space.

The notation for nearest-neighbor indices is defined as \(n_{k}^{A}(j)\), which denotes the k-th nearest neighbor of node j, with the distances measured in the output space:

$$\begin{aligned} n_{k}^{A}(j) : d^{A}\bigg (j, n_{k}^{A}(j)\bigg ) = \min _{j'\in A/\bigg \{j,n_{1}^{A}(j),\ldots ,n_{k-1}^{A}(j)\bigg \}} d(j, j'). \end{aligned}$$

In the same way let \(n_{k}^{V}(j)\) denote the k-th nearest neighbor of j, but with the distances measured in the input space.

$$\begin{aligned} n_{k}^{V}(j) : d^{V}(w_{j}, w_{n_{k}^{V}(j)}) = \min _{j'\in A/\bigg \{j,n_{1}^{V}(j),\ldots ,n_{k-1}^{V}(j)\bigg \}} d(w_{j}, w_{j'}). \end{aligned}$$

Using nearest neighborhood indexation the following ratios are defined:

$$\begin{aligned} Q_{1}(j,k) = \frac{d^{V}(w_{j}, w_{n_{k}^{A}(j)}) }{d^{V}(w_{j}, w_{n_{k}^{V}(j)}) }, \quad Q_{2}(j,k) = \frac{d^{A}\bigg (j, n_{k}^{A}(j)\bigg )}{d^{A}\bigg (j, n_{k}^{V}(j)\bigg )}. \end{aligned}$$

The topographic product formula is defined as:

$$\begin{aligned} P = \frac{1}{N(N-1)} \sum _{j=1}^{N} \sum _{k=1}^{N-1} \log (P_{3}(j,k)), \end{aligned}$$
(23)

where

$$\begin{aligned} P_{3}(j,k) = \left( \prod _{l=1}^{k} Q_{1}(j,l)Q_{2}(j,l) \right) ^{1/2k}. \end{aligned}$$

The topographic product of SOM and CSOM on all data sets are presented in Table 8. From (23) one can see that the desired value of P is 0. The CSOM algorithm outperforms SOM algorithm on all cases. The improvement of topographic product value is significant in Wine, TSPLIB1060, Image Segmentation, D15112 and NE.

Table 8 Results of topographic product on all data sets

5.6 Topology Preservation

In this section we present the topology preservation of the proposed algorithm. We choose 2-dimensional data sets in order to be displayed easily. Figures 6 and 8 show the topology of the SOM and CSOM for two data sets: TSPLIB1060, NE. It should be noted that only active neurons are presented in these figures. In Fig. 6a one can see that some of the SOM’s neurons for the TSPLIB1060 data set are far from the mapped data points which leads to increments in the quantization error E. These is due to the absorption of these neurons by their neighborhoods which, in fact, are far from these neurons in the n-dimensional space. On the contrary one can see from Fig. 6b that the CSOM spreads the neurons over the mapped data more accurately by overcoming the above mentioned deficiency in the SOM. These improvements can be obviously seen by comparing Figs. 6a and 6b at the bottom-left and right of the maps where exists a gap between two groups of data points. The visualization of neurons on D15112 data set shows similar results. The CSOM converge to the domain of data points more accurately. This is obviously proved by comparing the top-right side of Figs. 7a and b. Figure 8 presents the topology of the SOM and CSOM on NE data set. The CSOM in Fig. 8b is more accurate than the SOM in Fig. 8b where some active neurons are located in the space outside of the region where the data set is located (they are displayed by an arrow).

Fig. 6
figure 6

Topology preservation of CSOM versus SOM. a SOM on TSPLIB1060, b CSOM on TSPLIB1060

Fig. 7
figure 7

Topology preservation of CSOM versus SOM. a SOM on D15112, b CSOM on D15112

Fig. 8
figure 8

Topology preservation of CSOM versus SOM. a SOM on NE, b CSOM on NE

5.7 Comparison with Other Algorithms

In this section the CSOM is compared to similar topology preservation algorithms in the sense of accuracy and computational time. The CSOM is compared with Growing Grid (GG) [3], Growing Neural Gas (GNG) [8], Growing Hierarchical SOM (GHSOM) [21], Principal Component Analysis (PCA) [24], Sammon’s Mapping [23], Fuzzy Sammon Mapping [16] and RPSOM [5] algorithms. The max number of neurons and iteration for GG, GNG and GHSOM algorithms are set to 400 and 4000, respectively. The same settings as presented in Table 1 are considered for RPSOM algorithm.

The value e using 21 and the best value of quantization error, \(\mathbf {E_{best}}\), are presented in Table 9. Note that in this table LCSOM, HCSOM and SCSOM represent Linear CSOM, Hyperbolic CSOM and Sigmoid CSOM, respectively. In all data sets, except Gamma Telescope, the CSOM produces less quantization error than other algorithms. The GNG algorithm obtains less error on Gamma Telescope data set in comparison with other algorithms. From the results presented in the Table 9, one can see that the PCA is the second best algorithm in Iris, D15112 and Gamma Telescope data sets. Moreover, the RPSOM is the second best algorithm in Wine, NE and Pla85900 data sets. The results obtained by GG and GNG algorithms are close to those obtained by CSOM in TSPLIB1060 and Image Segmentation data sets, respectively. The RPSOM algorithm generates satisfactory results on NE and Pla85900 data sets in comparison with CSOM. Note that the PCA, Sammon’s Mapping and Fuzzy Sammon algorithms failed in two very large data sets, NE and Pla85900, due to high computational time requirement (more than five hours).

Table 9 Results of quantization error on all data sets

The CPU time required by all algorithms are presented in the Table 10. The CSOM is the best algorithm in Iris, Wine, NE and Pla85900 data sets in the sense of both accuracy and the required cpu time. The PCA and GNG algorithms are fast in most of the data sets, but the generated errors by these algorithms are quit far from the satisfactory results. From these results one can see that the CSOM outperforms the RPSOM algorithm in all cases.

Table 10 The CPU time required by all algortihms

6 Conclusion

In this paper, we develop a modified learning algorithm for the Self Organizing Maps. The aim is to propose a learning algorithm which restricts the neighborhood adaptations to only those neurons that are not far from the best matching unit in the n-dimensional space. Therefore, we introduced an adaptive constraint parameter. This parameter is a decreasing function with respect to iterations to be applicable to the SOM learning process. The adaptive constraint parameter selected as linear, hyperbolic and sigmoid function. The experiments on eight real-world data sets demonstrate the superiority of the proposed algorithm over the SOM in the sense of accuracy. The results show that the CSOM converges much faster than SOM and in all cases with less computational complexity, moreover, improves the topology preservation and presents promising clustering results. The proposed algorithm outperforms similar topology preservation algorithms specially in very large data sets in the sense of accuracy and computational time.

The CSOM is designed for offline training of large data sets however, it can be modified to be suitable in realtime environments. The constrained learning algorithm updates a set of neurons based on some parameters which can be saved in the memory for the next upcoming data points. The modification of the CSOM for realtime training is a part of future work.