1 Introduction

This paper proposes a novel growing grid approach, namely the constrained growing grid (CGG) for organization of high-dimensional data by using the self-organizing map-like technique with a constrained learning function. The self-organizing map (SOM), which is proposed by Teuvo Kohonen [1], is a powerful clustering algorithm for the visualization of high-dimensional data. It projects the high-dimensional data onto a low-dimensional map, usually a two-dimensional grid of units. The geometric relationship between units in a grid represents the relationship between high-dimensional data. In other words, the SOM is able to abstract the most important data relationships in order for them to be visualized on a two-dimensional map, which is a robust tool for various tasks such as data organizing, machine monitoring, pattern recognition, data mining, image analysis, communication, market segmentation, cloud resources management, 3-D shape reconstruction etc. e.g., [29].

The SOM depends significantly on certain predefined parameters including a fixed topological structure, which may be difficult to decide before learning. This has inspired the development of many extended SOM-like models such as growing cell structure (GCS) [10], growing neural gas (GNG) [11], growing neural gas with utility (GNG-U) [12], grow when required (GWR) [13], dynamic adaptive self-organizing hybrid (DASH) [14], self-organizing incremental neural network (SOINN) [15], local adaptive receptive field self-organizing map (LARFSOM) [16], and growing self-reconstruction map (GSRM) [6]. A common goal of these extended models is to project a data set from a high-dimensional space onto a low-dimensional space, and keep its inner structure as faithful as possible. Although these extended SOM-like models are successful in conquering the shortcoming of predefinition for the fixed topological structure, their non-grid topological feature makes them difficult to present on a two-dimensional space without information loss and thus losing the benefits of visualization.

Unlike such non-grid extended SOM-like models, Fritzke [17] proposed growing grid (GG), which is an incremental variant of the SOM in terms of the network topology. GG develops a growing grid structure based on an input data set without using the predefined topological structure. Thus, GG modifies the static SOM into a dynamic SOM and keeps a two-dimensional grid as its topological structure for visualization. The best matching unit (BMU) of GG is generally the most active unit and has the greatest reaction to its associated input vectors from the theoretical viewpoint of self-organization [18]. However, the learning function of the traditional GG may not make the BMU the most active unit during learning, as this goes against the competitive learning principle [19]. This paper modifies the original learning function of the GG and proposes the constrained growing grid (CGG) to pursue a better performance than the traditional GG model.

The remainder of this paper is organized as follows. In Section 2, we briefly review related literature, including neural gas (NG), growing cell structure (GCS), growing neural gas with utility criterion (GNG-U), grow when required (GWR), dynamic adaptive self-organizing hybrid (DASH), growing self-reconstruction maps (GSRM) and other recent related work. Sections 3 and 4 introduce the growing grid growing process and learning process respectively. In Section 5, we propose the constrained neural learning algorithm. Section 6 introduces five evaluation measures used in this paper, which are average quantization error (AQE), error entropy, average map scope (AMS), BMU activation rate (BAR) and topographic error (TE). Section 7 shows the experiment design and results. Finally, the conclusion and possible future work are provided in Sections 8 and 9 respectively.

2 Related work

Due to the deficiency of the SOM, i.e. the need for a fixed topological structure before learning, several extended SOM-like neural learning models have been proposed [6, 1014, 16]. Different modifications of the SOM suggest different enhancements from different viewpoints. For example, Martinetz and Schulten [20] proposed the neural gas (NG) model to alleviate the requirement of prior knowledge about the topological structure of the SOM. Like the SOM, the NG model defines a fixed number of output units and a learning rate which decays over time during learning. Unlike the SOM, the NG model defines the relationships between output units by connecting the best matching unit (BMU) and the second best matching unit (SMU) to an input vector while learning, which is called competitive Hebbian learning (CHL) [21]. All output units are updated based on their relative order, which is measured by the distance compared with other output units, rather than the absolute distance, to the input vector. Without the topological constraint, the NG units act like a gas, and spread throughout the input space.

The growing cell structure (GCS) is a dynamic neural model that maintains its units with a triangular connectivity [10]. GCS starts with three units and increases by one unit after every predefined period. A new unit is inserted by splitting the farthest unit from the unit with the greatest error. A unit with a low probability density, meaning that few input vectors are mapped onto it, will be removed, together with its direct neighbors on the corresponding triangle. The growing neural gas (GNG) [11] is a neural model applying the GCS growing mechanism with the CHL topology. GNG starts with two units and connects an input vector’s BMU to SMU. At each learning iteration, the age variable of each connection that is directly linked to the BMU is increased by 1, but the age variable of the connection between BMU and SMU is initialized to zero. After every predefined period, a new unit is inserted by splitting the unit with the highest error in the direct neighborhood from the unit with the highest error in the whole structure. The unit-pruning function works through the connection-trimming function, which removes old connections that exceed a predefined threshold and units which have no neighbors. Both GCS and GNG have two fixed learning rates which are applied to BMU and its direct neighbors respectively.

Fritzke [12] proposed the growing neural gas with utility criterion (GNG-U), which is an extension of GNG with an ability to handle rapid changes of input patterns. The GNG-U contains all GNG features but employs an extra unit-pruning function. GNG-U uses the utility of a unit to evaluate the contribution of this unit to the topological structure. A utility is an accumulated value, which computes the difference between the distance to BMU and SMU for an input vector. In other words, the utility of a unit is the amount by which the global error is increased if this unit is removed. Where a unit has small utility this means that it is not activated frequently, or that it can be replaced by the SMU with no harm to the model. If the proportion of biggest error to smallest utility in a model is greater than the utility-removing threshold, this unit will be pruned.

Marsland et al. [13] argued that a model should grow because it cannot represent the input vectors well rather than simply because a fixed predefined period is achieved. They propose the grow when required (GWR) model to discover the variation in a data set. GWR starts with two units and applies a CHL topology. At each iteration, all connections to the BMU are aged by 1, but the age for the connection between BMU and SMU is initialized to zero. A unit is inserted near to the BMU because the distance to the BMU for an input vector is greater than the unit-growing threshold. GWR also offers a unit-pruning function, which is performed through the removal of aged connections.

As selecting the appropriate thresholds for modification of the topological structure is a difficult task, Hung and Wermter [14] proposed the dynamic adaptive self-organizing hybrid (DASH), which contains a dynamic structure, hierarchical training, non-stationary data learning and parameter self-adaption. DASH starts with two units and stops when all units represent their associated input vectors well, or the number of inputs is too few to build a sub-map. The recursive learning continues for the individual unit whose average quantization error does not meet the requirement of DASH. DASH uses a CHL topology to connect BMU and SMU for an input stimulus. A connection is trimmed if it is relatively old compared to other connections and a unit without any connection is removed. However, if the model has met its quality requirement, the connection-trimming function is restrained.

Do Rêgo et al. [6] proposed the growing self-reconstruction map (GSRM), which is extended from the GNG model for use in the field of three-dimensional shape reconstruction. GSRM starts with three units randomly chosen from the input vectors. The main difference between GSRM and GNG is that GSRM uses an extension of the CHL to update the connections between units. The original CHL simply builds the connection between BMU and SMU, but the extended competitive Hebbian learning (ECHL) builds and removes the connection in order to produce a triangular two-manifold mesh representation of a three-dimensional target object. More complete literature reviews in the field of SOM with a time-varying structure can be found in [22].

Although many extended SOM-like models have successfully used a time-varying structure in order to relieve the need for a predefined topological structure before the presentation of input data, their non-grid topological feature prevents them from being presented on a two-dimensional space without information loss. Therefore, these non-grid SOM-like models lose the visualization benefit of the original SOM, which is its major feature compared with other clustering techniques.

Unlike such non-grid SOM-like models, growing grid (GG) is an incremental variant of the SOM in terms of the network topology [17]. By keeping a grid structure, GG uses a competitive-based learning technique, which aims to make the BMU learn more than its neighbors for each input vector and represent the relationship between input vectors on a two-dimensional map without information loss. However, the extent of modification is decided not only by the geometric relationships between the BMU and its neighbors but also by the distance between the input vector and its associated output units. Consequently, the BMU may not always move more towards its associated input vector because the distance between the BMU and this input vector is the shortest. Therefore, we propose a constrained learning function for GG to obey the competitive learning principle [19, 22, 23].

3 The GG growing process

Like the SOM, GG is a neural clustering approach. More specifically, GG is a dynamic variant of the SOM. GG contains growing and fine-tuning stages, which use the same learning process but different learning rates: a fixed learning rate at the growing stage and a decaying learning rate over time at the fine-tuning stage. The growing process of GG is achieved by increasing rows of units or columns of units in a grid. It starts with 2x2 units in a grid architecture and develops the grid periodically by inserting a whole set of units in a column or row (Fig. 1). The units grow from the position between the most frequently activated unit and its farthest direct neighboring unit (Fig. 1b). Thus, by keeping a grid structure that is formed by the positional units, GG inherits from the SOM a method of providing a two dimensional view effortlessly. The growing frequency depends on the learning length for the target unit number as shown in (1). After g t number learning steps, the whole set of units in a column or row is inserted. Criteria for stopping growth should be set in order to prevent the grid from growing indefinitely. Like the SOM, the GG uses the predefined learning length as its natural stop criteria. Unlike the SOM, which uses two predefined learning lengths for rough learning and fine-tuning learning respectively, the GG starts its fine-tuning learning stage once its unit number achieves a predefined target number of units at the growing stage [17].

$$ g_{t} =\max \left(1,\;ceil\left(\frac{l}{U-u_{t} }\right)\right), $$
(1)

where l is the learning length, U is the target unit number, t is time, and u t denotes the number of units built at time t.

Fig. 1
figure 1

The growing process of the Growing Grid (GG) model. GG adds a set of units after every g t learning steps as in (1). Units are represented as circles. Lateral connections between units are represented as lines. Circle A denotes the most frequently activated unit and Circle B is the farthest direct neighbor for Circle A. Grey circles are new units at each growing iteration

4 The GG learning process

The learning process is the means by which units of the grid are modified to represent their associated input samples as faithfully as possible. Like the SOM, GG uses a topological structure of units such that adjacent units contain similar weights so that they self-organize into an ordered map [1]. Based on a current topological structure, units in the output layer are interconnected with lateral connections. This interconnection influences the difference in activation for output units during learning.

At each learning step, the first core rule is to decide the best matching unit (BMU) for a current input vector. By comparison with an input vector, units of current GG topology compete against each other. The winner is the BMU, which is the unit most similar to the input vector and which represents the input vector best, as in (2). The second core rule is to make this BMU together with its neighbors, move towards the input vector. The BMU should then be the unit with the largest activation value, and units in a neighborhood with close proximity are more active than those units in a neighborhood that is further away.

$$ \left\| {X(t)-W_{b} (t)} \right\|\le_{j=1}^{K} \left\| {X(t)-W_{j} (t)} \right\| $$
(2)

where \(\left \| {\;.\;} \right \|\) is the Euclidean norm, X is the input vector, W is the output unit vector, t denotes time, K is the total number of output units in the current topological grid and b is the BMU.

In Fig. 2, the input vector is represented as circle X. Units in neighbourhood 1 are more active than those in neighbourhood 2. The learning process is a way of making these units represent input vectors better. In other words, as in the concept of vector quantization, the learning process is a way to reduce the quantization error. This is carried out by a neighborhood function and learning rate in the update function of the GG model, as in (3). At the growing stage, the GG model uses a fixed learning rate to represent its associated input samples quickly. At the fine-tuning stage, the topological structure is fixed. The GG model fine-tunes its output units with a time decaying learning rate in order to pursue a better representation.

$$ W_{j} (t+1)=W_{j} (t)+\alpha (t)A_{b} (j,t)(X(t)-W_{j} (t)), $$
(3)

where t denotes time, α(t) is the learning rate function and A b (j,t) is a bell-shaped neighborhood function centered on the BMU b as (4), X is the input vector, and W is the output unit vector.

$$ A_{b} (j,t)=e^{-d_{bj}^{2} /2{\delta_{t}^{2}} }, $$
(4)

where δ t is the neighborhood radius at time t and \(d=\left | {r_{b} -r_{j} } \right |\) is the city-block distance between output units b and j on the grid.

Fig. 2
figure 2

An example of the neighborhood. The input vector is represented as circle X. Units in neighborhood 1 are more active than those in neighborhood 2

In Fig. 3, before learning, output units are denoted as light grey units, such as units A, B, C, D, E, etc. Unit B is the best matching unit (BMU) because it is the nearest unit to the input vector X. In this example, only the winner and its direct neighbors are updated. That is, the winner B and it neighbors A, C, D and E move towards the input vector X.

Fig. 3
figure 3

An example of the 3×3 GG learning influenced by a current topological structure. The winner B and it neighbors A, C, D and E move towards the input vector X and the modified units are shown as dark circles

5 Constrained neural learning

The GG algorithm is a robust tool for a mapping task but it has also been proposed as a biological model of topological map formation. From the viewpoint of biological function, the competition feature enhances the winning unit’s reaction, while the cooperation feature makes the BMU and its neighbors work together towards modification of the synaptic strengths [24]. The lateral connection of output units is crucial for self-organization and formation of a topologically ordered network. The topologically ordered network is defined and proven by [25, 26] as in (5).

$$ A_{b} (b1,t)\!>\!A_{b} (b2,t)\;\mathit{if}\;\!\!\left\| {X(t)\,-\,W_{b1} (t)} \right\|\!\!<\!\left\| {X(t)\,-\,W_{b2} (t)} \right\| $$
(5)

where A b (b,t) is the neighborhood function centered on the unit b and \(\left \| {\;.\;} \right \|\) is the Euclidean norm.

According to the GG algorithm, the update rule (3) forces each output unit within the defined neighborhood to move towards the input vector, by the distance between the output unit and the input vector. However, besides the neighborhood function, this distance also represents the strength of activation. Compared to its neighbors, the BMU may not move the most towards the input vector since the BMU is the closest unit to the input vector. In other words, the BMU may not always be the most responsive unit to a given input vector. Although the neighborhood function reduces this opportunity, it may make output units overactive in the BMU direct neighborhood.

On the other hand, from a biological viewpoint, neurons or output units do not change their behavior abruptly but in a smooth way [27]. The GG adapted methods will make output units go back and forth at the learning stage so that the cortical column may not spread suitably. Thus, we constrain the output unit adapted behavior to the area of the input vector and its associated BMU. The modification of the traditional GG algorithm is proposed as in (6). As unit b is the best matching unit, its Euclidean norm or distance to the input vector X at time t (i.e., \(\left \| {X(t)-W_{b} (t)} \right \|)\) should be the shortest among the units. In (6), the activation of neighbors of the BMU (i.e. unit j, where jb s n e i g h b o r s, jb) is constrained to the proportion of \(\left \| {X(t)-W_{b} (t)} \right \|\) to \(\left \| {X(t)-W_{j} (t)} \right \|\). If X(t)≠W j (t) a n d jb, then \(\frac {\left \| {X(t)-W_{b} (t)} \right \|}{\left \| {X(t)-W_{j} (t)} \right \|}\) is less than 1. If X(t)≠W j (t) a n d j=b, then \(\frac {\left \| {X(t)-W_{b} (t)} \right \|}{\left \| {X(t)-W_{j} (t)} \right \|}\) equals 1. Therefore, this modification (6) will guarantee that the BMU will have the strongest activation to the input vector and there is no influence for the BMU per se.

$$ W_{j} (t+1)=\left\{\!\! {\begin{array}{l} W_{j} (t)\,+\,\alpha (t)A_{b} (j,t)(X(t)\,-\,W_{j} (t))\frac{\left\| {X(t)-W_{b} (t)} \right\|}{\left\| {X(t)-W_{j} (t)} \right\|},\\\qquad\qquad\qquad\qquad\qquad\qquad if\;X(t)\!\ne\! W_{j} (t)\,\, ,\\ W_{j} (t),\quad otherwise \end{array}} \right. $$
(6)

where t denotes time, α(t) is the learning rate function, A b (j,t) is the neighborhood function centered on the BMU b.

6 Evaluation measures

In this research, the average quantization error (AQE) [3], error entropy [28], average map scope (AMS) [23], BMU activation rate (BAR) and topographic error (TE) [3, 29] are evaluated for the explanation ability, smooth hyper surface, border effect, BMU activation and topology preservation of the GG algorithms respectively.

6.1 Average quantization error

The quantization error, which is also called the distortion measure for SOM-like models, is suggested by Kohonen as a measurement used in the vector quantization technique [3]. The quantization error is defined as the sum of the Euclidean norm between every input vector and its BMU. A trained topological structure with a smaller value of quantization error produces a smaller distortion for their input samples. Thus, a better GG model should have a smaller quantization error. As the number of output units affects the quantization error, this paper uses the average quantization error (AQE) instead. Given an input vector X i , the average quantization error is described as in (7).

$$ AQE=\frac{1}{N}\sum\limits_{i=1}^{N} {\left\| {X_{i} -W_{i} } \right\|} , $$
(7)

where N is the total number of input samples, X i is the weight vector of input sample i, and W i denotes the BMU of X i .

6.2 Error entropy

The GG models use their trained grids to represent input vectors. An ideal SOM-like model should build a smooth hyper surface and each unit on the surface should be evenly distributed [3]. Based on this concept, Wu and Takatsuka [28] proposed a measure, namely error entropy, to evaluate the average distribution of error for units on the map. Entropy [30], from information theory, is a measure to evaluate the uncertainty of variables, which is able to evaluate the degree of uniformity of the error distribution on the map. This paper uses the same definition of error entropy as in the work of Wu and Takatsuka [28] and Hung [23], which is shown in (8). A map with greater error entropy has a more even error distribution, producing a smoother surface to represent input vectors. A unit that does not represent any input samples is given an error entropy of zero.

$$\begin{array}{@{}rcl@{}} && ErrorEntropy=-\left(\sum\limits_{j=1}^{K} {e_{j} \log_{2} e_{j} } \right)\\ &&where\;\left\{ {\begin{array}{l} e_{j} =\frac{UQE_{j} }{TQE},\;if\;UQE_{j} >0 \\ e_{j} =1,\;\quad \;\;if\;UQE_{j} =0 \end{array}} \right., \end{array} $$
(8)

K is the total number of units, UQE j and TQE mean the quantization error of unit j and total quantization error for the map, which are shown in (9) and (10), respectively.

$$ UQE_{j} =\sum\limits_{i=1}^{L} {\left\| {X_{i} -W_{j} } \right\|} , $$
(9)

where W j is the weight vector of unit j, which is the BMU for input sample i, and L is the total number of input samples whose BMU is unit j.

$$ TQE=\sum\limits_{i=1}^{N} {\left\| {X_{i} -W_{i} } \right\|} , $$
(10)

where X i is the weight vector of input sample i, W i is the weight vector of BMU for input sample i and N is the total number of input samples.

6.3 Average map scope

One well-known problem of the SOM-like model is a border effect, whereby units at the border of a trained map do not represent their associated input samples well compared with those units inside the map [3, 23]. The border effect happens because units at the border of the map have fewer neighbors than those inside the map. Thus, the whole map does not stretch out enough. The result of this effect is that input samples whose BMUs are at the border are not represented well compared with those samples whose BMUs are inside the trained map. For example, in Fig. 4a and b, each input sample is illustrated by a small grey dot and each output unit is represented by a circle. The trained map in Fig. 4b stretches wider than that in Fig. 4a, so the border effect in Fig. 4a is greater than that in Fig. 4b. Hung [23] proposed the average map scope (AMS) to evaluate the border effect, which is also used in this research. AMS is defined as the average distance of border units to the center of the map as in (11). A trained map of GG with a smaller border effect should have a larger average map scope.

$$ Average\;map\;scope=\frac{1}{B}\sum\limits_{i=1}^{B} {\left\| {W_{i} -W_{c} } \right\|} , $$
(11)

where B is the total number of border units, W i denotes the weight vector of the border unit, W c denotes the average weight vector of all units, as shown in (12).

$$ W_{c} =\frac{1}{K}\sum\limits_{j=1}^{K} {W_{j} } , $$
(12)

where K is the total number of units and W j is the weight vector of unit.

Fig. 4
figure 4

(a) A 6×5 GG with a greater border effect. (b) A 6×5 GG with a smaller border effect

6.4 The BMU activation rate

As the traditional GG algorithm follows the competitive learning principle [17], the unit which is closest to the input vector should be more active than the others. However, according to the original update rule as shown in (3), the BMU may not produce the greatest response to the input stimulus due to the distance between the BMU and its associated input vector, which is the shortest. Although the bell-shaped neighborhood function as shown in (4) may reduce this situation, it is interesting to test whether the BMU moves more than its neighbors towards their associated input vector during learning. This test can be done by the BMU activation rate (BAR), which is shown in (13).

$$\begin{array}{@{}rcl@{}} BAR&=&\frac{1}{S}\sum\limits_{i=1}^{S} {r(x_{i} )} , \\ r(x_{i})&=&\left\{ {\begin{array}{l} 1,\;\mathit{if}\;\mathit{the}\;\mathit{BMU}\;\mathit{moves}\;\mathit{the}\;\mathit{most}\;\mathit{towards}\;x_{i} \\ 0,\;\mathit{otherwise} \end{array}} \right. \end{array} $$
(13)

where S is the total number of learning steps. The value of r(x i ) equals 1, if the BMU moves the most towards its associated input vector x i , or otherwise equals 0.

6.5 Topographic error

Like the SOM, one main goal of a GG model is to project input vectors from a high-dimensional space onto a low-dimensional space, and faithfully keep their internal structure. Therefore, nearby input vectors are projected onto output units, which are located nearby in the output space. For example, if two input vectors X and Y have a neighborhood relationship, their BMUs (i.e. W x and W Y ) should also be neighbors. This characteristic of the projection preserves the topology of data and is a useful evaluation criterion for the quality of neural clustering [31]. There are several measures to evaluate topology preservation, such as topographic product [32], topographic function [33], topographic error [29] and some other alternatives [34]. Generally speaking, the tests for topology preservation require huge computational resources for the calculation of the distances between all units and the distances between all input vectors. This research uses topographic error to measure the topology preservation as this approach reduces the huge computational burden, as suggested by Kohonen [3]. Topographic error is defined as the proportion of all data vectors whose first and second BMUs are not adjacent units (14). The lower the topographic error is, the better the GG preserves the topology.

$$\begin{array}{@{}rcl@{}} TE&=&\frac{1}{N}\sum\limits_{i=1}^{N} {u(x_{i} ),} u(x_{i} )\\ &=&\left\{ {\begin{array}{l} 1,\;\mathit{if}\,\mathit{the}\,1^{st}\;\mathit{and}\;2^{nd}\;\mathit{BMUs}\;\mathit{are}\;\mathit{not}\;\mathit{neighbors} \\ 0,\;\mathit{otherwise} \end{array}} \right. \end{array} $$
(14)

where neighbor denotes units with the direct connection between each other, N is the total number of input samples and x is the input vector.

7 Experiments

To test the performance of the traditional GG and the proposed constrained growing grid (CGG), we use fifteen data sets, including a square data set [23], animal data set [35], iris data set [36], ionosphere data set [37], sentiment polarity data set [38, 39] and the fundamental clustering problem suite (FCPS) [40], which consists of ten data sets (i.e., Atom, Chain Link, Engy Time, Golf Ball, Hepta, Lsun, Target, Tetra, Wing Nut, Two Diamonds). The whole learning length includes two stages, i.e. the growing and fine-tuning stages. The GG model grows its units only at the growing stage with a fixed and greater learning rate. In the second phase, the GG model fine-tunes its current topological structure with a smaller learning rate, which decays over time. We extend the traditional GG model by two approaches to test our constrained neural learning rule. The first one, namely fine-tune CGG, keeps the same learning rule as that in the traditional GG at the growing stage and constrains the reaction of non-BMUs to their associated input vectors only at the fine-tuning stage. Thus we can evaluate the performance of the constrained neural learning rule in a stable topological structure of GG. The second approach, namely full CGG, uses the constrained neural learning rule at both growing and fine-tuning stages. The full CGG realizes a real winner-take-more neural learning model, in which the best matching unit has the strongest reaction to its associated input stimulus.

For comparison, we treat the traditional GG as the benchmark model. The traditional self-organizing map (SOM) and its constrained version (i.e. CSOM) are also provided. All growing grid models use the same parameters such as the learning rate, the learning length and the target unit number. In this paper, the fixed learning rate at the growing stage is 0.1 and the initial decaying learning rate at the fine-tuning stage is 0.005, which are also used in the original GG paper [17]. The initial decaying learning rate for SOM models is 0.1. The whole learning length of all models is set to 200 epochs and the predefined target unit number is set to 25. Different parameters have been tried and similar conclusions have been reached. To obtain more general results, each model is tested 10 times and the average results are used for comparison. Finally, a t-test is used to test whether or not the difference between the proposed CGG and the traditional GG models is of statistical significance.

7.1 The square data set

The square data set is used to demonstrate the ability of the SOM, constrained SOM, growing grid, fine-tune CGG, and full CGG to learn from a square data distribution. The square data set, proposed by Hung [23], uses 900 two-dimensional input samples from a 3×3 grid, which is between (0, 0) and (3, 3) at intervals of 0.1. Thus, this data set contains 900 evenly distributed points in a square. According to the experiments in Table 1, the traditional GG model outperforms the traditional SOM model, evaluated by criteria of average quantization error (AQE), error entropy, BMU activation rate (BAR) and average map scope (AMS). The static SOM model is better than the dynamic GG model, evaluated by the topographic error criterion. The constrained SOM is better than the traditional one. The fine-tune CGG model is better than the traditional GG model only in the criteria of BAR and AMS. One possible reason is that the effect of the constrained learning rule is limited when the learning rate is decayed to a very small value. However, if the constrained learning rule is used at both growing and fine-tuning stages, our proposed full CGG model outperforms GG and fine-tune CGG models.

Table 1 The Performance Comparison Evaluated by AQE, error entropy, BAR, AMS and TE for a square data set

We further use t-test to measure the degree of paired difference between GG models. Based on the criteria of AQE, BAR and AMS, the full CGG is significantly better than GG, which clearly indicates that our proposed model is an improvement on the traditional GG model on these criteria. In terms of the criteria of error entropy and TE, the full CGG model does not significantly outperform GG, the p-value in the t-test being greater than 0.05.

It is useful to evaluate different models by their convergence maps. The whole learning length of GG includes the growing and fine-tuning stages. Generally speaking, all convergence maps in Figs. 5, 6, 7, 8 and 9 are smooth due to taking the average results of 10 runs. In terms of the TE evaluation criterion (Fig. 9), the GG models appear to be slightly unstable during learning. However, the full CGG model performs better than other GG models in all evaluation criteria at each learning epoch.

Fig. 5
figure 5

A convergence comparison of models evaluated by AQE for a square data set

Fig. 6
figure 6

A convergence comparison of models evaluated by error entropy for a square data set

Fig. 7
figure 7

A convergence comparison of models evaluated by BAR for a square data set

Fig. 8
figure 8

A convergence comparison of models evaluated by AMS for a square data set

Fig. 9
figure 9

A convergence comparison of models evaluated by TE for a square data set

7.2 The animal data set

The square data set contains input vectors which are clearly separated and thus the algorithms should have a good topological structure. The famous animal data set used in the literature [35] to demonstrate the ability of self-organizing semantic maps is also used in this paper. The animal data set contains 16 animals with 13 features and forms a 16 by 13 input matrix. Each row uses a schematic description of an animal, based on the presence (=1) or absence (=0) of some of the 13 different features. According to the experiments in Table 2 and Figs. 10, 11, 12, 13 and 14, the traditional GG model performs better than the traditional SOM model in all criteria except BAR. The fine-tune CGG model performs significantly better than the traditional GG model, evaluated by the criteria of AQE, error entropy and BAR. The constrained SOM and full CGG models outperform their traditional counterparts in all criteria except TE. In terms of the TE criterion, all GG models show instability in the convergence process (Fig. 14). The constrained GG models do not perform better than the benchmark for this criterion.

Fig. 10
figure 10

A convergence comparison of models evaluated by AQE for an animal data set

Fig. 11
figure 11

A convergence comparison of models evaluated by error entropy for an animal data set

Fig. 12
figure 12

A convergence comparison of models evaluated by BAR for an animal data set

Fig. 13
figure 13

A convergence comparison of models evaluated by AMS for an animal data set

Fig. 14
figure 14

A convergence comparison of models evaluated by TE for an animal data set

Table 2 Performance Comparison Evaluated by AQE, error entropy, BAR, AMS and TE for an animal data set

7.3 The iris data set

The well-known iris data set collects iris flowers of three species, i.e. setosa, virginica and versicolor, each of which contains 50 input samples [36]. Each flower is illustrated by four features, i.e. the length of sepals, the length of petals, the width of sepals and the width of petals. According to the experiments in Table 3 and Figs. 15, 16, 17, 18 and 19, the traditional GG models perform better than the traditional SOM models in all criteria. The constrained SOM models perform better than the traditional SOM models in all criteria except TE. The fine-tune CGG models outperform the traditional GG models, evaluated by the criteria of AQE, error entropy and BAR. The full CGG models significantly outperform the traditional GG models in all criteria except TE.

Fig. 15
figure 15

A convergence comparison of models evaluated by AQE for an iris data set

Fig. 16
figure 16

A convergence comparison of models evaluated by error entropy for an iris data set

Fig. 17
figure 17

A convergence comparison of models evaluated by BAR for an iris data set

Fig. 18
figure 18

A convergence comparison of models evaluated by AMS for an iris data set

Fig. 19
figure 19

A convergence comparison of models evaluated by TE for an iris data set

Table 3 Performance Comparison Evaluated by AQE, error entropy, BAR, AMS and TE for an iris data set

7.4 The ionosphere data set

The ionosphere data set contains 351 radar samples with 34 continuous attributes and a class label, which builds a 351 by 34 input matrix [37]. Since the GG model is an unsupervised neural clustering approach, the class label is not included in this research. According to the experiments in Table 4 and Figs. 20, 21, 22, 23 and 24, the traditional GG models perform better than the traditional SOM models and the constrained SOM models perform better than the traditional SOM models, evaluated by all criteria. The full CGG models are significantly better than the traditional GG models for the criteria of AQE, error entropy and BAR, but worse for the criteria of AMS and TE. According to the convergence maps in Figs. 20, 21, 22, 23 and 24, the criterion of TE appear to be unstable during learning for all GG models.

Fig. 20
figure 20

A convergence comparison of models evaluated by AQE for an ionosphere data set

Fig. 21
figure 21

A convergence comparison of models evaluated by error entropy for an ionosphere data set

Fig. 22
figure 22

A convergence comparison of models evaluated by BAR for an ionosphere data set

Fig. 23
figure 23

A convergence comparison of models evaluated by AMS for an ionosphere data set

Fig. 24
figure 24

A convergence comparison of models evaluated by TE for an ionosphere data set

Table 4 Performance Comparison Evaluated by AQE, error entropy, BAR, AMS and TE for an ionosphere data set

7.5 Sentiment polarity data set

In order to test the proposed approaches in a high-dimensional data set, a well-known movie review data set is used. The sentiment polarity data set contains 2000 movie reviews extracted from Internet movie database (IMDb, http://www.imdb.com/) by Pang and Lee [38]. After text processing in the work of Hung and Lin [39], this data set contains 3131 attributes and builds a 2000 by 3131 input matrix. According to the experiments in Table 5 and Figs. 25, 26, 27, 28 and 29, the constrained SOM models outperform traditional SOM models for all evaluation criteria. The traditional GG models perform better than the traditional SOM models, evaluated by the criteria of error entropy, AMS and TE. The fine-tune CGG models perform significantly better than the traditional GG models, evaluated by the criteria of AQE, error entropy and TE. The full CGG models outperform the traditional GG models in all evaluation criteria.

Fig. 25
figure 25

A convergence comparison of models evaluated by AQE for a sentiment polarity data set

Fig. 26
figure 26

A convergence comparison of models evaluated by error entropy for a sentiment polarity data set

Fig. 27
figure 27

A convergence comparison of models evaluated by BAR for a sentiment polarity data set

Fig. 28
figure 28

A convergence comparison of models evaluated by AMS for a sentiment polarity data set

Fig. 29
figure 29

A convergence comparison of models evaluated by TE for a sentiment polarity data set

Table 5 Performance Comparison evaluated by AQE, error entropy, BAR, AMS and TE for a sentiment polarity data set

7.6 Fundamental clustering problem suite (FCPS)

The fundamental clustering problem suite (FCPS) contains ten data sets, which are treated as test beds for neural clustering [40]. A short description of data sets is shown in Table 6. Based on the criteria of AQE, error entropy and BAR, except in a few cases, the experiments in Tables 7, 8 and 9 demonstrate that the fine-tune CGG models perform better than the traditional GG models. According to the experiments shown in Tables 7, 8, 9 and 10, the full CGG models significantly outperform the traditional GG models, evaluated by the criteria of AQE, error entropy, BAR and AMS. In terms of the TE criterion, the full CGG models perform better than the traditional GG models for seven datasets, which are Atom, Chain Link, Engy Time, Lsun, Target, Tetra and Wing Nut. Thus, the experiments conclude that the proposed full CGG models can significantly improve the traditional GG model based on five evaluation criteria. Convergence maps of the traditional GG, fine-tune CGG and full CGG are shown in Figs. 30, 31 and 32, respectively. From these convergence maps, we can evaluate the quality of growing maps based on visualization, and show that the proposed full CGG is more effective than the others (Table 11).

Fig. 30
figure 30

The convergence maps of growing grid after 0, 20 and 200 epochs for the target data set. A circle denotes an output unit and a grey dot denotes an input vector

Fig. 31
figure 31

The convergence maps of fine-tune CGG after 0, 20 and 200 epochs for the target data set. A circle denotes an output unit and a grey dot denotes an input vector

Fig. 32
figure 32

The convergence maps of full CGG after 0, 20 and 200 epochs for the target data set. A circle denotes an output unit and a grey dot denotes an input vector

Table 6 Description of fundamental clustering problem suite (FCPS)
Table 7 Performance Comparison Evaluated by AQE for FCPS
Table 8 Performance Comparison Evaluated by Error Entropy for FCPS
Table 9 Performance Comparison Evaluated by BAR for FCPS
Table 10 Performance Comparison Evaluated by AMS for FCPS
Table 11 Performance Comparison Evaluated by TE for FCPS

8 Conclusion

The Growing Grid (GG) is a very useful algorithm for mapping a high-dimensional data set onto a low-dimensional topological map without using the predefined topological structure before learning. It enforces a non-linear projection, a non-parametric statistical technique and an unsupervised competitive learning rule. In this paper, a novel constrained GG (CGG) is proposed, which emphasizes the effect on the lateral connections of output units and neutralizes the effect of the input vectors on the neighbors. The fine-tune CGG uses the constrained neural learning rule only at its fine-tuning stage, whereas the full CGG uses the constrained neural learning rule at both growing and fine-tuning stages. Fifteen data sets are used and evaluated by five criteria, which are average quantization error (AQE), error entropy, BMU activation rate (BAR), average map scope (AMS) and topographic error (TE). We compare the performance of the traditional SOM, constrained SOM, GG, fine-tune CGG and full CGG based on the average results from ten experiments for each evaluation criterion. Except in some cases, the performance of the traditional GG is improved if the constrained learning rule is used, and the constrained neural learning rule used at both growing and fine-tuning stages outperforms only at the fine-tuning stage. Finally, we use the t-test to test whether the improvements from our proposed model achieve statistical significance or not. Except in some cases, our proposed full CGG model significantly outperforms those models with an original learning rule. These exceptions may occur due to some GG models randomly building a twisted map at their growing stage, so that the fine-tune CGG or full CGG cannot produce a significant improvement.

9 Scope for future work

For future work, the following issues may be considered. Firstly, our proposed approach could be applied to other competitive learning models, such as neural gas (NG), growing neural gas (GNG), growing neural gas with utility (GNG-U), grow when required (GWR), dynamic adaptive self-organizing hybrid (DASH), etc. Secondly, more evaluation and application methods may be considered. For example, the receiver operating characteristics (ROC) curve could be explored as a recommender system, and other data sets from different domains may also be utilized for comparison. Thirdly, another important issue is how to avoid building a twisted topological map. This may occur due to some uncontrollable random parameters. However, such random parameters should not simply be removed, i.e. initial random unit weights and random input sample order. A twist-free mechanism would be a welcome addition in the field, and is a possible direction for further research. Fourthly, according to some experiments, it seems difficult to improve the performance of topological preservation based on the topographic error (TE) criterion for neural growing models. The possible reason may be inadequate learning length for each growing process. As this research only focuses on the principle of winner-take-more, further research could evaluate the relationship between the learning length and effectiveness of topological preservation. Fifthly, typical mapping algorithms have online and offline specifications of algorithms. However, this research applies only to online growing grid algorithms. In future work, the proposed constrained neural learning could be used in offline mapping algorithms and it should be interesting to compare the difference between online and offline constrained models.