1 Introduction

Real datasets are generally defined in a fuzzy context. In fact, there is no sharp boundary between classes so that fuzzy clustering is better suited for the data classification. Therefore, to classify such data is a very challenging task as in Ayadi et al. (2007, 2010, 2011), Alimi (2000, 2003), Hamdani et al. (2011a), Dhahri and Alimi (2005), Kohonen (1998), Denaïa et al. (2007) and Chang and Liao (2006). The clustering is an unsupervised classification mechanism where a set of entities are classified into a number of homogeneous clusters, with respect to a given similarity measure. Due to the fuzzy nature of many classification problems, a number of fuzzy clustering methods have been developed such as in Ayadi et al. (2007, 2010, 2011), Alimi (2003), Hamdani et al. (2008, 2011a, b), El Malek et al. (2002), Pascual et al. (2000), Kohonen (1998, 2001), Denaïa et al. (2007), Fritzke (1994), Petterssona et al. (2007) , Abraham and Nath (2001), Aguilera and Frenich (2001) and Chang and Liao (2006). Fuzzy clustering can be applied as an unsupervised learning technique to classify patterns (Alimi 2000, 2003; Alimi et al. 2003).

In the unsupervised training, the network learns to form its own classes of the training data without external adjustment. Unsupervised systems of neural networks use a competitive learning process which is based on similarity comparisons in a continuous space. The resulting system associates similar inputs close to each other in one- or two-dimensional grid. Among the architectures suggested for unsupervised neural networks, the self-organizing map (SOM) (Kohonen 1982) is the most efficient in creating specific organized internal representations of input vectors and their abstractions. SOM is an unsupervised learning model used to visualise and explore linear and nonlinear relationships in high-dimensional data. SOM was first used in speech recognition by Kohonen in (1998, 2001). In literature researches, SOM is successful in visualisation of rich and complex phenomena such as pattern recognition tasks involving different criteria especially in clustering mechanism (Denaïa et al. 2007; Fritzke 1994) and (Hsu and Halgarmuge 2001). Then, SOM algorithms have been implemented in various aspects of researches in classification Kohonen (1998), managing ecosystems Chang and Liao (2006) and GIS for addressing remediation Petterssona et al. (2007). SOM approaches were also applied in pattern recognition of big data (El Malek et al. 2002; Tlili et al. 2012; Amarasiri et al. 2004).

Generally, in their training process SOM approaches try to preserve the geometric form of the data called topology with the minimum error of quantization (Alimi et al. 2003; Hamdani et al. 2008; Pascual et al. 2000; Ayadi et al. 2012). However, the topology is not perfectly preserved with SOM algorithms due to their predefined structure. Therefore, dynamic variants of SOM are proposed to define the structure dynamically during the training process.

In this paper we present a new fuzzy dynamic learning method called FMIG (Fuzzy Multilevel Interior Growing Self-Organizing Maps). This algorithm is a hybrid approach in which both the MIGSOM technique (Multilevel Interior Growing Self-Organizing Maps) (Ayadi et al. 2010, 2012) and the FKCN algorithm (Fuzzy Kohonen Clustering Network) (Pascual et al. 2000) have been integrated. Our approach introduces a new form of fuzzy growing process and attempts to overcome some of the clustering difficulties by taking advantage of the best features of the multilevel structure given by MIGSOM and the fuzzy clustering model presented by FKCN. In fact, MIGSOM develops its growing process from the node that accumulates the highest Quantization Error (Fig. 1) (Amarasiri et al. 2004; Ayadi et al. 2012). On the other hand, FKCN presents a fuzzy model of clustering based on the Kohonen’s Self-Organizing Feature Maps (SOFM) (Kohonen 1997) and the Fuzzy C-Means (FCM) (Bezdek 1981, 1987). Both of SOFM and FCM are successfully applied in classification.

Fig. 1
figure 1

New nodes insertion with MIGSOM growing process. a From the boundary and b from the interior

One of the advantages of our approach is to introduce a new Growing Threshold (GT) used to control the growth of the map. In fact, we remarked that the GT structure used in other algorithms such as MIGSOM becomes complicated and takes big values as the training process grows (Ayadi et al. 2010; Alimi 2000, 2003; Alimi et al. 2003). In addition, when the dataset is large the computing process becomes more complicated at next iterations and takes more time with the augmentation of GT values. For these reasons, we created a new growing threshold applied in our powerful methodology FMIG for big data clustering. The result is a new method of fuzzy dynamic SOM, which maintains the topology of patterns and considerably decreases the quantization error.

The rationale underlying the presented FMIG algorithm is that both advantages of fuzzy clustering and MIGSOM approaches are combined. It is worth to mention that FMIG serves two purposes: first, it can be used to produce the best fuzzy distribution of data by finding out the topology corresponding to the fuzzy form of inputs. Second, it gives the minimum value of error quantization providing the best cluster structure. In fact, FMIG is used to produce the best approximation to the real distribution of a given dataset. This algorithm finds a set of codebook vectors to code compactly all the input patterns. The topological FMIG maps introduce an additional relation between the codebook vectors. This topological relationship is a constraint which has been proved to produce robust results with the FMIG algorithm (Alimi 2000; Alimi et al. 2003; Hamdani et al. 2011a).

The classification of high-dimensional datasets presents a computational problem and it increases depending on the data size. Thus, to reduce the computational cost, two-steps have been proposed. First, the datasets are trained using a neural map to find the quantization prototypes. Then, the generated prototypes are clustered using a clustering algorithm such as FCM algorithm (Bezdek 1981, 1987). The choice of FCM is based on its best clustering compared to SOM and hierarchical algorithms (Mingoti and Lima 2006). However, the challenge of big data clustering is how to evaluate the performance of the resulting partition and finding the optimal number of clusters. Currently, some validity indexes are developed focusing on the geometric structure of the data subject of unsupervised classification. The validity function Dunn (1974) is based on inter-cluster distance and the diameter of a hyperspheric cluster. Inspired from the Xie-Beni validity function, Kim and Ramakrishna developed in 2005 the Xie-Beni Index (XBI) based on a fuzzy compactness criterion (Mingoti and Lima 2006). In 2004, Tsekouras and Sarimveis proposed an index based on a fuzzy measure of separation called Separation Validity Index (SVI) (Wu and Yang 2005). Lung and Yang proposed in 2004 a fuzzy index evaluating noisy points in the dataset (Hu et al. 2004), called Partition Coefficient And Exponential Separation (PCAES) that was normalized by Tlili et al. (PCAESN) (2014). Wang and Lee also presented a method which calculates the fuzzy degree of overlapping clusters called Validity Overlap Separation (VOS) (Pal and Bezdek 1992). In 2009, M. Tlili et al. presented a new cluster validity function called Fuzzy Validity Index with Noise-Overlap Separation (FVINOS) for evaluating fuzzy and crisp clustering (Tlili et al. 2014).

Our analysis is based on cluster validity indexes FVINOS, PCAESN, VOS, SVI, Dunn and XBI. In fact, we have studied these indices in our previous research (Tlili et al. 2014) (Appendix).

In a first step, we have undertaken extensive performance comparisons with MIGSOM, GSOM and FKCN to establish the robustness of FMIG in mapping several synthetic as well as real-world datasets. Then, we have proceeded by the classification of the generated prototypes with the FCM algorithm. Finally, we have applied some validity indexes on the resulting partitions to evaluate FMIG clustering performance.

This paper is organized as following: Sect. 2 reviews the MIGSOM and FKCN functions. Section 3 introduces our new approach FMIG and the details of its steps. Section 4 evaluates FMIG method compared with MIGSOM, GSOM and FKCN algorithms in term of quantization and topologic errors with a variety of synthetic and real datasets. In Sect. 5, we focus our experimentations on the robustness of our approach in data clustering by validating the obtained partitions with the validity indexes described above. Finally, Sect. 6 summarises and concludes our remarks.

2 Review of crisp and fuzzy SOM algorithms

2.1 Multilevel interior growing self-organizing maps algorithm (MIGSOM)

Proposed by Ayadi et al. (2010) in 2012, this SOM heuristic presents a new method of growing process. The MIGSOM algorithm is based on three phases:

  1. 1.

    The initialization phase giving the values of the Growing Threshold (GT) to control the growth of the map and weight vectors.

  2. 2.

    The growing process is developed in the second phase that increases the map size by adding new nodes from the unit that presents the highest quantization error.

  3. 3.

    The smoothing phase is used to refine the resulting map of the previous phase.

MIGSOM is given by

  1. 1.

    Initialize the map grid with (2 \(\times \) 2) or (3 \(\times \) 3) nodes defined randomly and calculate the growth threshold (GT) as shown:

    $$\begin{aligned} \text{ GT } =-\text{ ln }\left( {n}\right) \times \text{ ln }\left( {\text{ SF }}\right) \end{aligned}$$
    (1)

    The GT is used to decide when to initiate new node growth. It specifies the spread of the feature map to generate. The GT value depends on the requirement for map growth. Thus if we require only a very abstract picture of the data, a large GT will result in a map with fewer nodes. Similarly, a smaller GT will result in the map spreading out more. For these reasons, we have integrated the size of the database in the new form NGT to represent faithfully and correctly Big Data. SF is the Factor of Spread used to control the growth of the map taken in [0, 1]. n is the size of the given dataset.

  2. 2.

    In the growing phase, six steps are developed: Step 1-Present the input patterns to the network. Step 2-Calculate the weight of the best vectors (winners) that are close to the input data using Euclidean distance. Step 3-Update weight vectors as following formula:

    $$\begin{aligned} w_i ( {t+1})=\frac{\sum \limits _{j=1}^n {h_{ci} ( t)\times x_j } }{\sum \limits _{j=1}^n {h_{ci} ( t)} } \end{aligned}$$
    (2)

    with \(w_i (t)\) the weight vector of unit \(i\), \(x_j\) is the input pattern (\(j=1 \ldots n)\). \(h_{ci} ( t)\) represents the neighbourhood kernel centred on the BMU (Best Match Unit) given by

    $$\begin{aligned} h_{ci} ( t)=\exp \left( {-\frac{Ud^2}{2\sigma ^2( t)}}\right) \times l( {\sigma ( t)-Ud}), \end{aligned}$$
    (3)

    where \(Ud\) defines the distance between neuron i and the winning node c on the map, \(\sigma ( t)\) represents the degree to which excited neurons in the vicinity of the winning neuron. \(l( {\sigma _i -Ud})\) is an evaluating function taking two values:

    $$\begin{aligned} l( {\sigma _i -Ud})=\left\{ {\begin{array}{l@{\quad }l} 1&{} if\ (\sigma _i \ge Ud), \\ 0 &{} otherwise \\ \end{array}} \right. \end{aligned}$$
    (4)

    Step 4-Compute the error of each node. Calculate the node presenting the highest quantization error called node q with k units mapped in as follows:

    $$\begin{aligned} E_{rr} =\sum \limits _j^k {\left\| {x_j -w_q } \right\| } \end{aligned}$$
    (5)

    Step 5-If (Errq \(>\) GT), new nodes will be generated from \(q\). Then initialize the new node weight vectors to match the neighbouring node weights. Step 6-Repeat steps 1–5 until the number of iterations is reached.

  3. 3.

    Smoothing Phase: Step 1-Fix the neighborhood radius to one. Step 2-Training the map as the growing phase without adding new nodes.

2.2 The fuzzy Kohonen clustering network: FKCN

Proposed by Pascual et al. (2000), FKCN presents a fuzzy SOM heuristic. It consists of two layers: The input layer is formed by n nodes, with n the number of input patterns. The output layer is composed of c nodes, where c is the number of the output map vectors. For each input node, it is assigned a connection to all output units with a weight vector \(\nu _i \) (\(i=1 \ldots c)\). Based on a pre-defined learning rate, the neurons in the output layer update their weights for an input vector \(x_k \) (\(k=1 \ldots n)\). FKCN approach integrates the fuzzy degree of membership \(u_{ik,t} \)introduced by the fuzzy c-means algorithm (Xie and Beni 1991). The weight vector \(\nu _{i,t} \) at iteration \(t\) is given by

$$\begin{aligned} \nu _{i,t} =\nu _{i,t-1} +\alpha _{ik,t} \times ( {x_k -\nu _{i,t-1} }) \end{aligned}$$
(6)

The learning rate \(\alpha \) is defined as

$$\begin{aligned}&\alpha _{ik,t} =\left( {u_{ik,t} }\right) ^{m_t }\end{aligned}$$
(7)
$$\begin{aligned}&m_t =m_0 -t\times \Delta m\end{aligned}$$
(8)
$$\begin{aligned}&\Delta m=\frac{\left( {m_0 -1}\right) }{t_{\max } }, \end{aligned}$$
(9)

where \(t_{\max }\) is the value of the iteration limit and \(m_0 \) is a positive constant greater than 1.

The membership degree \(u_{ik,t} \) is given by

$$\begin{aligned} u_{ik,t} =\left[ {\sum \limits _{j=1}^c {\left[ {\frac{( {x_k -\nu _{i,t} })}{( {x_k -\nu _j })}} \right] } ^{2/(m_{t}-1)}} \right] ^{-1} \end{aligned}$$
(10)

3 FMIG: fuzzy multilevel interior GSOMs

We present a new form of fuzzy self-organizing algorithm called FMIG (Fuzzy Multilevel Interior GSOMs). Our approach is based on the MIGSOM algorithm (Ayadi et al. 2010, 2012) which adapts its structure according to the data form. The fuzzy aspect of our method is inspired from the FKCN algorithm (Pascual et al. 2000).

As MIGSOM, the training process of FMIG is composed of three steps: initialization phase, growing phase and smoothing phase. The main idea of FMIG is to present a new parameter of Growing GT to ameliorate the control of map growth and improve the training process. In fact, the GT form used in MIGSOM algorithms becomes complicated as the growing increases. In addition, with big datasets, the computing process becomes more complex and takes more time with the augmentation of GT values.

Our new approach FMIG shows two interesting properties:

  • It produces the best distribution of data by finding out the topology corresponding to the form of inputs.

  • It gives the minimum value of error quantization providing the best cluster structure.

We define our FMIG algorithm as the following processes:

3.1 Initialisation

  • Initialize the pre-defined values of the Spread Factor (SF) used to control the growth of the map taken in [0, 1], and \(m_{0}\) the fuzzifier factor. The spread factor has to be specified to allow the data analyst to control the growth of the training process. SF is independent of the dimensionality of the given dataset. This factor is used by the system to calculate the Growing Threshold (GT). Such first step of the training process will act as a threshold value for initiating node generation. A high GT value will result in less distribution (spread) out map, and a low GT will produce a well-distributed map.

  • Taking into account the dimension of the training dataset (n) and the number of its characteristics (D), the growing threshold gives an identical image of the specific data aspects. Hence, mapping such data produces the best distribution and topology. The formula of our New Growing Threshold (NGT) is defined as follows:

    $$\begin{aligned} \text{ NGT } =\frac{( -{\text{ ln }( D)\times \text{ ln }( {\text{ SF }})})}{n} \end{aligned}$$
    (11)
  • Initialize the map grid with (2 \(\times \) 2) or (3 \(\times \) 3) \(c_{0}\) nodes randomly and the weight vectors \(\nu _{0}= (\nu _{1,0}, \nu _{2,0}, \nu _{c0,0}).\) Initialize randomly the membership matrix, with n the number of data inputs:

    $$\begin{aligned} U=\left| {u_{ij} } \right| _{({i} =1 \ldots {c}0, {j}=1 \ldots {n})} \end{aligned}$$
    (12)

    Fix \(t_{\max }\) the limit of iterations.

3.2 Growing process

For t=1... \(t_{\max }\):

  • Calculate the fuzzifier parameter \(m_t\) as shown in (8).

  • Determine the membership degree \(u_{ik,t} \) for the iteration t using the Euclidean distance between the input pattern \(x_k\) and the weight vector \(\nu _i \), with \(c_{t}\) being the number of weight vectors at the iteration \(t\) as defined in (10).

  • Compute the learning rate \(\alpha _{ik,t} \) as defined in (7).

  • The weight vectors are updated by the following formula Pascual et al. (2000):

    $$\begin{aligned} \nu _{i,t} =\nu _{i,t-1} +\frac{\sum \limits _{k=1}^n {\alpha _{ik,t} \left( {x_k -\nu _{i,t-1} }\right) } }{\sum \limits _{s=1}^n {\alpha _{is,t} } } \end{aligned}$$
    (13)
  • Calculate the quantization error of each node \(i\) as Ayadi et al. (2010):

    $$\begin{aligned} QE_i =\sum \limits _j^\mathrm{{nbu}} {\left\| {x_j -vi} \right\| } \end{aligned}$$
    (14)

    with nbu, the number of units mapped by the node i. The Quantization Error (QE) is a measure that completely disregards map topology and alignment. The quantization error is computed by determining the average distance of the sample vectors to the prototype vectors by which they are represented.

  • Compute the error of each node. Calculate the node presenting the highest quantization error QEmax called node \(q\) with \(k\) units mapped in as shown in (5).

  • If \(QE_\mathrm{{max}}\) \(>\) NGT then generate new nodes from \(q\) as (Ayadi et al. 2010).

3.3 Smoothing process

  • Initialise the limit of smoothing iterations \(t_{s\mathrm{{max}}}\).

  • Train the map as the growing process without adding new nodes as (Ayadi et al. 2010).

4 FMIG map quality preservation

In order to present the performance of our technique, comparative studies are made in terms of quantization and topologic errors. These comparisons take place between the proposed FMIG algorithm, MIGSOM, FKCN and an inspired version of GSOM methods (Denaïa et al. 2007).

We first executed FMIG considering two forms of growing threshold NGT Eq. (11) and GT Eq. (1) to demonstrate the performance of the new structure of NGT.

4.1 Experimental settings

The experimentations are realized by executing FMIG, MIGSOM [...], FKCN and GSOM algorithms in the test context parameters of the Euclidean distance function, the fuzzy factor value \(m_0 \)= 2 and the Spread factor SF = 0.05. The limit of iterations in the growing process \(t_{\mathrm{{max}}}\) is taken at the values 30, 50 and 100. The smoothing process takes its maximum of iterations \(t_{s\mathrm{{max}}}\) at 10.

The comparison of the algorithms is made in term of quantization and topologic errors which are presented as follows:

  • Topologic Error: the Final Topologic Error FTE is computed identically to Ayadi et al. (2010) as follows:

    $$\begin{aligned} \mathrm{{FTE}}=\frac{1}{c_t }\sum \limits _{i=1}^{c_t } {mu( {\nu _i })} \end{aligned}$$
    (15)

    \(mu( {\nu _i })\) takes the values 1 if the first and the second Best Matching Units (BMUs) of \(\nu _i \) are adjacent, 0 otherwise.

  • The Final Quantization Error: FQE is computed as the normalized average distance between each input data and its BMU.

    $$\begin{aligned} \mathrm{{FQE}}=\left\{ {\begin{array}{l} \text{1 } \text{ if } \text{ no } \text{ data } \text{ point } \text{ matches } \text{ the } \text{ unit } \\ \frac{1}{c_t }\sum \limits _{i=1}^{c_t } {\left[ {\frac{\frac{1}{\mathrm{{nbu}}}\sum \limits _{j=1}^{\mathrm{{nbu}}} {\left\| {x_j -\nu _{\mathrm{{BMUt}}}} \right\| } }{\mathrm{{norm}}( {\nu _i })}} \right] } \\ \end{array}} \right. \end{aligned}$$
    (16)

    \(\nu _\mathrm{{{BMUt}}} \) is the weight vector of the BMU at the iteration t.

Table 1 Datasets description

4.2 Datasets description

In order to test the effectiveness of the proposed algorithm FMIG, we performed a sensitivity analysis in synthetic datasets with different dimensions (Dataset1, Dataset2 and Dataset3) and real datasets (IRIS, Ionosphere, MNIST, DNA and NE) (Blake and Merz 1998). NE (Theodoridis 1996) consists of postal addresses of three metropolitan cities in US (New York, Philadelphia and Boston). All datasets are described in Table 1.

  • Dataset 1 is composed of 1,600 points, 2-Dimension and distributed in 4 overlapping clusters.

  • Dataset 2 presents 1,300 points, 2-Dimension and distributed in 13 noisy clusters.

4.3 Results and analyses

4.3.1 Multilevel map structure with FMIG

Figure 2 illustrates the labelled FMIG map structures for Dataset1, IRIS and NE. These maps are the result of the execution of FMIG with 50 iterations of growing phase. Thus, as we show FMIG is able to produce grid structure with multi-levels oriented maps. In addition, the maps present homogenous clusters.

Fig. 2
figure 2

Grid structure of FMIG algorithm for a dataset1, b IRIS and c NE datasets

4.3.2 New growing threshold NGT

In order to prove the amelioration introduced by our new structure of NGT (11) in quantization data and topology preservation, we performed some tests on different datasets. In fact, we executed our FMIG algorithm in two versions, taking into account, respectively, the GT and NGT formula to demonstrate the capacity of integrating the new parameter NGT. Table 2 shows results of FQE and FTE by executing FMIG 100 times, considering the two forms of GT and NGT.

Table 2 Comparison of growing thresholds NGT and GT

Table 2 proves that the using of the NGT gives the minimum values of quantization error FQE. Therefore, FQE using the NGT form that we propose is clearly smaller than FQE resulting from the classic GT form for all datasets especially when we have a big data (Letter and NE).

Based on results presented in Table 2, the topologic error is improved by the new growing threshold. In fact, FTE with NGT gives minimum values for the tested databases compared to FTE using GT. The minimum values of the topologic error lead to a well-distributed map.

This experimental study confirms that in iterative processes, the number of iterations is important to evaluate the results because it has the probability to give the global solution of the problem. If the stop condition is chosen correctly, the iterative process will have the ability to execute the max of iterations and produce the optimal values of the algorithm. For these reasons, we generated the new form of growing parameter in FMIG algorithm that corresponds to the stop condition of the growing process:

$$\begin{aligned}&\mathrm{If} \, \mathrm{QEmax} < \mathrm{NGT} \, \mathrm{then \, stops \, growing, \, else \, continue} \\&\quad \mathrm{generating \, new \, nodes}. \end{aligned}$$

When we have a small NGT, the probability to continue the process is important. A high GT value will result in less spread out map. However, a low GT will produce a well-distributed map.

Our NGT is smaller than the interior GT that is why the iterative process will continue until the maximum of iterations leading to a well-spread map.

4.3.3 Comparison of FMIG with MIGSOM, GSOM and FKCN

In this study, we executed our proposed algorithm FMIG on different databases with different iterations values. The experimental results of comparison with MIGSOM, GSOM and FKCN are presented by Tables (3, 4, and 5), with, respectively, 100, 200 and 500 iterations.

Table 3 Experimental results with 100 iterations

Table 3 shows that from 100 iterations, results of quantization and topographic quality given by the two methods FMIG and MIGSOM are similar for the synthetic databases. In fact, the two methods give the same value of quantization error for Dataset3 which is a small dataset. Whereas, applied on real data that present different geometric aspects, FMIG generates the best values compared to MIGSOM with similar map size. Such improvement could be explained by the combination of MIGSOM advantages in multilevel maps and the fuzzy aspect of FKCN to develop FMIG that produces a well distributed map taking into account the characteristics of the dataset. Results given by GSOM and FKCN are far from the optimal values found by our method for synthetic and real datasets.

Table 4 Experimental results with 200 iterations
Table 5 Experimental results with 500 iterations

The execution of the algorithms 200 and 500 iterations proves that FMIG gives more precision in quantization and topologic errors than MIGSOM, GSOM and FKCN (Tables 4, 5). Our fuzzy algorithm is able to generate the best distribution which produces the minimum values of FTE, especially with big real datasets. With the new structure of NGT taking into account the data size and its characteristics, in addition to the update modifications applied for the predefined parameters, the values of FQE and FTE give their minimum to improve the quantization factor of the map and preserve the topographic aspect of datasets.

To analyse each database separately shown by Tables 4 and 5, we discus results given by FMIG compared to those generated by MIGSOM, GSOM and FKCN. Dataset1 preserves its distribution significantly with the minimum values of quantization error and topologic error by the map structure given by FMIG. It is the same case with datasets 2 and 3.

The real databases are more preserved by our new algorithm in terms of quantization and topologic errors. In fact, FMIG takes its minimum values with 100 and 200 iterations compared to MIGSOM. We note that GSOM and FKCN become inefficient to train big data and attain their limitation for 200 and 500 iterations leading to unacceptable errors.

As an example, MNIST is considered a big data with a single aspect of data distribution that is why results given by our FMIG are more explicit and fiddle to the reality.

As we shown in Tables 4 and 5, with the evolution of growing iterations, the difference between crisp and fuzzy algorithms is significant. Therefore, 200 and 500 iterations of FMIG growing generate the best maps in parameters FQE and FTE compared with crisp MIGSOM and GSOM. These optimal results are returned due to the new fuzzy form of the weight vectors (13) taking into account the fuzzy membership degree (10). It is worth mentioning that GSOM and FKCN fail to map databases quite there preserving the topology criterion. Furthermore, these algorithms take an important execution time with big data to achieve their learning phase. Considering the GSOM and FKCN training, the resulting maps present important errors compared to FMIG. It could be explained by the drawback of GSOM (Ayadi et al. 2007; Amarasiri et al. 2004) and FKCN to learn big datasets. Therefore, it is clear that the results shown in the tables above are influenced by the number of iterations in the growing phase of FMIG, MIGSOM, GSOM and FKCN algorithms.

5 Evalution of FMIG clustering using validity indexes

In this study, we applied three stages of the clustering process to evaluate our method FMIG performance in a comparison with MIGSOM, GSOM and FKCN. We first trained synthetic and real datasets I with the four algorithms for 100 iterations of growing. The resulting prototypes were then clustered by FCM algorithm. Finally, we applied five validity indexes: FVINOS (Tlili et al. 2014), PCAESN (Hu et al. 2004), SVI (Wu and Yang 2005), VOS (Pal and Bezdek 1992), Dunn (Dunn 1974) and XBI (Mingoti and Lima 2006). Table 6 presents the clustering evaluation given by FMIG compared with other algorithms.

Table 6 Clustering evaluation with validity indexes

Results given by Table 6 prove that FCM partitions obtained by FMIG neural maps are optimal and produce the best number of clusters which is evaluated by the validity indexes. In fact, with different databases presenting multiple geometric aspects described by Table 1, FVINOS detects the optimal number of clusters given by FMIG training as well as in MIGSOM, GSOM and FKCN training. In addition, the majority of indexes give their best results with FMIG prototypes that could be explained by the ability of FMIG to produce well-distributed maps preserving the quality of data characteristics. Moreover, with noisy Dataset2, FMIG maps generate the best partition; all the indexes detect outliers and give good estimate of 13 clusters. MIGSOM algorithm produces the best partition with FVINOS, PCAESN and VOS, but the other indices are unable to detect the optimum number of clusters. GSOM and FKCN methods as presented in Table 6 return the best partition only with FVINOS and PCAESN because these indexes are able to detect noise aspect (Tlili et al. 2014).

Overlapping datasets such as Dataset1, IRIS and MNIST are well clustered by FMIG maps. As shown in Table 6, all the indexes discover the optimum number of clusters with Dataset1. For IRIS, just XBI is unable to give good estimate with FMIG; the other indices produce the best partition. MNIST is considered a big data and well trained with FMIG that gives 10 clusters in the resulting partition. FVINOS, VOS, Dunn and PCAESN indices are able to detect the overlap aspect (Tlili et al. 2014) of MNIST. MIGSOM, GSOM and FKCN produce the best partition just with FVINOS and fail in detecting the optimum number of clusters in the other cases.

Heterogeneous DNA is well clustered by FMIG partitioning that produces the optimum number of clusters evaluated by each of the indices. MIGSOM, GSOM and FKCN give good estimate with FVINOS and VOS, but the results of the other indexes fail to detect the optimum result.

Huge databases such as Letter and NE clustering prove that FMIG is able to produce the best distribution of prototypes clustered by FCM. In fact, with FMIG, FVINOS, PCAESN, VOS and SVI indexes detect the best partition. MIGSOM, GSOM and FKCN give the optimum number of clusters with FVINOS and fail to produce good results with the other indexes because they are limited in mapping huge data.

6 Conclusion

In this paper, we have presented a new form of fuzzy dynamic self-organizing map algorithm called FMIG. This new method improves the topology and quantization errors and generates its structure according the data form. The fuzzy aspect of our approach adapts its growing process to real problems. In this study, we presented a new parameter of Growing Threshold (GT) to ameliorate the control of the map growth and improve the training process. In fact, we proved that error values of quantization and topology are minimized by applying our new Growing Threshold (NGT). FMIG realizes the purposes of producing the best distribution of data by finding out the topology corresponding to the form of inputs. Also, FMIG gives the minimum value of quantization error providing the best cluster structure. Furthermore, FMIG is able to generate the best representations of big data.

Future research should focus on improving the performance of FMIG algorithm over high-dimensional datasets by incorporating some evolutionary techniques and providing a new form of Fuzzy evolutionary SOM.