Keywords

1 Introduction

The size and complexity of data collections are ever increasing from the beginning of the computer era, while the dimensionality of the data sets is rapidly increasing in recent years. Contemporary and especially future technologies allow us to acquire, store and process large high dimensional data collections. High dimensional data collections are commonly available in areas like medicine, biology, information retrieval, web analysis, social network analysis, image processing, financial transaction analysis and many others.

Clustering, considered the most important unsupervised learning problem, is used to reveal structures, to identify “natural” groupings of the data collections and to reduce large amounts of raw data by categorizing in smaller sets of similar items.

There are two main issues when clustering based on unsupervised learning, such as Growing Neural Gas (GNG) [9], is performed on vast high dimensional data collection:

  1. 1.

    The fast growth of computational complexity with respect to growing data dimensionality, and

  2. 2.

    The specific similarity measurement in a high-dimensional space, where the expected distance, computed by Euclidean metrics to the closest and to the farthest point of any given point, shrinks with growing dimensionality [2].

These two factors reduce the effectiveness of clustering algorithms on the above-mentioned high-dimensional data collections in many real applications.

The growth of computational complexity can be partially solved using the parallel computation facilities, such as High Performance Computing (HPC) cluster with MPI. Obviously, it is necessary to resolve technical and implementation issues specific to this computing platform. An effective parallel implementation of GNG is discussed in this paper, while the main focus is on minimizing of interprocess communication.

2 Artificial Neural Networks

2.1 Related Works

The methods based on Artificial Neural Networks (ANN) are highly computationally expensive. There are different approaches on how to improve effectivity of these methods. The one possibility is to improve computation. The authors of this paper [3] propose two optimization techniques that are aimed at an efficient implementation of the GNG algorithm internal structure. Their optimizations preserve all properties of the GNG algorithm. The first technique enhances the nearest neighbor search using a space partitioning by a grid of rectangular cells and the second technique speeds up the handling of node errors using the lazy evaluation approach.

The next possibility for how to improve effectivity methods based on ANN is parallelization. In the paper [4] the authors combine the batch variant of the GNG algorithm with the MapReduce paradigm resulting in a GNG variant suitable for processing large data sets in scalable, general cluster environments. The paper [16] is focused on the actualizations of neurons’ weights in the learning phase of parallel implementation of SOM. There are two extremal update strategies. Using the first strategy, all necessary updates are done immediately after processing one input vector. The other extremal choice is used in Batch SOM – updates which are processed at the end of whole epoch and authors study update strategies between these two extremal strategies.

For parallelization is often use Graphics Processing Units (GPU). In the paper [1] authors present the results of different parallelization approaches to the GNG clustering algorithm. They especially investigated the GPU and multi-core CPU architectures. Authors in the paper [12] explore an alternative approach for the parallelization of growing self-organizing networks, based on an algorithm variant designed to match the features of the large-scale, ne-grained parallelism of GPUs, in which multiple input signals are processed simultaneously. The paper [15] describes the implementation and analysis of a network-agnostic and convergence-invariant coarse-grain parallelization of the deep neural network (DNN) training algorithm. The coarse-grain parallelization is achieved through the exploitation of the batch-level parallelism. This strategy is independent from the support of specialized and optimized libraries. Therefore, the optimization is immediately available for accelerating the DNN training. The proposal is compatible with multi-GPU execution without altering the algorithm convergence rate.

2.2 Growing Neural Gas

The principle of this neural network is an undirected graph which need not be connected. Generally, there are no restrictions on the topology. The graph is generated and continuously updated by competitive Hebbian Learning [8, 13]. According to the pre-set conditions, new neurons are automatically added and connections between neurons are subject to time and can be removed. GNG can be used for vector quantization by finding the code-vectors in clusters [7], clustering data streams [6], biologically influenced [14] and 3D model reconstruction [10]. GNG works by modifying the graph, where the operations are the addition and removal of neurons and edges between neurons.

To understand the functioning of GNG, it is necessary to define the algorithm. The algorithm described in our previous article [17] is based on the original algorithm [5, 7], but it is modified for better continuity in the SOM algorithm. The description of the algorithm has been divided for convenience into two parts. Here is the Algorithm 1 which describes one iteration.

Remark. The notation used in the paper is briefly listed in Table 1.

Table 1. Notation used in the paper
figure a

3 Parallelization

In the paper [17] we have dealt with the parallelization of GNG. The following is a brief description of our parallelization algorithm.

After analysing the GNG learning algorithm we identified the one most time-consuming processor area. This part was selected as a candidate for the possible parallelization. The selected area are:

  • Finding BMU – this part of GNG learning can be significantly accelerated by dividing the GNG output layer into smaller pieces – distribution of neurons for effective parallelization. Each piece is then assigned to an individual computation process. The calculation of the Euclidean distance among the individual input vector and all the weight vectors to find BMU in a given part of the GNG output layer is the crucial point of this part of GNG learning. Each process finds its own partial BMU in its part of the GNG output layer. Each partial BMU is then compared with other BMUs obtained by other processes. Information about the BMU of the whole network is then transmitted to all the processes to perform the updates of the BMU neighbourhood.

  • Updates of weights – update weights of edges incident with \(N_{c1}\) it is quickly in the event that the neighboring nodes to \(N_{c1}\) are on a same process. This part can theoretically accelerate if we move adjacent nodes on a single process. Unfortunately, the transfer node for multidimensional data is very time consuming (test data have a dimension of 8000+).

A detailed description of our parallelization process is described in Fig. 1.

Fig. 1.
figure 1

Parallel algorithm

The parallelization of GNG learning was performed on an HPC cluster, using Message Passing Interface (MPI) technology. MPI technology is based on effective communication between processes. That means that one application can run on many cores. The application uses MPI processes which run on individual cores. The processes are able to send and receive messages and data, communicate etc. Detailed information about HPC and MPI technology is provided, for example, in [11].Footnote 1

3.1 Distribution Neurons for Effective Parallelization

In the paper [17] we used Method 1 (the equal distribution of neurons), where new neurons are allocated to the process with the lowest number of neurons (see Fig. 2). The advantage of this distribution is constant workload processes. The disadvantage is increased communication between processes.

Fig. 2.
figure 2

Add the new neuron to the next free \(Process_j\) by a cyclic way - Method 1.

Our goal is to focus on reducing interprocessor communication by using the following methods for adding new neuron:

  1. Method 2

    The neurons are gradually added to the process, which currently does not contain a predetermined number of neurons (\(\frac{N_{max}}{p}\)). If one process is filled up so a new neuron is added to the next process (see Fig. 3).

  2. Method 3

    The start is similar to Method 1. If \(|V_i|\ge 2\) \(\forall i\le p\) then add \(N_{new}\) to \(Process_i\) where (\(N_{c1}\in V_i\) and \(|V_i|\le \frac{N_{max}}{p}\)) or add \(N_{new}\) to \(Process_k\) where (\(N_{c2}\in V_k\) and \(|V_k|\le \frac{N_{max}}{p}\)) or add \(N_{new}\) to the next free \(Process_j\) (see Fig. 4).

Fig. 3.
figure 3

Add the new neuron to the \(Process_x\) by gradual way - Method 2.

Fig. 4.
figure 4

Add the new neuron to the \(Process_x\) where is \(N_{c1}\) or \(N_{c2}\) - Method 3.

4 Experiments

4.1 Experimental Datasets and Hardware

One dataset was used in the experiments. The dataset was commonly used in Information Retrieval – Medlars.

Medlars Dataset. The Medlars dataset consisted of 1,033 English abstracts from a medical scienceFootnote 2. The 8,567 distinct terms were extracted from the Medlars dataset. Each term represents a potential dimension in the input vector space. The term’s level of significance (weight) in a particular document represents a value of the component of the input vector. Finally, the input vector space has a dimension of 8,707, and 1,033 input vectors were extracted from the dataset.

Experimental Hardware. The experiments were performed on a Linux HPC cluster, named Anselm, with 209 computing nodes, where each node had 16 processors with 64 GB of memory. Processors in the nodes were Intel Sandy Bridge E5-2665. Compute network is InfiniBand QDR, fully non-blocking, fat-tree. Detailed information about hardware is possible to find on the web site of Anselm HPC clusterFootnote 3.

4.2 The Experiment

The experiment was oriented towards a comparison of the parallel GNG algorithm and parallel by modification by assignment to processes. The Medlars dataset was used for the experiment. A parallel version of the learning algorithm was run using 2, 8, 16, 24, 32 and 64 MPI processes. The records with an asterisk (*) represents the results for only one process i.e. this is the original serial learning algorithm and there is no network communication.

GNG parameters are the same for all experiments and are as follows \(\gamma \) = 200, \(e_w\) = 0.05, \(e_n\) = 0.006, \(\alpha \) = 0.5, \(\beta \) = 0.0005, \(a_{max}\) = 88, M = 2021, \(\delta \) = 1500. The achieved computing time is presented in Table 2.

Table 2. Computing time with respect to number of cores, standard GNG algorithm, dataset medlars
Fig. 5.
figure 5

Graph of computing time with respect to number of cores, standard GNG algorithm, dataset medlars

As we can see from Table 2 and Fig. 5, the computing time depends on the number of used cores as well. With a growing number of processors, the computation effectiveness increases, and the computational time is sufficiently reduced (in the paper [17] we used Method 1).

5 Conclusion

In this paper the parallel implementation of the GNG neural network algorithm is presented. The achieved speed-up was better than our previous approach. However, the effectiveness of a parallel solution is dependent on the division of the output layer. The authors introduced three different methods of neuron assignment to processes, where better acceleration for the new approaches was achieved. These approaches reached the best time but speed up is not too significant for the selected data set. Our methods are focusing on the different ways of assigning a new neuron in the GNG to processes for parallel computation.

In future work we intend to focus on the sparse date, use combinations of neural networks for improved result and improved acceleration.