1 Introduction

Machine learning is closely related to computational statistics, which also focuses on prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. A self-organizing map (SOM) is a type of unsupervised neural networks that showed to be very effective to produce a visualization from a set of data with a high number of dimensions. In other words, SOM is a learning algorithm which supports data exploration in an eclectic number of different application fields, due to the substantial independence of their expressive possibilities from the actual origin of data to be examined. The value of SOM, that relies on the capability of creating an internal, coherent spatial representation of data from differently organized sources, is a precious resource in a phase of computing history in which there is abundance of multidimensional data from different, coordinated or uncoordinated sources, that led to the rise of Big Data and related applications. Anyway, tow of the four V’s of Big Data (velocity, variety, volume and veracity) produce a push for performances, due to the need to fast process large amounts of data; and, in general, even when they do not qualify to be classified in the Big Data realm, modern applications present a richness of available data that need to be harvested quickly in order to produce valuable results [5, 7, 15].

SOMs are a well established approach to data classification and representation, with many decades of study and application: here we decided to point out some of the main references, that happen to be spread over the first three decades of their long story, starting in 1981, relying on the most comprehensive surveys and some interesting additional references, but excluding the first works, that are covered in the surveys, and are anyway available as sources cited in the surveys to the readers that may prefer a direct experience of the papers of the first decade. For a comprehensive introduction and a historical perspective on SOM and related applications the reader can refer to [20] and [21], that widely present SOM and survey some selected application fields. In [23] the authors survey instead a number of effective engineering applications of SOM in the first decades of use, ranging from process and systems analysis (e.g. fault identification, visualization of machine states), statistical patterns recognition (e.g. speech recognition, texture identification, computer vision), robotics (e.g. arm control, navigation), telecommunications (e.g. signal detection, channel equalization, interference cancellation, codification of images) and other fields, including a very large number of useful references. Other examples of SOM based applications are anomaly detection (e.g. [13]), face recognition in different conditions ([35], in combination with other techniques). Nevertheless, SOM have been also used in Data Science applications due to their efficiency in clustering problems, in support to Data Mining applications: interested readers may find an example in [36], while [2] provides a comparison between SOM and other classical analysis techniques of Data Science interest, namely Cluster Analysis and Principal Component Analysis, for the exploration of large data sets. Their classification features have been also applied to document categorization and classification, both on collections (e.g. [11, 24]) and on the web (e.g. [4]), due to their capability of semantic classification, and, in general, of producing abstractions [20, 21, 36]. It is relevant to underline the fact that SOM may be used in multilevel approaches, to compose structures of SOMs with specialized features that collaborate, with an appropriate composition in stages, to solve problems with a stepwise logic (e.g. [16, 23, 36]).

From the theoretical point of view, one of the most relevant features of SOM is their capability of building an inner spatial representation out of external data. The nature and the characteristics of this mapping ability has been extensively studied, due to its importance, and to the fact that, being it generated by a non supervised approach, the trustability of the emergent representation is a critical factor in applications. For a discussion on topology preservation in the internal representation with respect to the structure of external data, we suggest to read [18], while for a quantitative examination of neighborhood preservation between the two domains we suggest [3], that explores continuity, resolution and coherence with input probability distributions of the mapping. Finally, [22] deals with the problem of making large SOMs and [17] presents some variants to SOM, with [1] pointing out an extension that allows dynamicity and controlled growth.

Within this quest for performances, in this work we present an implementation of a SOM on CUDA-GPU architectures. GPUs are an important and cost effective resource that has shown its exceptional effectiveness in computing applications, at the point that they emerged as a fundamental stage of what are currently known as GPGPU (General Purpose computing on Graphics Processing Units). The benefits of GPU hardware derive from the efficiency of software libraries capable of offering the access to the computing power of graphics processors, do not have the architectural constraints of CPUs, because they are capable of massive parallelism essentially due to the high number of processors available for computation.Footnote 1 In the following, we assume that the reader is already familiar with the main aspects of GPGPU applications: for a survey, we suggest [27, 38] and [28] as a starting point, while for a quick glance on performance modeling and related issues about GPGPU architectures we suggest [9, 12].

The use of GPU to support the execution of SOM based application is well established in the literature. This is a natural evolution, as SOM greatly benefit of the massive parallelism offered by GPU. One of the first proposed solutions is reported in [43], and many interesting alternatives and developments toward distribution exist: for high dimensional SOM [26, 40, 41], for massively parallel SOM based on cellular approaches [37, 42], or the recent [30]. The number of applications is significant, both exploiting SOM or batch SOM: in [39] authors deal with text mining in a map-reduce based application; in [31, 34] authors present applications to computer vision, describing a parallel image processing pipeline; moreover,in [29] a large data real time classification application is implemented; finally, in [6, 32] an optical flow estimator is provided.

In this paper, based on the approach proposed in [25], we propose a novel GPU-implementation of SOM. Our algorithm and its implementation are characterized by the exploitation of the cuBLASFootnote 2 library, with its optimized routines for basic linear algebra operations. Moreover, we propose an improvement to the balanced approach both in the input stage and in the computational step whit respect to the original parallel, suggested in [25]. This allows to obtain a fully-parallel algorithm in which threads simultaneously work on a single element each, significantly reducing synchronization and waiting time. Lastly, compared to the existing parallel methods proposed in literature: the main advantage of our solution, that will be described in detail in the rest of the paper, is the load balancing capability for the computing workload.

The paper is organized as follows. Section 2 presents the definition of the SOM algorithm. Section 3 describes the approach of this work. In Sect. 4 we report the experiments performed to show the effectiveness of the approach. Finally, Sect. 5 closes the paper.

2 The Self-Organizing Maps Algorithm

A Self-Organizing Map, also known as SOM, is a kind of unsupervised neural network that produces a representation of training samples in a low dimensional space preserving topology properties of samples. This property makes the SOM particularly useful for displaying high dimension data.

The first model for SOM has been described in [19] and is also known as Kohonen Maps. In this type of neural network the output neurons are organized in low-dimensional grids (2D or 3D). Each input is connected to all output neurons. In other words, the SOM model is a fully connected network where each neuron has a weight vector of the same size of input vector and the size of the input vector is generally much higher than the size of the output grid.

Aim of the network is to specialize different parts of the lattice to react to input patterns to reflect the behaviour of cerebral cortex in the human brain. It is used a kind of training called competitive. Each training step is organized in the following way:

  • For each input vector the Euclidean distance from all neurons in the map is computed;

  • The most representative neuron of the input vector is that which has minimum distance and it is called Best Matching Unit (BMU);

  • Distance based on the BMU position in the map is computed for each neuron to find its neighbourhood;

  • All neuron in the neighbourhood are updated during adaptation phase using a BMU influence function and learning rate.

The approach depends on the distance of the neurons.

More precisely, let I(t) be an input vector at time t sent to the network, suitable weights \(W_v\) are defined, for each neuron v, in order to determine the most representative neuron, as follows:

$$\begin{aligned} W_v(t + 1) = W_v(t) + \varTheta (v, t)\alpha (t)[I(t) - W_v(t)] \end{aligned}$$
(1)

where \(I(t) - W_v(t)\) is the error regularization parameter, i.e. the difference between the weight vector at time t and the input at time t,

$$\begin{aligned} \sigma (t)=\sigma _0 \exp \left( -\frac{t}{\frac{t}{\log {\sigma _0}}}\right) \end{aligned}$$

is the neighbourhood size,

$$\begin{aligned} \alpha (t)=\alpha _0 \exp \left( -\frac{t}{t_{end}}\right) \end{aligned}$$

is the learning rate and \(\varTheta (v, t)\) is the BMU influence, which depends of the distances in the network between the BMU and the neuron v. Formally, it is:

$$\begin{aligned} \varTheta (v, t) = \exp \left( -\frac{dist(BMU, v)^2)}{2\sigma (t)^2}\right) \end{aligned}$$
(2)

This process results in a movement of neuron toward the input, so that, at the end of the training process, all neurons are respectively the best representative of an input vector. The SOM procedure is divided into two main phases: the training phase for the network learning and the classification phase to check if an input belongs to a certain class (here the only step to perform is the search for the BMU).

It is clear that the computational complexity of the first phase is a critical aspect, because there is a large amount of data to be processed, as shown in the Algorithm 1.

figure a

3 Our GPU-Parallel SOM Implementation

Our proposal provides an alternative parallelization of the training phase, that is the bottleneck for standard SOM. The fact that we are dealing with an unsupervised learning case implies that each neuron depends on all others.

With these assumptions it is very difficult to execute parallel neuron training, so our approach tries to enhance the main points in which a speed up may allow to achieve better performances. We analysed the following three steps of the previous Algorithm 1: (1) search of BMU; (2) search of BMU neighbourhood; (3) adaptation (weights’ update).

For each of these steps, we have implemented a CUDA parallel kernel. Therefore, the parallel version of the Algorithm 1 is listed in the Algorithm 2.

figure b

In the following, for each kernel, we will focus on two fundamental points: how block size and grid size are built and how the kernel algorithm works. However, we will first provide some information on the environment which we use and on the utility that improve the efficiency of our implementation.

3.1 Structure

The implementation provides a three dimensional map in which two dimensions are given by the map dimensions, while the third one is given by the weight vector for each point in the map; this map is implemented as a mono-dimensional map to take advantage of array indexing. All computations are done in the GPU internal memory, to avoid transfer delays between main memory and device memory. We remember that in the GPU device, we have two kind of memories, namely the global memory (shared by all elements in execution on GPU: thread) and a shared memory (shared by all threads in a block) that allow threads to works in a smaller but faster memory. In order to take benefit from this kinds of memory, all computations are designed to reduce the number of operations in global memory and to increase those that may be executed in the shared memory. Finally, neurons weight vectors are randomly uniform initialized.

A key feature of our algorithm is the use of the cuBLAS library. The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA-CUDA runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU). Starting with CUDA 6.0, the cuBLAS Library now exposes two sets of API, the regular cuBLAS API which is simply called cuBLAS API and the CUBLASXT API. To use the cuBLAS API, the application must allocate the required matrices and vectors in the GPU memory space, fill them with data, call the sequence of desired cuBLAS functions, and then upload the results from the GPU memory space back to the host. The cuBLAS API also provides helper functions for writing and retrieving data from the GPU. To use the CUBLASXT API, the application must keep the data on the Host and the Library will take care of dispatching the operation to one or multiple GPUs present in the system, depending on the user request.

3.2 Search of BMU

The first step of training consists in searching the BMU corresponding to the step 1 of the Algorithm 1. This is accomplished by computing the Euclidean distance among input vector and all neurons weight vectors. Its complexity depends on the map size and weight vector size. The parallelization of this step ensures the achievement of best performance all over the network. In the following we describe the kernel findBMU and its configuration for this first step.

3.2.1 Block and Grid Size

Our implementation uses a three dimensional array, so an accurate thread organization analysis is needed. The kind of network that we are studying easily exceeds the third dimension in block-size, so in this case, it is not possible to take advantages of GPU hardware design. To overcome this problem, a bi-dimensional block structure is used where the number of columns (second dimension) is equal to the weight size. Instead the number of neurons processed at the same time became the first dimension. In this way it is possible to build a block-size that is totally dependent on the weight size, that is the element on which the most computations are executed.

Other relevant efficiency factors of our design is that computations always happen on a number of blocks that is a power of 2, and that an entire warp (corresponding to 32 simultaneously threads) is always used. In order to achieve this, a virtual padding is applied when needed to the weight size, to align it to the next power of 2. If needed, the same adjustment is done on the first dimension. This settings can be obtained by following the steps:

  1. (1)

    Retrieve the maximum number of simultaneous threads

  2. (2)

    Compute the next power of 2 on the weight size;

  3. (3)

    Divide the number of threads by the weight size, so to obtain the maximum number of rows;

  4. (4)

    Compute the minimum between the maximum number of rows and the maximum number of rows in the matrix;

At this point it is also possible to compute the grid-size, number of total blocks. It is easily equal to the map columns size on the columns, first dimension is the next power of 2 of the matrix rows divided for the block row size.

3.2.2 Kernel

The kernel findBMU is the implementation of a task executed on the GPU. The data set needed in device memory, to enable the kernel to run, includes: (1) the input vector; (2) the neurons’ weight; (3) the input vector; the map size; (4) the output array for distances. As already mentioned, the aim is to compute Euclidean distance in a parallel fashion. Therefore, we start by computing the square of the difference of the elements. All differences between the elements of the input vector and the weight vector are put in the shared memory. To avoid global memory access the value in shared memory is multiplied with itself, all threads need to wait the end of this first task. Considering the code fragment reported in Algorithm 3, it is possible to observe that it requires one access to global memory for mtx and one for vec, instead of two accesses for both.

figure c

Next, all elements on each row are summed. For this sum, as shown in Algorithm 4, a parallel reduction is applied in order to maximize the number of parallel working threads, reducing memory accesses and sum operations (as suggested in [14]).

figure d

As performed this sum, the first position of each row has been replaced by the specified sum, computed on that row. Subsequently, the square root is computed and assigned to the output array. Finally, a search of the minimum of these distances, to find the position of BMU in the map, is carried. This will be done using the cublasIdamin function of the cuBLAS library. By this function, which finds the smallest index of the minimum magnitude element of a vector, we achieve a significant increasing in terms of performance.

3.3 Search of BMU Neighborhood

The second step consists in the search of the distance between neurons and BMU by using matrix coordinates instead of weight vectors, corresponding to the step 2 of the Algorithm 1. This is fundamental, because BMU needs to influences other neurons based on their position to permit their weight vector gets close to input vector. Although this operation is sufficiently simple also on big maps, as it only deals with two dimensions, it is possible to have a further performance gain because all data are already on device memory. In the following we describe the kernel findBMUDistances and its configuration for this second step.

3.3.1 Block and Grid Size

The main problem of this phase is that a remapping is needed between three dimensional indexing and two dimensional indexing. In fact, computing happens on a two dimensions structure and weight size is not needed. The general algorithm is the same of Sect. 3.2.1, but a fake weight size of an arbitrary power of 2 is used, instead of number of columns in the grid is equal to the number of columns in the matrix plus one divided by the false weight size.

3.3.2 Kernel

In the kernel findBMUDistances the computation is based on a squared Euclidean distance on two dimensions. So the input data are given by the position of the BMU on the map, map size and output vector. Computations are performed easily only on index as shown in Algorithm 5.

figure e

3.4 Adaptation (Weights Update)

In the adaptation step, corresponding to the step 3 of the Algorithm 1, there is a weight updating of neighbor neurons, by means of a movement towards the input vector, with an important influence degree by BMU, using a gaussian function and a learning rate that decay at each iteration. We observe that this step can not be considered as a RBF interpolation because it is not used a layer output which derives from an function RBF, but the Gaussian functions are used only to calculate the influence of the neighbors. One idea would be to combine the two strategies to have better results, see [8, 10, 33]. As previous, in the following we describe the kernel adaptation and its configuration for this last step.

3.4.1 Kernel, Block and Grid Size

The configuration of block and grid are similar to previous sections, while the kernel adaptation requires the following parameters:

  • the BMU position;

  • the map size;

  • the neurons map;

  • the input vector;

  • distances from BMU;

  • the current epoch;

  • the total epoch;

  • the learning rate;

  • the current neighborhood size;

All neurons are updated according to their distance from BMU. Each thread checks if the distance of the current neuron is less than the double of the current neighbourhood size, then the influence is computed and each element of the weight vector is updated by threads after Eq. 1, as shown in Algorithm 6.

figure f

In this case our solution does not use shared memory, because the overall time needed to copy the value in the shared memory and the waiting of the neurons results in a longer time than that required to execute the operation directly in global memory.

4 Experiments

In this section a comparison between CPU and GPU implementations is proposed, to actually estimate the performance improvement. GPU hardware is a nVidia Quadro K4200 with compute capability 3. Load factor varies with map size and weight vector size, with a fixed input size of 100 samples and a total number of epochs equal to 1000 (Figs. 1, 2).

The first part of the evaluation was to determine whether the maps produced by our GPU-parallel SOM are reliable and correct. To test how the algorithm carries out the training phase and how changes at every new step you chose to test a classic example, which shows both as learning proceeds, both the topological properties of the neural network in question. This test is a learning colors test: we start from a 3D set of random input, where each weight represents the 32-bit value of RGB channels in an image. Each pixel in the image represents a neuron, thus the size of SOM reflect the image size. Initially, the values are chosen randomly, as shown in Fig. 3.

Fig. 1
figure 1

Random initialization of the neuron weights

Fig. 2
figure 2

Ordered distribution of colors (Color figure online)

As learning goes the color begin to be distributed in accordance with natural color system (see Fig. 4), and at each step we verify that the range of the neurons decreases more and more covering a smaller area, until act locally.

Fig. 3
figure 3

Random weights’ initialization

Fig. 4
figure 4

Ordered distribution of colors (Color figure online)

Fig. 5
figure 5

Training in progress

In this case, the learning continues until the end of the iterations, but an alternative criterion may be to verify the quantization error between neurons and input and stop the process when this is small enough. In Fig. 5 we show some learning step.

The first comparison is based on the total time needed to train the networks. In Figs. 6 and 7 weight vector size is respectively set to 16 and 64, and it is possible to observe a significant speed-up of GPU over CPU. The same results are reported in Tables 1 and 2 in order to show effective times obtained from the tests.

Fig. 6
figure 6

Comparison between GPU and CPU execution times in ms (weight \(=\) 16)

Fig. 7
figure 7

Comparison between GPU and CPU execution times in ms (weight \(=\) 64)

Moreover, results from sequential SOM (on CPU) and our parallel SOM algorithm (on GPU) were compared to results from SOM PAKFootnote 3 (on a single processor) in Table 3. The SOM_PAK program package contains all programs necessary for the correct application of the Self Organizing Map algorithm in the visualization of complex experimental data. The first version 1.0 of this program package was published in 1992 and since then the package has been updated regularly to include latest improvements in the SOM implementations.

For the second test, from Figs. 8, 9, 10, 11, 12 and 13, average times are shown for each operation described in Sect. 3. It is clear, and in line with our expectations, that GPU outperforms CPU.

Further experiments are carried out by means of the cuda VISUAL profiler tool.Footnote 4 The VISUALprof allows to collect and view profiling data when the software runs. By using it, we observed the performance of our three kernels. In the following we report the results obtained.

Table 1 Training time (in ms) and when the weight is equal to 16
Table 2 Training time (in ms) and when the weight is equal to 64
Table 3 Training time (in ms) for SOM’s algorithms on CPU, on CPU with SOM_PAK and on GPU, and weight is equal to 64
Fig. 8
figure 8

findBMU kernel: average times GPU and CPU, per single operation, weight \(=\) 16

More precisely, the first step in analysing an individual kernel is to determine if the performance of the kernel is bounded by computation, memory bandwidth, or instruction/memory latency. Instruction and memory latency limit the performance of a kernel when the GPU does not have enough work to keep busy. The performance of latency-limited kernels can often be improved by increasing occupancy. Occupancy is a measure of how many warps the kernel has active on the GPU, relative to the maximum number of warps supported by the GPU. Theoretical occupancy provides an upper bound while achieved occupancy indicates the kernel’s actual occupancy.

Fig. 9
figure 9

findBMUDistances kernel: average times GPU and CPU, per single operation, weight \(=\) 16

Fig. 10
figure 10

adaptation kernel: average times GPU and CPU, per single operation, weight \(=\) 16

Fig. 11
figure 11

findBMU kernel: average times GPU and CPU, per single operation, weight \(=\) 64

For the findBMU kernel results are shown in Table 4. The results indicate that the performance of kernel findBMU is most likely limited by instruction and memory latency. You should first examine the information in the “Instruction And Memory Latency” section to determine how it is limiting performance. The findBMU kernel report analysis exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of “Quadro K4200”. These utilization levels indicate that the performance of the kernel is most likely limited by the latency of arithmetic or memory operations. Achieved compute throughput and/or memory bandwidth below \(60\%\) of peak typically indicates latency issues (see Fig. 14).

Fig. 12
figure 12

findBMUDistances kernel: average times GPU and CPU, per single operation, weight \(=\) 64

Fig. 13
figure 13

adaptation kernel: average times GPU and CPU, per single operation, weight \(=\) 64

Table 4 Analysis report for the findBMU kernel
Fig. 14
figure 14

findBMU kernel report analysis

Similar results are obtained for kernels findBMUDistances and adaptation. The analysis report for these kernels is shown in Tables 5 and 6 and Figs. 15 and 16.

Table 5 Analysis report for the findBMUDistances kernel
Table 6 Analysis report for the adaptation kernel
Fig. 15
figure 15

findBMUDistances kernel report analysis

Fig. 16
figure 16

adaptation kernel report analysis

The last test shows the effects of using the GPU in the case in which the network size is not a power of 2. The aim of this test is a simple consideration: the hardware architecture performs at best with problems that have a size that is a power of 2, but this condition rarely happens in real life applications, as it is very unlikely that the input dataset may produce a problem with an optimal size. The experiment is conducted on a map size of 72 rows, 82 columns and a weight vector size of 58 elements, using a random number generator (in order to test the software on a sequence as generic as possible). Figure 17 shows that performance enhancement is almost unchanged. Results demonstrate that using GPU to solve some problems, in this case a training of a not-fully parallelizable neural network, can produces large improvements.

Fig. 17
figure 17

Performance in case the map size is not a power of 2, for GPU (in blue) and CPU (in red) (Color figure online)

5 Conclusions

In this work we proposed a parallel implementation for a machine learning algorithm based on SOM. Our software exploit the computational power of GPU-CUDA and uses the optimized library cuBLAS, provided by nVIDIA for linear algebra operations. The parallel strategy implemented provides an alternative parallelization of the training phase, that is the bottleneck for standard SOM, based on a simultaneous work of threads, significantly reducing synchronization and waiting time. The results demonstrate very interesting improvements and a significant speed-up of GPU over CPU versions.