Keywords

1 Introduction

Self-organizing Map (SOM) is an unsupervised neural network that has been used as data analysis method. It is being widely used and applied to solve clustering and data exploration problems in various domain areas [1]. There were many researches have been found in the literature that used SOM to solve clustering problem [2, 3]. Despite its excellent performance, there are problems related to slow processing when visualizing large map size [4]. This imposed heavy workload on the processor especially when dealing with winner-search and updating weightage of neurons on the map [1]. On the other hand, the datasets dimension also have high influence in SOM processing [5].

This situation attracts much interest among researchers to improve SOM processing by parallelizing the algorithm. Among the common ways to parallelize SOM are network or map partitioning [6, 7] and data or example partitioning [8, 9]. However, there also efforts to parallelize SOM algorithm through combining both network and data partitioning [10, 11] with the interest to gain advantages of both parallelism. In the meantime, most of research works on improving SOM are aimed to achieve efficiency and scalability in their proposed works. In details, proposed parallel SOM that efficient should be faster in term of processing than the previous version [12, 13]. Meanwhile, some research works are attempting to increase the utilization of processing elements in executing the SOM algorithm [7, 14]. Furthermore, several research works pursue to lower the power consumption [15], and solve computer cluster problems [16, 17].

On the other hand, two computer architectures mostly used by researchers in improving SOM algorithm are; Single Instruction Stream Multiple Data Stream (SIMD) and Multiple Instruction Streams Multiple Data Streams (MIMD). Some of the SIMD architectures used Graphic Processing Unit (GPU) computing [18, 19], Field Programmable Gate Array (FPGA) [6, 20], and specialized hardware architecture for Artificial Neural Network (ANN) [21, 22]. Meanwhile, MIMD architectures are employed by researchers to parallelize SOM consists of different types of computer clusters. Among several computer architectures, GPU computing offers an efficient solution at lower cost compared to others.

GPU or widely known as General Purpose Graphic Processing unit (GPGPU) is a many core processor consisting hundreds or even thousands of compute cores [23]. It has been used to speed up applications of scientific computing and simulations. GPU computing has been proven to have high throughput in processing large data floating point operations in graphic applications [24]. Since the introduction of GPU programming frameworks such as of Compute Unified Device Architecture (CUDA) in 2007 and Open Computing Language (OpenCL) in 2009 [24], the GPUs have become popular in designing parallel algorithm in the quest for higher speed. In the meantime, the evolution of hardware technology, has made it possible to design high performance scientific computing software [25]. Essentially, GPU is an accelerator to Central Processing Unit (CPU) that has been mainly used for graphic purposes before applying to process scientific data. Combination of CPU and GPU that work closely together creates a paradigm known as Heterogeneous Computing (HC) [26].

On top of that, many researchers have attempted to capitalize HC to execute SOM algorithm in parallel manner. However Hasan et al. [5] found that when parallelizing SOM with larger map size and high attribute dimension, it will significantly slow down the processing even with both CPU and GPU. Many researchers agreed that executing SOM on GPU shows significantly increase processing speed for large data compared to executing on CPU only [11, 20, 27]. Moreover, due to the restrictions imposed by past GPU architectures, most of these frameworks treated the GPU as an accelerator which can only work under close control of the CPU. Consequently, the communication protocol between CPU and GPU is a source of high latency which causes bottleneck. This drawback is enhanced when using distributed memory of HC where the memory management has to be manually configured by the programmer to manage data movement between CPU and GPU which includes data transfer between host code and device code [28].

A more recent technology hardware design of heterogeneous systems is a single integrated unify CPUs and GPUs circuit chip which is known as Heterogeneous System Architecture (HSA) [28]. This technology provides a unified programming platform which eliminates programmability barrier and reduces CPU and GPU communication latency. The introduction of OpenCL 2.0 that supports HSA specification in July 2013 is able to improve communication by allowing the GPU to manage their own resources as well as access some of the CPU’s resources. It also introduces Shared Virtual Memory (SVM) which allows the host and the device to share a common virtual address range [29]. This reduces overhead by eliminating deep copies during host-to-device and device-to-host data transfers. Deep copies involve complete duplicating objects in the memory thus reduce redundancies [28].

In view of the above discussion, this study proposed an enhanced parallel SOM which executes on HSA compliant processor to solve clustering problem. The data used are benchmark datasets that were acquired from UCI Machine Learning Repository [30]. The performance of the algorithm is evaluated in terms of efficiency, scalability and accuracy. Efficiency and scalability are based on processing time while accuracy is based on number of class generated.

This paper is organized as follows. Section 2 explains about the proposed work on parallel SOM using HSA. Section 3 provides explanation on experimental setup while Sect. 4 discusses the experimental result and discussion. Lastly, Sect. 5 provides conclusion and future research directions.

2 Proposed Work

The proposed work consists of enhanced parallel SOM architecture and enhanced parallel SOM algorithm which will be explained in Sects. 2.1 and 2.2 respectively.

2.1 Enhanced Parallel SOM Architecture

Previous studies that highlighted parallel SOM have been successfully executed on GPU. Almost all the researchers in the literature apply parallelism at calculate distance and find Best Matching Unit (BMU) steps. There are many of them apply parallelism at update weight step. Consequent of that, this study proposed to parallelize these three steps into the new enhanced parallel SOM architecture. Meanwhile, heterogeneous system compromises a promising solution for reducing latency in communication between CPU and GPU. In order to gain these advantages, the proposed architecture is utilizing OpenCL 2.0 platform which specifically SVM feature. The implementation of this work is based on fined-grained SVM buffers. The fined-grained SVM buffers are synchronized during the implementation of SVM buffer which could reduce communication latency between CPU and GPU. The design of the proposed architecture is extended from the previous work [31] where it is introduced with two parallel kernels for distance calculation and find BMU as depicted in Fig. 1. The idea of parallel kernels is based on the analogy of biological neural networks in which neurons respond to the stimuli in parallel [32]. However this work is differ from [32] in term of SOM training types and hardware architecture. The main reason of duplicating the kernels is to increase utilization of work units in GPU.

Fig. 1.
figure 1

Enhanced parallel SOM architecture

The design of parallel kernels is realized by implementing multiple stimuli into the architecture. The main reason of parallelized the kernels are to increase utilization of work units in GPU. This work is supported by OpenCL where OpenCL allows a programmer to create more than one queue for execution. The queue will process based on out-of-order execution [29]. In out-of-order execution mode there is no guarantee that the enqueued commands will finish in the order they were queued. The first parallel kernel is Parallel Kernel 1 that consists of Calc_Dist_Kernel_1 and Calc_Dist_Kernel_2. Meanwhile, Parallel Kernel 2 consists of Find_BMU_Kernel_1 and Find_BMU_Kernel_2. The execution of kernels in each parallel kernel might overlap each other. This situation will create a chance to increase the utilization of work units in GPU side.

This study attempts to implement batch learning process through duo stimuli which leads to reduce training cycle to half. This solution that combines both online training and batch training will gain the benefits of the two types of training. The enhanced parallel SOM architecture is predicted to reduce the computation time due to the reduction of the training cycle to half and maintain the final results as online training. The enhanced parallel SOM architecture is essentially implement batch learning processing for executing two calculate distance kernels and two find BMU kernels. For instance, the execution of Calc_Dist_Kernel_1 is considered as executing one task and the execution of Calc_Dist_Kernel_2 also considered as another task. The rest of the algorithm of the proposed solution is similar to online training SOM algorithm.

2.2 Enhanced Parallel SOM Algorithm

This section will describe the parallel algorithm that applies the enhanced parallel SOM architecture. The enhanced parallel SOM algorithm includes additional three steps compared to the original SOM algorithm. For better understanding to the reader, the enhanced parallel SOM algorithm in this study is labeled with e-FGPSOM. Figure 2 illustrates pseudocode of e-FGPSOM.

Fig. 2.
figure 2

Enhanced parallel SOM algorithm

In depth of the proposed works, the algorithm begins with initializing SOM parameters such as learning factor and weights at the host side. The input data is retrieved and stored into an array. The training process begins with selecting duo stimuli randomly. Each input will be assigned to a command queue. In order to realize duo stimuli method, e-FGPSOM requires two sets of parameter of two calculate distance kernels and two update weight kernels due to the algorithm employs separate command queue for each kernel execution. Each kernel requires a map array. Both map arrays should contain similar values of neurons’ weights before the kernels processing is started. Right after the map has initialized using SOM_map_array_1, the values of SOM_map_array_1 will be copied into SOM_map_array_2. The SOM map array is very important because it will be employed within all kernel processing. At the step 2, the algorithm obtains duo stimuli from the dataset randomly and then at the step 3, the host broadcast the two inputs to device side, specifically the host assigns to two calculate distance kernels; Calc_Dist_Kernel_1 and Calc_Dist_Kernel_2, and two find BMU kernels; Find_BMU_Kernel_1 and Find_BMU_Kernel_2. Once all information has been broadcasted, the kernels are ready for the execution especially for calculate distance kernel at step 4. Each kernel at the GPU side is invoked by function respectively. The functions also provide setting, initializing parameters, and call the kernels. For example, the calculate distance function is used to call Calculate Distance kernel and it is done the same way with the other two kernels.

Essentially, the original flow of SOM algorithm is maintained unless with the additional two parallel stimuli execution that has applied for execution of two calculate distance kernels and two find BMU kernels in step 4 and step 5 respectively. The executions of two calculate distance kernels and two find BMU kernels are based on out-of-order execution mode. There is no guarantee that the enqueued commands will finish execution in the order because the execution of kernel is based on when the clEnqueueNDRangeKernel calls are made within a command-queue.

Calculate Distance Kernel.

The calculate distance kernel is used to calculate the distance between neurons and current input vector. The amount of work units is employed to parallelize the calculation distance step is mapped by amount of work-items on GPU where the amount of work-items is equal to the number of neurons in the SOM map. Specifically, each work-item of the kernel is responsible for finding the distance between a single neuron and the current input vector. This study applies Manhattan distance calculation.

Find BMU Kernel.

The Find BMU kernel applies two stages reduction method. The kernel utilizes work items the same amount of neurons on SOM map. The first stage of reduction method is to find the minimum distance for each local work group. The values of minimum distances of each work group will be stored into local array. The second stage is to find the minimum distance for each Compute Unit (CU). The minimum values of each CU then stored into global array and the host will determine the winning neurons. After the execution of both Find BMU 1 and Find BMU 2, at the step 6, the winner among the winners or final BMU from the both kernels will be selected with the minimum value at the host side. With the selected of the final BMU means the input vector which has the final BMU will use to update the map meanwhile the loser input will be eliminated. The final BMU will be broadcasted to device side for execution of third kernel, the update weight kernel.

Update Weight Kernel.

This kernel is the third kernel in the proposed architecture that updates the weight of neurons based on learning rate and neighborhood function. The learning rate defines how much a neuron’s vector is altered through an update with referring to how far the distance of the neuron from the BMU on the map. The BMU and its close neighbors will be altered the most, while the neurons on the outer edges of the neighborhood are changed the least. The enhanced version includes array copying process at host side right after the update weight kernel completed the execution. The array copying process is to copy the updated map array into another map array which is not selected as final BMU. The array copying codes are simply assign one array to another array with the same size through looping. Immediately after executing the three kernels, the learning factor and neighborhood radius are updated with the new values. All of the steps included in the loop block will repeat until certain number iterations or epochs before the SOM map is generated.

3 Experimental Setup

Two experiments have been conducted in this study which is result evaluation and result validation. These experiments are performed with the interest to evaluate the proposed work. Initially, this study employs four benchmark datasets from UCI Machine Learning Repository [30] for results validations. The details of the datasets are shown in Table 1. These datasets have been selected because information on the number of classes is known in advance and ease for the validation process. The information of number of classes is as provided by Fränti [33]. These datasets are processed by e-FGPSOM where each experiment is performed to validate each dataset.

Table 1. Benchmark datasets for result validations

On the other hand, for evaluations purpose, this study employs Bank Marketing benchmark dataset from UCI Machine Learning Repository [34]. Initially, data pre-processing takes place before the experiment is conducted. Data pre-processing is one of the important steps in data mining to obtain final datasets that can be considered as correct and useful for further data mining algorithm [35]. There are four methods in data pre-processing which are data cleaning, data integration, data transformation and data reduction [36]. Thus, this study uses method of data reduction by using discretization technique to convert values of the selected attributes [37]. This study converts all the values that appear in categorical into numeric values for easy processing in SOM.

Firstly, the experiment starts with result validation. The objective of result validation is to make sure the proposed work is capable to generate correct results by comparing the results from e-FGPSOM with results from other researchers. Four dimension sizes have been used: 4, 9, 13 and 8 parameters according to specific benchmark datasets as shown in Table 2. The algorithm is tested on the same map size, 40 × 40 by using 250 iterations.

Table 2. Experimental design for enhanced parallel SOM evaluation

On the other note, result evaluation is concerning on evaluation of the proposed work on different dimension size or number of parameter of dataset. Three dimension sizes have been used: 3, 5 and 8 parameters. The algorithm is tested on the same map size, 50 × 50 by using 30 iterations. Result evaluation applies time comparison and speed up [38] for performance measurements.

Moreover, for analysis purpose, the result of this study is compared to the result of our previous work [31] referred as FGPSOM. FGPSOM is based on standard parallel SOM on HSA platform. Both FGPSOM and e-FGPSOM implemented on OpenCL 2.0. These experiments are conducted on a laptop that equipped with Intel Skylake i7-6700HQ processor and built in Intel® HD Graphics 530.

4 Experimental Result and Discussion

4.1 Result Validations

Firstly, the experimental result starts with result validation. As mentioned in Sect. 3, result validation for e-FGPSOM is based on four benchmark datasets which are Iris, Glass, Wine, and Yeast dataset. Figure 3 illustrates the generated results in map visualization by e-FGPSOM. This figure shows that Iris dataset and wine dataset clearly produce three classes. This results can be validated with Fränti [33] that the proposed work is capable to generate the similar results. Meanwhile, the results of glass and yeast datasets are quite subjective to be decided. However, one could be identified the number of classes via observation that the proposed work also capable to generate equivalent results [33].

Fig. 3.
figure 3

Result produced by using algorithm of the enhanced parallel SOM architecture

4.2 Results Evaluation

For result evaluation purpose, the proposed algorithm tested on the same dataset size and map size. However, this study is focusing on experimenting the proposed algorithm on different dimension size of dataset; 3, 5, and 8 parameters. Figure 4 demonstrates the comparison result between FGPSOM and e-FGPSOM. From this figure, the results of e-FGPSOM perform better than results of FGPSOM for both datasets. The result of e-FGPSOM shows the improvement compared to FGPSOM where all the results of e-FGPSOM capable to reduce the processing time. The experiment exposes that e-FGPSOM successfully improved the processing speed compared to FGPSOM when experimenting on larger dimension size of dataset.

Fig. 4.
figure 4

Result produced by using algorithm of the enhanced parallel SOM architecture

For the details, according to 10000 dataset, e-FGPSOM achieves 1.24x, 1.13x, and 1.10x of speed up for 3, 5, and 8 parameters respectively. Meanwhile, for 15000 dataset, e-FGPSOM scores 1.16x, 1.11x, and 1.10x of speed up for 3, 5, and 8 parameters respectively. The proposed algorithm is capable to score speed up over FGPSOM due to it applies the proposed architecture which consists of parallel kernels. The execution of parallel kernels has triggered multiple stimuli processing. The processing of multiple stimuli capable to increase the utilization of GPU cores compared to FGPSOM. However, due to complexity of processing larger dataset and larger parameters size, the speeds up scores are reducing over the increasing the dataset size and parameter size.

5 Conclusion

In this study, we proposed an enhanced parallel SOM that based on heterogeneous system architecture. The proposed architecture is extended from parallel SOM researches that consist of three kernels: calculate distance kernel, find BMU kernel, and update weight kernel. The proposed architecture is included with two calculate distance kernels and two find BMU kernels. The proposed architecture is designed with the aim to increase the utilization of processing element on GPU.

The overall results indicate that the enhanced parallel SOM or e-FGPSOM is able to improve in terms of efficiency and scalability performance. This is due to e-FGPSOM that optimizes the usage of cores in the GPU. The enhanced parallel SOM or e-FGPSOM also demonstrates some advantages from various perspectives as below:

  • The proposed work is capable to generate comparable results (number of classes) compared to other researches by using benchmark datasets. Thus, the proposed work maintains the accuracy of the result.

  • The proposed work more scalable in terms of GPU cores utilization due to its imposition with multiple stimuli method. The implementation of multiple stimuli method also improves the proposed work to more efficient.

The proposed work has a limitation where the synchronization point at find BMU step could burden the processing of the find BMU kernel. Based on the limitation, there seems to be an opportunity for improvement. In the future, the consumption time of synchronization point could be reduced through eliminating the access of BMUs values at the host side. The final BMU values should be identified at kernel processing.