Keywords

1 Introduction

With the proliferation of data being generated, there is an urgent need of new technologies and architectures to make possible to extract valuable information from it by capturing and analysis process. New sources of data include various sensor enabled devices like medical devices, IP cameras, video surveillance cameras, and set-top boxes, which contribute largely to the volume of big data. Due to data proliferation, it is predicted that 44 zettabytes or 44 trillion gigabytes of data will be generated annually by the end of 2020Footnote 1. The data are continuously generated by the sources from internet applications and communications which are of large size, different variety, structured or unstructured, which is referred to as Big data. Big Data is characterized by three particularly significant V’s -Volume, Velocity, and Variety. The term Volume signifies the plethora of data produced from time to time by various different organizations and institutes. Velocity characterizes the rate at which data is generated from different sources. The third V, Variety denotes the diverse forms of data which may be structured, semi-structured or unstructured, generated from several organizations. For example, data can be in the form of video, image, text, audio, etc. Apart from the mentioned characteristics above, two other key features are–incremental and dispersed nature. They are incremental in the sense that there is dynamic addition of new incoming data to the pile of big data. Big data are dispersed in nature because they are geographically distributed across different data centers. These are some of the distinguishing characteristics which sets big data apart from traditional databases or data-warehouses. The traditional data storage techniques are not adequate to store and analyze those huge volume of data. In short, such a data is so large and complex that most traditional data management tools are inadequate to store or process it efficiently.

There are various challenges associated with big data. Such a large volume of data if processed sequentially it takes lot of time. Second, how do we process and extract valuable information from the huge volume of data within the given timeframe? To address the challenges, it is required to know various computational complexities, information security, and computational method, to analyze big data. For example, many statistical methods that perform well for small data size do not scale to voluminous data. Similarly, many computational techniques that perform well for small data face significant challenges in analyzing big data. Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes.

Predictive analysis gives a list of solutions by establishing the previous data patterns for a given situation. It studies the present as well as the past data and predict what may happen in the future or gives the probability what would happen in the future. We need to make use of such large data in order to make decisions in future. However, traditional machine learning and statistical methods in sequential mode takes much longer time in order to make prediction, especially, in case of intrusion data [3]. In this work, a traditional machine learning model–KNN with various proximity measures is studied both in sequential and parallel manner.

The major contribution of this paper is a parallel version of the KNN algorithm referred here as TUKNN. We also conduct an exhaustive experimental study on a good number of proximity measures in the KNN framework and recommend the best possible measure to achieve best effective, better accuracy with TUKNN algorithm. Further, we also recommend an optimal range for ‘k’ values to achieve best possible performance.

2 Related Work

KNN is a non-parametric classification method, which is simple but effective in many cases [5]. It classifies objects based on the closest training example in the feature space. For any test object t which is to be classified, its K nearest neighbors are retrieved, and this forms the neighborhood of the object t. Then based on a majority voting among the neighbours, the class label of t is decided.

In [9], the authors use the CUDA (Compute Unified Device Architecture) thread model to implement a CUDA based KNN algorithm. Adult data from UCI Machine Learning Repository were used to compare the performance of CUDA based implementation on GPU with ordinary CPU based implementation and authors suggests that KNN method is efficient for applications with large volume of data.

In [8], the authors implement CUKNN algorithm which constructs two multi- thread kernels such as distance calculation kernel and sorting kernel. With CUKNN, the authors claims that the method could achieve 15.2 times better execution time performance than CPU.

In [2], the authors propose a fast and parallel KNN algorithm and show the impact on content-based image retrieval applications. The authors implement the parallel version of KNN in C and MATLAB using GPU with CUDA.

3 Proposed Work

KNN is a widely used classification algorithm and can be considered parallel friendly because of the number of independent operations. When the training and testing datasets are large, then the speed of execution becomes quite slow which makes it suitable for parallel implementation. In this work, we implement KNN on CUDA framework. The proposed framework is depicted in Fig. 1. In our framework, we explore a good no of proximity measures in parallel during the mining process to recommend the best possible measure for better accuracy. The measures used are: Euclidean distance, Manhattan distance, Kulczynski distance, cosine similarity, Chebyshev Distance, Soergel distance, Sorensen, and Tanimoto.

Fig. 1.
figure 1

Framework of the proposed work

3.1 Distance Measures

Dissimilarity is an essential component in the KNN algorithm. It influences the performance of the algorithm significantly in terms of speed and accuracy. Since, every proximity (similarity or dissimilarity) measure has its own advantages and disadvantages. So, we conduct an empirical study to evaluate their performance and subsequently to recommend the best possible measure for cost effective performance with TUKNN. Table 1 shows the distance measures and their mathematical expressions used in our work.

Table 1. Distance measures and their mathematical expressions [4, 11]

Further, we also carry out an exhaustive experimentation on large no of datasets by varying the K values to identify an optimal range of K values for best possible performance by TUKNN. Next, we present both sequential and parallel version of KNN algorithm.

3.2 Sequential KNN Algorithm

  1. [1]

    For every fold in the 5 folds perform the steps 2 to 8.

  2. [2]

    Split the dataset into test set and training set using 5-fold cross validation.

  3. [3]

    For every test instance in the test set perform the steps 4 to 8.

  4. [4]

    Find the distance between this test instance and all the training instances in the training set.

  5. [5]

    Now, from the distances obtained from the step 4, find the first maximum K number of minimum values and thereby save the respective training instances having those values. Here, the maximum K value in the range (of K values) is chosen for the algorithm.

  6. [6]

    For every K in a range of values perform the steps 7 & 8.

  7. [7]

    Find the first K neighbors (i.e. the first K training instances with the minimum distances) from the results obtained in the step 5.

  8. [8]

    Perform a majority voting among these neighbors; the dominating class label in the pool will become the class label of the test instance.

In step 5, instead of applying a sorting algorithm, we find the first K minimum distances and their respective training instances. This has been done in order to decrease the time complexity of the algorithm as the best sorting algorithm (Quick sort) takes O(N2) time where finding the first K minimum distances takes O(NKmax) time. Here, N represents the size of the input (training set) and Kmax is the maximum K-Value in a range chosen for the algorithm.

3.3 The Proposed Parallel KNN Algorithm

The algorithm for parallel KNN implementation is stated below.

  1. [1]

    For every fold in the 5 folds perform the steps 2 to 8.

  2. [2]

    Split the dataset into test set and training set using 5-fold cross validation.

  3. [3]

    For every n instances (2500) in the test set, perform the steps 4 to 8.

  4. [4]

    Compute the distances between these n instances and all the training instances in the training set simultaneously by invoking the GPU kernel.

  5. [5]

    Now, from the distances obtained from the step 4, find the first maximum K number of minimum values and thereby save the respective training instances having those values. The maximum K is the maximum K value in the range of K values chosen for the algorithm. This step is performed for all these n instances simultaneously with the help of the GPU kernel.

  6. [6]

    For every K, perform the steps 7 & 8.

  7. [7]

    Find the first K neighbors (i.e. the first K training instances with the minimum distances) from the results obtained in the step 5.

  8. [8]

    Perform a majority voting among these neighbors and the dominating class label in the pool will become the class label of the test instance. The steps 6, 7 and 8 are performed for all these n instances simultaneously by invoking the GPU kernel.

4 Implementation and Results

For the parallel KNN, we compute all the distances between a set of test instances and all the training instances simultaneously. Hence, all the distances are computed in parallel at once. To calculate distance between the test instances and all the training instances in parallel, we use many cores of the GPU platform and develop the kernels in CUDA to compute the task in parallel. The most crucial task for a KNN classifier is to compute the distance d for finding the nearest neighbors. We implement the distance computation i.e. d on GPU platform which has resulted considerable improvement in the KNN performance.

The graphics card used in our work is NVIDIA Tesla k40c GPU Accelerator which has a memory of 12 GB. So, with a memory of 12 GB, we are able to compute the distance between 2500 test instances and all the training instances in the training set simultaneously on the GPU.

4.1 Datasets Used

We perform our experimentation on the following three types of datasets.

  1. 1.

    Ransomware Dataset: For our experiment, we use a dataset from Sgandurra et al. [10]. The dataset has total 582 and 942 instances of ransomware and goodware respectively. The 582 instances of ransomware comprise of 11 different variants. Also, it has total 30,692 features which collectively represent the characteristics of both goodware and ransomware. A detailed description of the dataset is given in the Table 2.

    Table 2. Ransomware dataset characteristics
  2. 2.

    SWaT Dataset: Secure Water Treatment [1] (SWaT) data set is also used in our experimentation. The dataset contains a total of 946,722 instances out of which 54,620 instances belong to attack category. The dataset has 51 attributes and two labels namely attack and normal.

  3. 3.

    UCI datasets: A total of 20 datasets is also used in our work. The list of datasets used are given in the Table 3.

4.2 Results and Observation

In our framework, an optimal range for K values is determined based on experimental study on twenty datasets from UCI Machine Learning repository. This testing reduces the overhead of calculating the best possible K values for highest accuracy and makes our model faster. As we can see form the Table 4, in the majority cases (15 out of 20) the results show optimal K values within the range of 2–9. Table 5 shows the ratio of CPU and GPU execution time for all the datasets used in our work. The optimal K value of each proximity measures for which highest accuracy is obtained is reported in Table 6, 7 and 8.

4.2.1 Accuracy Comparison

The graph plots for the accuracy comparison for three datasets are shown below.

  1. 1.

    Accuracy of Binary Class Ransomware Dataset: The classification accuracy of KNN algorithm with all the eight distance measures of ransomware dataset with binary class is shown in Fig. 2. As shown in the figure, the highest accuracy i.e., 95.27 is obtained with Kulczynski, Soergel, Sorenson, and Tanimoto measures.

    Table 3. Characteristics of 20 datasets obtained from UCI repository
    Fig. 2.
    figure 2

    Accuracy of ransomware dataset (binary class)

  2. 2.

    Accuracy of Multi Class Ransomware Dataset: In this study, our observation from Fig. 3 is that 82.32 is the highest accuracy given by KNN with Kulczynski measure.

    Fig. 3.
    figure 3

    Accuracy of ransomware dataset (multi class)

  3. 3.

    Accuracy of SWaT Dataset: Based on our study as depicted in Fig. 4 is that the model give same accuracy for all the eight measures i.e., 94.1 for this dataset. However, a difference in performance has been observed after 4th decimal (not reported here).

    Fig. 4.
    figure 4

    Accuracy of SWaT dataset

4.2.2 Comparison of KNN and TUKNN in Terms of Execution Time

  1. (a)

    KNN vs TUKNN Time Comparison for Binary Ransomware Dataset: The execution time comparison of KNN and TUKNN for binary ransomware dataset is shown in Fig. 5. It is clear from the figure that TUKNN performance is significantly better than KNN.

    Fig. 5.
    figure 5

    Time comparison for 2-class ransomware dataset: a) KNN and b) TUKNN

    Table 4. K values to achieve maximum accuracy
  2. (b)

    KNN vs TUKNN Time Comparison for Multi-Class Ransomware Dataset: Fig. 6 shows the performance comparison of KNN and TUKNN for multi class ransomware dataset. It is quite clear that TUKNN performance is much better than the KNN.

    Fig. 6.
    figure 6

    Time comparison for multi-class ransomware dataset: c) KNN and d) TUKNN

  3. (c)

    KNN vs TUKNN Time Comparison for SWaT Dataset: In Fig. 7, it is clear that TUKNN implementation is significantly advantageous over KNN for SWaT dataset.

    Fig. 7.
    figure 7

    Time comparison for SWaT dataset: e) KNN and f) TUKNN

    Table 5. Ratio of CPU and GPU time (in seconds)

5 Conclusion

Our study reveals that Kulczynski distance and Soergel distance are adequate with KNN to handle 2-class ransomware dataset with high classification accuracy. However, in case of multi-class data handling, although these two proximity measures have been found to assist winning performance in comparison to its other counterparts, the classification accuracies are relatively less. Interestingly, for SWaT dataset, among eight proximity measures, six measures such as Euclidean, Manhattan, Kulczynski, Cosine Similarity, Chebyshev, and Soergel distance are giving equal winning performances.

Out of all the computations performed, the Chebyshev distance for the bi- nary classification of Ransomware Dataset is least benefitted from the usage of Py-CUDA where GPU computation is only 40.86 times faster than the CPU computation and the Cosine Similarity for the classification of SWaT Dataset is highly benefitted from the usage of Py-CUDA where the GPU computation is 237.5 times faster than the CPU computation.

Table 6. Optimal ‘K’ values for proximity measures for 2-class ransomware dataset
Table 7. Optimal ‘K’ values for proximity measures for n-class ransomware dataset
Table 8. Optimal ‘K’ values for proximity measures for SWaT dataset

When dealing with the binary classification of the Ransomware Dataset using the KNN model, if accuracy is of high priority, then usage of Kulczynski or Soergel Distance is recommended. Similarly, when dealing with the multi-class classification of the Ransomware Dataset, if accuracy is of high priority, then usage of Kulczynski Distance is recommended. When dealing with the classification of the SWaT Dataset with high accuracy, usage of any of these six proximity measures is recommended. But if both the accuracy and the computational time are of high priority, then the usage of the Manhattan Distance is a better option to go with.

Also, we recommend K values ranging from 2 to 9 for best possible accuracy for all the datasets used in the study. An exhaustive experimentation was also carried out for optimal feature selection based on some prominent feature selection algorithms [6, 7]. The performance of TUKNN with the optimal feature subset has been found significantly better than the present performance. However, due to lack of space, those results are not reported.