Keywords

1 Introduction

The introduction of new technologies has made the computer science at its best as an emerging field. With the availability of new software paradigms, dependency on computers and machines to perform extreme and computationally exhaustive tasks is growing day by day. For example, the traditional way of file storage on local file systems is replaced by distributing and even more secure cloud storage systems. Among all, the Internet has turned out to be the most resourceful technology ever created for instant sharing of knowledge and resources with the rest of the world with ease. Its popularity around the world can be assessed by the fact that the total number of Internet users has grown by around 82% or almost 1.7 billion in the last 5 years and this number is forecasted to increase up to 4 billion around the year 2020. The main reasons behind this increasing popularity of the Internet are its speed, economic nature and ease of accessibility to the users. The aid of Internet has made the activities easier to a tremendous extent by reducing the communication delay between users in the different parts of the world.

The dependency and conscience to improve our methods by means of research and analysis have led to an immense hike in the rate of data production. As a result of its increased popularity, around 90% of the world’s data has come into existence in the past 2 years and Internet has turned out to be the largest contributor among its sources. With the availability of sufficient data for research and analytics, machine learning and related techniques have picked up popularity [1] to achieve some computationally impossible solutions by using well-defined mathematical models. Machine Learning [2] is a term used to define a technique to find a solution or more precisely improve the existing solution gradually by following the process of learning through previous observation or experience without any human intervention. Machine learning is a popular community to solve problems which require human-like instinct and decision-making. With the ease and availability of Big Data, machine learning is used to solve some complex and interesting problems [1] which help in achieving those tasks that normally require some special human assistance, intelligence or decision-making skills to execute successfully. For example, making classification, predictions in advanced robotics, and driver-less cars include such tasks.

1.1 Limitations of Classical ML Methodology

The basic approach to solve the machine learning problem involves two subtasks: (i) training and (ii) testing. Apart from the techniques [1] used, both these subtasks are common in the process of solving any machine learning task. Both these tasks are extremely crucial in solving a machine learning problem. The quality of the training and testing methods eventually decides the overall quality of the model or solution. Initially, training is performed to train or develop a mathematical model/classifier/machine with the help of existing chunks of data in a huge quantity. The vast amounts of data help in improving the accuracy and performance of the model, but processing these chunks of data is another crucial problem. Hence, the overall performance of the ML algorithm is limited by the processing methodologies used to process the data for training the model. The same can be improved to process those huge datasets, so that appreciable results can be achieved within the conserved time limits. A learning phase and training phase of ML process are depicted in Fig. 1a, b.

Fig. 1
figure 1

Machine learning as a process, a learning phase, b testing phase

The second task is testing and analysing the accuracy of the model by using the related test data. This step is crucial as it measures the accuracy and the acceptability of the model. But, the solution must work for regular as well as new problems [2], i.e. to make the solution robust against different possible situations. A suitable ML model can be used to test the model against a large dataset which may or may not belong to dataset. Again, the vastness of the dataset will extend the evaluation period of our model which is completely undesirable. To deal with these kind of problems, several efforts have been made in this field to improve its performance. To over-relate and achieve this problematic instance, in this paper, we discuss a strategy using the GPU-based parallel processing platform named as CUDA, to perform the ML tasks in allowable time limits [3, 4]. This approach works in an efficient way and can make the training and testing strategies workable within the time period scan.

2 Resolving with CUDA Platform

2.1 CUDA

In 2007, NVIDIA, a GPU-designing firm, released the initial version of CUDA—a parallel computing platform which uses the graphical processing units (GPUs) for massive parallelization of computation tasks [4]. It follows general purpose computing, on graphics processing unit (GPGPU) approaches to perform the general tasks which are computationally intensive and require a lot of time to be executed if performed on the CPU [5]. It allows users to write massively parallel programs using languages like C, C++ and Fortran [6]. The CPUs are optimized for sequential processing and contain only limited numbers of powerful cores, whereas on the other hand, GPUs are optimized for performing parallel tasks and contain thousands of smaller and less powerful cores [7]. The graphical representation of CPU and GPU is depicted in Fig. 2.

Fig. 2
figure 2

CPU versus GPU

2.2 Impact on Solving ML Problems

Solution to the ML problems using traditional systems may turn out to be an expensive technique with the increase in the size of the dataset. Especially, the time required to process such a large set of data increases drastically. Using parallel GPU, computation can turn out to be a perfect solution for solving such large-scale machine learning problems without any extra increase in the machine cost by harnessing the graphical computation power of the system [3, 8].

3 Problem Formulation: Predicting Flight Delay Occurrences

Every year around one-fifth of the airline flights suffer a cancellation or delay which results in critical loss of time and resources to both airlines and passengers. Hence, creating a model which can predict whether a given flight with certain parameters like source, destination, duration will suffer a delay or not can offer huge help to the passengers and airline managers in choosing and managing their flights. Here, we use a dataset from American Statistical Association (ASA) 2009 data expo [9], which includes dataset for all commercial flights within USA from year 1987 to 2008. This dataset includes nearly 120 million flight records. To predict the future occurrences, we consider Naïve Bayes Classifier (NBC). The sequential standard of NBC is described in Sect. 3.1.

3.1 Solution: Sequential Naïve Bayes Classifier

3.1.1 Naïve Bayes Classifier

Naïve Bayes Classifier (NBC) belongs to a class of conditional probabilistic classifiers which assume a major assumption of independence between all features of its dataset to calculate most probable outcome depending on the input given by the user [10]. It uses Bayesian interpretation to measure a degree of belief to identify the class which includes the input features [11]. Given an input in the form of a vector;

$$ X\, = \,(x_{1} ,x_{2} ,x_{3} ,\, \ldots ,\,x_{n} ) $$
(1)

The NBC will provide the probability Pk for each class, whether the input vector X belongs to the given class Ck, i.e.

$$ P(C_{k} \,|\,(x_{1} ,x_{2} ,x_{3} ,\, \ldots ,\,x_{n} )) $$
(2)

Using Bayes theorem, the given probability can be decomposed as;

$$ P(B\,|\,A)\, = \,\frac{P(B)P(A\,|\,B)}{P(A)} $$
(3)

The above formulae can be extended to calculate the joint probability considering several parameters for each sample;

$$ P(B_{i} \,|\,A)\, = \,\frac{{P(B_{i} )P(A\,|\,B_{i} )}}{{\sum\nolimits_{i = 1}^{n} {P(B_{i} )P(A\,|\,B_{i} )} }} $$
(4)

However, the denominator can be ignored since it will remain same for all the classes [12]. Hence, the classification problem reduces to finding numerator for a given input vector X. Using the assumption of independence between the features of the vector X, i.e. the probability of occurrence of one feature, is independent of the probability of occurrence of any other feature or collection of features. At the same time, we can simplify the calculation of the numerator probability same as a joint probability.

3.1.2 Normal Distribution

For calculating the probability of occurrence of features which are continuing in nature rather the discrete, we use the normal (also known as Gaussian) distribution function [10] which can be given as follows.

$$ P(x_{k} \,|\,C_{i} )\, = \,g(x_{k} ,\mu_{Ci} ,\sigma_{Ci} ) = \frac{1}{{\sqrt {2\pi } \sigma_{Ci} }}e^{{\frac{{(x_{k} - \mu_{Ci} )^{2} }}{{2\sigma_{Ci}^{2} }}}} $$
(5)

where \( g(x_{k} ,\mu_{Ci} ,\sigma_{Ci} ) \) is the Gaussian (normal) density function for attribute A, while \( \mu_{Ci} \) and \( \sigma_{Ci} \) are the mean and standard deviation, respectively, given the values for attribute A for Ci. Advantages of using NBC over other classifiers are: (i) simple and easy to test new data, (ii) minimum error rate as compared to other complex algorithms, (iii) high degree of accuracy with lower computational complexity and (iv) compete with more advanced algorithms like SVM, etc.

3.2 Methodology: Predicting Airline Delays

3.2.1 Training the Model

To train the model, we start by calculating mean and variance for each relevant parameter in the flight description which belong to either class \( C_{Delayed} \) or \( C_{Nodelay} \). The calculated values of mean and variance are shown in Table 1.

Table 1 Sample airline delay database

Using the table similar to above, we calculate the probabilities to find out the numerator for given values of features of flight whose class has to be identified as either delayed or on-time.

3.2.2 Testing

After performing preprocessing, we use mean and variance calculated in the previous step to find the posterior probabilities for the given test samples. After combining the posterior probabilities of all the features, we get the combined probabilities of that test sample belongs to the same class. For each sample, the class with the highest numerator value can be considered as the most optimum prediction for that sample. After this, we apply the strategy to find the probability for all the flight variables in the test dataset one by one, sequentially. The corresponding algorithm which includes these two steps is shown in Algorithm 1.

Algorithm 1: NBC Classification algorithm

Input: Airline Dataset

Output: Set of probabilities for each class

Step-1. Select the relevant features for each flight to be considered in the flight

delay evaluation.

Step-2. Find Mean () and Variance () for the selected features over each class in

the complete dataset.

Step-3. For each input vector, X calculate the individual probability of occurrence

for each different class.

Step-4. Combine all the probabilities, to calculate the joint probability in the numerator.

Step-5. Select and assign each input vector class which has the highest numerator value.

4 Parallelizing the Approach

In this section, we test and improve the performance of the above algorithm using the parallel CUDA computing model [13, 14]. To accelerate the process to find the mean, we use a methodology known as Tree-Based Reduction Sum. This technique utilizes the parallel architecture of CUDA [6] to efficiently calculate the mean and variance for a given feature belonging to a particular class. The instance of Tree-Based Reduction Sum is depicted in Fig. 3.

Fig. 3
figure 3

Tree reduction sum approach

In reduction sum, we particularly focus on the tree reduction technique to evaluate the sum of the given large set of values which can produce a speedup nearly equal to log(n). As CUDA has no global synchronization facility to easily communicate with all the threads [13], this limitation can be avoided by using a recursive kernel invocation while adding a small hardware/software overhead in the individual kernel launch [15]. The state of recursive kernel invocation is shown in Fig. 4.

Fig. 4
figure 4

Recursive kernel invocation

4.1 Parallelizing Testing Phase

To classify the test dataset, we select an optimum block size (i.e. the total number of threads per block) and total number of blocks each containing the threads equal to the block size [16]. After passing the set of mean and variance related to each feature in the flight description, we calculate the individual probabilities of the test flight belonging to a particular class one by one for all possible classes [9]. Figure 5 shows the calculation through the NB model. Hence, each block processes the number of flights equal to the number of threads, i.e. one thread per test flight that evaluate the total results. As compared to the sequential processing of each test flight, the parallel CUDA algorithm only takes time equivalent to the number of threads per block and some communication time between the CPU and GPU to distribute [17] the process and recombine the solution. The modified parallel classification algorithm is shown in Algorithm 2.

Fig. 5
figure 5

Parallel test results calculation through NB model

Algorithm 2: Parallel NBC Classification algorithm

Input: Airline Dataset

Output: Set of probabilities for each class

Step-1. Select the relevant features for each flight to be considered in the flight

delay evaluation.

Step-2. Find Mean () and Variance () using the reduction sum technique for the

selected features over each class in the complete dataset.

Step-3. Divide the dataset into set of blocks which includes flight vectors equal to

number of threads. For each input vector, X calculate the individual prob

ability of occurrence for each different class.

Step-4. Combine all the probabilities, to calculate the joint probability in the numerator.

Step-5. Select and assign each input vector class which has the highest numerator value.

4.2 Experimental Analysis

To perform the experiment-related approaches, we use the airline dataset of the American Statistical Association (ASA) [9]. The system on which we have conducted our tests has Windows 10 operating system running over Intel core i5-3337U processor 1.80 GHz with turbo boost up to 2.70 GHz and 6 GB of DDR3L internal memory. The GPU version is NVIDIA Geforce GT 740 M with dedicated 1 GB DDR3 memory. The coding and debugging of the CUDA project are done over Microsoft Visual Studio 2015 community edition with the latest CUDA toolkit 8.0 installed on the system. The compared results with the sequential algorithmic approach are depicted in Fig. 6.

Fig. 6
figure 6

Parallel CUDA implementation with the sequential method

From the results, we can conclude that the inclusion of the GPU support in training the model and testing the sample test data has resulted in a speedup which is nearly equal to 2.5 times the sequential methods. It also outperforms the related time execution sequences. However, this calculation may vary with different versions of NVIDIA GPU cards with different CUDA core configurations used for the purpose [18].

5 Conclusion and Future Work

This paper aims to describe how the enormous parallelizing power of CUDA platform can be utilized to accelerate the process of developing the ML model for prediction and the classification purpose. To make this scenario’s performance better, we used the reduction sum technique to parallelize the task of calculation of the mean in the training phase and parallelized the testing process using the simple CUDA computation scheme. Both these techniques result in combined actual improvement over the sequential methodology to calculate the model and evaluate the result. Similar improvements can be made in developing other ML models by utilizing the resources offered through CUDA computing platform to improve the time and resources used by overall prediction and classification process without using costly machines to process on it.