Keywords

1 Introduction

Cloud computing is a new technology containing pool of resources with large number of computers. The computation task is distributed to this pool and also provides us with unlimited storage and computing power which will benefit us to mine large amount of data.

Tightly coupled systems including shared memory systems (SMS), distributed memory machines (DMM), or clusters of SMS workstations are connected with a fast network in a parallel data mining environment. Loosely coupled processing nodes/computers are connected by the high-speed network in a distributed computing environment. Each node contributes to the execution or distribution/replication of data. It is generally called as a cluster of nodes. Usually, a cluster framework is used to set up a cluster.

The HPC clusters exploit parallel computing to exert more computation power for the resolution of a problem. HPC clusters have a large number of computers called “nodes,” and mostly these nodes would be configured identically. Externally, the cluster looks like one system. Client programs that run on a node are called jobs, and they are constantly monitored through a queuing technique for proper use of every accessible resource. HPC jobs include replication of numerical models or study of information from logical instrumentation which allow scientists to construct new science at the use of high-performance computing [1].

Data mining is the process of examining large preexisting databases or raw data to generate new information for further use. Data mining algorithms must be efficient and effective in order to produce some meaningful output. Among the numerous available data mining algorithms, the popular ones are Apriori, DIC, GSP, SPADE, SPRINT, and K-means algorithms. In earlier days, due to limitations in computing power, data mining process was slower. Nowadays, the data mining process has speed up many folds due to the presence of high-performance parallel and distributed computing environments. However, the data available today are very large and growing in an exponential rate, which need more effective and accurate data mining algorithm to be deployed. K-means algorithm, despite being one of the most effective algorithms to be used in a parallel computing environment, has some major limitations, and we have worked on proposed SKIK algorithm to overcome one such limitation.

The organization of the paper is as follows: Sect. 1 holds the introduction, and Sect. 2 contains brief discussion of data mining concepts used in HPC environment along with their pros and cons. Section 3 describes the proposed sifted K-means with independent K-value (SKIK) algorithm created based on the advantages and disadvantage of existing algorithms, and Sect. 4 shows the complexity measurement of SKIK algorithm. Section 5 highlights conclusion to this paper with few focus points on imminent works.

2 Data Mining Concepts and Related Algorithms

Different data mining concepts including types of machine, parallelisms, load balance, database layouts, and candidates are discussed in detail in Sect. 2.1. Performance analyses of some of the most popular algorithms vis-a-vis concepts are provided in Table 1. In Sect. 2.2, various related and most popular algorithms are explained briefly. A comparative analysis of these algorithms highlighting their advantages and disadvantages are pointed out in Sect. 2.3.

Table 1 Comparisons among common concepts used with data mining algorithms [2]

2.1 Concepts

Type of machine used. The main two types of machines are distributed memory machines (DMM) and shared memory systems (SMS). In DMM, the effort is communication optimization and hence synchronization is implicit in message passing. For SMS, synchronization occurs via locks and barriers, and the aim is to minimize these points. Data decomposition is very important for DMM, but not for SMS. SMS typically use serial I/O, while DMM use parallel I/O [2].

Parallelism type. Task and data parallelism are two major parallelisms used. In data parallelism, the database is partitioned among P processors. Each processor performs evaluating candidate patterns/models on its local part of the database. In task parallelism, the processors perform different computations independently, but have/need access to the entire database. SMS can connect to whole data, but for DMM can do this through careful reproduction or specific connection to the local data. It is also possible for a hybrid parallelism having properties of both task and data parallelisms.

Load balance type. Two main load balancing types are static and dynamic load balancing. In static load balancing, work is partitioned among the processors using heuristic cost function, and there is no subsequent correction of load imbalances resulting from the dynamic nature of mining algorithms. Dynamic load balancing distributes work from heavily loaded processors to lightly loaded ones. Dynamic load balancing is important in multi-user environments and in heterogeneous platforms, which have different processor and network speeds.

Database layout type. Usually, the recommended database for data mining is a relational table having R rows, called records, and C columns, called attributes. Horizontal database design is used in numerous data mining algorithms. Here, they collect transaction id (tid) as a unit including attribute values for that transaction. Other procedures use a vertical database design. Here, they collect a list of all tids (called tidlist) containing the item with each attribute and the related attribute value.

Candidate concepts. Dissimilar mining procedures use either shared or replicated or partitioned candidate concept generation and evaluation. All processors check out a single copy of the candidate set in shared concept. The candidate concepts are copied on each system, and checked locally, before overall outcomes are achieved by fusing them in replicated concept. Each processor creates and examines a dislocated candidate set in the partitioned concept.

Database type. The database itself can be shared (in SMS or shared-disk architectures), partitioned (using round robin, hash, or range scheduling) among the available nodes (in DMM) or partially or totally replicated.

Table 1 shows a comparison among the common concepts used with most popular data mining algorithms.

2.2 Most Popular Algorithms

Apriori Algorithm. This algorithm is used for mining common itemsets in large data sets. The point of view is “bottom up.” We call it candidate generation, where frequent subsets are available one item at a time, and groups of candidates are checked against the data. It is aimed to operate on transaction database.

Frequent Itemsets: All the sets holding the item with the least support (designated by Di for ith itemset).

Apriori Property: All subgroups of frequent itemset have to be frequent.

Join Operation: A set of candidate k-itemsets are generated by joining Dk−1 with itself to find Dk.

Prune Step: Any sparse (k − 1)-itemset cannot be a subset of a frequent k-itemset.

  • Ck: Candidate itemset of size k

  • Dk: frequent itemset of size k

  • D1 = {frequent items};

  • STEP 1: Have the support S of each 1-itemset by examining the given database, correlate S with supmin, and prepare a support of 1-itemsets, D1

  • STEP 2: Use Dk−1 and join Dk−1 to create a set of candidate k-itemsets. And use Apriori property to prune the infrequent k-itemsets from this set.

  • STEP 3: Examine the given database to find the support S of each candidate k-itemset in the find set, correlate S with supmin, and prepare a set of frequent k-itemsets Dk

  • STEP 4: Is the candidate set = Null, if YES go to STEP 5 else go to STEP 2

  • STEP 5: Produce all nonempty subsets of 1 for every common itemset 1,

  • STEP 6: If confidence C of the rule “s => (1 − s)” (=support of 1/support S of s)’ min_conf, output the rule “s => (1 − s)”, for every nonempty subset s of 1 [3].

Dynamic Itemset Counting Algorithm (DIC). It is an alternative to Apriori algorithm. As the transactions are read, the itemsets are dynamically inserted and removed. Assumptions are made that all subgroups of frequent itemset have to be frequent. After every T transactions, algorithm stops to add more itemsets. Itemsets are tagged in four different ways as they are counted:

Solid box: confirmed frequent itemset—an itemset we have completed counting and exceeds the support threshold supmin

Solid circle: confirmed infrequent itemset—we have completed counting and it is below supmin.

Dashed box: imagined frequent itemset—an itemset being counted that surpasses supmin

Dashed circle: imagined uncommon itemset—an itemset being counted that is below supmin

  • STEP 1: Tag the empty itemset with a solid square. Tag the 1-itemsets with dashed circles. Discard all other itemsets untagged.

  • STEP 2: While any dashed itemsets remain.

    1. 1.

      Read M transactions (if at the end of the transaction file, continue from the beginning). For each transaction, step up the corresponding counters for the itemsets that appear in the transaction and are tagged with dashes.

    2. 2.

      If a dashed circle’s count surpasses supmin, make it a dashed square. Insert a new counter for it and make it a dashed circle if any next superset of it has all subsets as solid or dashed squares.

    3. 3.

      If a dashed itemset has already been counted through all the transactions, make it solid and stop counting it [4].

Generalized Sequential Pattern Algorithm (GSP). A sequence database is formed of ordered elements or events. In GSP algorithm, horizontal data format is used and the candidates are generated and pruned from frequent sequences using Apriori algorithm.

  • STEP 1: Each item in database is a candidate of magnitude 1 at the beginning.

  • STEP 2: for each level (i.e., order of magnitude k) do

    1. 1.

      Examine database to gather support count for every candidate order.

    2. 2.

      Generate candidate magnitude (k + 1) orders from magnitude k frequent orders using Apriori.

  • STEP 3: Do it over till no common order or no candidate can be found [5].

Sequential PAttern Discovery using Equivalence classes (SPADE). It is an algorithm to frequent sequence mining using vertical ID list database format, where each sequence is related with a list of objects in which it appears. Then, frequent sequences can be found surely using intersections on ID lists. The procedure lowers the number of database scans and hence also lowers the execution time.

  • STEP 1: Sequences having singular item, in a single database scan, are measured as the number of 1-sequences.

  • STEP 2: Convert the vertical depiction into horizontal depiction in memory and measure the number of sequences for each pair of items using a two-dimensional matrix for 2-sequences calculation. Thus, this step can also be performed in only one scan.

  • STEP 3: Following n-sequences can then be formed by joining (n − 1)-sequences using their ID lists. The size of the ID lists is the count of sequences in which an item occurs. If this number is higher than minsup, the sequence is a frequent one.

  • STEP 4: If no frequent sequences available, the algorithm stops.

The algorithm can use a breadth-first or a depth-first search procedure to discover new sequences [6].

Scalable PaRallelizable INduction of decision Trees (SPRINT) Algorithm. This algorithm builds a model of the classifying characteristics based upon the other attributes. Classifications provided are called a training set of records having several attributes. Attributes are either continuous or categorical.

SPRINT algorithm frees all of memory restrictions in contrast with SLIQ algorithm. This is also fast and scalable and can be easily parallelized.

figure e

Original Call: Division(Training Dataset)

Prune step is done using SLIQ algorithm [7].

K-means Clustering Algorithm. K-means is an unsupervised learning algorithm classifies a given data set through a certain number of clusters (assume k clusters) fixed a priori. Different cluster positions will cause different outcomes. So we must define k centers, one per cluster which should be placed in a clever manner. So, placing them as far as possible from each other seems a better choice. This algorithm tries to minimize “squared error function” given by:

$$ {\text{J}}\left( {\text{X}} \right) = \sum\limits_{{{\text{i}} = 1 \to {\text{S}}}} {\sum\limits_{{{\text{j}} = 1 \to {\text{Si}}}} {\left( {||{\text{w}}_{\text{i}} - x_{j} ||} \right)^{2} } } $$

where

||wixj||”:

is the Euclidean distance between wi and xj.

“si”:

is the number of data points in ith cluster.

“s”:

is the number of cluster centers

Let W = {w1, w2, …, wn} be the set of data points and X = {x1, x2, …, xs} be the set of centers.

  • STEP 1: Randomly select “s” cluster centers.

  • STEP 2: Calculate the distance between each data point and cluster centers.

  • STEP 3: Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers.

  • STEP 4: Recalculate the new cluster center using:

$$ {\text{x}}_{\text{i}} = \left( {1/{\text{s}}_{\text{i}} } \right)\sum\limits_{{{\text{j}} = 1 \to {\text{Si}}}} {{\text{w}}_{\text{j}} } $$

where “si” means the data points number in ith cluster.

  • STEP 5: Recalculate the distance between each data point and new obtained cluster centers.

  • STEP 6: If no data point was reassigned, then stop, otherwise repeat from STEP 3 [8, 9].

2.3 Advantages and Disadvantages of the Algorithms

Table 2 shows a comparative study about pros and cons of the above-mentioned algorithms.

Table 2 Advantages and disadvantages of popular algorithms

3 Proposed SKIK Algorithm

K-means algorithm generates K clusters of the known data set, where every cluster can be expressed by a centroid which is a concise expression of all the objects present in a cluster. The main flaws of K-means algorithm are: (i) it is difficult to anticipate the number of clusters (value of K) and (ii) initial centroids have a big effect on the concluding outcome. Here, we are introducing a new algorithm sifted K-means with independent K-value (SKIK) algorithm to overcome these issues.

In data mining, we work on very large data set. We have proposed to sort these data based on any attribute as per user requirement. We would use parallel heap sort [19] to sort as it uses a parallel approach across the cluster utilizing the available architecture.

Steps to find initial centroids:

  1. 1.

    From n objects, determine a point by arithmetic mean. This is the first initial centroid.

  2. 2.

    From n objects, decide next centroids so that the Euclidean distance of that object is highest from other decided original centroids. Keep a count of the centroids.

  3. 3.

    Repeat Step 2 until n ≤ 3 [20].

We will get initial centroids from here and can use them in the proposed algorithm to calculate “optimal” centroids and K-value.

Determination of K:

The K-means algorithm creates compact clusters to minimize the sum of squared distances from all points to their cluster centers. We can thus use the distances of the points from their cluster center to measure if the clusters are compact. Thus, we adopt the inner-cluster distance, which is usually the distance between a point and its cluster center. We will take the median of all of these distances, described as

$$ {\text{D}}_{\text{wc}} = \left( {1/{\text{N}}} \right)\sum\limits_{{{\text{i}} = 1 \to {\text{k}}}} {\sum\limits_{{{\text{w}}\,\upvarepsilon\,{\text{Si}}}} {\left| {\left| {{\text{w}} - {\text{c}}_{\text{i}} } \right|} \right|^{2} } } $$

where N is the count of components in the data set, k is the count of original clusters equal to the number of originally determined centroids, and ci is the center of cluster si.

We can also measure the between-cluster distance. We take the minimum of the distance between cluster centers, defined as

$$ {\text{D}}_{\text{bc}} = { \hbox{min} }\left( {\left| {\left| {{\text{c}}_{\text{i}} - {\text{c}}_{\text{j}} } \right|} \right|^{2} } \right), {\text{where}}\,{\text{i}} = 1,2, \ldots ,{\text{k}} - 1\,{\text{and}}\,{\text{j}} = {\text{i}} + 1, \ldots ,{\text{k}} . $$

Now genuineness, G = Dwc/Dbc.

We need to decrease the inner-cluster distance, and this measure is in the numerator. So we need to decrease the genuineness measure. The between-cluster distance measure needs to be increased. Being in the denominator, we need to decrease the genuineness measure. Hence, clustering with a “lowest value” for the genuineness measure will provide us the “optimal value” of K for the K-means procedure [21].

We can also evaluate the “optimal” K using both inner-cluster and between-cluster scatter using the method proposed by Kim and Park [22].

Steps of SKIK:

  1. 1.

    Start.

  2. 2.

    Load the data set.

  3. 3.

    Sort the data using parallel heap sort.

  4. 4.

    Find initial centroids using previously mentioned procedure.

  5. 5.

    Determine K (number of clusters) from the centroids.

  6. 6.

    Calculate the distance between each data point and cluster centers.

  7. 7.

    Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers.

  8. 8.

    Recalculate the new cluster center using:

    $$ {\text{x}}_{\text{i}} = \left( {1/{\text{s}}_{\text{i}} } \right)\sum\limits_{{{\text{j}} = 1 \to {\text{Si}} }} {{\text{w}}_{\text{j}} } $$

where “si” means the data points number in ith cluster.

  1. 9.

    Recalculate the distance between each data point and new obtained cluster centers.

  2. 10.

    If no data point was reassigned, then stop, otherwise repeat from Step 7.

4 Complexity Measurement of SKIK

Sorting may imply initial workload, but once done it will decrease computation time in many folds.

Time complexity of sorting at Step 3 with n data elements [19]

$$ {\text{O}}\left( {\text{nlogn}} \right). $$

For the Step 4 to find initial centroids, time complexity for segregating the n data items into k parts and deciding the mean of each part is O(n). Thus, the total time complexity for discovering the initial centroids of a data set containing n elements and m attributes (where m is way less than n) is

$$ {\text{O}}\left( {\text{nlogn}} \right). $$

Step 5 is again a partitioning procedure having complexity [22]

$$ {\text{O}}\left( {\text{nlogn}} \right). $$

Steps 6–10 are same as the original K-means algorithm and hence take time

$$ {\text{O}}\left( {\text{nKR}} \right). $$

where n is the number of data points, K is the number of clusters, and R is the number of iterations. The algorithm converges in very less number of iterations as the initial centroids are calculated in a clever method in harmony with the data dispersion.

So, the general complexity of SKIK is the maximum of

$$ \begin{array}{*{20}c} {\left\{ {{\text{O}}\left( {\text{nlogn}} \right) + {\text{O}}\left( {\text{nlogn}} \right) + {\text{O}}\left( {\text{nlogn}} \right) + {\text{O}}\left( {\text{nKR}} \right)} \right\}} \\ {{\text{i}} . {\text{e}}.,{\text{O}}\left( {{\text{nlogn}} + {\text{nKR}}} \right)} \\ {{\text{i}} . {\text{e}}.,{\text{O}}({\text{n}}\left( {{\text{logn}} + {\text{KR}}} \right)} \\ \end{array} . $$

5 Conclusion

This paper has provided a detailed comparison among six most popular data mining algorithms which have significant contribution in high-performance cluster computation and artificial intelligence. The algorithms are Apriori, DIC, GSP, SPADE, SPRINT, and K-means. The paper presents short algorithmic steps about the main algorithms, explanation of their features, and respective advantages and disadvantages. Several variations of the algorithms exist, and they have been proved to be suitable based on certain scenarios. In the present days, research has been progressing with the most effective data mining algorithms, applicable with parallel and high-performance cloud computing, like SPRINT and K-means. We have proposed SKIK algorithm to improve K-means algorithm to be used with large data set and HPC architecture. We have shown the measurement of complexity of SKIK as well. In-depth research work needs to be conducted for extending the capabilities and complete performance analysis of the SKIK algorithm with respect to other available variations of K-means algorithm.