Keywords

1 Introduction

Clustering is the process of identifying the classes of objects with similar characteristics. Data clustering segregates the similarities and variances in the database to form groups of related data as either classes or clusters. Apart from classification, clustering is an un-supervised learning method to uncover the causal structures and patterns of a given dataset. It is also known as automatic classification in the sense, and data objects can be treated as an implicit class. The distinct advantage is that it can automatically find the groupings. The clustering methods are mainly divided into the following categories: partitioning, hierarchical, density-based, and grid-based methods. Partitioning method is a popular heuristic method which improves the segregation by moving the objects from one group to another by a local optimum approach. k-means and k-medoids are most commonly used partitioning methods. Hierarchical method breaks down the data objects to various levels of hierarchies. This method has two approaches, agglomerative and divisive. The agglomerative approach builds the hierarchies in a bottom-up fashion, whereas divisive approach does the same in the top down. Density-based method solves the difficulty in finding arbitrary-shaped clusters. The clusters are grown on the basis of density to solve the issue. The high density area is termed as clusters, whereas the sparse areas are used to differentiate the clusters. Grid-based method is one of the high-speed clustering methods, which divides the object space into a number of cells that form a grid structure. The processing time depends upon cells in each dimension of the quantized space [1]. Most clustering algorithms focus on numerical data clustering. However, real data sets are primarily mixed in nature which contains both numerical and categorical data types. Major challenge in mixed data clustering is to find a single clustering solution for both data types with improved performance and reduced complexities.

In this paper, we propose an efficient clustering algorithm for mixed data, FPMC which performs clustering after crack, transformation, and merge phase. In crack phase, the total data set is divided into nominal and numerical packs. In transformation phase, frequent patterns are mined to extract frequency-token which are numerical substitutes to nominal values. Attribute reduction is achieved by converting ‘n’ number of nominal attributes to a single numerical attribute. Merge phase combines the output of transformation phase and numerical attributes which will undergo normalization before any numerical clustering algorithm is applied.

Section 2 deals with the existing algorithms in mixed data clustering and explains the advantages and disadvantages of each method. Section 3 talks about the proposed work in detail, the process flow along with implementation details and the validation measures used in the work. Section 4 gives the experiment setup and the accuracy of the FPMC algorithm. Section 5 presents the conclusion and the extension of the proposed framework.

2 Related Works

Frequent pattern analysis has been a very interesting and focused area of research in data mining. The frequent pattern analysis finds item sets, subsequences, or substructure that appears frequently in a dataset. The frequency is measured in terms of number of occurrences of a particular sequence in the dataset with two important matrices, support and confidence. Support means the probability of occurrence, whereas confidence is the certainty of occurrence of an event. In frequent pattern analysis, the frequency greater than or equal to the minimum support is checked. For example, following are the patterns obtained of an application:

$$ \begin{aligned} {\text{I1}},{\text{I2}},{\text{I3}} & => 2\\ {\text{I3}},{\text{I4}},{\text{I5}} & => 3\\ {\text{I2}},{\text{I3}},{\text{I5}} & => 3\\ {\text{I4}},{\text{I5}},{\text{I1}} & => 2\\ {\text{I1}},{\text{I3}},{\text{I5}} & => 4\\ \end{aligned} $$

If the minimum support is set at 3, then the frequent patterns obtained are {{I1, I2, I3}, {I4, I5, I1}, and {I1, I3, I5}}. These are the only patterns that meet the general criteria, i.e., select the only pattern that has support count greater than or equal to minimum support count. The efficiency of association rule generation is different for the various types of frequent item set, namely Eclat, Apriori, and FP growth algorithms.

Eclat algorithm uses vertical database layout where each item of transaction is stored along with its cover in the database. This computes the support by interaction-based approach. The disadvantage is that it is suitable only for small data sets [2]. Apriori algorithm is based on level-wise search. This algorithm begins with the selection of one item and then proceeds by adding the item one at a time and is checked against the support which is required as per the requirement of the application [3]. The FP growth tree is the approach in which the problem of Apriori algorithm is solved by introducing a compact data structure which avoids the need of candidate generation. The execution time of this algorithm is large due to complex data structure and also it is difficult to fit in the main memory [4].

Commonly used algorithms for mixed data are Ralambondrainy [5], k-prototype [6], CLARA [7], and cluster ensemble [8]. Ralambondrainy proposed a new algorithm for mixed data set by converting the categorical attributes into binary and treating the binary as numerical value. By the conversion into a numerical value, k-means algorithm can be directly applied. The drawback of this approach is the space costs and computational complexity with increase in binary attributes in a data set. The other drawback is with ‘mean’, which do not give the cluster character. The k-prototype algorithm uses squared Euclidean distance as the dissimilarity measure for numerical (Sr) and the number of mismatches between two objects (Sc) for categorical. Sr+∞Sc is the combined dissimilarity measure where ∞ is the weightage given to provide equal importance to both sides, i.e., numeric and nominal attributes. It uses k-modes to update categorical attribute and all other procedures are similar to k-means. As k-prototype is derived from traditional k-means, it is having the same problems that k-means possess. CLARA combines sampling along with clustering program. This method finds objects by k-medoids, so it clusters categorical attributes. This is inefficient for large data set clustering. Cluster ensemble approach splits the data set into nominal and numerical and then applies clustering separately and at last combines the results of clustering. After combining the result, numerical or nominal clustering method is applied. Even though it avoids the problems with other clustering approaches, this approach has got high clustering complexity. The error rate will get multiplied as clustering is conducted three times.

3 Proposed System

Frequent pattern-based framework for mixed data clustering (FPMC) algorithm addresses the major challenges in the existing clustering solutions like performance degradation and space computational complexities by avoiding repeated iterations and optimizing using attribute reduction. Nominal attributes are replaced with the count of frequent patterns that are mined from the data set. For this FPMC uses Apriori due to its easiness in implementation and simplicity in data structure. Apriori is also suitable for large data sets.

The process flow of FPMC algorithm is given in Fig. 1.

Fig. 1
figure 1

FPMC flow of execution

First phase of the algorithm is crack stage where algorithm first separates the total data set into nominal and numerical after preprocessing. The crack stage produces two results, one is nominal and other numerical pack. The nominal pack undergoes the second phase, i.e., transformation phase where frequent pattern analysis is done on the dataset to obtain a frequency-token value for each frequent item set. This value is obtained by analyzing the nominal pack with the frequent patterns derived from Apriori analysis by evaluating Eq. (1).

Let P 1, P 2.… P n be the frequent patterns obtained after Apriori analysis:

$$ {\text{RS}}_{\text{value}} = \left\{ {\begin{array}{*{20}c} {P_{i} ,} & {{\text{Count}}++ ;{\text{flag}} = {\text{Valid}}} \\ {!P_{i} ,} & {{\text{flag}} = {\text{Invalid}}} \\ \end{array} } \right\} $$
(1)

where P i represents the ith frequent pattern and RSvalue is the result of row-wise scan. If a match is found, i.e., RSvalue = P i , increment the count and mark the row as valid; else mark it as invalid.

Third step is the merging phase, where numeric and frequency-token attributes are merged forming a complete numeric dataset. Normalization is done on the dataset for variance stabilization. Normalization is a process in which all attributes are given an equal weight. This is particularly useful for distance measures while used in clustering. There are mainly three types of normalization techniques available namely min-max, z-score, and decimal scaling normalization. Min-max normalization refers to the process of altering the original data into a specified range in a linear fashion. For mapping a v value, of an attribute A from range [min A , max A ] to a new range [new_min A , new_max A ], the computation is given by Eq. (2):

$$ \frac{{v - \min_{A} }}{{\max_{A} - \min_{A} }} ({\text{new}}\_{\max}_{A} - {\text{new}}\_{\min}_{A}) + {\text{new}}\_{\min}_{A} $$
(2)

where ‘v’ is the new value in the required range.

Z-score normalization is based on mean and median, and it is also called as zero mean normalization. The formula is given in Eq. (3):

$$ d^{*} = \frac{{d - {\text{mean}}(P)}}{{{\text{std}}(P)}} $$
(3)

where mean(p) is the sum of the all attribute values of P and Std(P) is the standard deviation of all values of P. Decimal scale normalization is based on the decimal point movement depending on the absolute values of the attributes. The formula is given below in Eq. (4):

$$ \hbox{max} (|{\text{d}}|) < 1.[5] $$
(4)

Z-score normalization is used in the FPMC algorithm as it maintains the range and dispersion of the data set, i.e., Standard deviation/variance. After normalization an efficient numerical clustering algorithm is applied. Pseudocode of FPMC is given in Table 1.

Table 1 FPMC Algorithm

4 Experiments and Results

To evaluate the effectiveness of FPMC algorithm, three different types of real-time data sets were taken into account. The accuracy of the experiment is evaluated using sum of squared error and percentage of incorrectly clustered instances. The experiments consider two algorithms for the analysis. They are simple k-means and cobweb. In simple k-means SSE for a data set without class label and percentage of incorrectly clustered instances for a data set with class label is used. The cobweb algorithm is used to show the percentage of incorrectly clustered instances of data set with class label. We have taken three data sets of different characteristics, for better analysis. They are automobile, labor, and post-operative patient datasets.

Automobile dataset has eighteen numeric and six nominal attributes. The aim of choosing this data set was to cluster mixed data containing equal number of numerical and nominal attributes. Post-operative patient dataset contains one numeric and eight nominal attribute. In this data set, 64 instances belonging to the patients are sent to the general hospital floor; 24 instances represent patients prepared to go home and two instances of patients sent to the intensive care unit. Labor dataset contains eight numeric and eight nominal attributes. Table 2 shows the comparison results of SSE for the data sets with and without class labels. FPMC is compared with simple k-means. We cannot make use of SSE validation in cobweb because it is not based on distance measures.

Table 2 Comparison results of SSE

Table 3 gives the comparison results of the FPMC with simple k-means and cobweb using percentage of incorrectly clustered instances as a validation measure. This validation technique is applicable only for data sets with class label.

Table 3 Comparison results of percentage of incorrectly clustered instances

5 Conclusions and Future Work

The main objective of clustering is to group similar instances of a data set. The grouping of instances is made on the basis of similarity measures. Even though there are many distance measures available, most of them are applied either on numeric or nominal data. But the real-world data are usually mixed in nature. So we cannot directly apply these distance measures. For this most algorithms for mixed data require partitioning of a dataset into nominal and numeric which increases the complexity and degrades the clustering result. In our proposed work we try to find a solution to this problem by transforming nominal data into numeric. The future plan is to improve the performance using efficient pattern generation and clustering algorithms in multidimensional dataset.