Keywords

1 Introduction

Similarity or distance between two objects plays a significant role in many data mining and machine learning tasks like classification, clustering and outlier detection. In general distance computation is a built-in step for these learning algorithms and different distance measures can be conveniently used. However the effectiveness of the proposed distance measure/metric usually has significant influence on the tasks like classification and clustering.

For numerical data sets, the distance computation is well understood and most commonly used measures such as minkowski distance, mahalanobis distance can be applied. However, measuring similarity or distance is not straight forward for categorical data sets as there is no explicit ordering of categorical values and it is very difficult to determine how much one symbol differs from another. By categorical data we mean the values that are nominal e.g. \(color = \lbrace black, blue, green \rbrace \) or ordinal e.g. \(size = \lbrace small, large, very large \rbrace \).

In this paper we study the similarity/distance measures for categorical data objects and propose a hybrid approach, based on Set based Context Selection (SBCS) and distance computation on the context using KL divergence method. In our Hybrid Approach (HA) context for each attribute refers to the subset of attributes which gives some contextual interpretation over the attribute set.

The proposed approach can well quantify the distance in supervised learning environment. In this paper we also focus on a data driven methods for selecting a good context for a given attribute. We provide information theoretic approaches for context selection of each and every attribute. The underlying assumption in our approach is that if the similarity function has high value for the given context, then the objects represented by the given context description are similar. Recently, increasing attention is being paid to find the grouping structure/classification of categorical data.

The rest of the paper is organized as follows: In Sect. 2 we discuss the state of the art in categorical similarity/dissimilarity measures. In Sect. 3 we present the theoretical details of Hybrid Approach. In Sect. 4 we present the technical details and in Sect. 5 we present the results of set of experiments on low dimensional UCI benchmark datasets.

2 Background

According to Michalski [16], the conventional measures of similarity are “Context-free” i.e. the distance between any two data objects X and Y is a function of these points only, and does not depend on the relationship of these points to other data points.

$$\begin{aligned} Similarity(\textit{X},\textit{Y}) = f(\textit{X},\textit{Y}) \end{aligned}$$

Context-free similarity measures may be inadequate in some clustering/classification problems. Hence recent approaches for finding similarity measures include “Context-sensitive” methods, where

$$\begin{aligned} Similarity(\textit{X},\textit{Y}) = f(\textit{X},\textit{Y},Context) \end{aligned}$$

Here the similarity between X and Y depends not only on X and Y, but also on the relationship of X and Y to other data points, represented by Context.

In a user driven approach context refers to the subset of attributes of interest and is application dependent.

Pearson’s [17] Chi-square statistic was often used in the late 1800s to test the independence between categorical variables in a contingency table. Sneath and Sokal  [19] were among the first to put all the similarity measures together. The conventional methods of similarity measures used to binarize the attribute values where bit 1 indicates the presence and bit 0 indicates absence of a possible attribute value. After obtaining binary vectors, binary similarity measures are used to apply on them. The major drawback is transforming objects into a binary vector may leave many minute insights in to the dataset. The most Commonly used similarity measures for categorical data is overlap, where similar values are assigned a distance of 1, and dissimilar values are assigned a distance of 0. The drawback with this measure is, it does not distinguish the different values taken by an attribute.

In general the similarity measures for categorical data are categorized based on how they utilize the context of the given attributes. There are several supervised and unsupervised measures from the literature are existing to find the similarity between categorical feature values.

  • The supervised similarity measures do consider the class attribute information. The supervised measures are further divided into learning, non-learning approaches. The learning approaches can be IVDM [21], WVDM [21], Learned Dissimilarity measure [22]. The non-learning approaches can be VDM [20], MVDM [7].

  • The unsupervised similarity measures do not consider the class attribute information. These measures are further classified into Probabilistic, Information theoretic, and Frequency based approaches.

    • Probabilistic approaches: Goodall [9], Smirnov [18], Anderberg [2].

    • Information theoretic approaches: Lin [15], Lin1 [4], Burnaby [6].

    • Frequency based approaches: OF [11], IOF [11].

Boriah et al. [4] classified the similarity measures based on the storage mechanism of similarity values in the similarity matrix, parameters used to propose the measure and based on the weight of the frequency of the attribute values.

Jierui Xie [22] proposed a learning algorithm, which learns a dissiimilarity measure by mapping each categorical value into random numbers. This learning algorithm is guided by the classification error for effective classification of the data points.

The taxonomy of various distance/similarity measures for categorical data is explored in [1].

Apart from these categories of similarity measures, recent attention has been paid to context-based approaches. Dino et al. [10] present a context based approach to compute distance between pair of values of a categorical data. They proposed a distance learning framework for context selection and validated in a hierarchical clustering algorithm.

Zeinab et al. [12] proposed a novel approach for distance learning based on the context information. This method is also used to compute dissimilarity between probability distributions. Both these approaches use the context information for dissimilarity computation.

We extend the current context based approaches by introducing a Hybrid Approach which utilizes information-theoretic measures. The proposed approach is explored in Sect. 3.

3 Proposed Hybrid Approach

In this section we present a Hybrid Approach for computing distance between any pair of values of a categorical attribute. We also introduce “Set Based Context Selection Algorithm” (SBCS) for the effective selection of the context followed by calculating the distance by using KL divergence [13, 14] method. Our Hybrid Approach is formulated based on the following two steps.

  1. 1.

    Set Based Context Selection: This method selects a subset of correlated features based on a given attribute. i.e., selection of a meta attribute set of a given attribute which is relevant in terms of information theoretic measure is calculated in this step.

  2. 2.

    Distance Computation: Computation of the distance measure between pair of values of an attribute using the meta attributes set defined in the previous step. KL divergence method is applied on the context to measure the difference between the probability distributions.

The essential premise in formulating this algorithm is with an open minded, fairness in the importance of the attributes, it excludes weightage or bias towards a certain set of attributes, unless it is explicitly defined in the context.

The notations used in this algorithm are as follows: Consider the set \( F = \lbrace A_{1},A_{2},\ldots A_{m}\rbrace \) of m categorical attributes and let the set of instances set \( D = \lbrace X_{1}, X_{2}\ldots X_{n} \rbrace \). We denote by a lower case letter \(a_{i} \epsilon A_{i}\), a specific value of an attribute \(A_{i}\).

3.1 Set Based Context Selection

The selection of a good context is not trivial, when data are high dimensional. In the SBCS, we use mutual information normalized with joint entropy to get the context for each and every attribute.

The Set Based Context Selection Algorithm which we propose considers a score for each feature independently of others. A useful and relevant set of features may not only be individually relevant but also may not be redundant with respect to each other. The selecting criterion of a context is based on the relevance index which quantifies whether a particular feature can be included in a context set or not.

In the following sub section we introduce some basic concepts of information theoretic measures followed by how they are utilized to tackle the problem of context selection.

3.2 Entropy and Mutual Information

The entropy of a random variable [8], is the fundamental unit of information which quantifies the amount of uncertainty present in the distribution of the random variable.

The entropy of a random variable \(A_{i}\) is defined as,

$$\begin{aligned} H(A_{i})=-\displaystyle \sum _{k\epsilon A_{i}}p(a_{k}^{i})log_{2}p(a_{k}^{i}) \end{aligned}$$
(1)

where \(p(a_{k}^{i})\) is the probability of value of \(a_{k}\) of attribute \(A_{i}\).

The entropy of a random variable can be conditioned on other variables. The conditional entropy of \(A_{i}\) given \(A_{j}\) is,

$$\begin{aligned} H(A_{i}|A_{j})=-\displaystyle \sum _{k\epsilon A_{j}}p(a_{k}^{j})\displaystyle \sum _{l\epsilon A_{i}}p(a_{l}^{i}|a_{k}^{j})log_{2}p(a_{l}^{i}|a_{k}^{j}) \end{aligned}$$
(2)

where \(p(a_{l}^{i}|a_{k}^{j})\) is the probability that \(A_{i}=a_{l}\) after observing the value \(A_{j}=a_{k}\). This can be interpreted as the amount of uncertainty present in \(A_{i}\) after observing the variable \(A_{j}\).

The amount of Information shared between \(A_{i}\) and \(A_{j}\), which is also called as mutual information is defined by,

$$\begin{aligned} I(A_{i};A_{j})=H(A_{i})-H(A_{i}|A_{j}) \end{aligned}$$
(3)

This is the difference between two entropies which can be interpreted as the amount of uncertainty in \(A_{i}\) which is removed by knowing \(A_{j}\).

The mutual information between two attributes also measures the average reduction in uncertainty with another attribute. A smaller value of mutual information indicates lesser dependence and a larger value of mutual information indicates greater dependence.

The main drawback of using this measure is that the mutual information value increases with the number of distinct values that can be chosen by each attribute. To overcome this problem Au et al. [3] proposed interdependence redundancy measure where mutual information is normalized with joint entropy, which is defined as,

$$\begin{aligned} IDR(A_{i},A_{j}) = \frac{I(A_{i};A_{j})}{H(A_{i},A_{j})} \end{aligned}$$
(4)

where the joint entropy \(H(A_{i},A_{j})\) is calculated as,

$$\begin{aligned} H(A_{i},A_{j})=-\displaystyle \sum _{k\epsilon A_{i}}\displaystyle \sum _{l\epsilon A_{j}}p(a_{k}^{i},a_{l}^{j}) log_{2}p(a_{k}^{i},a_{l}^{j}) \end{aligned}$$
(5)

According to [3] the interdependence measure evaluates the degree of dependency between two attributes. Unlike mutual information, where the number of possible values which an attribute can take effect, the interdependency measure has no effect on the number of distinct values taken by an attribute. Hence IDR measure is considered as more ideal index to rank the attributes in terms of dependency. \(IDR(A_{i},A_{j}) = 1 \) means that the attributes \(A_{i}\) and \(A_{j}\) are dependent on each other while \(IDR(A_{i},A_{j}) = 0 \) indicates that the attributes are statistically independent. When the value of IDR lies between 0 and 1 the attributes are partially independent. By using this IDR measure, we can maintain a \(m \times m\) matrix IDR to store the dependency degree of pair of attributes.

For each and every attribute \(A_{i}\) we find all the attributes that have interdependency with it and store them in a context set. In order to not to unnecessarily increase the size of the context set we introduce a threshold t to include the significant attributes in the context set.

$$\begin{aligned} context(A_{i}) = \left\{ A_{k}|IDR(A_{i};A_{k}>\textit{t}, A_{i},A_{k}\epsilon F) \right\} \end{aligned}$$
(6)

It is being conjectured that for lower values of threshold t we may obtain higher values of classification accuracy. However for larger values of the threshold the classification accuracy may drop.

The relationship between these quantities is explored in [5] and can be observed from the Fig. 1.

Fig. 1
figure 1

Relationship of various information-theoretic measures

To get the context of a given attribute we make use of these information-theoretic measures by adding/removing the relevant features in the context set.

3.3 Distance Computation

The distance between pair of values \((x_{j},y_{j})\) of a categorical attribute \(A_{j}\) is formulated using KL divergence method as,

$$\begin{aligned} {\begin{matrix} d_{A_{j}}(x_{j},y_{j})&{}=\displaystyle \sum _{A_{i}\epsilon context(A_{j})}\displaystyle \sum _{v_{i}\epsilon A_{i}}p(v_{i}|x_{j})log \frac{p(v_{i}|x_{j})}{p(v_{i}|y_{j})} + p(v_{i}|y_{j})log\frac{p(v_{i}|y_{j})}{p(v_{i}|x_{j})} \\ \end{matrix}} \end{aligned}$$
(7)

The distance defined above depends on the meta attribute set associated to each attribute, where meta attribute set is derived from the SBCS algorithm. The dissimilarity between two probability distributions using KL divergence is symmetric, and hence the distance between pair of values of an attribute is also symmetric.

Let X and Y be two instances of the dataset then the above calculated pairwise distance between attribute values is embedded in the total distance as,

$$\begin{aligned} D(X,Y)=\sum _{j=1}^{m}d_{A_{j}}(x_{j},y_{j}) \end{aligned}$$
(8)

4 Hybrid Approach Implementation

In this section we introduce the algorithmic details for the implementation of

(i) Set Based Context Selection and (ii) Distance Computation of Hybrid Approach.

In Algorithm 1 we propose interdependence redundancy based context selection for each attribute in the feature set. Initially the selected set S is empty and the unselected set US is the set of all features from the feature set. At line 6 this algorithm selects an attribute to be included in the context set based on the threshold t and at line 8, it deselects the attribute from the unselected set US. When the context is chosen for a given attribute Distance Computation function computes the distance matrix between each pair of values of the attribute.

In Algorithm 2 distance measure is computed between pair of values of \(A_{i} \epsilon F\) using context set derived from the first step of the Hybrid Approach.

The total distance between two objects is then calculated by using Eq. 8 defined in Sect. 3.3.

figure a
figure b

5 Experimental Results

To evaluate the proposed Hybrid Approach, we compare our approach with other base-line similarity measures explored in Sect. 2 and with the other classifiers. We present results on 5 benchmark categorical datasets, which are taken from the UCI machine learning repository (Table 1).

Table 1 Dataset description
Table 2 Performance comparison with various similarity measures with knn (k = 7)
Table 3 Performance comparison with various classifiers

5.1 Results Description

We compare our Hybrid Approach with the 5 similarity measures Overlap, Lin, Gambaryan, OF, IOF described in Sect. 2. We evaluate the classification accuracy of the nearest neighbor classifier (k = 7) with five fold cross validation. The last row of the Tables 2 and 3 gives the average performance over all the datasets. In summary the proposed Hybrid Approach achieves best rank in two datasets namely Balance Scale and Hayes Roth and stands best on average classifier accuracy.

We also compare our Hybrid Approach with the algorithms implemented in Weka 3.6.10 including SVM, C4.5 and Naive Bayes (NB). Our method uses the set based context selection with KL divergence as a distance measure whereas the other methods use the Euclidean distance with simple matching between categorical objects. As shown in Table 3, the Hybrid Approach performs better in Tic-Tac-Toe and Hayes-Roth and stands best in average performance with 5 datasets.

6 Conclusions

In this paper we propose a hybrid approach to measure similarity between categorical attribute values. This algorithm uses Set Based Context Selection method inspired from information-theoretic measures. We tested our approach on five benchmark datasets from UCI machine learning repository. The proposed approach gives promising results for low-dimensional datasets.

The proposed approach gives superior results for datasets with dimensionality approximately below 10. Computation of context selection is very expensive for high dimensional datasets. This is a limitation of the proposed approach. We are exploring generalization of dimensionality reduction techniques for categorical attributes so that the proposed approach can be combined with these methods in future. We also investigate the impact of the threshold parameter t on the proposed distance measure for large datasets in future.