A Hybrid Approach to Classification of Categorical Data Based on Information-Theoretic Context Selection

Alamuri, Madhavi; Surampudi, Bapi Raju; Negi, Atul

doi:10.1007/978-3-319-27212-2_22

Madhavi Alamuri⁶,
Bapi Raju Surampudi^6,7 &
Atul Negi⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 415))

575 Accesses

Abstract

Clustering or classification of data described by categorical attributes is a challenging task in data mining. This is because it is difficult to define a measure between pairs of values of a categorical attributes. The difficulty arises due to lack of ordering information between various pairs of categorical attributes. In this paper we introduce a Hybrid Approach which combines set based context selection with distance computation using KL divergence method. In the literature context based approaches have been introduced recently. Current approaches look at categorical attributes individually, however our approach proposes a novel scheme inspired from information theory. We consider the interdependence redundancy measure to select the significant attributes for context selection. The proposed approach gives encouraging results for low dimensional benchmark UCI datasets with k-nearest neighbor classifier based on the proposed measure. On these datasets the proposed measure performed well in comparison to other distance measures while using various classifiers such as SVM, Naive Bayes and C4.5.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A New Context-Based Similarity Measure for Categorical Data Using Information Theory

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

ConDist: A Context-Driven Categorical Distance Measure

Keywords

1 Introduction

Similarity or distance between two objects plays a significant role in many data mining and machine learning tasks like classification, clustering and outlier detection. In general distance computation is a built-in step for these learning algorithms and different distance measures can be conveniently used. However the effectiveness of the proposed distance measure/metric usually has significant influence on the tasks like classification and clustering.

For numerical data sets, the distance computation is well understood and most commonly used measures such as minkowski distance, mahalanobis distance can be applied. However, measuring similarity or distance is not straight forward for categorical data sets as there is no explicit ordering of categorical values and it is very difficult to determine how much one symbol differs from another. By categorical data we mean the values that are nominal e.g. $color = \lbrace black, blue, green \rbrace $ or ordinal e.g. $size = \lbrace small, large, very large \rbrace $.

In this paper we study the similarity/distance measures for categorical data objects and propose a hybrid approach, based on Set based Context Selection (SBCS) and distance computation on the context using KL divergence method. In our Hybrid Approach (HA) context for each attribute refers to the subset of attributes which gives some contextual interpretation over the attribute set.

The proposed approach can well quantify the distance in supervised learning environment. In this paper we also focus on a data driven methods for selecting a good context for a given attribute. We provide information theoretic approaches for context selection of each and every attribute. The underlying assumption in our approach is that if the similarity function has high value for the given context, then the objects represented by the given context description are similar. Recently, increasing attention is being paid to find the grouping structure/classification of categorical data.

The rest of the paper is organized as follows: In Sect. 2 we discuss the state of the art in categorical similarity/dissimilarity measures. In Sect. 3 we present the theoretical details of Hybrid Approach. In Sect. 4 we present the technical details and in Sect. 5 we present the results of set of experiments on low dimensional UCI benchmark datasets.

2 Background

According to Michalski [16], the conventional measures of similarity are “Context-free” i.e. the distance between any two data objects X and Y is a function of these points only, and does not depend on the relationship of these points to other data points.

$$\begin{aligned} Similarity(\textit{X},\textit{Y}) = f(\textit{X},\textit{Y}) \end{aligned}$$

Context-free similarity measures may be inadequate in some clustering/classification problems. Hence recent approaches for finding similarity measures include “Context-sensitive” methods, where

$$\begin{aligned} Similarity(\textit{X},\textit{Y}) = f(\textit{X},\textit{Y},Context) \end{aligned}$$

Here the similarity between X and Y depends not only on X and Y, but also on the relationship of X and Y to other data points, represented by Context.

In a user driven approach context refers to the subset of attributes of interest and is application dependent.

Pearson’s [17] Chi-square statistic was often used in the late 1800s to test the independence between categorical variables in a contingency table. Sneath and Sokal [19] were among the first to put all the similarity measures together. The conventional methods of similarity measures used to binarize the attribute values where bit 1 indicates the presence and bit 0 indicates absence of a possible attribute value. After obtaining binary vectors, binary similarity measures are used to apply on them. The major drawback is transforming objects into a binary vector may leave many minute insights in to the dataset. The most Commonly used similarity measures for categorical data is overlap, where similar values are assigned a distance of 1, and dissimilar values are assigned a distance of 0. The drawback with this measure is, it does not distinguish the different values taken by an attribute.

In general the similarity measures for categorical data are categorized based on how they utilize the context of the given attributes. There are several supervised and unsupervised measures from the literature are existing to find the similarity between categorical feature values.

The supervised similarity measures do consider the class attribute information. The supervised measures are further divided into learning, non-learning approaches. The learning approaches can be IVDM [21], WVDM [21], Learned Dissimilarity measure [22]. The non-learning approaches can be VDM [20], MVDM [7].
The unsupervised similarity measures do not consider the class attribute information. These measures are further classified into Probabilistic, Information theoretic, and Frequency based approaches.
- Probabilistic approaches: Goodall [9], Smirnov [18], Anderberg [2].
- Information theoretic approaches: Lin [15], Lin1 [4], Burnaby [6].
- Frequency based approaches: OF [11], IOF [11].

Boriah et al. [4] classified the similarity measures based on the storage mechanism of similarity values in the similarity matrix, parameters used to propose the measure and based on the weight of the frequency of the attribute values.

Jierui Xie [22] proposed a learning algorithm, which learns a dissiimilarity measure by mapping each categorical value into random numbers. This learning algorithm is guided by the classification error for effective classification of the data points.

The taxonomy of various distance/similarity measures for categorical data is explored in [1].

Apart from these categories of similarity measures, recent attention has been paid to context-based approaches. Dino et al. [10] present a context based approach to compute distance between pair of values of a categorical data. They proposed a distance learning framework for context selection and validated in a hierarchical clustering algorithm.

Zeinab et al. [12] proposed a novel approach for distance learning based on the context information. This method is also used to compute dissimilarity between probability distributions. Both these approaches use the context information for dissimilarity computation.

We extend the current context based approaches by introducing a Hybrid Approach which utilizes information-theoretic measures. The proposed approach is explored in Sect. 3.

3 Proposed Hybrid Approach

In this section we present a Hybrid Approach for computing distance between any pair of values of a categorical attribute. We also introduce “Set Based Context Selection Algorithm” (SBCS) for the effective selection of the context followed by calculating the distance by using KL divergence [13, 14] method. Our Hybrid Approach is formulated based on the following two steps.

1.
Set Based Context Selection: This method selects a subset of correlated features based on a given attribute. i.e., selection of a meta attribute set of a given attribute which is relevant in terms of information theoretic measure is calculated in this step.
2.
Distance Computation: Computation of the distance measure between pair of values of an attribute using the meta attributes set defined in the previous step. KL divergence method is applied on the context to measure the difference between the probability distributions.

The essential premise in formulating this algorithm is with an open minded, fairness in the importance of the attributes, it excludes weightage or bias towards a certain set of attributes, unless it is explicitly defined in the context.

The notations used in this algorithm are as follows: Consider the set $ F = \lbrace A_{1},A_{2},\ldots A_{m}\rbrace $ of m categorical attributes and let the set of instances set $ D = \lbrace X_{1}, X_{2}\ldots X_{n} \rbrace $. We denote by a lower case letter $a_{i} \epsilon A_{i}$, a specific value of an attribute $A_{i}$.

3.1 Set Based Context Selection

The selection of a good context is not trivial, when data are high dimensional. In the SBCS, we use mutual information normalized with joint entropy to get the context for each and every attribute.

The Set Based Context Selection Algorithm which we propose considers a score for each feature independently of others. A useful and relevant set of features may not only be individually relevant but also may not be redundant with respect to each other. The selecting criterion of a context is based on the relevance index which quantifies whether a particular feature can be included in a context set or not.

In the following sub section we introduce some basic concepts of information theoretic measures followed by how they are utilized to tackle the problem of context selection.

3.2 Entropy and Mutual Information

The entropy of a random variable [8], is the fundamental unit of information which quantifies the amount of uncertainty present in the distribution of the random variable.

The entropy of a random variable $A_{i}$ is defined as,

$$\begin{aligned} H(A_{i})=-\displaystyle \sum _{k\epsilon A_{i}}p(a_{k}^{i})log_{2}p(a_{k}^{i}) \end{aligned}$$

(1)

where $p(a_{k}^{i})$ is the probability of value of $a_{k}$ of attribute $A_{i}$.

The entropy of a random variable can be conditioned on other variables. The conditional entropy of $A_{i}$ given $A_{j}$ is,

$$\begin{aligned} H(A_{i}|A_{j})=-\displaystyle \sum _{k\epsilon A_{j}}p(a_{k}^{j})\displaystyle \sum _{l\epsilon A_{i}}p(a_{l}^{i}|a_{k}^{j})log_{2}p(a_{l}^{i}|a_{k}^{j}) \end{aligned}$$

(2)

where $p(a_{l}^{i}|a_{k}^{j})$ is the probability that $A_{i}=a_{l}$ after observing the value $A_{j}=a_{k}$. This can be interpreted as the amount of uncertainty present in $A_{i}$ after observing the variable $A_{j}$.

The amount of Information shared between $A_{i}$ and $A_{j}$, which is also called as mutual information is defined by,

$$\begin{aligned} I(A_{i};A_{j})=H(A_{i})-H(A_{i}|A_{j}) \end{aligned}$$

(3)

This is the difference between two entropies which can be interpreted as the amount of uncertainty in $A_{i}$ which is removed by knowing $A_{j}$.

The mutual information between two attributes also measures the average reduction in uncertainty with another attribute. A smaller value of mutual information indicates lesser dependence and a larger value of mutual information indicates greater dependence.

The main drawback of using this measure is that the mutual information value increases with the number of distinct values that can be chosen by each attribute. To overcome this problem Au et al. [3] proposed interdependence redundancy measure where mutual information is normalized with joint entropy, which is defined as,

$$\begin{aligned} IDR(A_{i},A_{j}) = \frac{I(A_{i};A_{j})}{H(A_{i},A_{j})} \end{aligned}$$

(4)

where the joint entropy $H(A_{i},A_{j})$ is calculated as,

$$\begin{aligned} H(A_{i},A_{j})=-\displaystyle \sum _{k\epsilon A_{i}}\displaystyle \sum _{l\epsilon A_{j}}p(a_{k}^{i},a_{l}^{j}) log_{2}p(a_{k}^{i},a_{l}^{j}) \end{aligned}$$

(5)

According to [3] the interdependence measure evaluates the degree of dependency between two attributes. Unlike mutual information, where the number of possible values which an attribute can take effect, the interdependency measure has no effect on the number of distinct values taken by an attribute. Hence IDR measure is considered as more ideal index to rank the attributes in terms of dependency. $IDR(A_{i},A_{j}) = 1 $ means that the attributes $A_{i}$ and $A_{j}$ are dependent on each other while $IDR(A_{i},A_{j}) = 0 $ indicates that the attributes are statistically independent. When the value of IDR lies between 0 and 1 the attributes are partially independent. By using this IDR measure, we can maintain a $m \times m$ matrix IDR to store the dependency degree of pair of attributes.

For each and every attribute $A_{i}$ we find all the attributes that have interdependency with it and store them in a context set. In order to not to unnecessarily increase the size of the context set we introduce a threshold t to include the significant attributes in the context set.

$$\begin{aligned} context(A_{i}) = \left\{ A_{k}|IDR(A_{i};A_{k}>\textit{t}, A_{i},A_{k}\epsilon F) \right\} \end{aligned}$$

(6)

It is being conjectured that for lower values of threshold t we may obtain higher values of classification accuracy. However for larger values of the threshold the classification accuracy may drop.

The relationship between these quantities is explored in [5] and can be observed from the Fig. 1.

To get the context of a given attribute we make use of these information-theoretic measures by adding/removing the relevant features in the context set.

3.3 Distance Computation

The distance between pair of values $(x_{j},y_{j})$ of a categorical attribute $A_{j}$ is formulated using KL divergence method as,

$$\begin{aligned} {\begin{matrix} d_{A_{j}}(x_{j},y_{j})&{}=\displaystyle \sum _{A_{i}\epsilon context(A_{j})}\displaystyle \sum _{v_{i}\epsilon A_{i}}p(v_{i}|x_{j})log \frac{p(v_{i}|x_{j})}{p(v_{i}|y_{j})} + p(v_{i}|y_{j})log\frac{p(v_{i}|y_{j})}{p(v_{i}|x_{j})} \\ \end{matrix}} \end{aligned}$$

(7)

The distance defined above depends on the meta attribute set associated to each attribute, where meta attribute set is derived from the SBCS algorithm. The dissimilarity between two probability distributions using KL divergence is symmetric, and hence the distance between pair of values of an attribute is also symmetric.

Let X and Y be two instances of the dataset then the above calculated pairwise distance between attribute values is embedded in the total distance as,

$$\begin{aligned} D(X,Y)=\sum _{j=1}^{m}d_{A_{j}}(x_{j},y_{j}) \end{aligned}$$

(8)

4 Hybrid Approach Implementation

In this section we introduce the algorithmic details for the implementation of

(i) Set Based Context Selection and (ii) Distance Computation of Hybrid Approach.

In Algorithm 1 we propose interdependence redundancy based context selection for each attribute in the feature set. Initially the selected set S is empty and the unselected set US is the set of all features from the feature set. At line 6 this algorithm selects an attribute to be included in the context set based on the threshold t and at line 8, it deselects the attribute from the unselected set US. When the context is chosen for a given attribute Distance Computation function computes the distance matrix between each pair of values of the attribute.

In Algorithm 2 distance measure is computed between pair of values of $A_{i} \epsilon F$ using context set derived from the first step of the Hybrid Approach.

The total distance between two objects is then calculated by using Eq. 8 defined in Sect. 3.3.

5 Experimental Results

To evaluate the proposed Hybrid Approach, we compare our approach with other base-line similarity measures explored in Sect. 2 and with the other classifiers. We present results on 5 benchmark categorical datasets, which are taken from the UCI machine learning repository (Table 1).

Table 1 Dataset description

Full size table

Table 2 Performance comparison with various similarity measures with knn (k = 7)

Full size table

Table 3 Performance comparison with various classifiers

Full size table

5.1 Results Description

We compare our Hybrid Approach with the 5 similarity measures Overlap, Lin, Gambaryan, OF, IOF described in Sect. 2. We evaluate the classification accuracy of the nearest neighbor classifier (k = 7) with five fold cross validation. The last row of the Tables 2 and 3 gives the average performance over all the datasets. In summary the proposed Hybrid Approach achieves best rank in two datasets namely Balance Scale and Hayes Roth and stands best on average classifier accuracy.

We also compare our Hybrid Approach with the algorithms implemented in Weka 3.6.10 including SVM, C4.5 and Naive Bayes (NB). Our method uses the set based context selection with KL divergence as a distance measure whereas the other methods use the Euclidean distance with simple matching between categorical objects. As shown in Table 3, the Hybrid Approach performs better in Tic-Tac-Toe and Hayes-Roth and stands best in average performance with 5 datasets.

6 Conclusions

In this paper we propose a hybrid approach to measure similarity between categorical attribute values. This algorithm uses Set Based Context Selection method inspired from information-theoretic measures. We tested our approach on five benchmark datasets from UCI machine learning repository. The proposed approach gives promising results for low-dimensional datasets.

The proposed approach gives superior results for datasets with dimensionality approximately below 10. Computation of context selection is very expensive for high dimensional datasets. This is a limitation of the proposed approach. We are exploring generalization of dimensionality reduction techniques for categorical attributes so that the proposed approach can be combined with these methods in future. We also investigate the impact of the threshold parameter t on the proposed distance measure for large datasets in future.

References

Alamuri, M., Surampudi, B.R., Negi, A.: A survey of distance/similarity measures for categorical data. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 1907–1914. IEEE (2014)
Google Scholar
Anderberg, M.R.: Cluster analysis for applications. Technical report, DTIC Document (1973)
Google Scholar
Au, W.H., Chan, K.C., Wong, A.K., Wang, Y.: Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinf. (TCBB) 2(2), 83–101 (2005)
Article Google Scholar
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings SIAM International Conference on Data Mining. SIAM, Atlanta (2008)
Google Scholar
Brown, G., Pocock, A., Zhao, M.J., Luján, M.: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13(1), 27–66 (2012)
MathSciNet MATH Google Scholar
Burnaby, T.P.: On a method for character weighting a similarity coefficient, employing the concept of information. J. Int. Assoc. Math. Geol. 2(1), 25–38 (1970)
Article Google Scholar
Cost, S., Salzberg, S.: A weighted nearest neighbor algorithm for learning with symbolic features. Mach. Learn. 10(1), 57–78 (1993)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley & Sons, New York (1991)
Google Scholar
Goodall, D.W.: A new similarity index based on probability. Biometrics 22(4), 882–907 (1966)
Article Google Scholar
Ienco, D., Pensa, R.G., Meo, R.: From context to distance: learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data (TKDD) 6(1), 1 (2012)
Article Google Scholar
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)
Article Google Scholar
Khorshidpour, Z., Hashemi, S., Hamzeh, A.: Distance learning for categorical attribute based on context information. In: 2010 2nd International Conference on Software Technology and Engineering (ICSTE), vol. 2, pp. V2–296. IEEE (2010)
Google Scholar
Kullback, S.: Information Theory and Statistics. John Wiley & Sons, New York (1959)
Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951)
Article MathSciNet MATH Google Scholar
Lin, D.: An information-theoretic definition of similarity. Proc. ICML 98, 296–304 (1998)
Google Scholar
Michalski, R.S.: Knowledge acquisition through conceptual clustering: a theoretical framework and algorithm for partitioning data into conjunctive concepts. Int. J. Policy Anal. Inf. Syst. 4(3), 219–244 (1980)
Google Scholar
Pearson, K.: On the general theory of multiple contingency with special reference to partial contingency. Biometrika 11(3), 145–158 (1916)
Article Google Scholar
Smirnov, E.S.: On exact methods in systematics. Syst. Biol. 17(1), 1–13 (1968)
Article Google Scholar
Sneath, P.H.A., Sokal, R.: Numerical taxonomy. The principles and practice of numerical classification. Freeman, San Francisco (1973)
Google Scholar
Stanfill, C., Waltz, D.: Toward memory-based reasoning. Commun. ACM 29(12), 1213–1228 (1986)
Article Google Scholar
Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)
MathSciNet MATH Google Scholar
Xie, J., Szymanski, B., Zaki, M.J.: Learning dissimilarities for categorical symbols. J. Mach. Learn. Res. Proc. Track 10, 97–106 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India
Madhavi Alamuri, Bapi Raju Surampudi & Atul Negi
Cognitive Science Lab, International Institute of Information Technology, Hyderabad, India
Bapi Raju Surampudi

Authors

Madhavi Alamuri
View author publications
You can also search for this author in PubMed Google Scholar
Bapi Raju Surampudi
View author publications
You can also search for this author in PubMed Google Scholar
Atul Negi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Madhavi Alamuri .

Editor information

Editors and Affiliations

IDRBT, Center of Excellence in Analytics, Hyderabad, India
Vadlamani Ravi
Dept of Electrical Engineering, IIT Delhi, Delhi, Delhi, India
Bijaya Ketan Panigrahi
Indian Statistical Institute, Electronics and Communication Sci Unit, Kolkata, India
Swagatam Das
Div of Ctrl & Instrumentation, Nanyan Technological Univ, Singapore, Singapore
Ponnuthurai Nagaratnam Suganthan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alamuri, M., Surampudi, B.R., Negi, A. (2015). A Hybrid Approach to Classification of Categorical Data Based on Information-Theoretic Context Selection. In: Ravi, V., Panigrahi, B., Das, S., Suganthan, P. (eds) Proceedings of the Fifth International Conference on Fuzzy and Neuro Computing (FANCCO - 2015). Advances in Intelligent Systems and Computing, vol 415. Springer, Cham. https://doi.org/10.1007/978-3-319-27212-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-27212-2_22
Published: 25 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27211-5
Online ISBN: 978-3-319-27212-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Hybrid Approach to Classification of Categorical Data Based on Information-Theoretic Context Selection

Abstract

Similar content being viewed by others

A New Context-Based Similarity Measure for Categorical Data Using Information Theory

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

ConDist: A Context-Driven Categorical Distance Measure

Keywords

1 Introduction

2 Background