Irrelevant Feature and Rule Removal for Structural Associative Classification Using Structure-Preserving Flat Representation

Shaharanee, Izwan Nizal Mohd; Hadzic, Fedja

doi:10.1007/978-3-662-45620-0_10

Izwan Nizal Mohd Shaharanee⁴ &
Fedja Hadzic⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 584))

2951 Accesses

Abstract

Practical applications of association rule mining often suffer from overwhelming number of rules that are generated, many of which are not interesting or useful for the application in question. Removing irrelevant features and/or rules comprised of irrelevant features can significantly improve the overall performance. Many statistical and constraint based measures are used to discard unnecessary and irrelevant features and rules when vectorial or tabular data is in question. In contrast, the use of such measures is limited in the tree-structured data domain, due to the structural aspects that are not easily incorporated. In this chapter, we explore the use of a feature subset selection measure as well as a number of common statistical interestingness measures via a recently proposed structure-preserving flat representation for tree-structured data such as XML. A feature subset selection is used prior to association rule generation. Once the initial set of rules is obtained, irrelevant rules are determined as those that are comprised of attributes not determined to be statistically significant for the classification task. The experiments are performed using real world web access trees and property management dataset. The results indicate that where the dataset has more standard structure a large number of insignificant rules will be discarded and accuracy will increase. However, where the tree instances can vary greatly in terms of structure and label distribution among nodes, while many rules are removed and the accuracy increases, there is a significant reduction in coverage rate of the rule set.

Access provided by Autonomous University of Puebla. Download chapter PDF

Evaluation of Position-Constrained Association-Rule-Based Classification for Tree-Structured Data

A Performance Evaluation of Chi-Square Pruning Techniques in Class Association Rules Optimization

An Efficient Framework for Building Fuzzy Associative Classifier Using High-Dimensional Dataset

Keywords

1 Introduction

Real world datasets often contain attributes that are irrelevant or redundant for the classification problem at hand. These features can degrade the performance and interfere with the learning mechanism typically resulting in a reduction in the quality and generality of the discovered patterns/model and overfitting of the model to the train data. The basic principle of feature subset selection is to find the necessary and sufficient subset of features or attributes which results in simplification of the discovered knowledge model, better generalisation power, while at the same time the accuracy for classification tasks is not compromised.

Association rule mining, being one of the most popular techniques for discovering interesting associations among data objects, has also been utilized for the classification task, where it can contribute to discovering strong associations between occurring attribute and class values [26]. An associative classification framework was first proposed in [28], which consists of an algorithm to generate all class association rules from which a classifier is constructed. Many works [10, 45, 49, 50] have developed various extensions and refinements to this initially proposed framework and the results reported high accuracy and efficiency for the classification problem. Similarly in tree-structured data domain, the XRules structural classifier [52], is based on association rules generated from the ordered embedded subtree mining algorithm [51].

When dealing with pattern selection, one faces the quantity problem due to large volume of output as well as the quality assurance problem of rules reflecting real, significant associations in the domain under investigation [25]. In a recent work presented in [24] the search space of Apriori-like algorithms is pruned so that discovered rules are interesting with respect to the Jaccard measure, rather than the support constraint for which an optimal threshold is often unknown. To deal with the quality problem many interestingness measures have been developed and utilized in various knowledge discovery tasks [12, 29]. In one train of thought, since the nature of data mining techniques is data-driven, the generated rules can often be effectively validated by a statistical methodology in order for them to be useful in practice [13, 22]. Interesting rules could then be interpreted as those rules that have a sound statistical basis and are neither redundant nor contradictory. The aforementioned works [12, 13, 22, 29] have mainly focused on relational data. There is relatively less work in this area when it comes to tree-structured data (an overview is given in the next section). Tree-structured data has underlying complex structural characteristics which typically need to be preserved in the knowledge patterns discovered during a data mining task [17, 52]. The structural characteristics of data pose difficulties in application of traditional classifiers and interestingness measures, whose mechanism typically does not take structural aspects of data into account.

In [38], a unified framework was proposed that systematically combines several statistical/heuristic techniques to assess the rule quality and remove any redundant and unnecessary rules for the classification problem. In this chapter, the focus is on exploring the application of this framework to tree-structured data, enabled by the recently proposed structure-preserving flat data format for tree-structured data [14]. The work presented in [14] is based on the extraction of a database structure model (henceforth DSM) within which every tree instance from the database can be matched to and which keeps the structural information of the flat representation generated. The implications of the representation in contrast to traditional tree mining field is that every subtree pattern or a rule, will be constrained by the pre-order position of the constituting tree nodes of the subtree w.r.t the DSM. In this work, we explore the application of a feature subset selection measure and statistical interestingness measures via this method to filter out unnecessary and irrelevant subtree patterns for the structural classification task. A feature subset selection method is used prior to association rule generation. Once the initial set of rules is obtained, irrelevant rules are determined as those that are comprised of attributes not determined to be statistically significant for the classification task. The experiments are performed using real world web access tree dataset and a property management dataset from a real estate company. The results indicate that where the dataset has more standard structure the use of statistical measures will discard a large number of insignificant rules and at the same time increase the accuracy of the rule sets. On the other extreme, where the tree instances can vary greatly in terms of structure and label distribution among nodes, as is the case in the web access tree dataset, while many rules are removed and the accuracy increases, there is a significant reduction in coverage rate of the rule set. Furthermore, we compare some of the results with a structural classifier based on traditional subtrees, and highlight some important differences and implications when subtree based rules are constrained by their position. The results also show that including the associations that do not necessary result in connected trees can be important, while such associations are typically ignored within the tree mining field. These findings indicate that structural classifier could be improved and complemented by including disconnected subtrees and constraining the subtrees by their exact occurrence in the database. However, more work is required to identify the domains and application where including such association rules can be beneficial and the right way to combine them with traditional subtree patterns for optimal performance.

The rest of the chapter is organized as follows. The related works are given in Sect. 10.2, while Sect. 10.3 defines the concepts and the rule set optimization problem focused on in this study. In Sect. 10.4, we describe the steps involved in the proposed approach which is evaluated using real-world datasets and experimental findings are discussed in Sect. 10.5. Section 10.6 concludes the chapter.

2 Related Work

To date, limited work has been done on the feature selection, rule evaluation and interestingness measures for tree-structured data. Many of the well developed rule interestingness measures are in relational data and they have had great success in evaluating rule interestingness as discussed in [44]. Several works on the evaluation of discovered patterns based on statistical significance [2, 22, 46] are limited to relational data. The existence of vast well-developed measuring techniques to evaluate interestingness of rules from relational data, offers great opportunities for adapting these techniques for verifying significant subtrees from semi-structured data. The applicability of these interestingness measures needs to be explored in the context of frequent subtree mining, where necessary adjustments and extensions need to be made to ascertain the validity of the methods given the more complex structured aspects in the data, which often need to be preserved in the rules.

One line of work focusing on more interesting subtree patterns aims to reduce the patterns through the application of plausible constraints. The problem of mining mutually dependent ordered subtrees has been addressed in [32]. The proposed algorithm utilizes the hyperclique method [47] in the tree mining context so that all the components of a subtree are highly correlated together. These hyperclique subtree patterns are discovered using an h-confidence measure which is the minimum probability of an item from a pattern in one transaction implying the presence of all other items in the same transaction. Hence, the extracted hyperclique subtree patterns will satisfy the minimum h-confidence threshold. The work done in [3] uses the method proposed for database compression in regards to item set mining in [39] to demonstrate how the same minimum description length principle can yield good results for sequential and tree-structured data. The work presented in [31] extends the idea of the item constraint [41] to that of a node-inclusion constraint in subtrees. Furthermore, Knijf and Feelders [20] proposed the use of monotone constraints in frequent subtree mining, namely monotone, anti-monotone, convertible and succinct constraints. Using these constraints, the frequent subtrees are mined using an opportunistic pruning strategy, and the set of frequent subtrees are reduced to only those satisfying the specific user pre-defined constraints.

Besides the aforementioned constraint-based techniques, to the best of our knowledge, there are limited works on verifying the significance of discovered frequent subtrees. Hashimoto et al. [19] proposed and developed an application of statistical hypothesis testing to re-rank the significant frequent subtrees. This approach ranks the significant patterns according to P-values obtained from the Fisher’s Exact test of significance. The significant patterns were then used for Glycan classifications problems. Recently Yan et al. [48], proposed a mining framework called LEAP (Descending Leap Mine) for checking and mining significant frequent subgraphs which helps to discard redundant frequent subgraphs. For a predefined class label in XML documents, an efficient XRules classifier has been proposed in [52]. This approach offers promising results in terms of a structural classifier for semi-structured data, but utilizes standard measures of interestingness based on support and confidence.

2.1 Relationship Between Feature Subset Selection and Frequent Subtree Interestingness

In general, the objective of feature subset selection as defined in [18] is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Han and Kamber in [18] asserted that domain expertise can be employed in order to pick out useful attributes. However, because the data mining task involves a large volume of data and unpredictable behaviour of data during data mining, this task is often expensive and time consuming.

The test of statistical significance is one of the prominent approaches in evaluating attributes/features usefulness. Stepwise forward selection, stepwise backward selection and a combination of both are three commonly used heuristic techniques utilized in statistical significance tests such as linear regression and logistic regression [18]. Moreover, the application of correlation analysis such as the chi-square test is also valuable in identifying redundant variables for feature subset selection. Another powerful technique for this purpose is the Symmetrical Tau [54], which is a statistical-heuristic feature selection criterion. It measures the capability of an attribute in predicting the class of another attribute. Additionally, information gain is another attributes relevance analysis method employed in the popular ID3 [33] and C4.5 [34] as reported in [18]. An extensive overview and comparison of the different approaches to the feature subset selection problem has been provided in [6, 11, 21, 30].

While the original purpose of feature subset selection is to reduce the number of attributes to only those attributes relevant for a certain data mining task, they nevertheless can be utilized to measure the interestingness of rules/pattern generated. For example, if the rule/pattern consists of irrelevant attributes, the aforementioned measure can also give some indication that the rule/pattern is not interesting. Moreover, [12] stated that there are three roles of interestingness measures. The first is their ability to discard uninteresting patterns during the mining process, thereby narrowing the search space and improving the mining efficiency. The second role is to calculate the interestingness scores for each pattern, which allows the ranking of patterns according to specific needs. The final role is the use of interestingness measures during the post-processing stage to select interesting patterns. Interestingness measures such as the chi-square test [8], Symmetrical Tau [54] and Mutual Information [44], are capable of measuring the interestingness of rules and at the same time identifying useful features for frequent patterns.

Since frequent patterns are generated based solely on frequency without considering their predictive power, the use of frequent patterns without selecting appropriate features will still result in a huge feature space which leads to larger volume and complexity of rules. This might not only slow down the model learning process, but even worse, the classification accuracy deteriorates (another kind of overfitting issue since the features are numerous) [9].

3 Problem Background

The problem of finding association rules $x\rightarrow y$ was first introduced in [1] as a data mining task for finding frequently co-occurring items in large databases. Let $I=\{i_1, i_2,\ldots ,i_{|I|}\}$ be a set of items. Let $D$ be a transactions database for which each record/transaction $R$ is a set of items, such that $R\subseteq I$. An association rule is an implication of the form $x\rightarrow y$ where $x\subseteq I$ and $y\subseteq I$ and $x\cap y=\emptyset $. The absolute support of a rule $x\rightarrow y$ is the number of transactions that contain both $x$ and $y$. Typically, the relative support is used, where given the support of rule $x\rightarrow y$ (denoted as $\sigma $ ($x\rightarrow y$)) be $s$ %, there are $s$ % of transactions in $D$ that contain items (itemsets) $x$ and $y$. In other words, the probability $P(x\cup y)=s$ %. An itemset is frequent if it satisfies the user-specified minimum support threshold. The confidence of a rule $x\rightarrow y$, is the estimate of conditional probability of a transaction containing the consequent ($y$) if the transaction contains the antecedent ($x$), and is calculated as $\sigma (x\rightarrow y)/\sigma (x)$.

Association rule discovery finds all rules that satisfy specific constraints such as the minimum support and confidence threshold, as is the case in the Apriori algorithm [1]. When tree-structured data such as XML is in question, the underlying associations are tree-structured by nature. Thus, the pre-requisite for the discovery of (structural) association rules becomes the task of frequent subtree mining. A tree-structured document can be modeled as a rooted ordered labeled tree. A rooted ordered labeled tree can be denoted as $T =(v_0,V,L,E)$, where (1) $V_0\in V$ is the root vertex; (2) $V$ is the set of vertices or nodes; (3) $L$ is a labelling function that assigns a label $L(v)$ to every vertex $v\in V$; (4) $E=\{(v_1,v_2)|v_1, v_2\in V$ AND $v_1\ne v_2\}$ is the set of edges in the tree, and (5) for each internal nodes, the children are ordered from left to right.

This problem is generally defined as: given a database of trees $T_{db}$ and minimum support threshold $\sigma $, find all subtrees that occur at least $\sigma $ times in $T_{db}$. Most commonly considered subtrees are induced and embedded. The formal definitions of induced and embedded subtrees are as follows [42]: Given a tree $S=(vs_0,V_S,L_S,E_S)$ and tree $T=(vt_0,V_T,L_T,E_T)$, $S$ is an ordered induced subtree of $T$ iff (1) $V_S\subseteq V_T$; (2) $L_S\subseteq L_T$, and $L_S(v)=L_T(v)$; (3) $E_S\subseteq E_T$; (4) the left-to-right ordering of sibling nodes in the original tree is preserved. Moreover, $S$ is an ordered embedded subtree of $T$ iff (1) $V_S\subseteq V_T$; (2) $L_S\subseteq L_T$, and $L_S(v)=L_T(v)$; (3) if $(v_1,v_2)\in E_S$ then $parent(v_2)=v_1$ in $S$ and $v_1$ is ancestor of $v_2$ in $T$, and (4) the left-to-right ordering of sibling nodes in the original tree is preserved. If $S=(vs_0,V_S,L_S,E_S)$ is an embedded subtree of $T=(vt_0,V_T,L_T,E_T)$, and two vertices $v_1\in V_S$ and $v_2\in V_S$ form ancestor-descendant relationship, the level of embedding (LoE) [42], between $v_1$ and $v_2$, denoted by $ \vartriangle (v_1,v_2)$, is defined as the length of the path between $v_1$ and $v_2$ in T. Hence, a maximum level of embedding constraint (MaxLoE) $M_\vartriangle $ can be imposed on the subtrees extracted from T, such that any two connected nodes in an embedded subtree of $T$ will be connected in $T$ by a path that has the maximum length of $M_\vartriangle $. Examples of induced and embedded subtree are given in Fig. 10.1 (the number on the left of the nodes indicate its pre-order position in the original tree $T$).

In this chapter, the focus is on evaluating rules based on embedded and induced subtrees that satisfy minimum support and confidence thresholds, and discarding any rules determined to be irrelevant to the classification task at hand. Let us denote the subtree patterns from the frequent subtree set $SF$ that have a class label (value), as SFC, their accuracy as ${ ac}({ SFC})$ and coverage rate as ${ cr}({ SFC})$. The problem focused on in this work can then be generally defined as follows: Given SFC with accuracy ac(SFC), obtain ${ SFC}'\subseteq { SFC}$, such that ${ ac}({ SFC}') \ge ({ ac}({ SFC})-\varepsilon )$ and ${ cr}({ SFC}')\ge ({ cr}({ SFC})-\varepsilon )$ ($\varepsilon $ is an arbitrary user defined small value used to reflect the noise that is often present in real-world data).

In what follows we discuss the common way of representing trees. This will lay the necessary ground for understanding the positional constraint imposed by the DSM approach [14]. A pre-order traversal can be computed as follows: If an ordered tree $T$ consists only of a root node r, then $r$ is the pre-order traversal of T. Otherwise let $T_1, T_2,\ldots , T_n$ be the subtrees occurring at $r$ from left to right in $T$. The pre-order traversal begins by visiting $r$ and then traversing all the remaining subtrees in pre-order starting from $T_1$ and finishing with $T_n$. The string encoding ($\varphi $) can be generated by adding vertex labels in the pre-order traversal of a tree $T=(v_0, V, L, E)$ and appending a backtrack symbol (e.g., ‘/’, ‘/’ $\notin L$) whenever we backtrack from a child node to its parent node. Figure 10.2 and Table 10.1 depict a tree database consisting of 7 tree instances (or transactions) and the string encoding for tree database, respectively.

Table 10.1 Example of tree transactions

Irrelevant Feature and Rule Removal for Structural Associative Classification Using Structure-Preserving Flat Representation

Abstract

Similar content being viewed by others

Evaluation of Position-Constrained Association-Rule-Based Classification for Tree-Structured Data

A Performance Evaluation of Chi-Square Pruning Techniques in Class Association Rules Optimization

An Efficient Framework for Building Fuzzy Associative Classifier Using High-Dimensional Dataset

Keywords

1 Introduction

2 Related Work

2.1 Relationship Between Feature Subset Selection and Frequent Subtree Interestingness

3 Problem Background

3.1 Feature Subset Selection

3.2 Modeling Tree-Structured Data

3.3 Database Structure Model (DSM)

3.4 Tree to Flat Conversion Example Using Academic Institution WebLogs Data

3.5 Representing Disconnected Trees w.r.t. DSM

4 Method and Experimental Setup

5 Experimental Evaluation

5.1 Experiment Set 1—CRM Data

5.2 Experiment Set 2—CSLOGS Data

5.3 Experiment Set 3—Academic Institution Web Log Data

6 Conclusion and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation