1 Introduction

Association rule mining is a popular approach in data mining that can discover interesting patterns hidden in a database. It is widely used in many application domains, such as market basket analysis, web usage mining, intrusion detection, and bioinformatics [27]. Compared to other approaches, the patterns or rules produced from association rule mining are easy to interpret because they are defined as a relationship \(X\rightarrow Y\), where X, Y are each an item or a set of items occurring in the data. However, one of the major drawbacks of association rule mining is that it produces a large number of patterns that include both trivial and non-trivial relationships. Downstream analysis still too often consists of manual trudging through long lists of patterns numbering in the thousands. Therefore, prioritizing those patterns that are most likely to be interesting is a central problem in data mining research.

Unexpectedness is one of the nine criteria put forward to describe the relevance of a pattern for a given research question [16]. Unexpectedness is an important criterion as it suggests the patterns that contradict users’ expectations or existing knowledge. These patterns allow us to identify failings in previous knowledge and may reveal an aspect of the data that needs further investigation. As a result, there are several studies tackle with mining unexpectedness from different kinds of data such as transaction data [24, 28, 30], sequential data [23], and recently are attributed graphs [4, 5, 21]. The unexpected patterns investigated in these studies are diverse as well: they could be unexpected transactions [29], unexpected associate rules [24, 28, 30], unexpected sequences [23] or unexpected sub-graphs [4, 5, 21].

In this study, we focus on detecting unexpected association rules from transaction datasets. We present a novel method to discover unexpected patterns based on beliefs that are automatically derived from the data. Our method utilizes a clustering-based framework where rule clusters play the role of a belief system in the mining process. To this end, we integrate DBSCAN, a density-based clustering algorithm, into the framework. Clustering of association rules has been previously explored [17, 20, 22] but unique to our proposed clustering procedure is that patterns are represented as feature vectors. The feature vectors are designed so that they can capture both semantic and lexical relationships between patterns. Therefore, these feature vectors not only facilitate processing but account for logical relationships such as contradiction or similarity in the distance between the rules. This focuses the way the relationship between rules are defined, which in turn improves rule clusters which can be considered as major beliefs within the dataset. Identified outliers from the clustering process can then be prioritized as candidate unexpected patterns. The implementation of our model can be found at https://github.com/banhdzui/UnexpectedPatternMining.

2 Related works

Unexpected pattern mining methods can be divided into objective measures and subjective measures. Objective measures are only based on pattern structure and underlying data while subjective measures include users’ knowledge in term of belief system and an interestingness measure that estimates the unexpectedness of patterns to the beliefs.

In subjective measure based studies [24, 28, 30] the belief systems were constructed manually using domain knowledge. They are represented as first-order logic formulations which allow the system to easily verify the unexpectedness of patterns against the beliefs. The exact manner in which unexpectedness is defined differs among these studies: Silberschatz et al. [30] considered the relative difference in belief degree of the system with and without the pattern; Padmanabhan et al. [28] proposed a matching function in terms of logical contradiction while Liu et al. [24] employed lexical comparison of the pattern content. In 2005, Jaroszewicz et al. [19] proposed to represent domain knowledge as a Bayesian network instead of first-order logic and estimate unexpectedness as an absolute difference between the pattern’s probability inferred from the network and from the observation. Recently, De Bie T. [10] proposed a Maximum Entropy model to formalize prior knowledge. The strategy was based on considering prior knowledge as constraints on a probabilistic model that represents the uncertainty in the data.

The largest drawback of the subjective measures is the difficulty in building a “good” belief system. Interviewing domain experts is a commonly proposed approach but is extremely hard in practice because of two reasons. First, experts may have trouble enumerating their beliefs; they often make vague and contradictory statements, which are hard to translate into logic. Second, there may be too many beliefs and it is not trivial to know which of these beliefs are really important and worth including in the belief system.

Compared to the subjective measures, objective measures can avoid the need to construct a belief system. Dong and Li [12] argued that unexpectedness could be interpreted in the statistical sense. They considered a significant variation in confidence between a rule and its neighbors as unexpectedness. However, their neighbor distance was defined in term of the lexical difference, which does not reflect true logical relationships between rules. We believe that the logical relationships also play an important role in estimating rule distance, thus we include this information into rule feature vectors which in turn effects the distance measure. In 2016, Chang et al. [6], proposed a hybrid way for mining unexpected patterns, which is specific for a case study of endometriosis data.

In addition, there are some shared concepts between our study with exception rule mining and exceptional model mining. Exception rules [7, 8] can be considered as negative forms of strong association rules. For example, \(X \rightarrow \sim Y, \sim X \rightarrow Y\) and \(\sim X \rightarrow \sim Y\) are negative forms of the rule \(X \rightarrow Y\). In other studies [18, 31], the authors defined exception rules as those that represent a change of conclusion caused by adding an extra condition to common sense rules. The rule pairs in form \(\{ A \rightarrow X, AB \rightarrow X^{\prime }\}\) are here evaluated with different metrics to judge whether \(AB \rightarrow X^{\prime }\) is exceptional. In 2005, Suzuki et al. [32] argued that there are 11 categories of exception rule based on rule triplets \(\{y \rightarrow x, z \rightarrow x, yz \rightarrow x\}\). They proposed an algorithm that performs a depth-first search for triplets of literals a, b, c and leave an algorithm for a transaction dataset for future work. Meanwhile, exceptional model mining (EMM) is a supervised local pattern mining framework, where several target attributes are pre-selected and a model over these attributes is chosen to be the target concept. EMM aims to find interesting subgroups where the target model behaves differently from the complement of these subgroups or whole dataset [14]. These target models can be correlation, association, classifier, regression or Bayesian network. Their research focus on describing how much the relationship between the target attributes changes when they are placed in the context of those subgroups. [25] is an extension of [14] where they discover exceptional relationship in term of association rules instead of item-sets. In this study, we focus on unexpectedness as when patterns’ conclusions change depending on the specific context. Thus when one outcome from a rule is expected given the bulk of the dataset, but another outcome is returned.

3 Methods

Figure 1 illustrates our method for detecting unexpected rules from data. It starts with mining association rules from the transaction database, followed by pruning the redundant rules. Non-redundant rules are then converted into numerical feature vectors before clustering. The non-redundant rules are next categorized into two groups: beliefs and potential candidates for unexpected rules. In the end, a contradiction check is applied to reveal the true unexpected rules from the candidates.

Fig. 1
figure 1

Unexpected pattern discovery workflow with example

3.1 Pruning redundant rules

Redundant rules can present a major challenge in association rule mining. Several previous studies have defined prior assumptions that form the basis of custom pruning frameworks [3].

For this study, the redundancy pruning was inspired by the following case. Given a rule \(X, \alpha \rightarrow Y\) in which α is a ubiquitous item in the database or α always occurs wherever X occurs in a database, means P(α|X) = 1, due to an underlying relationship. Rules with such items do not provide any new knowledge about the data compared to \(X \rightarrow Y\). Furthermore such rules are longer and hence harder to interpret. These types of redundant rules can be filtered based on their conditional probability \(P(\alpha |S), S \subseteq X\setminus \{\alpha \}\). We thus apply Procedure 1 during association rule mining to prune redundant rules.

figure a

3.2 Rule representation

We represent rules as numerical feature vectors. This representation is the basis for calculating distance between the rules, and enables subsequent rule clustering.

An association rule is defined by two itemsets: one in the left hand side named antecedent or LHS, and one in the right hand side, named consequent or RHS. The feature vector is constructed by concatenating features for the LHS and the RHS of the rule. Each feature corresponds to a unique item which potentially occurs at corresponding sides. Let \(x_{i}^{(l)}, x_{j}^{(r)}\) be the ith feature of LHS and the jth feature of RHS; kl, kr are feature vector length for LHS and RHS respectively. Figure 2 illustrates the structure of the complete feature vector. In the case of unsupervised pattern mining, kl = kr = |I|, as any item can occur on either side of the rule. However, in the case of classification rule mining, only decision items occur in the RHS thus kl + kr = |I|.

Fig. 2
figure 2

An illustration of the feature vector structure. It consists of two parts: antecedent (or LHS) features and consequent (or RHS) features

A straight-forward representation of the feature vector of a rule would be a binary encoding that represents the absence or presence of items in the rule. However, the difference between two binary features vectors only addresses the difference in the lexical aspect. Consider the following examples from a hypothetical medical dataset that contains \(``sick^{\prime \prime }\) and \(``healthy^{\prime \prime }\) patients:

Example 1

\(X \rightarrow ``sick^{\prime \prime }\) and \(X \rightarrow ``healthy^{\prime \prime }\). The pair of rules differ only in two items: \(``sick^{\prime \prime }\) and \(``healthy^{\prime \prime }\), but these items are contrary each other. Thus, the distance of these rules should be greater than rules that differ in two arbitrary non-exclusive items.

Example 2

\(X, \alpha \rightarrow ``sick^{\prime \prime }\) and \(X, \beta \rightarrow ``sick^{\prime \prime }\), where α and β are positively correlated with each other. This means that α is likely to occur wherever β occurs and vice versa. Thus, these rules can be considered as highly similar and should have a small distance.

As these examples show, a binary representation of item presence is not sufficient to distinguish rules in an interestingness context. Therefore we propose a new feature value domain in the range of [− 1,1], that represents how much a given feature is related to the LHS or RHS of the rule. A negative value implies a contradiction while a positive value implies similarity. More specifically, we define each feature value as a relationship score between the feature and the most correlated item in the rule.

This relationship score between a feature and an item is defined as the Spearman rank-order correlation coefficient of their relative occurrences in transaction data. These coefficients can then be interpreted as a similarity or a contradiction, depending on their sign. This interpretation is based on the assumption that items occurring together might be similar while items that appear in different transactions might be contradictory. To remove trivial relationships, we only keep the coefficients with a p_value ≤ 0.05, and the other coefficients are set to 0. Small correlation coefficients indicate little if any correlation. We can set a magnitude threshold, which is dataset-dependent, as more transactions allow minor relationships to be significant. Here the empirical correlation coefficient threshold was set to 0.1.

The steps of extracting feature values for a rule are described in Procedure 2, with M as the correlation matrix between items in the data after removing insignificant correlation coefficients. For each LHS feature, its value is the correlation coefficients of the feature and the LHS item that shows the highest (negative or positive) correlation to that feature. The same procedure is followed for the RHS features using the RHS items. Figure 3 illustrates the feature vector for the rule “\(age=40..49, tumor\_size=10..14 \rightarrow recurrence=no\)” from the Breast Cancer dataset [13]. By comparing these feature vectors, we can judge how the rules are related to each other. This is useful for subsequently grouping the rules into clusters and discovering unexpected rules.

Fig. 3
figure 3

Feature vector of the rule “\(age=40..49, tumor\_size=10..14 \rightarrow recurrence=no\)” from the Breast Cancer dataset. The first row is the list of features while the second row contains corresponding feature values. Here the LHS contains an item which is highly similar to “age = 40..49” but contradictory to “age = 30..39”. Meanwhile its RHS is “recurrence = yes”, then the value of feature “rec = yes” is 1.0 while “rec = no” feature is − 1.0 (contradiction)

figure b

3.3 Rule clustering and Contradiction check Function

To detect unexpectedness, we cluster the association rules and subsequently check for belief contradictions. Prior to clustering, a Principal Component Analysis (PCA), is applied on the feature vectors representing the association rules. This allows us to reduce the clustering computational complexity on large scale rule sets. The clustering is done by the DBSCAN algorithm [15] using the Euclidean distance on the proposed numerical feature vectors. DBSCAN classifies rules as either core points, reachable points or outliers. The rules that are core or reachable points are considered as beliefs because they are supported by several similar neighbors. The outliers are candidate unexpected rules. To determine if they are unexpected or not, these outliers are finally compared to the beliefs by a contradiction check function.

Definition 1

A rule \(X \rightarrow Y\) is unexpected with respect to a belief \(X^{\prime } \rightarrow Y^{\prime }\) on the dataset D if the following conditions hold:

  • Y and \(Y^{\prime }\) logically contradict each other.

  • X and \(X^{\prime }\) co-occur in a substantial subset of transactions in D. A subset is substantial if it satisfies a user-defined support threshold.

  • X and \(X^{\prime }\) are similar to each other (defined by the Cosine similarity).

By combining the above conditions, we obtain the following knowledge base: \(\lbrace X \rightarrow \sim Y^{\prime }, X^{\prime } \rightarrow Y^{\prime }, X, X^{\prime } \rbrace \). The knowledge base is un-satisfiable which can be proven by resolution. Because \(X^{\prime } \rightarrow Y^{\prime }\) is a belief, the existence of \(X \rightarrow Y\) becomes unexpected.

To estimate the contradiction or similarity between two itemsets, we proposed to compute the Cosine similarity of their feature vectors. If the Cosine value exceeds a given threshold, we can conclude that these itemsets are contrary or similar each other. We denote a similarity threshold for LHS as δ1 and a contradiction threshold for RHS as δ2, where δ1 needs to be positive while δ2 needs to be negative for an unexpected rule to occur.

The DBSCAN algorithm used to identify the candidate unexpected rules requires two parameters, minPts and eps. minPts is the minimum number of data points required to form a dense region and eps is the maximum distance a data point and the nearest data point can have to be considered as part of the same cluster. The choice of these parameter values directly impacts the belief system and the identified outliers. However a belief system should not contain any contradictory rules. This can be used to reject specific belief system solutions that are invalid. This observation can thus be used as a ’no-contradiction constraint’ on the eps parameter for the DBSCAN clustering. This constraint specifies that there should be no contradiction allowed between clusters, or as a parameter that can be tuned, no contradiction between clusters that consist of a large number of rules. The maximum size of a contradicting cluster is L𝜖 = 𝜖|R| rules, where R is the set of all rules and 𝜖 is a small fraction. We then reduce the eps value steadily until DBSCAN finds a set of clusters that do not break the no-contradiction constraint.

figure c

The complete framework for detecting unexpected rules is described in Procedure 3. This procedure requires the following parameters: DBSCAN parameter minPts; an initial value for epsmaxEps; a reducing step for epsepsStep; a small proportion 𝜖 for the non-contradiction constraint and two thresholds δ1, δ2 for the contradiction check function. After unexpected rules are discovered from the database, we can rank them according to the similarity in LHS between unexpected rules and their contrary beliefs.

4 Experiments and discussion

In this section, we present experiments on six different datasets to test our method in pruning redundancy, constructing the belief systems and discovering the unexpected rules. These datasets are classification data which allows us to evaluate performance of the unexpected rules. At first, we generate rules from data and show the result on redundancy pruning. Next is the evaluation of the unexpected rules generated by our method and the comparison with two other methods by of Hussain et al. [18] and by Suzuki et al. [31]. Finally, we show some examples of unexpected rules obtained from the data and their interpretation.

Most of our datasets are obtained from UCI repository [13], including: Adult, Breast cancer, Credit approval, Smtp (KDDCup99) and Pima. Also included is a T-cell receptor (TCR) dataset, an example of a higher dimensional dataset derived from Dash et al. [9]. The Smtp(KDDCup99) is a subset of KDD CUP 1999 network intrusion data, which consists of transactions that have service = smtp. Similar to [33], we do not use all features from the Smtp data. We restricted the number of features to 9 (count, dst_bytes, flag, src_bytes, logged_in, diff_srv_rate, dst_host_srv_serror_rate, same_srv_rate, dst_host_serror_rate), based on the cross-validated information gain in the training dataset. Most of these datasets are imbalanced, i.e. class distributions are not uniform among the classes. In the Adult dataset, for example, the label “≤ 50K” occurs in 75.92% of the transactions while “> 50K” only in 24.08%. Such datasets potentially contain abnormal patterns hidden in minority classes that are not easy to discover. The properties of the datasets and the used parameters for association rule mining are summarized in Table 1. Except Adult and Smtp(KDDCup99) where training and test data are available, we randomly split every dataset into 8 : 2 for training and testing, respectively.

Table 1 A summary of experimental datasets and the used parameters for association rule mining

Prior to association rule mining, we convert continuous attributes into categorical attributes and then consider each pair of attribute–value as a transaction item. We used the Apriori algorithm [2] with low minimum support and high minimum confidence to generate pre-mined rules. Apriori is one of the most well-known algorithms in association mining. In this step, we can replace Apriori by other association mining algorithms such as FP-Growth [34], ECLAT [35] and LCM [36] which have been proposed to improve the mining process in terms of running time [37]. As this step is not the main objective of this study, we leave the exploration of these algorithms onto mining unexpectedness for our near future work. High minimum confidence is to guarantee the quality of the beliefs while low minimum support is to reveal unexpected rules. Spearman rank-order correlation coefficients between items are computed by spearmanr function of SciPy library.

All experiments run on an Intel Core i7 machine with 16GB of RAM running at 2.8GHz. The experimental results are described in the following sub-sections.

4.1 Redundant rule pruning

To evaluate the performance of pruning redundant rules, we used the redundancy ratio [1] as defined by equation 1. The higher redundant ratio is, the more redundancy exists in the generated rules.

$$ \partial = \frac{\text{Total rules generated }(T)}{\text{Non-redundant rules }(E)} $$
(1)

Table 2 shows the redundancy ratio which was obtained by our method on a set of rules generated by the Apriori algorithm for each dataset.

Table 2 Redundancy ratio detected by our method on the set of rules generated by the Apriori algorithm

4.2 Unexpected rules evaluation

To evaluate the quality of unexpected rules, denoted as UnexpRules, that are automatically generated in our framework, we examine their contribution to the performance of classifiers on the minority class. The basic idea behind this evaluation is that classifiers are learning and capturing knowledge hidden in data. Therefore these classifiers can be considered as black-box belief systems. If rules improve the performance of classifiers, that means these rules are meaningful but unexpected to the classifier based belief system. In other words, they are actually true unexpected patterns.

The evaluation procedure is as follows. A classifier and UnexpRules are generated from the same training data. An independent testing dataset is then used to validate performance of the classifier with and without using UnexpRules. In the case of using UnexpRules, the samples satisfying the LHS of UnexpRules must follow the prediction of UnexpRules instead of the classifier’s. In this study, we used the F1 measure to validate the classifiers’ performance. The two classifier models deployed in our experiments are Support Vector Machines (SVM, polynomial kernel with degree = 3) and Random Forest (RF, numberoftrees = 20), from the python scikit-learn package. These models are not chosen to represent the state-of-the-art in the field of unbalanced data classification, but are only used as a stand-in for those beliefs that can be easily learned from the data.

The UnexpRules set was obtained using minPts = 3, maxEps = 2.0, epsStep = 0.1 and δ1 = 0.0. The values of other parameters, 𝜖 and δ2, depended on the dataset. In most cases, δ2 = − 1. If more than two prediction classes were present the correlation coefficients among these classes cannot reach − 1.0, thus δ2 = − 0.5 was used. Table 3 reports 𝜖 and the summary of the cluster based framework outcomes while Fig. 4 shows the contribution of generated UnexpRules to SVM and Random Forest over six datasets.

Table 3 𝜖 and the outcomes from the cluster-based framework for each dataset
Fig. 4
figure 4

F1 score on minority class for the SVM and RF, with and without using UnexpRules on testing data. Overall the obtained UnexpRules improve the performance of classifiers on the minority class. This indicates that these rules are unexpected and actionable. The largest improvement can be found in the case of Smtp dataset where the F1-score on the minority class increases from 0.021 to 0.621. This dataset in comparison to the others is extremely imbalanced, only 1.22% in training data correspond to the “attack” class compared to 98.78% of “non-attack” class. This could prevent the classifier to learn knowledge related to “attack” class and the UnexpRules are helpful in this case

In regards to running time, the DBSCAN algorithm takes only a few seconds for small rule sets but it can take up to several hours for larger scale sets. e.g. Adult data which consists of more than 27 × 104 rules. The computational complexity for detecting UnexpRules from clusters is O(n2) for a rule set with size n, as the procedure 3 requires comparisons between all outliers and belief rules.

We also examine the influence of the threshold δ1 on our framework. The threshold δ1 is used as the criterion for the similarity between the LHS of an unexpected rule and a belief. In addition, it impacts the search for the optimal eps for the DBSCAN algorithm and thus effects the quality of both the belief system and UnexpRules. Figure 5 illustrates the effect of δ1 on the clustering result. The figure shows that when the δ1 increases, the clusters grow and the number of outliers decreases. The fact is that δ1 influences the eps through relaxing or restricting the no-contradiction constraint among clusters. The eps in its turn will guide the clustering process. Figure 6 reports the F1 score variation over four datasets when δ1 increases from 0.0 to 1.0. For each dataset, the maximum improvement is reported at δ1 = 0.0 and decreases as δ1 reaches 1.0.

Fig. 5
figure 5

The effect of δ1 onto the clustering result of Breast cancer dataset. The red points are outliers while the others are beliefs. Data points in the same cluster are in the same color. The axes are three first principal components of the rules

Fig. 6
figure 6

F1 score on minority class of SVM with and without UnexpRules on four datasets with δ1 varying from 0.0 to 1.0. This value signifies the minimal similarity between the LHS between an unexpected rule and a belief. None of the datasets have any UnexpRules at δ1 = 1.0

Given the conceptual similarities between the unexpected patterns described here and the prior known exception rules, the two were compared for the different datasets. Table 4 reports the performance of exception rules based on the approach from Hussain et al. [18] and Suzuki et al. [31] and the unexpected rules found by our method. Hussain et al. [18] generated exception rules by pairing common sense rules and ranks the rules with a relative interesting measure. Only exception rules with high relative score were kept to determine performance. As this method was designed to work on binary datasets, only those with two classes were used in this comparison. Suzuki et al. [31] introduced 5 constraints that an exception rule pair \(\{A \rightarrow X, AB \rightarrow X^{\prime }\}\) has to satisfy so that the rule \(AB \rightarrow \)\(X^{\prime }\) is considered as an exception. These constraints require following user-defined thresholds: \({\theta _{1}^{S}}, {\theta _{1}^{F}}, {\theta _{2}^{S}}, {\theta _{2}^{F}}, {\theta _{2}^{I}}\) which define the bounds for five derived scores related to Pr(A), Pr(X|A), \(Pr(AB), Pr(X^{\prime }|AB)\) and \(Pr(X^{\prime }|B)\) respectively and δ for significant level of rules. Based on the discussion in their study and our experimental examination, we choose \({\theta _{1}^{F}}=0.5, {\theta _{2}^{F}}=0.75, {\theta _{2}^{I}}=0.5\) and δ = 0.9. The other thresholds depend on experimental datasets. In general, rules from three mentioned methods improved the F1 score on minority classes. However, the Hussain et al. approach lowers the F1 score for the majority classes of two datasets, Adult and Credit, while the others did not. The Suzuki et al. approach gave the significant improvement on Smtp data while UnexpRules did on Breast Cancer. Overall, our method gave the best score on most of the experimental datasets. This difference in performance was caused by the different types of patterns these methods mined. The Hussain et al. and the Suzuki et al. approaches suggest the rules \(AB \rightarrow X^{\prime }\) as exceptional when \(A \rightarrow X\) is high in both support and confidence and \(B \rightarrow X^{\prime }\) is low in confidence. Meanwhile, we focused on \(B \rightarrow X^{\prime }\) that is high in confidence and used AB as a reference for unexpectedness. The rules from our method are therefore more concise and more reliable.

Table 4 F1 score on minority and majority classes when doing prediction on testing data with SVM and SVM plus unexpected rules generated from Hussain et al. [18], Suzuki et al. [31] and our method

4.3 Unexpected rules and their interpretation

To evaluate the validity and interpretability of both the beliefs and the unexpected rules, we take a closer look at some examples from the TCR, Credit approval and Breast cancer datasets.

TCR dataset::

the rules represent relationships between the T-cell receptor beta-chain CDR3 amino acid sequence and the antigen that is targeted by the T-cell. The CDR3 beta-chains consist of 12 to 17 amino acids. The dataset contains human T-cell receptor data for three antigens, namely BMLF, M1 and pp65. The classification goal is to predict for the CDR3 amino acid sequence which antigen it is associated with, which has been recently tackled with a RF [11] as per the black-box belief system presented here. Here each CDR3 amino acid sequence was encoded as a set of 3-grams: one 3-gram is a sequence of three adjacent amino acids. Table 5 lists the unexpected rules generated from this dataset. From the table, we can see that contradiction surrounds two 3-grams EQY and GGE. The proposed method found the belief that EQY is associated with the M1 antigen. This belief matches observation made about this study in the past, the EQY 3-gram is derived from the TRBJ2-7*01 gene, which can in turn for this dataset be associated with the M1 antigen [26]. Thus our approach captured a belief that had been formulated by experts working with this data. However, the appearance of GGE breaks the rule and it turns the conclusion to the BMLF antigen. This is highly unexpected as the 3-gram GGE can also be derived from the same TRBJ2-7*01 gene. However in this instance, the physicochemical properties of the T-cell receptor allow it to be associated with the BMLF antigen. This is a departure from what has been previously observed, and is a good lead for further investigation.

Credit approval dataset::

the rules represent associations between customer information and credit approval decision. Table 6 shows 2 of 6 unexpected rules discovered by our method. One belief says that if customers have the following properties: a8 < 7, a9 = t, a10 = t, a12 = f, a13 = g, they will be approved for credit with high confidence. However, there are some transactions in database, about 0.67%, showing customers that, even though they satisfy these properties, if their profile also includes the property a6 = ff, will not be approved for credit with a higher confidence.

Breast cancer dataset::

the mined rules correspond to the relationship between 9 attributes that describe women that have successfully undergone breast cancer treatment and if their cancer recurred at a later date. Table 7 shows two of the unexpected rules found from this dataset. In the first example, the belief says that “if menopause=ge40, node-caps = no and irradiat=no then recurrence=no”. However, when inv-nodes= 3..5 is added into the condition, cancer recurrence happens with a confidence of 100%. This contradiction can be interpreted as follows: normally we do not expect a recurrence in a woman who is menopausal after 40 years of age, who did not originally receive radiation treatment and had no so-called node caps, except if she had 3 to 5 auxiliary lymph nodes that contain metastatic breast cancer visible with histological examination. This follows the belief that if the cancer was not severe enough to warrant radiation treatment, it will likely have been entirely eliminated from the first treatment. However the contradiction is that some of these women already had auxiliary lymph node metastasis, which makes the cancer much harder to remove entirely.

Table 5 Unexpected rule examples of the TCR dataset. Reference is the LHS co-occurrence of belief and unexpected rule in data
Table 6 Unexpected rule examples from the Credit approval dataset. Reference is the LHS co-occurrence of belief and unexpected rule in data
Table 7 Unexpected rule examples related to the class attribute on the Breast cancer data. Reference is the LHS co-occurrence of belief and unexpected rule in data

Thus far only the class attribute has been considered as the unexpected outcome of the rule, however the approach is equally valid for other attributes. Table 8 shows two examples of such patterns. A belief describes that “if patients go through menopause since 40 year old, and their tumors occur at the left low quad of breast then radiation treatment is no”. Unexpectedly, if those patients have node-cap unknown then they do receive radiation treatment. In the second example, we find a belief that claims “if patients whose age is from 60 to 69 and they do not get any radiation treatment then the number of auxiliary lymph nodes is less than 2”. Unexpectedly, several cases in the data set with the same initial conditions, have auxiliary lymph nodes from 6 to 8. In addition, it adds the outcome node-cap is yes which does not occur in the case of invnode = 0 − 2. In this manner, the approach is able to find novel outcomes that do not match the beliefs and warrant further study.

Table 8 Unexpected rule examples related to the other attributes on the Breast cancer data. Reference is the LHS co-occurrence of belief and unexpected rule in data

5 Conclusion

Unexpectedness is considered as one of the nine relevant criteria to evaluate pattern interestingness. It identifies a failing in previous knowledge or may reveal an aspect of data that needs further study. In this work, we developed a clustering-based method to identify unexpected rules from different datasets. An advantage of our method is that it can discover unexpected rules without the need to build a belief system manually. We evaluated the quality of unexpected rules based on the performance of classifiers. The experimental results show that the discovered unexpected rules are actionable and have great potential for imbalanced datasets which appear in many real world applications such as fault detection, fraud detection, and medical diagnosis.

To facilitate rules clustering, we proposed to represent association rules as numerical feature vectors. We have extended this representation to also summarize and visualize the association rules and their relationships. This could be useful for the development of a system in which users are able to examine and select interesting rules interactively. However, such representation requires high memory capacity when the number of unique items in the data set is large. Moreover, the clustering process scales poorly on large rule collections. Various options exist that are likely to speed up this aspect of the algorithm, such as using an approximate nearest neighbor search, but these are beyond the scope of this study. We also proposed a way for pruning redundant rules based on association between items. This approach holds great promise for integration with other interestingness discovery approaches to form a complete pattern prioritizing work-flow.