Keywords

1 Introduction

Association has been highlighted, among the data mining tasks, due to its comprehensibility even by non-experts in the field. To have an idea, the Apriori algorithm, broadly used to obtain association patterns, was elected as one of the ten data mining algorithms most employed by the community [1]. For this and other reasons, association has been applied in many domains, as noticed in [26] works.

In the last years, researches have adopted some strategies to aid the user to identify the relevant associative patterns of the domain. One of these strategies is to pre-process the data before obtaining the rules. For that, many approaches have been proposed, being clustering a promising one. In this case, as seen in Fig. 1, the data are initially grouped into \(n\) groups (\(GD_1\),\(GD_2\),...,\(GD_n\)). Association rules are extracted within each group and, in the end, \(n\) groups of rules are obtained (\(GR_1\),\(GR_2\),...,\(GR_n\)). All these rules compose the rule set. According to [7], each group expresses its own associations without the interference of the other groups that contain different association patterns. The aim is to obtain potentially interesting rules that would not be extracted from unpartitioned data setsFootnote 1. The user must set the minimum support to a low value to discover these same patterns from unpartitioned data sets, causing a rapidly increase in the number of rules. Recent works have used these ideas in different domains as [8] in maintenance systems, [9] in banking context and [10, 11] in automotive data.

Fig. 1.
figure 1

Overview of the process to extract association rules through clustering in comparison to the traditional one.

Distinct methodologies have been proposed to enable the described process. Each methodology uses a different combination of similarity measures with clustering algorithms to obtain the groups of rules (see Fig. 1). However, little has been done to analyze the performance of the methodologies or even to compare the results. For example, given a specific problem of a domain, how could the user identify the most suitable methodology to use in his problem? Or, how the user could check if the selected methodology was good enough for the problem, considering that different interests may be important for his decision? To aid the user in these tasks, considering the aims of the clustering process above discussed, there are some issues that have to be investigated (see Fig. 1):

Issue 1. Is there overlap between a rule set obtained through partitioned data, i.e., extracted from clustered data, in relation to a rule set obtained through unpartitioned data, i.e., extracted from traditional process? A rule set obtained through a partitioned data is named here as RsP and a rule set obtained through a traditional process is named here as RsT.

Issue 2. Is there overlap between the rules in RsT and RsP regarding the interesting knowledge? In other words, has RsP, in fact, more interesting patterns than RsT?

Issue 3. What is the process behavior regarding the number of rules that are obtained in RsP?

Based on the exposed arguments and on the three presented issues, eleven metrics are proposed in this paper to provide an assessment procedure in order to support the evaluation of the methodologies that use clustering in the pre-processing step. However, to propose the metrics, a subjective evaluation was done with some users to understand each issue. The analysis was done as a way to comprehend the important aspects to be considered by a user when comparing the performance of some methodologies or even their results. Thereby, this paper contributes with current researches since the metrics provide criteria to: (a) analyze the methodologies, (b) identify their positive and negative aspects, (c) carry out comparisons among them and, therefore, (d) help the users to select the most suitable solution for their problems. Besides, the paper does the users think about many aspects to be considered in the presented context and provides to them a flexible way to explore their problems. This paper is an extension of [12]’s work, in which two more metrics are proposed (\(M_{P}\), \(M_{RF}\)) and where the subjective evaluation is used to attest the importance of the previous obtained results.

This paper is organized as follows. Section 2 surveys the related researches. Section 3 presents the subjective evaluation and Sect. 4 the proposed metrics. Section 5 describes some experiments that were carried out to show how the metrics can be used. Section 6 discusses the results obtained in the experiments. Conclusion is given in Sect. 7.

2 Related Works

There are many researches that initially cluster data aiming to discover and facilitate the search for the interesting pattern of the domain. Some of these works are described below.

Plasse et al. [11] propose to split the transactions’ items into groups in order to extract the rules. The aim is to find interesting associations among rare items (less frequent) that would not be discovered if the traditional process were applied, mainly in sparse data. Besides, the authors show that if the data are clustered it is possible to reduce both the amount and the complexity of obtained rules compared to non-clustered data. However, these results depend on the used clustering algorithms and similarity measures. The authors evaluate many hierarchical algorithms (Single, Complete, Average, Ward, Varclus) combined with many similarity measures (Jaccard, Russel Rao, Dice, Ochiai, Pearson). Nevertheless, it is not understandable how the rules are obtained within the groups, since it is necessary to have a set of transactions and not a set of items. This means that it is not clear how the transactions are distributed over the groups. Among the similarity measures used by them, we emphasize Jaccard (Eq. 1) and Russel Rao (Eq. 2) – the Russel Rao due to its good performance in the experiments presented in [11] and the Jaccard due to its use by the measure described below (Agg). The Jaccard between two items \(i_{1}\) and \(i_{2}\), named here as P-J (Plasse Jaccard), is the ratio between the transactions \(t\) the items cover simultaneously and the total of transactions the items cover. An item covers a transaction \(t\) if the item is in \(t\). The Russel Rao between two items \(i_{1}\) and \(i_{2}\), named here as P-RR (Plasse Russel Rao), is computed considering the transactions \(t\) the items cover simultaneously over all the transactions. In fact, this measure is equivalent to the support of \(i_{1} \cap i_{2}\). In Eq. 2, \(N_{t}\) represents the total number of transactions.

$$\begin{aligned} {\text {P-J}}(i_{1},i_{2})=\frac{|\{t \ covered \ by \ i_{1}\} \cap \{t \ covered \ by \ i_{2}\}|}{|\{t \ covered \ by\ i_{1}\} \cup \{t \ covered \ by \ i_{2}\}|}. \end{aligned}$$
(1)
$$\begin{aligned} {\text {P-RR}}(i_{1},i_{2})=\frac{|\{t \ covered \ by \ i_{1}\} \cap \{t \ covered \ by \ i_{2}\}|}{N_{t}}. \end{aligned}$$
(2)

Aggarwal et al. [13] propose an algorithm, named CLASD, to split the transactions aiming to discover associations on small segments (subsets) of the data. The authors justify that the approach has a considerable impact on the rules that are obtained, since the patterns that cannot be identified in the whole set can be identified in the subsets. To cluster the transactions, [13] use a similarity measure proposed by them (Eq. 3), named here as Agg (Aggarwal). The clustering is done by CLASD. In this case, the similarity between two transactions \(t_{1}\) and \(t_{2}\) is computed by the affinity (Af) average among the transactions’ items. It can be noted that the affinity (Af) is equivalent to the measure P-J. Therefore, after computing P-J among the \(m\) items in \(t_{1}\) and the \(n\) items in \(t_{2}\), the average among them is obtained. The higher the affinity among the items the more similar the transactions. From this initially clustering the rules are then extracted. Different from [11], the number of rules that are extracted by their process is higher compared to the traditional process, but smaller if the support in the traditional process had to be set with a very low value to obtain the same associations. A limitation of the approach is the number of parameters that has to be set (five plus two constants).

$$\begin{aligned} {\text {Agg}}(t_{1},t_{2})=\frac{\sum _{p=1}^{m}\sum _{q=1}^{n}Af(i_{p},j_{q})}{m*n}, \text{ Af(i,j) }=\frac{sup(\{i,j\})}{sup(\{i\})+sup(\{j\})-sup(\{i,j\})}. \end{aligned}$$
(3)

Koh and Pears [7] propose a methodology to cluster transactions and then extract a rare association rule set. The algorithm they proposed, named Apriori Inverse, applies upon groups of transactions in order to verify if it is possible to extract rare rules that are not discovered considering the whole data. Therefore, the idea is similar to the works above described. The authors demonstrate that the clustering contributes to the discovery of new associations. To cluster the transactions, their algorithm initially finds \(k\) seeds (centroids), where \(k\) indicates the number of frequent itemsets in transactions that match some conditions. Each seed forms a group. After the seed generation, each transaction \(t\) is allocated to one of the groups (seed) considering the similarity obtained through the measure presented in Eq. 4. The equation is an adaptation of the Jaccard coefficient and it compares the number of common items that occur between the transaction (\(t\)) and the group centroid (\(c_{k}\)). In this case, the higher the overlap between \(t\) and a centroid the higher the similarity value (Sim = similarity). In this approach it is necessary to set two additional values beyond the minimum support, hindering a little the exploration.

$$\begin{aligned} Sim(t,c_{k})=\frac{|t \cap c_{k}|}{|t \cup c_{k}| - |t \cap c_{k}| + 1}. \end{aligned}$$
(4)

There are other researches concerned with the clustering of transactions that, although not related to the extraction of association rules, could be used for that purpose [1417]. In [18] the authors propose an approach to identify, a priori, the potentially interesting items to appear in the antecedents and in the consequents of the association rules without extracting them. The approach is divided in two steps: the clustering of the transactions and the selection of the interesting items. To do the clustering the authors propose the use of incremental K-means with the similarity measure presented in Eq. 5, named here as Denza. Note that this measure is the Jaccard between transactions. Therefore, the similarity between two transactions \(t_{1}\) and \(t_{2}\) is computed considering the items the transactions share. After the grouping, statistics are applied upon the groups to identify the items that are relevant to the application.

$$\begin{aligned} {\text {Denza}}(t_{1},t_{2})=\frac{|\{items \ in \ t_{1}\} \cap \{items \ in \ t_{2}\}|}{|\{items \ in \ t_{1}\} \cup \{items \ in \ t_{2}\}|}. \end{aligned}$$
(5)

Among the papers above described, little has been done to analyze the performance of the methodologies, allowing to identify their positive and negative aspects, or even to compare the results among them. In general, the researchers compare the number of rules and/or itemsets that are obtained from unpartitioned data and clustered data to expose the usefulness of the methodologies. This strategy can be found in [7, 11, 13] and is related to “Issue 3” of Sect. 1. However, [11] also analyze the process considering the complexity of the rules that are obtained – the greater the number of items that compose a rule the higher its complexity. Koh and Pears [7] and Aggarwal et al. [13] discuss about some rules found through clustering to show that the process provides the discovery of interesting patterns, but the analysis of the process is subjective. Aggarwal et al. [13] also consider the execution time. Finally, [7] is the only work that allows a better analysis considering the existing overlap between the rules obtained from unpartitioned data and clustered data. This strategy is related to “Issue 1” of Sect. 1. Based on the presented arguments, as mentioned before, the necessity of an assessment procedure becomes evident.

It is important to mention that cluster validation indices (CVI), independently of being relative, internal, or external [19, 20], are not reviewed here. It is understood that these CVI can be used to evaluate the groups of transactions before obtaining the association rules: in this case, the aim is to check if the partition fits the structure underlying the data. However, after obtaining the groups of transactions, the rules are obtained and, therefore, the CVI can no longer be used, since the data used as input (for example, transactions) must be the same of the output (for example, groups of transactions); however, in this case, the input are transactions and the output are rules. Thus, the metrics presented here are important as a final step to evaluate the usefulness of a clustering process in obtaining groups of rules. Therefore, the works described in this section are the ones that aim to apply clustering as a previous step to obtain association rules and their limitations regarding an assessment procedure.

3 Understanding the Issues Through a Subjective Evaluation

A subjective evaluation was done with some users to understand each issue before proposing the metrics. The analysis was done as a way to comprehend the important aspects to be considered by a user when comparing the performance of some methodologies or even their results.

The evaluation was done through a questionnaire, which was elaborated to present to the users scenarios that can occur in the considered context. For each scenario a question was formulated in order to comprehend the importance of the aspects related to the exposed problem. The full questionnaire can be found in Appendix. Before presenting the Questions, a brief introduction was given to the users about the questionnaire. The questionnaire comprehends 10 Questions: Questions 1 and 2 are related to “Issue 1”, Questions 3, 4, 5 and 6 to “Issue 2” and Question 7 to “Issue 3”. Question 8 is related to a more general advantage the clustering process can provide. Questions 9 and 10 are general questions about the evaluation. The questionnaire was answered by 5 users, all of them researches on data mining and/or text mining. All of them have knowledge on clustering and association rules, including their particularities, problems, means of validation, etc., and, so, they could understand the exposed problem and contribute with the current research. A brief overview of their experiences is here presented: (user (1)) worked with objective measures to post-process association rules obtained to extract related words in a collection of documents to construct a enriched bag-of-words representation; worked with clustering to construct a topic hierarchy based on features extracted through association rules; (user (2)) worked with clustering to post-process association rules and developed a labeling method to label groups of rules; worked with complex networks and semi-supervised learning to post-process association rules; (user (3)) worked with clustering and semi-supervised learning to build a dynamic topic hierarchy; worked with association rules in some applications; (user (4)) worked with objective and subjective measures to post-process association rules; worked with clustering to organize document collections; (user (5)) worked with taxonomies to generalize association rules to help users during the post-processing phase; worked with association rules to develop recommendation systems; worked with clustering in some applications. The users’ answers are presented in Table 1, which are discussed below along with each Question.

Table 1. Users’ evaluation about the exposed scenarios.
  • Question 1 (Issue 1). The idea behind this question, presented in Fig. 2, was to analyze if it is important not to lose knowledge during the clustering process. The other idea was to analyze, regarding the repetition over the clusters, if the knowledge must be in subsets that express its own associations. All of them considered that is desirable that rules extracted in RsT be found in RsP, directing the analysis to the aspects embedded in question. Besides, all of them stated that this aspect is important for an assessment procedure (1.a.) – the “yes” answers, in all the questions, regarding the importance of the considered aspect, are omitted in Table 1. In relation to the comments made by the users (1.b.), we can mention: (a) the necessity to differentiate the rules that don’t repeat over the clusters from the ones that repeat; (b) the understanding that if the patterns are present in both sets it is because they are relevant (even if they, probably, represent a frequent relation of the domain), being desirable that this scenario occurs. To formalize these aspects, two metrics are proposed in the next section: \(M_{O-RsP}\) and \(M_{R-O-RsP}\).

  • Question 2 (Issue 1). The idea behind this question, presented in Fig. 3, was to analyze if it is important to obtain new knowledge during the clustering process. The other idea was to analyze, regarding the repetition over the clusters, if the new knowledge must be in subsets that express its own associations. Almost all of them (four out five) considered that is desirable that rules extracted in RsP be not found in RsT, directing the analysis to the aspects embedded in question. However, all of them stated that this aspect is important for an assessment procedure (2.a.). In relation to the comments made by the users (2.b.), we can mention: (a) the necessity to differentiate the rules that don’t repeat over the clusters from the ones that repeat; (b) the understanding that the patterns appearing only in RsP represent new knowledge, probably interesting to the users, that would be difficult to be found through the traditional process. To formalize these aspects, two metrics are proposed in the next section: \(M_{N-RsP}\) and \(M_{R-N-RsP}\).

  • Question 3 (Issue 2). The idea behind this question, presented in Fig. 4, was to analyze if it is important to discovery new interesting knowledge; if so, the cost of the clustering process would be minimized since new interesting knowledge would be found. Almost all of them (four out five) considered that is desirable that some (or none) of the \(n\) most interesting rules in RsP be not found in RsT, directing the analysis to the aspect embedded in question. Besides, almost all of them (four out five) stated that this aspect is important for an assessment procedure (3.a.). In relation to the comments made by the users (3.b.), we can mention: (a) the importance of identifying new interesting rules not extracted through the traditional process; (b) the understanding that these new interesting patterns are the ones that stand in a certain group of data. To formalize this aspect, one metric is proposed in the next section: \(M_{N-I-RsP}\).

  • Question 4 (Issue 2). The idea behind this question, presented in Fig. 5, was to analyze if it is important not to lose interesting knowledge during the clustering process. Most of them (three out five) considered that is indifferent that some (or none) of the \(n\) most interesting rules in RsT be not found in RsP; thus, in this case, any conclusion was done in relation to the aspect embedded in question. However, most of them (three out five) stated that this aspect is important for an assessment procedure (4.a.). In relation to the comments made by the users (4.b.), there were few relevant observations. Only one user mentioned that it is important to analyze the considered scenario because it is understood that the clustering process could be losing knowledge, making the user to think about the truth quality of the clusters (directing to the aspect embedded in question). Although no conclusion could be made in relation to the main question, considering the importance for an assessment procedure (4.a.), one metric is proposed in the next section to formalize the aspect: \(M_{O-I-N-RsP}\).

  • Question 5 (Issue 2). The idea behind this question, presented in Fig. 6, was to analyze if it is important that an intersection exists between the new interesting knowledge, discovery through the clustering process, in relation to the interesting knowledge already found through the traditional process. Most of them (three out five) considered as desirable the existing intersection between the \(n\) most interesting rules in RsP and the \(n\) most interesting rules in RsT, directing the analysis to the aspect embedded in question. Besides, almost all of them (four out five) stated that this aspect is important for an assessment procedure (5.a.). In relation to the comments made by the users (5.b.), we can mention: (a) the understanding that interesting rules appearing in both sets have a high probability of being really interesting to the users; (b) however, as an opposed idea, the understanding that different knowledge must be extracted from the clustering process compared to the traditional one (indifferent cases). Thus, although distinct views exist (5.b.), considering the importance for an assessment procedure (5.a.), one metric is proposed in the next section to formalize the aspect: \(M_{C-I}\) (which can be interpreted according to the user’s view).

  • Question 6 (Issue 2). The idea behind this question, presented in Fig. 7, was to analyze if it is important that the new interesting knowledge be found in a small number of the groups. Most of them (three out five) considered as desirable the spread of the \(n\) most interesting rules in RsP in a small number of clusters, directing the analysis to the aspect embedded in question. Besides, all of them stated that this aspect is important for an assessment procedure (6.a.). In relation to the comments made by the users (6.b.), we can mention: (a) the understanding that the user could explore a group of interesting knowledge, directing the user to a reduced exploration space; (b) however, as an opposed idea, the understanding that it could be more relevant to explore the interesting rules within each cluster, since each group express different concepts of the domain (no desirable cases). Thus, although distinct views exist (6.b.), considering the importance for an assessment procedure (6.a.), one metric is proposed in the next section to formalize the aspect: \(M_{NC-I-RsP}\) (which can be interpreted according to the user’s view).

  • Question 7 (Issue 3). The idea behind this question, presented in Fig. 8, was to analyze the clustering process regarding the number of extracted rules in relation to the traditional one. Two of them considered that the amount of rules to be extracted through clustering, compared to the traditional process, should be low, two of them average and one high. However, all of them stated that this aspect is important for an assessment procedure (7.a.). In relation to the comments made by the users (7.b.), we can mention: (a) the understanding that, since each group contains a specific bunch of items, the amount of rules should be low (low cases); (b) the understanding that each group will contain a small number of rules (due to the same arguments of (a)) and, as a consequence, an average amount considering all the groups (average cases); (c) the understanding that since each group will contain similar items, the amount of rules should be high (due to the low diversity of items compared to the traditional process considering the same value of support/confidence) (high cases). Thus, although distinct views exist (7.b.), considering the importance for an assessment procedure (7.a.), one metric is proposed in the next section to formalize the aspect: \(M_{NR-RsP}\) (which can be interpreted according to the user’s view).

  • Question 8 (Others). The idea behind this question, presented in Fig. 9, was to analyze the clustering process regarding the spread of the concepts of the domain. All of them considered that is desirable that the clustering process should, as a consequence, enable each cluster to express a distinct topic of the domain. Besides, all of them stated that this aspect is important for an assessment procedure (8.a.). In relation to the comments made by the users (8.b.), we can mention, unanimously, the understanding that the aim of the clustering process is to split the distinct topics of the domain in order to aid rule exploration. To formalize this aspect, two metrics are proposed in the next section: \(M_{P}\) and \(M_{RF}\).

  • Question 9 and 10 (Others). Questions 9 and 10 are general questions about the evaluation. In relation to “Question 9”, presented in Fig. 10, none of them gave any suggestion. However, in relation to “Question 10”, also presented in Fig. 10, two comments were made: (a) the understanding that an assessment procedure should consider two or more scenarios but maintaining a good trade-off among them; (b) the understanding that the presentation of the rules in different scenarios, as the ones exposed, can aid the user during an evaluation process, since the user can give more attention to some aspects according to his aims. Therefore, as observed, this paper contributes with current researches making the users think about some aspects to be considered in the presented context and providing to them a flexible way to explore their problems (since the users can interpret a metric according to their view).

Fig. 2.
figure 2

Question 1 (see Appendix for details).

Fig. 3.
figure 3

Question 2 (see Appendix for details).

Fig. 4.
figure 4

Question 3 (see Appendix for details).

Fig. 5.
figure 5

Question 4 (see Appendix for details).

Fig. 6.
figure 6

Question 5 (see Appendix for details).

Fig. 7.
figure 7

Question 6 (see Appendix for details).

Fig. 8.
figure 8

Question 7.

Fig. 9.
figure 9

Question 8.

Fig. 10.
figure 10

Questions 9 and 10.

4 Evaluation Metrics: Providing an Assessment Procedure

Eleven metrics are proposed to provide an assessment procedure in order to support the evaluation of the methodologies that use clustering in the pre-processing step (as the ones described in Sect. 2). These metrics formalize the aspects related to each issue, which were analyzed by some users through a subjective evaluation. Each metric, which was implicit explored through a subjective question, is related to an issue. For each issue there are one or more metrics. All metrics, with exception to \(M_{NR-RsP}\), range from 0 to 1. Since RsP contains all the rules extracted within each group, repeated rules may exist in the set; therefore, RsP can be, in some cases, a multiset. In RsT the same doesn’t occur since the rules are unique.

Issue 1. Regarding the existing overlap among the rules in RsP and RsT, four metrics are proposed, which are described as follows:

  • \(M_{O-RsP}\) Related to “Question 1”, measures the ratio of “old” rules in RsP, i.e., the ratio of rules in RsT found in RsP (Eq. 6). A rule is considered “old” if it is in RsT, i.e., in the rule set obtained through the traditional process. Therefore, considering users’ analysis, the higher the value the better the metric, since the value indicates that there was no loss of knowledge during the process.

    $$\begin{aligned} M_{O-RsP}=\frac{|RsT \cap RsP|}{|RsT|}. \end{aligned}$$
    (6)
  • \(M_{R-O-RsP}\) Related to “Question 1”, measures the ratio of “old” rules that repeat in RsP (Eq. 7). In this work, it is assumed that a rule should exist in only one of the clustering groups, since it has to be in a subdomain that expresses its own associations. Therefore, considering users’ analysis and authors’ understanding on the problem, the lower the value the better the metric, since the value indicates that the knowledge, already known, is in subsets that express its own associations.

    (7)
  • \(M_{N-RsP}\) Related to “Question 2”, measures the ratio of “new” rules in RsP, i.e., the ratio of rules in RsP not found in RsT (Eq. 8). A rule is “new” if it isn’t in RsT, i.e., in the rule set obtained through the traditional process. Although it is important that any knowledge be lost (metric \(M_{O-RsP}\)), it is expected that the ratio of “new” rules in RsP be greater than the ratio of “old” rules. Therefore, considering users’ analysis, the higher the value the better the metric, since the value indicates the amount of knowledge, previously unknown, obtained during the process.

    (8)
  • \(M_{R-N-RsP}\) Related to “Question 2”, measures the ratio of “new” rules that repeat in RsP (Eq. 9). Idem to \(M_{R-O-RsP}\). Therefore, considering users’ analysis and authors’ understanding on the problem, as in \(M_{R-O-RsP}\), the lower the value the better the metric, since the value indicates that the knowledge, previously unknown, is in subsets that express its own associations.

    (9)

Issue 2. Regarding the existing overlap among the rules in RsP and RsT considering the interesting aspect of the knowledge, four metrics are proposed, which are described as follows:

  • \(M_{N-I-RsP}\) Related to “Question 3”, measures the ratio of “new” rules among the \(h\)-top interesting rules in RsP (Eq. 10). Given a subset of \(h\)-top interesting rules, selected from RsP, it is expected that the ratio of “new” rules in this subset be as large as possible. The \(h\)-top rules are the \(h\) rules that contain the highest values regarding an objective measure, where \(h\) is a number to be chosenFootnote 2. Therefore, considering users’ analysis and authors’ understanding on the problem, the higher the value the better the metric, since the value indicates that the cost of the process is minimized by the discovery of interesting knowledge, previously unknown, in RsP.

    (10)
  • \(M_{O-I-N-RsP}\) Related to “Question 4”, measures the ratio of “old” rules not in RsP among the \(h\)-top interesting rules in RsT (Eq. 11). Given a subset of \(h\)-top interesting rules, selected from RsT, it is expected that all these rules are present in RsP. Since any conclusion could be done considering users’ analysis, it is understood, in this work, that is not desirable that patterns in RsT disappear in RsP, which would imply in the loss of relevant knowledge. Thus, this metric measures the ratio of “old” interesting rules not in RsP. The \(h\)-top rules are as described in \(M_{N-I-RsP}\). Therefore, considering the importance of this aspect for an assessment procedure, according to users’ view, and authors’ understanding on the problem, the lower the value the better the metric, since the value indicates that the interesting knowledge in RsT was not lost during the process.

    (11)
  • \(M_{C-I}\) Related to “Question 5”, measures the ratio of common rules among the \(h\)-top interesting rules in RsP and the \(h\)-top interesting rules in RsT (Eq. 12). Consider two subsets, \(S_1\) and \(S_2\), containing, respectively, the \(h\)-top interesting rules in RsP and the \(h\)-top interesting rules in RsT. This metric measures the existing intersection between these two subsets, which is expected to be as large as possible according to the users (different from what was assumed in [12], that presented a metric interpretation as in the indifferent cases (see the justification below)). Therefore, the higher the value the better the metric, since the users understand that rules present in both sets have a high probability of being really interesting to them. However, the value of the metric can be interpreted according to the user’s view, being the lower the value the better the metric (indifferent cases – in this condition, the process would not provide any additional relevant information, since all the knowledge already known as interesting in RsT is also identified as interesting in RsP, as understood in [12]). Therefore, as noticed, the metric provides a flexible way to analyze the problem.

    (12)
  • \(M_{NC-I-RsP}\) Related to “Question 6”, measures the ratio of groups in the clustering related to RsP that contains the \(h\)-top interesting rules in RsP (Eq. 13). Therefore, considering users’ analysis, the lower the value the better the metric. This means that just some of the groups would have to be explored by the user, which will contain the “new” relevant knowledge extracted during the process. However, the value of the metric can be interpreted according to the user’s view, being the higher the value the better the metric (indifferent cases – in this condition, each cluster would express its own interesting knowledge). As noticed, the metric provides a flexible way to analyze the problem.

    (13)

Issue 3. Regarding the process behavior related to the number of rules that are obtained in RsP, a unique metric is proposed, which is described as follows:

  • \(M_{NR-RsP}\) Related to “Question 7”, measures the ratio of rules in RsP in relation to RsT (Eq. 14). It is important to analyze the process in relation to the number of rules in RsP. It is not desirable, according to the authors’ understanding, to have a large increase in the volume of rules, because even if new patterns are discovered, it will be harder for the user to identify them. Therefore, the lower the value the better the metric, since the value indicates that although new patterns have been extracted, the number of extracted rules is not big enough to overload the user. However, the value of the metric can be interpreted according to the user’s view, since each user has a different opinion (although all of them agree with the importance of this aspect for an assessment procedure). As noticed, the metric provides a flexible way to analyze the problem.

    $$\begin{aligned} M_{NR-RsP}=\frac{|RsP|}{|RsT|}. \end{aligned}$$
    (14)

Others. “Question 8” tried to capture a more general advantage the clustering process can provide: to enable each cluster expresses a distinct topic of the domain. According to the users, this is a desirable aspect. Carvalho et al. [21] evaluated some labeling methods for association rule clustering through two measures. Precision (\(P\)) measures how much a labeling method finds labels that represent as accurately as possible the rules contained in their own groups – if the labels don’t represent the knowledge inside each cluster the user will have difficult to explore the existing concepts related to the topics the labels express. Repetition Frequency (\(RF\)) measures how the labels are distributed over the clusters – if the labels appear repeatedly over the clusters the user will have difficult to identify the existing topics. As known, a labeling method is applied over the clusters obtained through a clustering process. Thus, considering that a good labeling method existsFootnote 3, it is assumed that a methodology, in the presented context, provides a good distribution of the topics if it presents high values for \(P\) and \(RF\), since which will impact the results is the clustering itself. Therefore, these measures are used as metrics in the considered context, which are described as follows:

  • \(M_{P}\) Considering that a good labeling method is available, measures how much the labels, built over the obtained clustering, represent the rules contained in the clusters (Eq. 15). The more a methodology provides groups that express their own associations, more specific domain knowledge the groups will contain and, probably, the more the labels will represent the rules in the clusters. Therefore, the higher the value the better the metric, since the value indicates that the methodology succeed to enable a suitable distribution of the domain topics.

    (15)
  • \(M_{RF}\) Considering that a good labeling method is available, measures how much the distinct labels, built over the obtained clustering, don’t repeat (Eq. 16). The more a methodology provides groups that express their own associations, more specific domain knowledge the groups will contain and, probably, the more distinct the labels will be over the clusters. Therefore, the higher the value the better the metric, since the value indicates that the methodology succeed to enable a suitable distribution of the domain topics.

    $$\begin{aligned} M_{RF}=1-\frac{\#\{distinct \ labels \ that \ repeat \ in \ the \ clusters\}}{\#\{distinct \ labels \ in \ the \ clusters\}}. \end{aligned}$$
    (16)
Table 2. Summary and recommended use of the proposed evaluation metrics.

Table 2 summarizes the metrics above described, indicating the suitable use of each one. As noticed, the metrics provide a flexible way to analyze the problem, since the user can analyze the results according to his interests (users can disagree about some aspects), both in relation to the metrics values’ interpretation as in relation to the importance (weight) of each one to a given domain – as in other contexts, such as objective measures for association rules [23], where the user chooses the measures (and sometimes their thresholds cut) to evaluate the obtained patterns according to his interests. Besides, the metrics also contribute in the sense of enabling the users to think about important aspects related to the presented context. Finally, relating the proposed metrics by the researchers found in the literature (Sect. 2), it can be observed that: (a) [7] is the only work that provides a similar analysis related to the metrics \(M_{O-RsP}\) and \(M_{N-RsP}\) in Issue 1; (b) none of them provide an analysis related to the aspects covered by Issue 2 and Others; [7, 11, 13] provide a similar analysis related to the metric \(M_{NR-RsP}\) in Issue 3. Thus, as mentioned in Sect. 2, the necessity of an assessment procedure becomes evident.

5 Experiments

Some experiments were carried out in order to present how the metrics can be used. Therefore, two contexts were defined. Suppose a user decides to apply clustering in the pre-processing step. First of all, he has to find the most suitable methodology to use in his problem. After that, he has to check if the selected methodology was good enough for the problem, considering that different interests may be important for his decision. Thus, two different situations were regarded: (i) identify among some clustering setups the most suitable; (ii) analyze the process itself. A clustering setup is obtained by the application of a clustering algorithm combined with a similarity measure. Therefore, the metrics provide the support to evaluate each situation under the discussed issues: while in (i) the data is initially clustered through some clustering setups in order to identify the one that obtains a good association set, in (ii) the usefulness of the process itself is analyzed. Four data sets and four clustering setups were selected to be used in the experiments. It is important to mention that the choices below could be changed without affecting the paper relevance, since the aim here is to present the metrics and to demonstrate how they can be used, independently of the clustering setup.

The four data sets were Adult (48842;115), Income (6876;50), Groceries (9835;169) and Sup (1716;1939). The numbers in parenthesis indicate, respectively, the number of transactions and the number of distinct items in each data set. The first three are available in the R Project for Statistical Computing through the package “arules”Footnote 4. The last one was donated by a supermarket located in São Carlos city, Brazil. Adult and Income are relational sets and Groceries and Sup transactional. Therefore, before extracting the rules on Adult and Income, the sets were converted to a transactional format, where each transaction was composed by pairs of the form “attribute=value”.

The four clustering setups were obtained by the combination of the algorithms and similarity measures presented in Table 3. Each combination gives a clustering setup, i.e., a different way to analyze the process. PAM is a partitional medoid-based algorithm that splits \(n\) objects in \(k\) groups, in which each object is closer to the medoid that defines its own cluster than to the medoid of any other cluster – a medoid is an object that has the minimal average dissimilarity to all the other objects in the same cluster [24]. Therefore, the algorithm works searching for the \(k\) representative objects in \(n\) to built the \(k\) clusters by assigning each object to its nearest medoid. On the other hand, Ward is an agglomerative hierarchical algorithm that starts by considering each object as a singleton cluster and then, at each iteration, merges the two closest clusters until a single cluster remains. The two closest clusters are the ones that minimize the increase of the within-cluster sum of the squared errors (SSE) [24]. Despite the existence of algorithms designed for transactions, such as ROCK, the choices of the algorithms were made based on works that cluster the rules in the post-processing phase, as [25, 26], aiming a posteriori comparison. The similarity measures were chosen considering the works described in Sect. 2 – only the similarities among transactions were selected based on previous experiments. Finally, although it is necessary to set \(k\), the number of groups, in order to obtain a clustering setup, this value was only used to analyze the clustering setups on different views. We understand that even though \(k\) is an important parameter, its values were ranged and then averaged (see Sect. 6) without affecting the experiments’ results, since the aim of the paper, as mentioned before, is to present some metrics and to demonstrate how they can be used, independently of the clustering setup. Besides, as it will be seen in Sect. 6, almost all the metrics present a low standard deviation, showing a good homogeneity among the results obtained over the values of \(k\) Footnote 5.

As described before, the rules are extracted within each group after clustering the data. The values of the minimum support (min-sup) and minimum confidence (min-conf) have to be set in order to extract a set of association rules. To automate the specification of the min-sup in each group, the following procedure was adopted: (i) find the 1-itemsets of the group with their supports, (ii) compute the average of these supports, (iii) use this average support as the min-sup of the group. This strategy was adopted since some groups can have few transactions and others many transactions, which implies in choosing suitable parameter values to avoid an explosion of rules within a group or the obtainment of an empty rule set. Regarding min-conf, the following values were used for each data set: Adult 50 %; Income 50 %; Groceries 10 %; Sup 100 %. Thus, the same min-conf value was applied in all the groups of a given data set. These values were chosen experimentally. Although it is known that min-sup and min-conf impact on the set of rules that are obtained, it was assumed that the focus was on the use of the metrics and, so, that the values were adequate for the proposed problem (the same argument is also applied to algorithms, similarity measures and \(k\)). The rules were extracted with an Apriori implementation developed by Christian BorgeltFootnote 6 with a minimum of two items and a maximum of five items per rule.

Considering the four clustering setups, the RsP sets were obtained. However, once almost all the metrics are based on the rules obtained through the traditional process, the four data sets were also processed to obtain the RsT sets. For that, the min-sup was set automatically, as described before. Regarding min-conf, the same values used in RsP were considered, i.e., Adult 50 %, Income 50 %, Groceries 10 % and Sup 100 %. Furthermore, as some of the metrics are based on the \(h\)-top interesting rules of a given rule set, an objective measure should be selected. Instead of choosing a specific measure, the average rating obtained through 18 objective measures (see Table 3) was considered as follows: (i) the value of 18 measures was computed for each rule; (ii) each rule received 18 ID’s, each ID corresponding to the rule position in one of the ranks related to a measure; (iii) the average was then calculated based on the rank positions (ID’s). Thus, the \(h\)-top rules were selected considering the best average ratings. \(h\), also a number to be set, was defined, in all the sets (RsT and RsP), to assume 10 % of the total of rules in RsT (always the smallest set). Therefore, each rule set contains its own values that are proportional in all of them.

To finish, as mentioned before, the labeling method applied over the clusters of each obtained clustering was the one presented by [22], named GLM (Genetic Labeling Method). In GLM the labels of each cluster are chosen by optimizing Precision (\(P\)) and Repetition Frequency (\(RF\)), the two measures previous described in Sect. 4. In fact, GLM is a genetic algorithm approach that aims to ensure a good tradeoff between \(P\) and \(RF\). The fitness function of an individual is defined by \(Fitness(I)=(P+RF)-\left( \frac{Max(P,RF)}{Min(P,RF)}*10^{-5}\right) \), where (P+RF) shows how good are the measures according to the labels and \(\left( \frac{Max(P,RF)}{Min(P,RF)}*10^{-5}\right) \) the penalty proportional to the distance between \(P\) and \(RF\). Table 3 summarizes the configurations used to apply the proposed metrics.

Table 3. Configurations used to apply the proposed metrics.

6 Results and Discussion

Considering the configurations presented in Table 3 and the RsT sets above described, the experiments were carried out and the values of each metric obtained. Regarding the first proposed situation, i.e., identify among some clustering setups the most suitable (Sect. 5), an analysis based on the average of each metric was carried out. Table 4 presents the results for each data set. In this case, the metrics will help the users to find a suitable methodology for their problems. In order to aid the comparison of the results, all the metrics that present better results when their values are the smallest (\(M_{R-O-RsP}\), for example) were processed to store the complement of the information. Therefore, all the metric, with exception to \(M_{NR-RsP}\), have the same interpretation: the higher the value the better the performance. Furthermore, all the metrics can be seen in terms of percentage if multiplied by 100 (ex.: 0.807*100 = 80.7 %).

Table 4. Average of the proposed metrics, for each data set, in the considered clustering setups.

Each average in Table 4 was obtained from the results of the experiments related to the presented configuration. The value 0.807 in \(M_{O-RsP}\) at Adult:PAM:Agg, for example, was obtained by the average of the values in \(M_{O-RsP}\) at Adult:PAM:Agg over the values of \(k\). The table also presents, between “[ ]”, the standard deviations; regarding the last example, the standard deviation is \(\pm \)0.02 (0.807 [\(\pm \)0.02]). It can be observed that almost all the metrics present a low standard deviation, showing a good homogeneity among the results obtained over the values of \(k\). The major exception is \(M_{NR-RsP}\), which presents high standard deviation values, since the higher the number \(k\) of groups the higher the number of rules. For each data set, the highest averages are marked with in each metric. The only exception is \(M_{NR-RsP}\), where the lowest averages are highlighted. For Adult, for example, the best average for \(M_{R-O-RsP}\) is the one related to PAM:Agg (0.807). Thereby, it is possible to visualize, for each data set, the most suitable clustering setup. It is important to mention that the results are deterministic and, therefore, no statistical test was done to check if there is a significant difference among the averages. It can be noticed that:

Adult. The most suitable configuration for this data set is PAM:Agg, since it presents better results in 8 of the 11 metrics. Furthermore, it can be noticed that in some cases the values in PAM:Agg are more representative than the others – observe, for example, that while in PAM:Agg \(M_{O-RsP}\) presents a performance above 80 %, the others presents a performance below 60 % (see also \(M_{O-I-N-RsP}\) and \(M_{NR-RsP}\)).

Income. The most suitable configuration for this data set is PAM:Denza, since it presents better results in 4 of the 11 metrics. However, PAM:Agg can also be useful, since it presents better results in 3 of the 11 metrics and the two setups present close values in almost all the metrics. Thus, in this case, the user can choose one of them based on the importance each metric represents to him in the considered problem.

Groceries. The most suitable configuration for this data set is Ward:Agg, since it presents better results in 7 of the 11 metrics (in 5 excluding the ties). Furthermore, it can be noticed that in some cases the values in Ward:Agg are more representative than the others – observe, for example, that while in Ward:Agg \(M_{R-O-RsP}\) presents a performance above 89 %, the others presents a performance below 50 % (see also \(M_{NR-RsP}\) and \(M_{P}\)).

Sup. The most suitable configuration for this data set is Ward:Agg, since it presents better results in 6 of the 11 metrics (in 5 excluding the ties). Furthermore, it can be noticed that in some cases the values in Ward:Agg are more representative than the others (see \(M_{NR-RsP}\) and \(M_{P}\)). However, PAM:Agg can also be useful, since it presents better results in 5 of the 11 metrics (also 5 excluding the ties). Thus, in this case, the user can choose one of them based on the importance each metric represents to him in the considered problem.

Therefore, considering the user selected and used the configurations presented in Table 3, it can be observed that the most suitable clustering setup according to the metrics is PAM:Agg for Adult, PAM:Denza for Income and Ward:Agg for Groceries and Sup. In other words, the user will obtain better results, i.e., reasonable rule sets, if he initially clusters Adult through PAM:Agg, Income through PAM:Denza and Groceries and Sup through Ward:Agg. However, in other situations, different aspects can be of interesting, providing to the user a flexible way to explore his problems: the user can choose the metrics to apply and their better interpretations (lower/higher values (as mentioned in some of the metrics (Sect. 4))). Thus, in this first situation, the metrics provide criteria to carry out comparisons, helping the user to select the most suitable methodology for his problem.

From that point, supposing that PAM:Agg is a suitable solution for the user’s problem regarding Adult, it is possible to analyze the process itself, i.e., to check if good results are really obtained (the same is valid for the other data sets). Observe that different interests may be important for his decision. Thus, the metrics provide criteria not only to analyze the process, but also to identify its positive and negative aspects, helping the user to reach a conclusion. To discuss this second situation, Table 5 presents the values of the metrics, in the selected clustering setup, regarding Adult (although this second situation is only discussed on Adult, the same analysis can be done to the other data sets). These values are the ones presented in Table 4, but in their original scales, since the smaller scales (\(\downarrow \)) were previous converted – the larger scales (\(\uparrow \)) remain the same. The scale, in each metric, is found between “[ ]”. It can be noticed that:

  • \(M_{O-RsP}\): little knowledge is lost during the process, around 20 %, since more than 80 % of the rules in RsT are found in RsP – being a positive aspect.

  • \(M_{R-O-RsP}\): the repetition of “old” rules in RsP is high, around 66 %, indicating that the knowledge, already known, is not in subdomains that express its own associations – being a negative aspect.

  • \(M_{N-RsP}\): most of the rules in RsP are “new”, around 83 %, indicating the discovery of a great amount of knowledge previously unknown – being a positive aspect.

  • \(M_{R-N-RsP}\): the repetition of “new” rules in RsP is low, around 11 %, indicating that the knowledge, previously unknown, is in subdomains that express its own associations – being a positive aspect.

  • \(M_{N-I-RsP}\): nearly half of the \(h\)-top interesting rules in RsP are “new”, around 59 %, indicating that the cost of the process is minimized by the discovery of interesting knowledge, previously unknown, in RsP – being a positive aspect.

  • \(M_{O-I-N-RsP}\): the loss of “old” and interesting knowledge is low, around 16 %, since a great amount of the \(h\)-top interesting rules in RsT are found in RsP, around 84 % \((100\,\%-16\,\%)\) – being a positive aspect.

  • \(M_{C-I}\): the intersection between the \(h\)-top interesting rules in RsP and the \(h\)-top interesting rules in RsT is high, around 78 %, indicating that a great amount of the knowledge already known as interesting in RsT is found in RsP – being a positive aspect.

  • \(M_{NC-I-RsP}\): the number of groups that contain the \(h\)-top interesting rules in RsP is high, around 45 % – being a negative aspect, since many groups would have to be explored once the “new” relevant knowledge of the domain would be spread over the clusters.

  • \(M_{NR-RsP}\): the number of rules in RsP is only around 12 times greater in relation to RsT – being a positive aspect (it can be seen in Table 4 that this ratio can be very high).

  • \(M_{P}\): the selected clustering setup provides a suitable distribution of the domain topics, since the labels represent the rules in the clusters at a ratio around 70 % – being a positive aspect.

  • \(M_{RE}\): the selected clustering setup provides a suitable distribution of the domain topics, since the distinct labels don’t repeat over the clusters at a ratio around 83 % – being a positive aspect.

Table 5. Average of the proposed metrics in Adult, PAM:Agg clustering setup.

As seen, considering the positive and negative aspects of the process, the user can analyze the results, according to his interests, and conclude if good results were reached. It is relevant to mention that the importance of each percentage depends on the user’s needs, on the data sets, etc., and, therefore, the metrics provide a flexible way for him to explore the problems. Regarding the presented context, it can be said that the process obtains reasonable results, since 9 of the 11 aspects were considered positives. However, if the weight of the negative aspects is more important to the user, he can discard the results. Moreover, he can try to improve the process to obtain better results in these metrics, since he has an overview of all the aspects. Thus, in this second situation, the metrics provide criteria to analyze the process based on different interests, identifying its positive and negative aspects, helping the user to reach a conclusion.

Finally, to complement and finalize the discussion, Fig. 11 presents the variation of \(h\) parameter in the four metrics that depend on that value: \(M_{N-I-RsP}\), \(M_{O-I-N-RsP}\), \(M_{C-I}\), \(M_{NC-I-RsP}\). The clustering setup related to each data set is the one above identified as the most suitable. Axis \(x\) is related to \(h\) and axis \(y\) to the metrics’ values. The metrics are represented by the different lines in the graphics. Note that the metrics’ values related to \(h\)=10 % are the same as the ones presented in Table 4 (as before, each metric’s value was obtained by the average of the values in the metric over the values of \(k\)). It can be seen that: (a) \(M_{N-I-RsP}\) [\(\uparrow \)] (blue line) tends to become higher as \(h\) increases or tends to assume a value close to the other \(h\) values; therefore, this metric presents better results with high values of \(h\); (b) \(M_{O-I-N-RsP}\) [\(\downarrow \)] (red line) tends to decrease as \(h\) increases or tends to assume a value close to the other \(h\) values; therefore, this metric presents better results with high values of \(h\); (c) \(M_{C-I}\) [\(\uparrow \)] (green line) does not present a pattern; however, while in the relational data sets the values are close and high regardless the \(h\) value in the transactional ones the values are close and low regardless the \(h\) value; (d) \(M_{NC-I-RsP}\) [\(\downarrow \)] (purple line) tends to decrease as \(h\) increases or tends to assume a value close to the other \(h\) values; therefore, this metric presents better results with high values of \(h\).

Fig. 11.
figure 11

Behavior of the \(h\) parameter in \(M_{N-I-RsP}\), \(M_{O-I-N-RsP}\), \(M_{C-I}\), \(M_{NC-I-RsP}\) in each data set (Color figure online).

7 Conclusion

In this paper, eleven metrics were proposed to provide an assessment procedure in order to support the evaluation of methodologies that use clustering in the pre-processing step. The metrics were developed to answer three main issues. However, to propose the metrics, a subjective evaluation was done with some users to understand each issue. Some experiments were carried out in order to present how the metrics can be used. For that, two different situations were regarded: (i) identify among some clustering setups the most suitable; (ii) analyze the process itself. Through the experiments, it could be noticed that the metrics provide criteria to: (a) analyze the methodologies, (b) identify their positive and negative aspects, (c) carry out comparisons among them and, therefore, (d) help the users to select the most suitable solution for their problems. Besides, the metrics do the users think about aspects to be considered in the presented context and provide to them a flexible way to explore the problems. Finally, this paper complements [12]’s work, since the subjective evaluation is used to attest the importance of the previous obtained results.