Keywords

1 Introduction

Pattern mining is an active subfield of data mining used to find interesting knowledge patterns in a large database, where mined rules are used for decision support. The Apriori algorithm [1, 2] considers item frequencies in a binary database, but the number of items sold or their importance is ignored. Utility mining (UM) was thus proposed [3], where items in a database include the purchase quantities and relative importance indicating unit profits or weights. Its goal is to find high-utility itemsets, which indicate potential importance; however, the downward closure (DC) property does not hold in UM. Two-stage mining is used to improve mining efficiency [4]. In the UM mining process, larger itemsets in a transaction tend to have a greater utility value than that of their sub-itemsets. Hence using the same threshold to evaluate itemsets, regardless of their length, is an unfair strategy. The average-utility mining algorithm is used to normalize itemset lengths [6]. In contrast to UM, FUM jointly considers the characteristics of UM and fuzzy reasoning to identify high fuzzy utility itemsets and handle quantity information better via its transformed linguistic terms [5]. However, as with UM, the DC property does not hold for FUM either. Thus, two-stage [9] and two-stage tree-structure [8] based approaches were proposed to find desired itemsets using an average-utility measurement. However, their two-stage nature makes these methods computationally expensive. To efficiently extract fuzzy average-utility itemsets, a single-stage tree structure method based on FP-tree [7] is proposed, in which each node in the tree has an external array list to store mined information. Experiments demonstrate improved execution times with respect to [8]. However, adding list information for each node also increases memory consumption compared to [8].

2 Related Work

The frequent itemset mining, named Apriori [1, 2], is used to find knowledge patterns in which their frequency counting is executed by scanning multiple databases. To account for the performance, the FP-Growth [7] is then proposed by applying tree structure to store mined information, reducing the database scans. The Apriori is useful, but it does not take into account item quantities or unit profits for items. To overcome this, relative importance based on profit and item quantity is considered using utility mining [3]. The utility values of itemsets are used to evaluate whether they are useful. One phenomenon of UM is that since the utility value of an itemset in a transaction may be larger than those of its subsets, it is unfair to use the same threshold to determine different itemsets. Average-utility measurement accounts for those [6, 11,12,13]. FUM [5] is superior to UM in that it efficiently explains quantitative information. By using the membership function of items, item quantities are transformed into fuzzy terms where they possess semantic meaning in item amounts. The FUM process derives actual itemsets with their fuzzy utility values satisfying the threshold along with the quantitative values of items, profits, and semantic meaning in item amounts. However, FUM shares the limitation of UM: the actual value for a larger itemset may be higher than that of a smaller itemset. Two fuzzy average-utility methods for FUM have thus been proposed [8, 9]. An over-estimation model is used to avoid information loss and a two-stage algorithm with this model is designed for efficient mining [9]. To improve the efficiency in [9], Hong et al. consider a two-stage tree-based method [8]. Here, we propose an alternative with shorter execution times than [8]. An external list containing mined information is embedded within each node in the tree, performing for single-stage operation.

3 Definition

Let D be a transaction database, and the items in D are represented as I = {i1, i2, …, iQ}, where each item in has its own profit, denoted as p(in). The database is the set of transactions denoted as D = {t1, t2, …, tP}. Each tm in D contains purchased item in with quantities vmn. A set of membership functions (MFs) is given in advance, which represents the membership degree of each item. Given the MF for an item, each quantity value vmn in D is converted into a fuzzy set \({f}_{mn}=(\frac{{f}_{mn1}}{{R}_{n1}}+\frac{{f}_{mn2}}{{R}_{n2}}+\dots +\frac{{f}_{mnl}}{{R}_{nl}}+\frac{{f}_{mnh}}{{R}_{nh}})\), where h is the number of membership functions for in, Rnl is the l-th fuzzy term of in, and fmnl is the fuzzy membership value of vmn in Rnl.

Shown in Table 1 is a transaction database that contains the items and the item quantities for each transaction. Table 2 is the utility table, which records the unit profit of each item. The membership functions are shown in Fig. 1, where we assume that the MFs of all the items are the same. We use the MFs to divide the quantities into fuzzy regions L, M, and H. The above information is used as an example of the definition.

Table 1. A transaction database
Table 2. A utility table
Fig. 1.
figure 1

Membership functions

Definition 1.

The fuzzy average utility of a fuzzy item Rnl in in in tm is \({fau}_{mnl}={f}_{mnl}*{v}_{mn}*p\left({i}_{n}\right)\), where vmn is the quantity of in in tm, fmnl is the fuzzy value of Rnl according to the MF of in, and p(in) is the individual profit for in. According to the MF in Fig. 1, {A} with quantity 4 in t2 in Table 1 is converted to {0.33/A.L, 0.67/A.M}, yielding a fau value of 0.67 * 4 * 5 (= 13.4) for {A.M}. All fuzzy items from Table 1 are calculated and shown in Table 3.

Definition 2.

The fuzzy average utility of each fuzzy itemset S in tm is \({fau}_{mS}=\frac{1}{|S|}*{f}_{mS}*\sum_{{R}_{nl}\in S}\left[{v}_{mn}*p\left({i}_{n}\right)\right]\), where |S| is the number of Rnl and fmS is the minimum fuzzy value for Rnl, where \({R}_{nl}\in S\). Take {A.L, B.M} in t1 as an example. According to the MF in Fig. 1, its integrated fuzzy value is min{0.67, 1}, which is 0.67. Thus, its fau1,{A.L, B.M} is \(\frac{1}{2}*0.67*\left(2*5+6*6\right)=15.41\).

Table 3. Fuzzy average utility values

Definition 3.

The actual fuzzy average utility of each fuzzy itemset S in D is expressed as \({afau}_{S}=\sum_{S\subseteq {t}_{m}\cap {t}_{m}\in D}{fau}_{mS}\). For example, the afau{A.L, B.M} in D is fau1,{A.L, B.M} + fau2,{A.L, B.M} = 0.5 * 0.67 * (2 * 5 + 6 * 6) + 0.5 * 0.67 * (4 * 5 + 5 * 6) = 32.16.

Definition 4.

Let MinFAUtil be the given threshold. A fuzzy itemset S is considered a high fuzzy average-utility itemset HFAUI and afauS ≥ MinFAUtil holds. Let MinFAUtil = 30. Since the afau{A.L, B.M} is 32.16, {A.L, B.M} is an HFAUI. However, afau{A.L} is 0.67 * 10 + 0.67 * 20 = 20.1, so {A.L} is not a HFAUI, because the DC in fuzzy average-utility mining does not hold.

To take this into account, we use the over-estimation model [8] for fuzzy average-utility mining. The definitions for this model are given below.

Definition 5.

The maximum fuzzy average utility of an item in in tm is \({mfau}_{mn}=\underset{{R}_{\mathit{nl}}\subseteq {i}_{n}\cap {i}_{n}\in {t}_{m}}{\mathit{max}}{\{fau}_{mn1},{fau}_{mn2},\dots ,{ fau}_{mnh}\}\). For example, the mfauA in t2 is 13.33.

Definition 6.

The maximum transaction fuzzy average utility in tm is \({mtfau}_{m}=\underset{{i}_{n}\subseteq {t}_{m}}{\mathit{max}}{mfau}_{mn}\). For example, the mtfau2 is 20.

Definition 7.

The fuzzy average-utility upper bound of a fuzzy itemset S is \({fauub}_{S}=\sum_{S\subseteq {t}_{m}\cap {t}_{m}\in D}{mtfau}_{m}\). Since {A.L} exists in t1 and t2, its fauub{A.L} is 56.

Definition 8.

The fuzzy itemset S is considered the high fuzzy average-utility upper-bound itemset HFAUUBI and fauubS ≥ MinFAUtil holds. For example, fauub{A.L} is 56, which is greater than MinFAUtil, so {A.L} is an HFAUUBI.

4 Proposed FHFAUIM Algorithm

This algorithm, called Fast High Fuzzy Average-Utility Itemset Mining (FHFAUIM), enhances the performance for fuzzy average-utility mining compared to High Fuzzy Average-Utility Itemset Mining (HFAUIM) [8]. An external list that stores fuzzy item’s transaction ID, fuzzy value, and utility value is added to the tree node. Thus, the mined process can be performed directly in a single phase. Below we list the steps of the algorithm:

Step 1. Based on the MFs for all items, convert the quantities in D into a fuzzy set.

Step 2. Calculate the mfau value of each item in each transaction.

Step 3. Find the mtfau value of each transaction.

Step 4. Initialize the candidate 1-table (HFAUUBI1) table into an empty table with three attributes: fuzzy itemset S, its fauubS value, and its frequency.

Step 5. Store fuzzy items in D into the HFAUUBI1 table and get the fauubS for each.

Step 6. Filter each fuzzy itemset S in the HFAUUBI1 table: if fauubS is not less than MinFAUtil, keep it in the table; otherwise, remove it.

Step 7. Calculate the frequency of the fuzzy items in the HFAUUBI1 table. Sort all fuzzy items in the table by decreasing frequency; this is the header table.

Step 8. Trim fuzzy items in D that do not appear in the HFAUUBI1 table as UD.

Step 9. Build a tree structure similar to an FP-tree. Each node in the tree stores a fuzzy item and its mtfau value. In addition, each node contains an external list that stores the identifier transaction, the fuzzy value of the fuzzy item, and its utility value. According to the UD, each fuzzy item in a transaction is inserted into the tree structure from the first transaction to the end, one by one.

Step 10. The HFAUUBI1 table is considered the header table. All fuzzy items in the HFAUUBI1 table are directed to the nodes of the tree’s corresponding fuzzy items.

Step 11. After completing the tree structure, find HFAUIs. First, each fuzzy item in the HFAUUBI1 table is used to establish its conditional FP-tree by traversing the tree from the bottom up. After going through the conditional FP-tree with each fuzzy item’s node, the afau values of the fuzzy itemsets are calculated using the nodes’ external lists for fuzzy itemsets. If their afau \(\ge \) MinFAUtil, they are considered HFAUIs.

Step 12. Output all HFAUIs.

5 Experiments

We compared the previous HFAUIM [8] with the proposed FHFAUIM on the test datasets, T25I2N1KD10K and T24I2N1KD10K [10]. Two methods were implemented in Java, and experiments were conducted on a computer with an Intel CPU at 3.00GHz and 8GB of RAM. Various thresholds were used to evaluate the performance of the two methods, with the execution time and the memory consumption results shown in Figs. 2 and 3. Moreover, to evaluate the execution time, the FHFAUIM uses a single-stage strategy to reduce the number of candidates generated compared to HFAUIM. Execution times decrease as MinFAUtil is increased. Also, when MinFAUtil = 0.01, the single-stage strategy in FHFAUIM generally yields significantly reduced computation times in comparison with HFAUIM. Therefore, the maximum efficiency improvement rate of execution time is 95.39%. In memory strategy, given different thresholds: the memory usage of HFAUIM is less than that of FHFAUIM, because FHFAUIM accelerates the runtimes by using an external list for each node to store mined data, which requires extra memory.

Fig. 2.
figure 2

Execution times and memory consumption in database T25I2N1KD10K

Fig. 3.
figure 3

Execution times and memory consumption in database T24I2N1KD10K

6 Conclusion

We propose a fast method for mining fuzzy average-utility itemsets. The proposed algorithm integrates a single-stage strategy with a tree structure to reduce the search space by storing information in node-level external lists. Experimental results show that the method requires far less computation time than the previous approach [8].