Keywords

1 Introduction

High Utility Itemset Mining (HUIM) appears to be one of the most effective methods for pattern detection, which considers the quantity and profit of an itemset to discover the set of high utility itemsets (HUIs). Several Algorithms, like Two-Phase [1], tree-based [2,3,4], and one-phase [5,6,7,8] were introduced to find HUIs. The two-phase algorithms search potential candidate itemsets in the first phase and HUIs in the second phase. All two-phase algorithms suffer from scalability issues. The tree-based algorithms use a tree structure to produce the candidate itemsets and HUIs. The one-phase algorithms utilize a vertical list structure to search candidates and HUIs in a single phase. Several other one-phase algorithms like FHM [6], HUP miner [7], and EFIM [8] were introduced with several efficient pruning strategies. The EFIM algorithm introduces various novel ideas to improve time efficiency and memory optimization compared to the HUI miner [5]. It presents database projection and transaction merging to reduces the merging complexity and size of the database. EFIM also uses two upper-bounds sub-tree utility and local utility for efficient pruning. A fast utility counting method was introduced to calculate the utilities. This new counting method uses an array to store the utilities.

The main focus of the existing HUI mining algorithms [9, 10] is to search for the set of high utility items in a transactional dataset. However, they do not consider the mutual relationship among items of the discovered patterns. For example, a high-profit item like gold may appear with some low-profit items like cotton resulting in a high utility itemset. Still, cotton and gold are unrelated and could have occurred by chance. Hence, it is important to develop a framework that considers the correlation factor and produces high utility itemsets with inherent correlation. Many correlation measures have been commonly used for pattern mining, e.g., support [11], confidence [12], frequency affinity [13], all-confidence [14], and coherence [14]. HUIPM [13] and FDHUP [15] algorithms were introduced in utility-based pattern mining to discover HUIs using frequency affinity. Both algorithms use co-occurrence to calculate the correlation measure, which makes them ineffective. A projection-based algorithm called CoHUIM [16] was introduced, which considers the inherent correlation instead of the co-occurrence frequency. It discovers a set of high utility itemsets correlated using the correlation factor called Kulc [17]. The Kulc factor follows null transaction unvarying property to ensure that the correlation measure should not be affected by the number of null transactions in the database. However, the projection-based CoHUIM was found inefficient as it produces a huge number of candidate itemsets resulting in an extra memory consumption and efficiency decline. Correlated high-Utility Pattern Miner (CoUPM) [18] is a one-phase algorithm that uses the utility-list structure to store important details of the itemsets. This algorithm also uses the correlation factor Kulc to discover correlated patterns efficiently.

We observed that CoUPM [18] is an extension of the HUI-miner [5], which is computationally expensive for the small value of minimum utility. We propose an algorithm ECHUM, which exploits efficient EFIM [8]. Our choice is based on the fact that EFIM takes three times lesser time and takes eight times lesser memory than the HUI-miner. We have used the Kulc measure to find the inherent correlation in the discovered patterns. We adopted three major techniques to search the Correlated High Utility Itemsets (CoHUI). Firstly, database projection and transaction merging are used to reduce the size of the dataset and merging complexity. Secondly, sub-tree utility and local utility upper-bounds are used to prune the unpromising candidate itemsets. Thirdly, a fast counting method is adopted that uses an array to store the utilities to lower the upper bounds. We performed comprehensive experiments to compare ECHUM with the existing CoUPM algorithm. It has been observed that the ECHUM is faster than CoUPM.

The remaining paper is structured as follows. Section 2 discusses various preliminaries. The proposed algorithm is described in Sect. 3. The performance evaluation of the proposed algorithm has been carried out in Sect. 4. We conclude the paper in Sect. 5.

2 Preliminaries

Let D be a dataset consists of a collection of items I. An itemset X with k different items is called k-itemset. If X is a subset of transaction Tq then X is said to be contained in that transaction. An example of transaction and profit tables are shown in Tables 1 and 2.

Table 1. Transaction table
Table 2. Profit table

Definition 1.

Utility of item i in a transaction T, i.e., U(i, T) is the multiplication of its profit p(i) and quantity in that transaction q(i, T).

$$ U\left( {i,T} \right) = p\left( i \right) \times q(i,T) $$
(1)

The utility value for an itemset X in a transaction T and dataset D is calculated as follows.

$$ U(X,T) = \sum\nolimits_{i \in X \wedge X \subseteq T} {U(i,T)} $$
(2)
$$ U(X) = \sum\nolimits_{X \in T \wedge T \subseteq D} {U(X,T)} $$
(3)

E.g., U(B2, T4) = 4 × 4 = 16; U(A1B2,T6) = 1 × 3 + 1 × 4 = 7; U(A1B2) = (1 × 3 + 1 × 4) + (2 × 3 + 1 × 4) = 17.

Definition 2.

Transactional utility associated with any transaction T is represented by tu(T) and calculated as follows.

$$ TU(T) = \sum\nolimits_{i \in T} {u(i,T)} $$
(4)

E.g., tu(T4) is (2 × 5 + 4 × 4) = 26.

Definition 3.

The transaction weighted utility of an itemset X is represented by TWU(X) and calculated as follows.

$$ TWU(X) = \sum\nolimits_{X \in T} {TU(T)} $$
(5)

E.g., TWU(C1) = 1 + 12 + 31 + 16 + 17 = 77. The TWU values for running dataset are depicted in Table 3.

Table 3. Transactional weighted utility

Definition 4 (High Utility Itemset).

If the utility of an itemset X is greater than or equal to the minimum utility threshold minutil, it is said to be a high utility itemset.

$$ HUI = \{ X|u\left( X \right) \ge minutil\} $$
(6)

Definition 5 (Remaining Utility).

Let T/X be the items appearing after X in a transaction T. The remaining utility is denoted as ru(X) and calculated as follows.

$$ ru(X,T) = \sum\nolimits_{i \in T/X} {u(i,T)} $$
(7)

E.g., ru(B1) = (4 × 4) + (3 × 4 + 4 × 1) = 32.

Property 1.

If the value of Transactional Weighed Utility of X, i.e., TWU(X) is smaller than minutil then X and all its supersets are pruned.

Definition 6.

The upper limit of remaining utility for itemset X, i.e., reu(X), is calculated as follows.

$$ reu\left( X \right) = u\left( X \right) + re\left( X \right) $$
(8)

E.g., reu(B1) = u(B1) + ru(B1) = 60 + 32 = 92.

Property 2.

If the remaining utility upper bound for X is smaller than the minutil then X and all its supersets are pruned.

Definition 7.

The correlation value among the items of an itemset determines the strength of the inherent correlation. We use Kulc measure to find the correlation for an itemset X in ECHUM.

$$ Kulc(X) = 1/k\sum\nolimits_{i \in X} {SUP(X)/SUP(i)} $$
(9)

The value of Kulc lies between 0 and 1 and helps determine the positive correlation among the items.

2.1 Pruning Strategies

2.1.1 Exploring the Search Space

ECHUM explores the search space recursively in a depth-first search manner. In the search space, items are arranged in increasing order of their TWU. This order is selected because it reduces the number of candidates generated.

Definition 8.

Extensions of itemset α is denoted by E(α) and defined as E(α) = \({\text{\{y | y}} \in {\text{I}}\)^\(\wedge y > x \forall {\text{x}} \in {\text{a}}\). Let extension of itemset α be Z where Z = α ∪ W if W ∈ 2E(α) and W ≠ Ø. Also, Z is a single itemset extension if Z = α ∪ x.

E.g., The single extensions of itemset {A3} are {A3, B2}, {A3, A1}, {A3, C1} etc. The other extensions are {A3, B2, B1}, {A3, A1, C1} etc.

2.1.2 2.1.2. Cost reduction Using Projections

Definition 9.

For calculating the utility value for an itemset α, the itemsets that are not the extensions of α, i.e., E(α), are disregarded. The database consisting of such itemsets is called a projected database. The projection of a transaction T is denoted by α − T, which is equal to {I | i ∈ T^i ∈ E(α)}. The projection of the database is denoted as α − D, which is equivalent to {α – T | T ∈ D ^ α − T ≠ Ø}.

E.g., if α = {B2} then the projected database will consist of 3 transactions i.e. T1{A2, A2, B2}, T4{B1, B2} and T7{B1, B2, C1}.

Database projections are used in ECHUM to reduce the database size. Database projection is carried out efficiently using pseudo-projection by sorting the database in total order. The method of pseudo projection is explained in EFIM [8].

2.1.3 Cost Reduction by Transaction Merging

Identical transactions are merged to form a single transaction. A transaction is a duplicate of another transaction if the same items are present in both the transactions, but each item’s quantity can be different.

Definition 10.

Transaction merging is applied to each of the projected transaction databases α − T, and all the duplicate transactions are combined to form a single transaction in the projected database. Consider a transactional database D; if the transactions are sorted in lexicographical order and are read from end to start, then the order is known as total order and is denoted by \({ > }{\text{.T}}\).

Transaction merging is used to merge a set of transactions efficiently, and after applying this method to the projected database, the number of transactions is drastically reduced. The smaller the transaction length, the greater the probability of matching transactions. The database is sorted according to the \({ > }{\text{.T}}\) order to merge identical transactions efficiently. The transactions are combined efficiently using the property introduced in EFIM [8] by first sorting the transactions lexicographically and then introducing various measures to compare the transactions.

Property 3.

In a transactional database sorted according to the \({ > }{\text{.T}}\) property, the duplicate transactions will appear consecutively. All the duplicate transactions can be removed directly from the database by comparing the current transaction with its next transaction.

2.1.4 Pruning Using Upper Bounds

Definition 11.

Consider an itemset α associated with a single extended itemset y. The sub-tree utility of the element y is denoted as su(α,y).

$$ su(\alpha ,y) = \sum\nolimits_{T \in g(\alpha \cup \{ y\} )} {[u(\alpha ,T) + u(y,T) + \sum\nolimits_{i \in T \wedge i \in E(\alpha \cup \{ y\} )} {u(i,T)} ]} $$
(10)

E.g., if α = {A3} and z = {B2} then su(α,z) = (1 + 8) + (2 + 4 + 3) = 18.

Property 4.

Consider an itemset α, the utility value of the single extensions of α is smaller than or equal to the sub-tree utility. Mathematically, su(α, y) ≥ u(Y), where Y is known to be only the single extended itemset of α.

Pruning Method 1.

If the sub-tree utility of α is smaller than minutil then the single itemset extensions of α cannot be a HUI, and therefore the extensions are pruned.

Definition 12.

Consider an itemset α, the extension of α denoted by y ∈ E(α), the local utility of y concerning α is indicated by lu(α, y).

$$ lu(\alpha ,y) = \sum\nolimits_{T \in g(\alpha \cup \{ y\} )} {[u(\alpha ,T) + ru(\alpha ,T)]} $$
(11)

Property 5.

For an itemset α, the local utility value of α is always greater than or equal to the utility of the extensions of α. Mathematically, lu(α, y) ≥ u(Y), where Y is any extension of α.

Pruning Method 2.

For itemset α, if the local utility value of α is less than the minutil then the single itemset extensions of α cannot be a high-utility itemset. Therefore, the extensions are pruned.

Property 6.

For any itemset α with an extension Y, the following property always holds.

$$ TWU\left( \alpha \right) \ge lu\left( {\alpha ,y} \right) \ge su\left( {\alpha ,y} \right) $$
(12)

Definition 13.

The primary items of an itemset α are the collection of all items, which are the extensions of α whose sub-tree utility is greater than minutil. The secondary items of α are the collection of those items: the extensions of α, and local utility is greater than minutil. Mathematically, Primary(α) = {y|y ∈ E(α) ^ su(α,y) > minutil} and Secondary(α) = {y|y ∈ E(α) ^ lu(α,y) > minutil}. By Property 6, it can be said that Primary items are the subset of Secondary items.

2.1.5 Pruning Using Correlation Properties

Property 7.

In a transactional database where are the items in a transaction are arranged according to their support values, the Kulc measure uses the following sorted downward closure property.

$$ Kulc\left( {i_1 ,i_2 \ldots \ldots \ldots .,i_k } \right) \le Kulc\left( {i_1 ,i_2 \ldots \ldots \ldots .,i_k ,i_{k - 1} } \right) $$
(13)

Pruning Method 3.

Consider an itemset X for which the value of Kulc(X) is smaller than any of the Kulc values of its subsets, then the collection of every superset of X are not CoHUI’s. Let Xk be an itemset and subset Xk−1 be its subset; Mathematically, if Kulc(Xk) ≤ Kulc(Xk-1), then all supersets of Xk are not in CoHUI’s.

2.2 Improving Efficiency Using Fast Utility Counting

In the previous section, two utility-based pruning methods are used to eliminate ineffective itemsets. To compute the utility upper bounds, EFIM introduced a technique that uses an array data structure to store utilities known as Fast Utility Counting (FUC). The novel array structure is called utility-bin [19]. Let a dataset D containing |I| items, and U[y] denotes the utility bin, where y is any item in itemset I. The array is initialized with 0. Following is the process for the calculation of upper limits by using the utility bin array.

TWU Upper-Bound.

The TWU upper bound is calculated using the utility bin array. Initially, values in the array are set to 0. Then by proceeding transaction-wise in the database, the values of each item of array U[z] are modified by the total utility TU(T) of that transaction. i.e., U[y] = U[y] + TU(T), where y is the item contained in transaction T. Finally, all the values of the utility array will be equal to the TWU value.

Sub-tree Utility

upper-bound associated with any itemset α is calculated using the utility bin in the following manner. We proceed with the dataset transaction-wise, such that the sub-tree utility value modifies each item’s utility in U[p] for that transaction.

$$ U\left[ p \right] = U\left[ p \right] + u\left( {p,T_i } \right) + u\left( {\alpha ,T_i } \right) + \sum\nolimits_{i \in T \wedge i > .z} {u(i,T)} $$
(14)

Where p is the extension itemset in transaction Ti. Finally, all the utility array values will be equal to the subtree utility value, i.e., su(p, α).

Local-Utility

upper-bound associated with any itemset α, the dataset is processed transaction-wise. Utility values of each item of array U[p] are modified by the “local-utility” of that transaction.

$$ U\left[ p \right] = U\left[ p \right] + \sum\nolimits_{i \in T \wedge i > .z} {u(i,T)} + \, reu\left( {\alpha ,T} \right) + u\left( {\alpha ,T} \right) $$
(15)

Where p is the extension itemset in transaction T. Finally, all the utility array values will be equal to the local utility value, i.e., lu(y, α).

2.3 Problem Statement

An itemset X is called a CoHUI if the value of u(X) is more than minutil, and Kulc(X) is also greater than the minimum correlation measure minCor.

$$ {\text{CoHUI}} = u\left( X \right) \ge minutil\;{\text{AND}}\;Kulc\left( X \right) \ge minCor $$
(16)

3 ECHUM Algorithm Design

The ECHUM algorithm is the combination of the two algorithms EFIM and CoHUM. The updated utility-list is used in this method, which considers the utility-list formation of EFIM and the support value of each item used in CoHUM. The algorithm takes input as the transactional database, utility table, minimum utility minutil, and minimum correlation minCor. Initially, an itemset X and the support values are initialized to Ø. The local-utility associated with every item is calculated using the utility-bin array by examining the database. The secondary itemsets are then generated with the help of the local-utility associated with every item. Now, the secondary itemsets are arranged based on the increasing order of their TWU values. The scanning of the database is then done again to withdraw the collection of items that are not present on the secondary itemsets, and the empty transactions are deleted. Finally, the sorting of D takes place once again according to their \({ > }{\text{.T}}\) order. The sub-tree utility is denoted by su(X, i) for every item present in the secondary itemset andcalculated using the utility bin array. The primary itemset list is generated using these values of the sub-tree utility array. The primary itemset, secondary itemset, minutil, minCor, and database D are then passed to the Explore function. Algorithm 1 depicts the pseudo-code of ECHUM.

The Explore Method takes input as itemset X, projected database X − D, the items that primary items to α: Primary(X), and secondary items to X: Secondary(X), user-defined minutil and minCor threshold. For every item i in Primary(X), an itemset Y is created, which is the extension of X. The support values are updated in the utility list of Y by incrementing its count. The projected database X − D is then examined to find the utility of Y, and the projection of the database Y − D is created. If the utility of Y is larger than the minutil and the Kulc value is greater than the minCor then the itemset Y is considered as CoHUI. Then the sub-tree utility su(Y, y) and the local utility lu(Y, y) are calculated for every item y present in Secondary(X) by scanning the projected database Y − D. Finally, the itemsets Primary(Y) and Secondary(Y) are generated from the itemset Secondary(X) by applying Properties 4, 5, and 6. The explore method is recursively called again to search the extensions of X. The final output is accumulated in the set of CoHUIs.

4 Experimental Analysis

The ECHUM algorithm is compared with the CoUPM. We performed the analysis of execution time and memory usage by varying the minutil and minCor. The experiments were carried out on an Ubuntu machine with a 2.7 GHz Intel i5 6th generation processor. The RAM allotted to the system was 8 GB, and the program code was written in Java Language. A total of 4 datasets were used to test the algorithms. The datasets were collected from SPMF library [20]. The dataset properties are given in Table 4.

Table 4. Dataset properties

4.1 Runtime Comparison

The runtime comparisons are shown in Figs. 1 and 2. In Fig. 1, the minutil for foodmart, chess, mushroom, and retail are kept fixed at 0.007%, 20%, 8%, and 0.12%, respectively. The minCor value is varied from 0.01 to 0.06 in foodmart dataset, from 0.74 to 0.79 in chess dataset, from 0.4 to 0.5 in mushroom dataset, and from 0.1 to 0.2 in retail dataset. For almost all the values, the running time for ECHUM is faster than the existing CoUPM algorithm. As the minCor value is increased, the run time required for each of the algorithms is persistent. In Fig. 2, the minCor for foodmart, chess, mushroom, and retail are kept fixed at 0.03, 0.76, 0.42, and 0.1, respectively. The minutil value is varied from 0.006% to 0.011% in the food mart dataset, from 17 to 22 in the chess dataset, from 8 to 13 in the mushroom dataset, and from 0.01 to 0.015 in the retail dataset. For almost all the values, the running time for ECHUM is faster than the existing algorithm CoUPM. As the minutil value is increased, the run time required for each of the algorithms is persistent.

When the minCor is set fixed and the minutil is increased, the time taken by both the CoUPM and the ECHUM Algorithm decreases non-linearly. For the dataset, foodmart the time taken by both the algorithms is almost similar but is less for the ECHUM algorithm. For the datasets, chess and mushroom, we have a drastic time difference in both algorithms with ECHUM the faster. For retail dataset, the time difference is the most. The reason is that CoUPM generates more unpromising candidates due to a huge number of transactions and many distinct items. More candidates require extra memory for CoUPM. It can be seen that the ECHUM algorithm outperforms in all the cases as the number of candidates generated is very low because of efficient pruning.

Fig. 1.
figure 1

Runtime comparison for fixed minutil and variable minCor

4.2 Memory Comparison

The Memory Consumptions by the two algorithms ECHUM and CoUPM are shown in Figs. 3 and 4. In the first graph, the minutil for foodmart, chess, mushroom, and retail are kept fixed at 0.007%, 20%, 8%, and 0.12%, respectively. The minCor value is varied from 0.01 to 0.06 in foodmart dataset, from 0.74 to 0.79 in chess dataset, from 0.4 to 0.5 in mushroom dataset, and from 0.1 to 0.2 in retail dataset. For some datasets, the memory consumed by ECHUM is more, while for some datasets, the memory consumed is less. The overall memory consumed is lesser for the ECHUM algorithm than the CoUPM Algorithm.

Fig. 2.
figure 2

Runtime comparison for fixed minCor and variable minUtil

Fig. 3.
figure 3

Memory consumption comparison for fixed minutil and variable minCor

For the first graph in Fig. 3, the ECHUM algorithm’s memory is a little more than the CoUPM algorithm. The average memory consumed is around 100 MB for ECHUM, while it is around 90 MB for CoUPM. In the chess dataset, the ECHUM algorithm took less than half the memory consumed by the CoUPM algorithm for all cases. The average memory consumed by the CoUPM algorithm for the chess dataset is 230 MB, while for the ECHUM algorithm, it is 120 MB. The ECHUM algorithm performs best in the mushroom dataset in terms of memory consumption. It is almost 5 times efficient than the CoUPM algorithm. For the retail dataset, the memory taken is similar for both the algorithms, but it is less for ECHUM.

Fig. 4.
figure 4

Memory consumption comparison for fixed minCor and variable minutil

In Fig. 4, the minCor for food mart, chess, mushroom, and retail are kept fixed at 0.03, 0.76, 0.42, and 0.1, respectively. The minutil value is varied from 0.006% to 0.011% in food mart dataset, from 17 to 22 in chess dataset, from 8 to 13 in mushroom dataset, and from 0.01 to 0.015 in retail dataset. The first graph in Fig. 4 represents the food mart dataset in which for all the minutil values, the memory consumed by ECHUM is more than the CoUPM algorithm. The average memory consumed by the CoUPM algorithm for the food mart dataset is 90 MB, while for ECHUM, it is 115 MB. For the chess dataset, the ECHUM algorithm is faster than CoUPM in all cases. The ECHUM algorithm works best for the mushroom dataset as the average memory consumed is almost 5 times less than the CoUPM algorithm. For the retail dataset, the memory consumed by both algorithms is high, which is because of the large transactions, but the memory consumed by ECHUM is less than the CoUPM algorithm. Overall the memory consumed by the ECHUM algorithm is less than the CoUPM algorithm.

It is evident from the runtime and memory consumption graphs that ECHUM is almost 200–300% faster than CoUPM and takes nearly 50% less memory than CoUPM. The pruning methods help to reduce the search space, and also, the use of a better utility-list structure minimizes the memory usage.

5 Conclusion and Future Scope

This paper presents an efficient algorithm ECHUM to reduce the time and space required to search the CoHUIs. The idea involved in this paper is to combine the two algorithms EFIM and CoHUM, such that the overall complexity is reduced. The ECHUM uses two upper utility bounds named sub-tree utility and the local utility and a fast utility counting measure, which improves efficiency. The ECHUM also uses transaction and projection merging, which reduces the dataset size and enhances efficiency. More optimizations will be done for future work in this algorithm by developing a much tighter upper bound to reduce the search space.