1 Introduction

In past decades, many approaches have been extensively studied to mine the interesting patterns such as FIM (Frequent Itemsets Mining) [1], and SPM (Sequential Pattern Mining) [2]. FIM extracts the patterns that have frequency or support values no less than the user-specified minimum threshold in a database. Traditional FIM algorithms consider each item equally in the database by assuming whether the item is appeared in the transaction or not. They assume the quantity is either 0 or 1 and utility of each item is always 1. But, in most of the applications, the frequent itemsets consist only a less fraction of the total profit, while non-frequent itemsets generate a larger portion of the profit. Hence, the retail businesses do not get significant benefits from these methods.

To overcome the limitations of FIM algorithms, various HUIM (High Utility Itemsets Mining) algorithms [3, 4] have been proposed that measure the usefulness (or interestingness) of the items. The objective of the HUIM is to mine the HUIs (High Utility Itemsets) which generates high profit. The utility is defined as the measurement of the usefulness of an itemset. Utility values of items in a transaction database consist of two parts: the item profit (external utility) and the quantity of the item in one transaction (internal utility). The utility of an itemset is defined as the external utility multiplied by the internal utility. HUI mining facilitates crucial business decisions to maximize revenue, minimize marketing expenditure, and reduce inventory.

The HUIM algorithms have various applications, for example, market basket analysis [5], web click-stream analysis [6], web mining [7], cross-marketing [8], gene regulation [9], mobile commerce environment [10], e-commerce [11], etc. However, it is a challenging and complex task to mine the HUIs because it does not hold the DCP (Downward Closure Property)Footnote 1. It means that a super-set of low utility itemsets may be high utility itemsets. Hence, it is difficult to prune the search space of the mining process. A brute-force method can solve this problem that enumerates all itemsets from a database (i.e. use the principle of exhaustion). Obviously, this suffers from a combinatorial explosion, especially for databases containing many long transactions or for low minimum utility thresholds. Therefore, effectively pruning the search space and efficiently capturing all HUIs with no misses are crucial aspects of HUI mining.

To resolve this problem, Liu et al. [12] proposed Two-phase algorithm that presented TWU (Transaction Weighted Utilization) based anti-monotonic propertyFootnote 2. Various algorithms are presented on the basis of TWU model [13]. However, these algorithms generate numerous candidates and need multiple database scans. To address these issues, tree-based HUIM algorithms [14, 15] have been proposed that avoid excessive candidates generation and test strategy to mine the HUIs. However, they still produce too many candidates and require more than one database scan. To overcome these issues, utility-list-based algorithms [16,17,18] have been proposed that avoid the excessive candidates generation and multiple database scans. However, they incur the problem of costly join operations and inefficient memory usage. To address these issues, projection-based algorithms [13, 19] have been presented to enhance the effectiveness of the mining process. To represent the concise and compact representation of HUIs, closed HUIM algorithms [20] are proposed to overcome the challenges of the conventional HUIM approaches. Closed HUIM approaches mine only those high utility itemsets that have no proper superset with the same support count. However, sometime they consume high memory.

This article presents a survey of HUIM algorithms from transactional databases. This survey emphasizes the classification of the existing literature, developing a perspective on the area, and evaluating trends. As shown in Fig. 1, there has been a rapid growth in the interest of HUIM in recent years in terms of the number of research papers published. Figure 1 shows the research articles published in recent years in the area of HUIM, including HUIM from on-shelf, sequential databases, uncertain databases, transactional databases, dynamic databases and many others. However, this survey presents a detailed analysis of HUIM approaches from transactional databases.

Fig. 1
figure 1

Year-wise number of papers published for “High Utility Itemsets Mining”. The year-wise number of publication data are obtained from Google Scholar. Note that the search phrase is defined as the sub-field named with the exact phrase “utility pattern,” and at least one of itemset\(\backslash \)utility\(\backslash \)pattern\(\backslash \)sequential and utility pattern

Many surveys [21,22,23,24,25,26,27,28] exist on HUIs mining. In literature, there are only three articles [22, 23, 28] that present a comprehensive review of HUIs mining algorithms. The major contribution of the existing state-of-the-art surveys are discussed as:

  • In 2019, Fournier-Viger et al. [21] presented a book for HUIM and the chapters of the book covers different ways to find HUIs as there are various type of data such as transactional databases, incremental, dynamic, on-shelf and sequential databases. One chapter mainly focuses on transactional database-based approaches and includes very few basic papers on transactional database-based HUIs mining. It extensively explores three approaches: Two-phase [12], pattern-growth [14] and FHM [17] approach. Our work does not adhere only three approaches, it covers and compare more than 60 approaches.

  • The scope of the article [26] is very limited and only covers privacy-preserving-based utility mining. The study [27] covers the incremental (dynamic) database-based high utility itemsets mining. Similarly, [24] explains very limited about the HUI mining and its main concern is about the different types of databases such as voluminous, dynamic, and continuous data with high speeds and uncertainty. Later, Singh et al. [25] presented a survey on High average-utility itemsets mining. All these [24,25,26,27, 29] are not related to HUIM for transactional databases. We are not comparing these works with ours because they serve as inspiration for other data mining tasks.

  • In 2019, Rahmati et al. [23] provide a systematic review of basic HUIs mining techniques. It categories the methods into four parts, Apriori-like, pattern-growth-based, utility-list-based and additional types of methods. It reviewed only 20 approaches.

  • In 2019, Gan et al. [28] present a general, very comprehensive overview of the state-of-the-art methods. It is very concise survey article that categorizes the HUIs mining algorithms into four groups, Apriori-based, tree-based, projection-based and new data format-based algorithms. It surveys HUIM and high-utility sequential pattern mining approaches. It discusses only 26 approaches related to HUIM.

  • In early 2020, Zhang et al. [22] present a brief survey which classifies the algorithms into five categories, Apriori-based, tree-based, projection-based, list-based and vertical \( \& \) horizontal data-based algorithms. It is very concise survey and only discussed the advantages, and disadvantages of the current state-of-the-art methods.

All the existing surveys consider less number of approaches compared to the last survey. Our survey is different from the existing surveys [22, 23, 28] in various aspects.

  • Our survey specifically focuses on HUIM for transactional databases, however, existing surveys are general, cover HUIM, and other aspects of HUIM.

  • The existing surveys included less number of approaches and they do not include approaches published in 2020 and later years, however, our work covers more approaches and included all methods to date.

  • Our work presents a detailed overview along with the pros and cons of all the current HUIM approaches whereas the existing surveys are not as detailed.

  • The existing surveys did not include a description of the databases which are utilized by HUIM approaches. However, our survey included a detailed description of the utilized databases which helps readers to find the relevant databases for experimental results.

  • Most of the existing surveys do not describe the applications, open-source resources, and detailed future research directions. Our work presented applications, open-source resources, and detailed future research directions.

This survey only includes the transaction-based algorithms and gives a detailed overview of the existing algorithms. Our paper is comprehensively focused on the level-wise, tree-based, utility-list-based, projection-based HUIM algorithms in the transactional databases while the other survey papers do not extensively highlight the advancements in this area. Moreover, all the above-discussed the state-of-the-art works consider lesser algorithms than our work. Furthermore, we have reviewed considerably the other domains of HUIM such as incremental [27], negative utility [30], average-utility [25], multiple minimum utility thresholds [31], data stream [32], on-shelf [33], periodic and privacy-preserving [26] utility values while other survey papers were not reviewed considerably. Our feature representation is also completely different than the state-of-the-art works.

This work provides a comprehensive survey of HUIM algorithms from transactional databases that can serve as a recent advancement and opportunity in the field of data mining. The main contributions of this work are:

  1. 1.

    This paper presents a taxonomy of more than 60 state-of-the-art HUIM approaches for the transactional database that includes level-wise, tree-based, utility-list-based, projection-based, and miscellaneous approaches.

  2. 2.

    It presents the detailed comparison tables of more than 60 approaches in various parameters i.e. phases, number of database scans, utilized data structure, type of patterns mined, searching type, utilized pruning strategies, type of utility value, state-of-the-art algorithms, and the base approach. This survey discusses the pros and cons of each category of HUIM approaches in detail.

  3. 3.

    It also provides an in-depth summary and discussion of other existing HUIM approaches including incremental, negative utility, average utility, soft computing using HUIM, multiple minimum utility thresholds, data stream, on-shelf, periodic, and privacy-preserving utility.

  4. 4.

    The survey briefly discusses 16 real-world databases. It also discusses some open-source resources of HUIM. Furthermore, the article presents research opportunities and the future directions of the HUIM.

The rest of this paper is organized as follows: In Section 2, we introduce the preliminaries’ definitions, properties, and applications. In Section 3, we categorize and describe basic HUIM algorithms along with their pros and cons. In Section 4, we discuss the other advance topics of HUIM algorithms. In Section 5, we present some open-source resources and transactional databases. In Section 6, we discuss the research opportunities and future directions of HUIM algorithms. Finally, we conclude the paper in Section 7.

2 Preliminaries and applications

This section provides the preliminaries, important definitions and applications of the high utility itemsets mining.

2.1 Preliminaries and definitions

Let \(I=\{i_1, i_2, \dots , i_n\}\) be a finite set of n distinct items. Let the transactions \(t_1, t_2, \dots , t_m\) occur in the transactional database D, where each transaction \(t_q \subseteq I\). An itemset X is a finite set of k items such that \(X \subseteq I\), where k is the length of itemset denoted as k-itemset. Each item i in the transaction \(t_q\) has an internal utility denoted as \(iu(i, t_q)\). Each item i in the transaction \(t_q\) is associated with the external utility denoted as eu(i). The internal utility is simply the quantity of an item purchased during the visit of the retail-shop or super-market. Similarly, the external utility is price or profit of an item as depicted in Table 2. A minimum utility threshold is set according to the user preference, denoted as \(\delta \).

Table 1 Transaction database
Table 2 External utility value

For example, the transaction database consists eight transactions \(t_1, t_2, \dots , t_8\) as shown in Table 1 that will be used as a running example. It consists five items abcde. The external utility of each item is shown in Table 2. For example, the transaction \(t_3\) consists four items abde and has internal utility 3, 1, 5, 2 respectively. The external utility of abcde are 3, 2, 1, 4, 1 respectively.

Definition 1

Internal Utility of an item (\(iu(i, t_q)\))

Each item \(i \in I\) in a transaction \(t_q\) is associated with an internal utility, denoted as \(iu(i, t_q)\).

For instance, the internal utility of itemset \( \{d\} \) in \(t_3\) is 5 as depicted in Table 1.

Definition 2

External Utility of an item (eu(i))

Each item \(i \in I\) is associated with an external utility, denoted as eu(i).

For instance, the external utility of itemset \( \{d\} \) is 4 as shown in Table 2.

Table 3 Transaction utility of the running example

Definition 3

Utility of an item (\(u(i,t_q)\))

The utility of an item \(i \in t_q\) is denoted as \(u(i,t_q)\) and is defined as:

$$\begin{aligned} u(i, t_q) = iu(i, t_q) \times eu(i) \end{aligned}$$
(1)

For instance, the utility of itemset \( \{d\} \) in \(t_3\) is computed as: \((d, t_3) = iu(d) \times eu(d) = 5 \times 4 = 20\). The utility value of all the items for the running example is depicted in \(2^{nd}\) column of Table 3.

Definition 4

Utility of an itemset in the transaction (\(u(X\!,\!t_q)\))

The utility of an itemset X in the transaction \(t_q,\) \(X \subseteq t_q,\) is denoted as \(u(X,t_q)\) and is defined as:

$$\begin{aligned} u(X, t_q) = \sum _{i \in X \cap X \subseteq t_q } u(i,t_q) \end{aligned}$$
(2)

For example, the utility of an itemset \(\{a, d\}\) in \(t_3\) is calculated as: \(u(\{a,d\}, t_3) = (3 \times 3) + (5 \times 4) = 29\).

Definition 5

Utility of an itemset in the transactional database (u(X))

The utility of an itemset X in the transactional database D is denoted as u(X) and is defined as:

$$\begin{aligned} u(X) = \sum _{X \subseteq t_q \in D} u(X,t_q) \end{aligned}$$
(3)

For example, the utility of an itemset \(\{a, d\}\) in the transactional database D is calculated as: \(u(\{a, d\}) \!=\! u(\{a, d\},t_3) + u(\{a, d\},t_7) + u(\{a, d\},t_8)\) \(= 29 + 25 + 13 = 67\).

Table 4 High utility itemsets of the running example

Definition 6

Transaction utility (\(tu(t_q)\))

The transaction utility of a transaction \(t_q\) is denoted as \(tu(t_q)\) and is defined as:

$$\begin{aligned} tu(t_q) = \sum _{i \in t_q} u(i, t_q) \end{aligned}$$
(4)

For example, transaction utility of \(t_3\) is calculated as: \(tu(t_3) = u(a, t_3) + u(b, t_3) + u(d, t_3) + u(e, t_3) = 9 + 2 + 20 + 2 = 33\). The tu value of all the transactions is shown in the fourth column of Table 3.

Definition 7

Total utility of a transactional database (\(tu^{(D)}\))

The total utility of a transactional database D is denoted as \(tu^{(D)}\) and is defined as:

$$\begin{aligned} tu^{(D)} = \sum _{t_q \in D} tu(t_q) \end{aligned}$$
(5)

For example, the total utility \(tu^{(D)}\) of the transactional database D is calculated as: \(tu^{(D)} = tu(t_1) + tu(t_2) + tu(t_3) + tu(t_4) + tu(t_5) + tu(t_6) + tu(t_7) + tu(t_8) = (15 + 11 + 33 + 8 + 6 + 10 + 30 + 16) = 129\).

Definition 8

High Utility Itemset (HUI)

An itemset X is called to be high utility itemset (HUI) in a transactional database D if it satisfies the following condition:

$$\begin{aligned} HUI \longleftarrow \sum _{X \subseteq t_q \in D} u(X,t_q) \ge \delta \times tu^{(D)} \end{aligned}$$
(6)

For example, the minimum utility threshold \(\delta \) is set to 10%. In the running example, the itemset \( \{d\} \) is an HUI in a transactional database D because the utility of itemset \( \{a,d\} \) is \(u(a,d) = 29\), as \((29 > \delta \times tu^{(D)} (= 10\% \times 129 = 12.9))\). On the other hand, the itemset \(\{a,c\}\) is not an HUI because \(u(\{a,c\}) = 10\) as \((10 < 12.9)\). All the HUIs of the running example are shown in Table 4.

Definition 9

Transaction weighted utility of an itemset ( TWU ( X ))

The transaction weighted utility (TWU) of an itemset X is denoted as TWU(X) and is defined as:

$$\begin{aligned} TWU(X) = \sum _{X \subseteq t_q \in D} tu(t_q) \end{aligned}$$
(7)

For example, TWU of itemset \( \{d\} \) is calculated as: TWU(d) \(= tu(t_1) + tu(t_3) + tu(t_6) + tu(t_7) + tu(t_8) = 15 + 33 + 10 + 30 + 16 = 104\). Similarly, \(TWU(a, d) = tu(t_3) + tu(t_7) + tu(t_8) = 33 + 30 + 16 = 79\). The TWU values of all the single length items for the running example are shown in Table 5.

Problem statement: Given a transaction database D and a user-specified minimum utility threshold \(\delta \), the objective of the HUIM is to discover the complete set of items that have utilities no less than the minimum utility count (\( \delta \times tu^{(D)}\)).

It is a challenging task to prune the search space in the HUIM because the downward closure property of ARM [1, 34] does not hold for the utility measure of an item in the database. To address this issue, the TWDCP (Transaction Weighted Downward Closure Property) [12] is proposed that is based on the following definitions.

Table 5 TWU value of the items

2.1.1 \({\textbf {TWU}}\) based pruning strategy

The following properties related to TWU measure are used to prune the search space that are defined as:

Property 1

Overestimation. The TWU of an itemset X is higher than or equal to its utility [12] and is defined as: \(TWU(X) \ge u(X)\)

For instance, TWU(d) is 104 and u(d) is 60. Similarly, TWU(ad) is 79 whereas u(ad) is 67.

Property 2

Anti-monotone. The TWU of an itemset follows the anti-monotone property [12] and is defined as: let X and Y be two itemsets and \(X \subset Y\), then \(TWU(X) \ge TWU(Y)\)

Property 3

Pruning. Let X be an itemset. If \(TWU(X) \ge (\delta \times tu^{(D)})\), then the itemset X is a candidate high utility itemset, otherwise, X is a low utility itemset.

Definition 10

HTWUI (High Transaction Weighted Utilization Itemsets (\(HTWU^{(D)}(X)\))

An itemset X is HTWUI in a transactional database D if it satisfies the following condition:

$$\begin{aligned} HTWUI(X)= HTWU^{(D)}(X) \ge \delta \times tu^{(D)} \end{aligned}$$
(8)

For example, the itemset \(\{a, b\}\) is HTWUI in the transactional database D as \(TWU(\{a, b\}) = tu(t_3) + tu(t_7) = (11 + 13) = 24\) as \((24 > (10\% \times 129 = 12.9))\). On the other hand, the itemset \(\{a, c\}\) is not HTWUI as \(TWU(\{a, c\}) = tu(t_2) = 10\) as \((10 < 12.9)\).

Table 6 Closed high utility itemsets of the running example

Property 4

TWDCP (Transaction Weighted Downward Closure Property). The TWDCP indicates that if an itemset X is not an HTWUI, then itemset X and its all supersets are low utility itemsets.

Using the TWDCP property of TWU, many HUIM algorithms [12, 15] are proposed that significantly prune the search space and eliminate the numerous unpromising candidates from the search space. These algorithms consist of two phases. During the first phase, they discover all HTWUIs from the transactional database, while the HUIs are obtained from the set of HTWUIs in the second phase. Although these algorithms mine all the HUIs, but they may generate a large number of candidates in phase one that degrade the overall performance of the mining process. To overcome this problem, many closed HUIM algorithms [20] are proposed that provide a concise and lossless representation of HUIs. These are referred as CHUIs (closed HUIs). For more information, the readers may refer to the closed itemsets in [35, 36].

Definition 11

Support count of an itemset ( sc ( X ))

The support count of an itemset X is the number of transactions containing X in D and is denoted as sc(X). The support of itemset X is denoted as s(X) and is defined as:

$$\begin{aligned} s(X) = \frac{sc(X)}{|D|} \end{aligned}$$
(9)

The complete set of all the itemsets in D is denoted as L and is defined as:

$$\begin{aligned} L = \{X | X \subseteq I, sc(X) > 0\} \end{aligned}$$
(10)

For instance, the support count of itemsets \(\{d\}\) and \(\{abd\}\) are, respectively, 5 and 2. The support of itemsets \(\{d\}\) and \(\{abd\}\) are \(5/8 = 0.625 (> 0)\) and \(2/8 = 0.25 (> 0)\) respectively.

Definition 12

CHUIs (Closed High Utility Itemsets)

An itemset X is CHUIs if it is HUIs and there does not exist any HUIs itemset Y such that \(X \subset Y\) and \(sc(X) = sc(Y)\).

As compared to the HUIs shown in Table 4, itemset \(\{a, b\}\), \(\{a, b, d\}\), \(\{a, b, e\}\) and \(\{b, d, e\}\) are not CHUIs because these itemsets are the subset of the itemset \(\{a, b, d, e\}\) with the same support count which is 2. Similarly, itemset \(\{a, d\}\) and \(\{d, e\}\) are the subset of \(\{a, d, e\}\) with the support count 3. And, itemset \(\{c, d\}\) is the subset of \(\{b, c, d\}\). Table 6 shows all the CHUIs for the running example.

2.2 Applications of high utility itemsets mining

In this section, we briefly discuss various applications of HUIM.

2.2.1 Market basket analysis

The market basket analysis [5] is a data mining technique that is used to analyze the data from large databases such as purchase history, to know the product grouping and products that are often purchased together. It considers the data across their various stores and purchases from different customer groups at different times. It is useful to increase the sales and customer satisfaction. It provides more benefits to retailers and makes the shopping experience more valuable and productive for the customers. The HUIM approaches obtain the utility of each itemset that can be predefined based on the user preferences. Hence, HUIM approaches [5, 37] are quite useful in the market basket analysis to obtain the detailed information about the purchasing behavior of the customers.

2.2.2 Web click stream analysis

Web click stream analysis [38] provides a rich set of click streams data. It includes the visited pages, time spent on each page and detailed history of web pages. Detailed analysis is used to report user behavior on a specific website. The HUIM approaches can discover valuable items and information from the website click-streams [38].

2.2.3 Web mining

Web mining [39] is a data mining technique that automatically discovers useful knowledge from the world wide web and its usage patterns. It improves the power of web search engines by identifying the web pages and web documents. It is used to predict the user behavior. For example, Yahoo and Google are the powerful web search engine. The HUIM approaches discover the high utility itemsets from the web-log such as high utility traversal items [40] and high utility access items [39]. These approaches are quite useful to improve web services, web browsing, and web pages.

2.2.4 Cross marketing

Cross marketing [32, 41, 42] helps to identify the customer purchasing behavior, improves customer service, enhances sales, focuses on customer retention and reduces the cost of businesses. Gan et al. [5] design a novel method to extract the non-redundant correlated purchase behaviors by considering the utility and correlation factor that result in high recall value and accurate precision.

2.2.5 Gene regulation

The gene regulation [9] technique is used to acquire information about the sequence structure, gene interaction and genetic pathway. This information significantly complements the analysis of any data and improves its results. The HUIM approach [43] is used to obtain promising gene regulation patterns from large gene expression data. It is useful to discover new drugs for health care.

2.2.6 E-commerce

E-commerce [11] warehouse management significantly improves the efficiency of the warehouse operations and improves the customer service. The proposed method [44] solves the item storage assignment problem by using the top-k HUIM method and a heuristic algorithm. Similarly, mobile commerce environment [10, 15] is useful to discover the sequential purchasing patterns (or mobile sequential patterns) by the movement of the path with the purchasing transactions. It is also useful to analyze and manage the online shopping websites. Yun et al. [45] proposed a method that integrates the movement of the path with the sequential patterns to discover the mobile sequential patterns.

Although there are plenty of applications of HUIM in the field of data mining. But, due to space limitation, we have taken into consideration only a limited number of applications. In the next section, we provide an up-to-date and comprehensive survey of the high utility itemsets mining algorithms from transactional databases along with their pros and cons.

3 High utility itemsets mining algorithms

Traditional HUIM approaches for transactional databases are generally divided into the following groups: level-wise, tree-based, utility-list-based and projection-based. The level-wise approaches [12, 46] are based on candidates generation and test methods. They discover HUIs in two phases. During the first phase, they overestimate the utility of itemsets based on the TWU method. In phase 2, they identify all the HUIs. However, they generate numerous candidates and require multiple database scans. In order to resolve the problems of level-wise approaches, the tree-based algorithms [47,48,49,50,51] are proposed that mainly consist of three steps: (1) tree construction; (2) candidates generation from the constructed tree; (3) identification of the HUIs from the set of candidates. However, they generate a large number of candidates that eventually consume high memory and degrade the performance of the mining process. In order to resolve these issues, utility-list-based algorithms [16,17,18] are proposed that do not generate the candidates and avoid multiple database scans. They are more efficient with regard to the run-time as compared to the level-wise and tree-based algorithms. However, they need to perform costly join operations that lead to high memory consumption and complexity. In order to resolve these issues, projection-based algorithms [13, 19] are proposed to enhance the effectiveness of the mining process by reducing excessive candidates generation. Moreover, they are highly memory-efficient and scalable. However, they do not perform well on some benchmark databases due to high threshold value.

Fig. 2
figure 2

Taxonomy of high utility itemsets mining algorithms for transactional databases

In the past decade, numerous HUIM algorithms are proposed to discover the HUIs from transactional databases. This paper is divided into five categories: (1) level-wise (2) tree-based (3) utility-list-based (4) projection-based; (5) miscellaneous. Figure 2 represents the taxonomy of the state-of-the-art HUIM approaches for transactional databases.

3.1 Level-wise high utility itemsets mining

Level-wise high utility itemsets mining algorithms follow the TWU property to mine the HUIs from the transaction database. They perform the level-wise search and/or depth-first search to discover the items present in the transactional database. The low utility itemsets are pruned by using pruning strategies such as TWU, upper-bound utility, etc. The objective of these algorithms is to find the usefulness of the items to extract the maximum profit for the businesses.

Chan et al. [52] proposed OOA (Objective-Oriented utility-based Association) approach that mines the top-k closed utility patterns. It uses the weaker but anti-monotone condition [1] to reduce the search space to derive all OOA rules efficiently. The main advantage is that the user-specified minimum utility is not required that makes the proposed algorithm very effective. But, there is still little overhead to obtain the intended results.

In 2005, Yao et al. [53] proposed MEU (Mining with Expected Utility) algorithm that considers both the internal and external utility of an itemset to find the HUIs. It uses the heuristic model to limit the search space and provides a theoretical foundation of efficient utility mining algorithms. However, some high utility itemsets may miss, therefore, it cannot obtain the complete results. Moreover, MEU does not follow the DCP property of Apriori [34].

To address the limitations of MEU [52], Liu et al. [12, 54] proposed Two-phase algorithm. They proposed a TWU mining model by maintaining the TWDCP in phase one, while in phase two, only one extra database scan is required to prune the overestimated itemsets. However, it generates too many candidates and performs multiple database scans.

The main limitation of frequency mining association rules [34] is that they do not take into account the statistical correlation between itemsets. To overcome this limitation, Yao et al. [46] proposed two algorithms, namely UMining and UMining_H by incorporating pruning strategies, utility upper-bound and support upper-bound respectively. UMining is better than UMining_H because it discovers all HUIs. However, both algorithms suffer from excessive candidates generation.

Li et al. [8] proposed IIDS (Isolated Items Discarding Strategy) method to identify the HUIs. It can pertain to any existing level-wise methods and DCG (Direct Candidates Generation) [41] to decrease the number of candidates and hence, improve the mining performance. By applying IIDS to ShFSM and DCG, authors have implemented FUM (Fast Utility Method) and DCG+ respectively. But, it has also the same performance issues like Apriori. Moreover, these methods generate a large amount of candidates and requires multiple database scans.

HUIM mining methods lack to provide business insights so that customers can get maximum benefits [53]. To provide the business insights, Lee et al. [55] developed HURM (High Utility Rule Mining) method that gives the firms to quantitatively represent user’ preferences for utility values. The algorithm provides a large amount of data from the various heterogeneous business environments and hence gives opportunities to firms to increase their business benefits. The performance of HURM is much better (best 133 percent, normal 120 percent and worst 113 percent) than that of HUIM methods in terms of business context.

Wu et al. [35] provide CHUD (Closed\(^+\) High Utility itemset Discovery) algorithm to mine closed\(^+\) HUIs. Later, this work is extended by Tseng et al. [20]. The authors proposed three algorithms namely, AprioriCH (Apriori-based algorithm for mining High utility Closed\(^+\) itemsets), AprioriHC-D (AprioriHC algorithm with Discarding unpromising and isolated items) and CHUD that represent the compact and lossless representation of HUIs. Further, they proposed DAHU (Derive All High Utility Itemsets) that recovers all HUIs from the CHUIFootnote 3 (Closed High Utility Itemsets). CHUD is used to mine the closed itemsets. However, AprioriCH and AprioriHC-D have poor performance on dense databases as they suffer from excessive candidates generation, while CHUD suffers from high memory consumption.

Table 7 An overview of key characteristics of level-wise high utility itemsets mining algorithms
Table 8 Pros and cons of level-wise high utility itemsets mining algorithms

3.1.1 Summary

Level-wise HUIM approaches are proposed that allow a user to conveniently express his or her perspectives concerning the usefulness of itemsets as utility values and then find itemsets with utility values higher than the specified threshold. The challenge of level-wise HUIM is in restricting the size of the candidate set and simplifying the computation for calculating the utility. Level-wise HUIM algorithms generate a large amount of candidates. They perform multiple database scans to obtain the desired results that are same as the Apriori approach. Therefore, they take longer execution time and high memory consumption. These algorithms are not suitable for those databases that are large in size and have longer transactions. The detailed overview of level-wise HUIM algorithms is shown in Table 7. Furthermore, the pros and cons of all the level-wise HUIM algorithms are depicted in Table 8.

3.2 Tree-based high utility itemsets mining

In this section, the tree-based HUIM algorithms are discussed to address the limitations of level-wise approaches. The level-wise approaches [46, 53, 54] require two phases to find HUIs that cause numerous candidates generation and multiple database scans. To overcome these issues, tree-based HUIM algorithms [47,48,49,50,51] are proposed that are based on the pattern-growth approach [57] and compact tree structure. These algorithms significantly reduce the number of candidates with the overestimated methods and effectively reduce the search space to find HUIs. They just need two and/or three passes over databases and much faster than Apriori-like approaches. The objectives of these algorithms are to significantly decrease the number of candidates set and numerous database scans by proposing efficient pruning strategies and compact tree structures.

To address the problem of identifying the high utility itemsets, Hu et al. [58] proposed an efficient approximation by introducing the novel binary partition tree called HYP (High Yield Partition) tree. It contributes towards pre-defined utility, objective function or performance metric. It finds the segment of data by taking a combination of a few items and/or rules. However, when the HYP-tree becomes larger, it consumes a high amount of memory. Hence, it is not suitable for large databases.

To address the limitation of HYP-tree [58], Erwin et al. [47] proposed CTU-Mine (Compressed Transaction Utility Mine) algorithm that mines HUIs using pattern-growth approach. It introduces compact data representation, called CTU-tree (Compressed Transaction Utility tree). This approach avoids candidates generation-and-test and best suits for dense databases and long patterns. But, CTU-tree is very complicated and store excessive information that leads to high memory consumption. Moreover, Two-phase algorithm [12, 54] is relatively faster than CTU-Mine when the threshold value is set high.

As we have seen in previous works [12, 46, 47, 58], they only work for the fixed databases and do not consider dymanic databases like insertion, deletion or modification of the transactions occur in the database. To address this issue, Ahmed et al. [32] proposed IHUP (Incremental High Utility Pattern) algorithm that is based on "build once mine many" property to mine the HUIs in the incremental database. It consists of three efficient tree structures, namely, IHUP\(_L\)-Tree (Incremental HUP Lexicographic Tree), IHUP\(_{TF}\)-Tree (IHUP Transaction Frequency Tree) and IHUP\(_{TWU}\)-Tree (IHUP Transaction Weighted Utilization Tree). The first tree structure, IHUP\(_L\)-Tree is arranged as per item’s lexicographic order. The second tree structure, IHUP\(_{TF}\)-Tree obtains the compact size by arranging items as per the decreasing order of their transaction frequency. The final tree structure, IHUP\(_{TWU}\)-Tree is designed to decrease the mining time by considering the decreasing order of TWU value of items. The main advantage of IHUP is that the proposed tree structures are efficiently discover and achieve scalablity for increment and interactive HUP mining. However, the proposed algorithm produces large number of HTWUIs during the first phase.

Ahmed [48] proposed HUC-Prune (High Utility Candidates Prune) to avoid the candidates generation in level-wise fashion. It adopts the pattern-growth approach and uses the proposed HUC-tree (High Utility Candidates tree) structure. In the first database scan, HUC-Prune finds the patterns of single element candidates. During the second database scan, the HUC-tree is used to capture the TWU value of items in transactions. Finally, a third database scan finds all the HUIs from the candidate patterns. Hence, it requires maximum of three database scans to discover HUIs for a minimum utility threshold. The advantage of HUC-Prune is that it avoids excessive candidates and multiple database scans. It performs well with large dense databases, highly scalable, and memory efficient. However, the construction of tree is complex and large.

The previous level-wise and tree-based HUIM algorithms find the excessive promising itemsets in step one that cause the problem of mining congestion. To avoid this problem, Lin et al. [49] proposed UMMI (High Utility Mining using the Maximal Itemset property) algorithm that uses maximal itemset property. They developed HTP (High TWU Pattern) tree structure to find all the maximal high TWU itemsets. In step one, it significantly decreases the promising itemsets, while in step two, it utilizes the MLexTree (Maximal Lexicographic Tree) structure to find all HUIs. MLexTree structure is based on the LexTree (Lexicographic Tree) structure. UMMI is a highly efficient algorithm to find HUIs in very large databases. Moreover, it is linearly scalable in terms of the number of transactions. But, it needs additional memory to construct the HTP tree structure.

When there are large number of PHUIs (Potential HUIs) especially when the database consists of longer transactions or the low threshold is set, the previous algorithms do not perform well. To avoid this problem, Tseng at al. [14] proposed UP-Growth (Utility Pattern-Growth) to effectively discover the HUIs from the transaction databases. They proposed the compact UP-Tree (Utility Pattern Tree) structure to keep the information of HUIs. The proposed algorithm generates the HUIs in two database scans. It uses four pruning strategies, namely DGU (Discarding Global Unpromising items), DGN (Discarding Global Node utilities), DLU (Discarding Local Unpromising items) and DLN (Decreasing Local Node utilities). To extend the work of UP-Growth, UP-Growth+ [15] is proposed that utilizes the minimal utilities of each node in each path of the UP-tree. As compared to the UP-Growth, the UP-Growth+ significantly decreases the overestimated utilities of PHUIs and effectively reduces the number of candidates. The limitation of UP-Growth+ is that it consumes more time to recursively process all conditional prefix-trees to generate the candidates.

Yun et al. [59] proposed MU-Growth (Maximum Utility Growth) algorithm which efficiently reduces the candidates. They suggest MIQ-Tree (Maximum Item Quantity Tree) structure that holds the data information in a single pass. MIQ-tree can be restructured to reduce the overestimated utilities. It has following three steps: (1) During the first step, the initial tree is constructed by using the items and quantities associated with items in the transactions. Then, a tree is reconstructed based on the decreasing order of TWU values. (2) In the second step, MU-Growth algorithm performs the generation of candidates from the restructured tree. (3) Finally, in the last step, actual HUIs are identified. The proposed algorithm performs efficiently on the databases that contain long transactions or the low minimum utility threshold value is set. MU-Growth performs well than the UP-Growth+ [15] with regard to all the threshold values. However, it suffers from the high execution time when the total number of transactions is increased in the database.

As the previously patter-growth approaches [14, 15, 47, 48, 59] generate excessive conditional trees that lead to high cost with regard to space and time. To address this challenge, Song et al. [50] design an efficient algorithm, called CHUI-Mine (Concurrent High Utility Itemsets Mine) which dynamically prunes the tree structure to discover the HUIs. They proposed CHUI-tree structure that is based on the pattern-growth approach and can be completed within two database scans. The main advantage of CHUI-Mine is that it can discover HTWUIs during the process of tree construction and avoids the generation of the whole tree structure. Hence, CHUI-Mine efficiently reduces memory consumption because the pruned trees are usually small. However, it does not consider the comparison with a faster tree-based algorithm IHUP [32] with respect to memory usage.

To avoid excessive candidates generation, HUI-Miner [16] and d2HUP [4] are proposed to mine HUIs in a single phase. Then, HUP-Miner [60] and FHM [17] are designed that are six times faster than that of HUI-Miner [16]. However, the computation task remains very expensive to find HUIs despite all these efforts. Zida et al. [61] developed EFIM (EFficient high utility Itemset Mining) algorithm that efficiently finds the HUIs in one phase only. It uses two upper bounds, namely sub-tee utility and local utility that effectually limit the search space. A novel array-based utility counting method, called FAC (Fast Utility Counting) is introduced that calculates these upper bounds in linear time and space. The authors proposed efficient database projection and transaction merging techniques to reduce database scans cost. EFIM is two to three times faster and eight times less memory consumption compared to the UP-Growth+ [15], HUI-Miner [16], d2HUP [4], HUP-Miner [60] and FHM [17] on various benchmark databases. This work is later extended in [51] that significantly improves the performance of EFIM [61] with regard to the run-time and memory usage. However, in some cases, the recursive projection takes considerable time and consumes high memory. Moreover, the transaction merging techniques do not scale well on sparse databases.

Shao et al. [62] proposed CUARM (Combined Utility Association Rules Mining) algorithm to mine actionable high utility association rules, named CUAR (Combined Utility Association Rules). The proposed CUAR structure generates patterns that are both strongly associated and have high utility. They also proposed AUG (Associated Utility Growth) that integrates the utility and association. The proposed algorithm is based on UG-Tree (Utility Growth Tree) structure that requires only two database scans. The advantage of CUARM is that it generates HUIs without the loss of representativeness (strong association). However, it does not consider the utilities which are lower than the underlying itemsets for the utility decrement itemsets.

The previous works [14, 15] are based on the UP-tree that generates a large number of candidates to compute the utility value to estimate the itemsets. To address this issue, Dawar et al. [63] proposed the efficient UP-Hist Growth algorithm and UP-Hist tree structure that discover the HUIs from a transaction database. UP-Hist tree keeps the histogram of item quantities with each node of the tree. The process of UP-Hist tree construction requires two database scans. The histograms compute better utility estimation to significantly limit the search space. However, histograms are required extra storage space to store the UP-Hist tree.

CHUD [20, 35] incurs highly computing cost to mine CHUIs. To resolve this challenge, Fournier-Viger et al. [64] proposed EFIM-Closed (EFficient high utility Itemset Mining Closed) algorithm that uses three strategies namely, CJU (Closure Jumping), FCC (Forward Closure Checking), and BCC (Backward Closure Checking) to efficiently mine CHUIs. EFIM-Closed uses two efficient techniques, namely HDP (High utility Database Projection) and HTM (High utility Transaction Merging) to reduce the cost of database scans. The proposed algorithm is up to 71 times faster and consumes up to 19 times less memory than that of CHUD [35]. However, EFIM-closed does not perform well in case of dense databases.

The previous algorithms [14, 15] require two database scans to construct the UP-tree. Ryang et al. [65] proposed SIQ-Tree (Sum of Item Quantities Tree) which scans the database in a single pass to capture the data information. The authors suggest restructuring method with two strategies, namely RPS (Reducing Path Support) and REPU (Reducing Estimated Path Utility using maximum item utility) that effectively decrease the number of candidate patterns with the reduced overestimation of utilities. A fast algorithm for HUIs is considered to enhance the mining performance. However, it needs a high amount of computing time as the item size increases.

The previous works [8, 14, 15, 32] require two phases to discover the HUIs which result of high cost in terms of candidates generation and utility calculation. To address these issues, Qu et al. [66] first present BIA (Basic Identification Algorithm) to find the HUIs. Second, they proposed a new candidate tree structure to store the candidates. A candidate tree is a modified prefix-tree [67] that shares a common path using the same prefix. Finally, FIA (Fast Identification Algorithm) is developed to fast identify the HUIs. The proposed algorithm performs well when the candidate tree completely fits in memory. However, it does not take into consideration the case when the candidate tree is too large to completely fit in memory.

To overcome the problem of mining lots of tiny itemsets, concise HUIM is required. To address this issue, Singh et al. [68] proposed EHIL (Efficient High utility Itemsets with Length constraints) algorithm that considers the length constraints to find HUIs. The minimum and maximum length constraints, respectively remove tiny itemsets and restrict the higher length of the itemsets, to restrict the length of HUIs. The proposed algorithm uses two database compaction techniques namely, database projection and transaction merging, to speed up the execution process. EHIL utilizes two pruning strategies, namely RLU (Revised Local Utility) and RSU (Revised Sub-tree Utility) that use depth-first search to reduce a large number of unpromising candidates. An efficient array-based utility counting technique is employed that computes the upper bounds to speed up the utility counting process. It computes the utility of itemsets without the need of original database scan. EHIL incorporates length constraints in the redefined sub-trees and TWU pruning strategies. EHIL outperforms FHM+ algorithm [69] with regard to memory consumption and run-time. The memory consumption is up to 28 times less than that of FHM+ algorithm, while the improvements of execution time are range from 5 percent to two orders of magnitude across various compared databases. However, the performance of the proposed algorithm depends on the user-defined threshold, minimum utility, the minimum and maximum length. Moreover, it does not perform well on dense databases.

The algorithms [15,16,17, 51] do not perform consistently across sparse and dense databases. To overcome this issue, Dawar et al. [70] proposed UT-Miner (Utility Tree Miner) that mines HUIs in one phase without generating any candidate. The proposed algorithm uses the lightweight method that constructs the projected database to explore the search space. They proposed UT (Utility Tree) structure to store the information compactly with each node of the tree in the transaction. UT-Miner is one of the top-performing efficient algorithm across dense and sparse databases.

To design efficient pruning strategies, Wu et al. [71] proposed HUI-PR (HUIM with Pruning Strategies) to efficiently find the HUIs. The proposed work is the extension of EFIM [61]. It proposes two pruning strategies, namely strict local utility and strict remaining utility to significantly limit the search space and effectively estimate the candidates. HUI-PR has following advantages: (1) estimates fewer candidates than that of d2HUP [4] and EFIM [61]. (2) generates fewer branches of search space than that of EFIM [61]. (3) significantly reduces the memory consumption on large databases. (4) multi-threshold method helps to speed the mining performance in large databases without using the projection operations. (5) performs well on dense databases. However, it suffers from the following disadvantages: (1) increases the complexity of the newly proposed upper bounds. (2) does not perform well on sparse databases.

FCHM (Fast Correlated High-utility itemset Miner) algorithm [72] incurs two major drawbacks: (1) Correlation measures must satisfy the DCP. (2) Each measure requires different strategies to calculate and prune to enhance the performance of the corresponding mining task. To address these issues, Hoa et al. [73] proposed the HUIM-class algorithm, namely GMCHM (General Method for Correlated High-utility itemset Mining), based on the EFIM algorithm [51] which considers the high-correlated property among itemsets within customer transaction databases. It stores all the required information to compute the given measure in only one database scan. The GMCHM algorithm utilizes more efficient approaches like High-utility Database Projected (HDP), High-utility-Transaction Merging (HTM), and tighter upper bounds, such as local utility and sub-tree utility, to prune the candidates. The experiment results show that the proposed algorithm performs well as compared to existing FCHM [72] in terms of execution time, memory consumption, and the candidates checked. The speed of GMHCM is up to 7.5 times faster than that of the FCHM algorithm using the bond measure and up to 2,600 times faster when \( all\_confidence \) measure is 0.1.

EFIM-Closed [64], HMiner-Closed [74], and CHUI-Miner(Max) [75] provide concise representation of HUIs. However, they incur long execution time, high memory consumption, and scalability issues, especially for dense and large datasets. To solve these problems, Hai Duong et al. [76] proposed two efficient algorithms, namely C-HUIM and MaxC-HUIM, first of this kind, to simultaneously mine CHUIs (Closed HUIs) and MaxHUIs (Maximal HUIs) respectively. A novel weak upper bound (WUB) named FWUB and corresponding pruning strategy, named SPWUB is proposed to quickly prune the low utility itemsets. Moreover, two pruning strategies PSNonCHUB (Pruning Strategy of NonCHU Branches in the Process of Constructing CHUI) and LPSNonCHUB (Local Pruning Strategy of NonCHU Branches Without Checking the Subpattern Relationship) are proposed to reduce the search space. PSNonCHUB only need checking the inclusion relationship among a small number of itemsets, while LPSNonCHUB does not need inclusion relationship. The first optimization technique is proposed by designing a new minimum support threshold called newms based on the minimum utility threshold mu and the maximal value of the TWU. The second optimization technique is designed to reduce the number of inclusion relation checks between itemsets when using the pruning strategy PSNonCHUB. A structure, named MPUN-list, modified version of the PUN-list (PU-tree-Node list) structure [77], is adopted to efficiently store and calculate the information about utility and support of each itemset. The experiment results show that the proposed algorithm is 100 times faster, memory efficient, scalable as compared to the state-of-the-art algorithms, namely CHUI-Miner [78], EFIM-Closed [64], HMiner-Closed [74], and CHUI-Miner(Max) [75] with regards to the execution time, memory consumption and scalablity.

3.2.1 Summary

As discussed above, tree-based HUIM algorithms have shown improvement over level-wise approaches [8, 46, 54] in terms of candidates generation and database scans. The reason is that most of the approaches require only two scans of the database and require less memory as they explore the search space in a depth-first manner. However, tree-based algorithms suffer from the following main disadvantages: (1) tree structure is complex and requires too much information that leads to high memory usage; (2) require more time to recursively process all the prefix-trees to generate candidates; (3) a compact tree is generally expensive to build; (4) a constructed tree may not completely fit in the memory; (5) does not perform well either on dense databases or sparse databases. (6) does not scale well in the case of very large databases. The detailed comparison of the characteristics of tree-based HUIM algorithms are shown in Table 9. Furthermore, the pros and cons of all the tree-based HUIM algorithms are depicted in Table 10.

Table 9 An overview of key characteristics of tree-based high utility itemsets mining algorithms
Table 10 Pros and cons of tree-based high utility itemsets mining algorithms

3.3 Utility-list based high utility itemsets mining

Utility-list-based HUIM algorithms [16,17,18, 78, 84] have the following advantages over level-wise [8, 46, 54] and tree-based approaches [15, 32, 47, 48, 50, 51]: (1) use vertical [17, 60] and/or horizontal [4] data structure to efficiently perform in one phase only to find the HUIs; (2) the search space spans for the set-enumeration tree [34] obtain the total utility of itemsets to build a utility-list by performing join operations; (3) the upper-bound remaining utility is used to obtain the utility lists; (4) the depth-first search method is used to quickly compute the value of utilities; (5) avoid the costly candidate’s generation and utility computation; (6) scalable for the large number of items and transactions; (7) effectively reduce the execution time and memory usage.

Liu et al. [81] proposed HUI-Miner (High Utility Itemset Miner) algorithm to efficiently mine HUIs in one phase. The authors proposed a vertical data structure called a utility-list that stores the utility information of an itemset and the heuristic information to prune the search space. However, the join operations between utility-list of k-itemsets and (k+1)-itemsets consume a high amount of time. Moreover, it suffers from high space and complexity.

The existing utility mining algorithms [14, 32, 42] do not address the challenges when there are longer transactions or the minimum utility threshold is set to be small. To overcome these challenges, Liu et al. [4] proposed d2HUP (Direct Discovery of High Utility Patterns) that uses a pattern-growth approach to directly discover the HUIs by performing only single-phase without the candidates generation. It searches the prefix extension tree using DFS and enumerates an itemset to compute its utility. The original utility information is maintained by the proposed CAUL (Chain of Accurate utility-lists) structure that computes the tight-bound for pruning and directly identifies the HUIs. Moreover, an upper-bound on the utility of prefix extensions of itemsets prunes the search space to directly obtain the HUIs. Look-ahead strategy, based on the closure propertyFootnote 4 [4] and singleton propertyFootnote 5 [4], is incorporated to enhance the efficiency of the algorithm in the dense databases. This work is further extended in [16] that includes several features like efficient computation by pseudo-projection, controlled irrelevant item filtering, and optimizations by partial materialization. However, it suffers from the following disadvantages: (1) the tree structure and CAUL both consume a high amount of memory; (2) the efficiency of the algorithm is low when the look-ahead strategy is of little use; (3) the algorithm obtains the values of upper-bound and utility for a large number of candidates; (4) the performance of the proposed algorithm [16] is not compared with recent algorithms [17, 60].

HUI-Miner [81] performed costly join operations to calculate the utility of each itemset. To address this limitation, Fournier-Viger et al. [17] proposed FHM ((Fast High utility Miner) algorithm and a novel pruning strategy EUCP (Estimated Utility Co-occurrence Pruning). EUCP is based on the EUCS (Estimated Utility Co-occurrence Structure) [17] that reduces the costly join operations by analysis of item co-occurrences. Experimental results show that FHM is up to six times faster and performs up to 95 percent fewer join operations than that of HUI-Miner [81]. However, it has the following disadvantages: (1) consumes slightly high memory than HUI-Miner and has poor performance on dense databases; (2) suffers from high space and complexity; (3) treats all items equally that cannot exactly reflect the characteristics of the real-world databases.

The traditional HUIM algorithms [35, 52] compactly represent HUIs but fail in the case of association rules. To resolve this issue, Sahoo et al. [85] proposed HUCI-Miner (High Utility Closed Itemset Miner) algorithm that extracts HUCIs (High Utility Closed Itemsets) along with their generators. It represents the condensed representation of association rules in support of the confidence framework to mine the HUIs. The proposed algorithm achieves a significant reduction to compress the number of HUIs.

CHUD [78] overestimates too many low utility of candidates that result in high memory consumption and low run-time. To address the issue, Wu et al. [78] proposed an EU-List (Extended utility-list) structure to maintain the utility information of itemsets in the transaction. EU-List efficiently calculates the utility of itemsets in memory without the original database scan. Further, CHUI-Miner (Closed\(^+\) High Utility Itemset mining without candidates) approach [78] is proposed that uses the divide-and-conquer method to find all the CHUIs without generating candidates. However, it consumes too much time to recursively process all the conditional prefix-trees.

The author [60] proposed the HUP-Miner algorithm to efficiently discover the HUIs. The partitioned utility-list structure is proposed to maintain the utility information at a granular partition level. Two novel pruning strategies, namely PU-Prune (Partitioned Utility Pruning) and LA-Prune (Look-ahead Utility Pruning) are proposed to limit the mining search space and hence improve the efficiency of the mining process. HUP-Miner performs well on sparse databases. However, it shows poor performance on dense databases. Moreover, the number of partitions is needed to explicitly set by the users.

To enhance the effectiveness of the mining process by pruning the search space using the length constraint, Fournier-Viger et al. [69] proposed FHM+ (Fast High utility itemset Mining+) algorithm to find the HUIs using length constraint. It is an extension work of FHM [17] with LUR (Length upper bounds Reduction). FHM+ reduces the upper bounds on the utility of itemsets using length constraints to prune the search space. The advantage of FHM+ is that the number of patterns is effectively decreased which improves the performance of the mining process. However, the detailed results are not shown to evaluate the efficiency of the LUR concept.

List-based methods [4, 17, 81] require a large number of comparison operations between two given items in the transaction and also need to construct the list for them. To overcome this limitation, Ryang et al. [86] proposed IMHUP (Indexed-list based Mining of High Utility Patterns) algorithm that is based on the IU-List (Indexed Utility-List) structure. IU-List effectively reduces the comparison operations when constructing the local lists to mine the HUIs. Moreover, RUI (Reducing upper-bound utilities in IU-Lists) technique is developed that decreases the search space by reducing the upper-bound utilities in IU-List. Further, CHI (Combining High utility patterns without constructing IU-List) technique is developed that efficiently generates the HUIs from the lists without the need for construction of a local IU-List when the lists consist only of information of the same revised transactions. The experimental results show that the IMHUP algorithm works better in the case of dense and sparse databases as compared to HUI-Miner [81] and FHM [17].

Traditional HUIM algorithms consist of weakly correlated items that lead to incorrect or useless decisions. To avoid this problem, Fournier-Viger et al. [87] proposed an FCHM algorithm to efficiently find the CHIs (Correlated High utility Itemsets) using the bond measure [88]. It integrates four strategies, namely DOS (Directly Outputting Single items), PSN (Pruning Supersets of Non-correlated itemsets), PBM (Pruning using the Bond Matrix) and AUL (Abandoning Utility-List construction early), to discover CHIs efficiently. The proposed algorithm is more than two orders of magnitude faster as compared to the FHM [17]. It discovers more than five orders of magnitude fewer patterns by pruning a large number of weakly CHIs and mining only CHIs in some cases. However, the detailed results are not shown in the case of PBM and AUL strategies.

EFIM [51] uses expensive sort operations to identify duplicate transactions in the database. To address this issues, Krishnamoorthy [18] proposed HMiner algorithm which utilizes CUL (Compact Utility-List) structure to efficiently store the utility information. It develops a virtual hyperlink structure that discovers duplicate transactions in the database. Further, it applies several pruning strategies (TWU-Prune, U-Prune, LA-Prune, C-Prune and EUCS-Prune) to efficiently mine the HUIs. The proposed algorithm achieves the improvement on the execution time that is ranged from a modest thirty percent to three orders of magnitude across several benchmark databases. Moreover, the requirements of memory consumption also show up to an order of magnitude improvement over the HUI-Miner [81], FHM [17], IMHUP [86] and EFIM [51]. HMiner performs well in the dense regions of both dense and sparse benchmark databases. However, in the case of longer transactions of moderate dense databases, it shows poor performance with regard to memory consumption and run-time.

The previous approaches [16, 17, 61] do not perform well on sparse databases. To address this issue, Peng et al. [82] proposed a one-phase mHUIMiner (modified HUI-Miner) algorithm which shows the best running time on the sparse databases. It incorporates the IHUP tree structure [32] to avoid unnecessary utility-list constructions into the original HUI-Miner algorithm [16]. The mHUIMiner is the fastest algorithm on sparse databases and has comparable performance on dense databases for the benchmark methods. Moreover, it performs efficiently despite the decrease in density. However, the execution time and memory usage are high as the input size increases, but it does not grow exponentially.

The algorithms [16, 17] require high memory and execution time. To overcome the problem, Duong et al. [89] proposed ULB-Miner (Utility-List Buffer Miner) to find the HUIs. They proposed an improved ULB (Utility-List Buffer) structure that efficiently stores and retrieves the utility-list and reuses the memory during the mining process. ULB constructs the utility-list segments in linear time. The proposed algorithm is up to ten times faster and consumes up to six times less memory as compared to HUI-Miner [16] and FHM [17]. Moreover, it achieves better performance on both sparse and dense databases. However, it cannot be directly applied to utility-lists stored in the ULB.

CHUI-Miner [78] suffers from the following drawbacks: (1) needs to conduct the costly join operations to evaluate the utility of each itemset. (2) a large number of candidates are generated to find the set of CHUIs. (3) applicable to an itemset only when its utility-list is fully constructed. (4) considers a large number of non-closed items that significantly reduce the performance of the mining process. To address these issues, Dam et al. [84] proposed the CLS-Miner algorithm to efficiently mine the CHUIs. The proposed algorithm integrates three pruning strategies, namely Chain-EUCP, LBP, and pruning by coverage that prune the search space before constructing the utility lists. The concept of coverageFootnote 6 is inspired by the definition of frequent closed itemsets in FIM [67]. Moreover, a pre-check method is also proposed that quickly determines whether an itemset is a subset of another itemset or not. It optimizes the operations of closure computations and subsumption checks that significantly reduce the time to discover the CHUIs. The proposed algorithm outperforms CHUD [20] and CHUI-Miner [78] on both dense and sparse benchmark databases. Moreover, it is linearly scalable in number of transactions and the number of items. However, it needs to store structures in memory for its pruning strategies, Chain-EUCP, and coverage.

The concept of MHUIs (Maximal High Utility Itemsets) [75, 90] is proposed to mine all the HUIs along with their utility values with/without the need for database scans again and requires less memory to store all the results. However, they either require two phases to mine MHUIs or are incomplete. Moreover, they generate a large number of candidates or mine compact forms of HUIs indirectly through CHUIs and remove the non-maximal patterns. To address these issues, Nguyen et al. [91] proposed CHUI-Mine (Maximal) algorithm that efficiently mines MHUIs from the transaction databases. It utilizes pruning techniques, namely EUCP and CUIP (Continuous Unpromising Item Pruning) that significantly prunes the search space to enhance the mining performance. The proposed algorithm significantly reduces the memory requirement and execution time to store the discovered patterns. However, it may generate unnecessary candidate patterns.

Closed and concise rule mining approaches take additional time to construct the latticeFootnote 7 to mine closed HUIs. To address this issue, Merugula et al. [92] proposed CG-Algorithm (Closed and Generator Algorithm) that extracts closed and generator HUIs from the same single closure check in the given database. It uses a hash-based data structure to maintain HUIs. The advantage of the CG-Algorithm is that it consumes less memory and low execution time for both closed and generator HUIs over the standard databases.

Many efficient algorithms [20, 64, 78, 84] provide concise profitable commodity combinations to managers. However, operators may need commodities, not only to generate generous profits but also to be purchased by the customers often. To resolve this issue, Wei et al. [93] proposed FCHUIM (Frequent Closed High Utility Itemset Mining) algorithm to find all FCHUIs (Frequent and Closed High Utility Itemsets). They proposed an efficient list structure, named TSL (Total Summary List) to reuse memory and fast access of information of items. A pre-check method is proposed that prunes the search space to reduce the non-closed HUIs efficiently. Moreover, a nested list structure is also used to quickly find FCHUIs in a large number of candidates. The proposed algorithm uses an upper-bound on the utility of items that works the same as the Z-element in [94]. FCHUIM significantly reduces the number of candidates. It shows high performance in the case of dense and sparse databases.

One-phase HUIM algorithms [4, 16, 17] are time- consuming and memory-consuming, especially on dense databases with long transactions. To address these issues, Shan et al. [95] proposed an efficient one-phase HUIM algorithm, named EHUIM-DS, based on a novel data structure. The data structure is used to reorganize the transaction database to obtain all HUIs effectively. It calculates the utility values with one or two scans of ITems Data (ITDs) instead of scanning the utility-list structures or the entire database. Moreover, it significantly reduces memory usage by using a depth-first search. Two upper bounds, namely extension utility and local transaction weighted utility are proposed to prune the search space from width and depth. The experimental results show that the proposed algorithm performs well as compared to the state-of-the-art methods, HUI-Miner [16], d2HUP [4], FHM [17], and UFH [83] in terms of the number of candidates, run-time, and memory usage on the sparse and dense databases.

Wu et al. [96] propose an efficent algorithm, named UBP-Miner (Utility Bit Partition Miner) to improve the utility-list construction process. A novel set of bit-wise operations is proposed called BEO (Bit mErge cOnstruction) to speed up the construction process. Besides, a novel data structure called UBP (Utility Bit Partition) is designed to support BEO. This structure is integrated into a novel UBP-Miner algorithm, which also applies several search space reduction strategies. Experimental results show that UBP-Miner is faster than several state-of-the-art algorithms such as HUI-Miner [16], HUP-Miner [60] and ULB-Miner [89] in terms of run-time, memory usage and scalability on benchmark datasets. However, the proposed algorithm requires some additional memory.

The existing utility-list-based algorithms [16, 60, 89] are time-consuming and memory-consuming when they store the itemset information for the utility list. To solve this problem, Cheng et al. [97] proposed an efficient one-phase utility-list-based HUIM algorithm, named HUIM-SU to mine HUIs from the transactional dataset. A simplified utility list is designed where each record represents all the utilities of the transactions of an individual item. A construction tree is proposed to reduce the search space based on the simplified utility-list. Moreover, compressed storage is also proposed to reduce the memory usage of the construction tree. To further reduce the search space of the promising candidates, extension utility and local TWU utility are utilized to minimize the number of items. The experimental results show that the proposed algorithm performed better compared to the state-of-the-art algorithms HUI-Miner [16], HUP-Miner [60], FHM [17], and ULB-Miner [89] in terms of the number of candidates, memory usage, and execution time from dense and sparse datasets.

Table 11 An overview of key characteristics of utility-list-based high utility itemsets mining algorithms

3.3.1 Summary

As discussed above, the utility-list-based HUIM algorithms outperformed level-wise and tree-based approaches on the benchmark databases in terms of efficiency, memory usage, scalability, etc. However, the utility-list-based approaches suffer from the following drawbacks: (1) need to perform the costly join operations between utility-lists of (k+1)-itemsets and k-itemsets that need a high amount of time; (2) suffer from high space and complexity; (3) either perform well on sparse databases or dense databases; (4) avoid promising itemsets which do not appeared in the database; (5) the run-time and memory usage is high as the input size increases; (6) do not scale well in the case of large databases. The detailed comparative summary of utility-list-based HUIM algorithms is shown in Table 11. The pros and cons of all the utility-list based HUIM algorithms are depicted in Table 12. The simplicity of the utility-list structure and the high performance of utility-list-based algorithms have led to the development of numerous utility-list-based algorithms for HUIM and variations of the HUIM problem such as closed high utility itemset mining [84], top-k high utility itemset mining [94, 99], high utility itemset mining in uncertain databases [100], high utility sequential pattern mining [101], and on-shelf high utility itemset mining [102], among others [17, 69, 103]. Although the introduction of the utility-list structure has been a breakthrough in the field of HUIM, however, the utility-list structures still have to be improved. Due to the wide applications of the utility-list structure in high-utility pattern mining, there is an important need to propose a more effective and efficient utility-list structure that can be constructed in linear time and can reduce memory usage.

Table 12 Pros and cons of utility-list-based high utility itemsets mining algorithms

3.4 Projection-based high utility itemsets mining

To overcome the drawbacks of the utility-list-based approaches, projection-based HUIM algorithms [13, 19, 42, 104] are introduced to recursively project the target items into the projected sub-databases. They have following advantages: (1) reduce the excessive candidate’s generation; (2) process efficiently by using the prefix projection to decrease the size of projected sub-databases; (3) improve the efficiency of the mining process by using bi-level and pseudo-projection; (4) perform well on both sparse and dense databases at most support levels; (5) memory-efficient and scalable algorithms.

Two-phase algorithm [12] shows poor performance with dense databases and long patterns. To resolve this problem, Erwin et al. [79] proposed a CTU-PRO algorithm that discovers HUIs using the pattern-growth approach [57]. The CUP-tree (Compressed Utility Pattern tree) is a variant of the CTU-tree [47]. The proposed algorithm uses TWU [12] to prune the search space and discovers the HUIs by avoiding the re-scanning of the database. To extend the work [79], the authors developed an approach, CTU-PROL which mines the HUIs from large databases using the pattern-growth approach [57]. CTU-PROL performs well as compared to the benchmark algorithms, Two-phase [12] and CTU-Mine [47] on the dense and sparse databases at most support levels. However, the global CUP-tree does not fit completely in memory.

CTU-PRO algorithm [79] suffers from memory space. To address this issue, Lan et al. [105] proposed a novel PB (Projection-based) method that uses the indexing mechanism and a pruning strategy to efficiently find the HUIs. It uses a TC (Temporal Candidate) itemset table that quickly stores and obtains significant information on the values of the itemsets in the mining process. Later, this work is extended by [13]. The pruning strategy is based on the TWU [12] that reduces the number of unpromising candidates. The proposed algorithm [13] performs better as compared to the Two-phase [12] and CTU-PRO [79] about the number of candidates, run-time, and memory consumption. However, it generates too many redundant candidates.

One-phase HUIM approaches [16, 60, 61, 69] suffer from high costs concerning memory usage and run-time. To address these issues, Bai et al. [104] proposed SPHUI-Miner (Selective database Projection-based HUI mining algorithm) to enumerate all the HUIs. It uses a compact data format, named HUI-RTPL (High Utility Reduced Transaction Pattern List) that stores unique transactions and cumulative utilities of items in these transactions in the database. Two new upper bounds, namely, tup (Transaction Utility in Projection) and pu (Projection Utility) are proposed that effectively prune the search space. Furthermore, two novel data structures, namely SUP-List (Selective database projection utility-list) and Tail-Count list are developed that reduce the number of database scans, and database projection size and only store the relevant information to mine HUIs. The proposed algorithm performs better than benchmark algorithms [4, 15, 16, 60, 61, 69, 81] in computing time, memory consumption, and candidate generation from very condensed databases. Moreover, the proposed algorithm is scalable because it is independent of the order of processed projections which makes it suitable for distributed environment.

HUI-Miner [15], d2HUP [4], and EFIM [61] face the problem of high costs concerning run and memory usage. These algorithms either perform well on dense databases or sparse databases. To address these issues, Jaysawal et al. [19] proposed DMHUPS (Discovering Multiple High Utility Patterns Simultaneously) algorithm to effectively discover the HUIs. It utilizes IUData List (Item Utility Data list) structure to store the information of promising length-1 itemsets along with their positions in the transactions that are used to efficiently obtain the initially projected database. It simultaneously utilizes the utility and tighter extension upper-bound that are used to reduce the search space for multiple potential candidates. The proposed algorithm finds the multiple HUIs simultaneously to efficiently limit the search space. It uses transaction merging and look-ahead strategies to efficiently discover longer patterns. DMHUPS shows high performance as opposed to the UP-Growth+ [15], HUI-Miner [15], d2HUP [4], and EFIM [61] on both dense and sparse databases. Moreover, it is memory efficient and scalable. However, it uses the transaction merging strategy only for dense databases.

HUIPM [106] and FDHUP [107] only take the correlation factor by considering the co-occurring frequency in each transaction. To resolve this challenge, Gan et al. [5] proposed CoHUIM (non-redundant Correlated High Utility Itemset Mining) algorithm to find the CoHUIs (non-redundant Correlated High Utility Itemsets) by spanning the sub-projected databases of candidates. CoHUIs consider both the utility and positive correlation measures. Moreover, an SDC (global Sorted Downward Closure) property is developed to guarantee the global anti-monotonic to identify the complete set of CoHUIs. The proposed algorithm prunes a large number of unpromising candidates and accelerates the performance of the mining process. CoHUIM outperforms benchmark algorithms concerning the run-time and generated patterns. Moreover, it avoids excessive amounts of meaningless and redundant information. Furthermore, the number of obtained CoHUIs is more interestingness and valuable than that of HUIs.

Table 13 An overview of key characteristics of projection-based high utility itemsets mining Algorithms

3.4.1 Summary

High utility itemset mining algorithms represent the transaction database through a summarized data structure and mine high-utility itemsets by recursively constructing projected databases from the global data structure. The bottleneck of HUIM algorithms is the exponential search space for exploration and the time spent to construct the projected database during recursive calls. Projection-based algorithms are proposed to represent the transaction database as transactions only and utilize several techniques like closure, transaction merging, etc. to mine patterns efficiently in one phase only. However, projection-based approaches incur several problems: (1) generate too many candidate sets in some cases due to the use of TWU property; (2) do not perform well for high threshold values on few databases. The detailed summary of projection-based HUIM algorithms is shown in Table 13. Furthermore, the pros and cons of projection-based HUIM approaches are depicted in Table 14.

Table 14 Pros and cons of projection-based high utility itemsets mining Algorithms

3.5 Miscellaneous approaches of high utility itemsets mining

In this section, several miscellaneous approaches for HUIM are discussed. These approaches use vertical bitmap [109, 110] and/or horizontal bitmap [109] representation to reduce memory usage and grow linearly with the size of the database.

Song et al. [109] proposed BAHUI (Bitmap-based Algorithm for High Utility Itemsets) algorithm which utilizes a divide-and-conquer approach and uses bitmap representation to mine the HUIs. It vertically uses a bitmap to visit the itemset lattice, while it horizontally uses a bitmap to calculate the real utilities of candidates. It uses efficient bit-wise operations and significantly reduces memory usage. Furthermore, it only stores the promising HUIs using maximal length and inherits the search from the maximal itemsets mining process. The proposed algorithm is efficient and scalable as compared to the benchmark methods [12, 48]. But it consumes a slightly high memory compared to HUC-Prune [48].

Song et al. [111] proposed IHUI-Mine (Index High Utility Itemsets Mine) algorithm to efficiently mine the HUIs. It uses subsume index [112] data structure for efficient frequent itemset mining that enumerates and prunes the search space. Moreover, a discovery algorithm is used to effectively compute the TWU for the HUIs. Furthermore, the real utilities of candidates can be verified from the recorded transactions in the database using a bitmap representation. The proposed algorithm is 5.50, 5.76, 1.37, and 2.73 times faster as compared to the state-of-the-art algorithms, Two-phase [12], FUM [8], HUC-Prune [48], and UMMI [49], respectively about memory consumption, efficiency, and scalability.

Song et al. [110] proposed BPHUI-Mine (Binary Partition-based High Utility Itemsets Mine) algorithm to efficiently find the HUIs. It uses vertical bitmap representation to effectively represent the item expansion based on the binary partition in the transaction database which decreases the processing time and memory consumption. Several pruning strategies such as maximum itemset utility [59] and EUCP [17] are used to prune the search space to efficiently mine HUIs. Furthermore, support count is used to prune HTWUIs [50]. The proposed algorithm outperforms Two-phase [12], FUM [8], HUC-Prune [48], and UMMI [49], respectively concerning memory usage, efficiency, and scalability. It grows linearly with the size of the database. Moreover, it is robust to noise in dense databases.

Hidouri et al. [108] proposed an algorithm, named SATCHUIM (SAT (Satisfiability)-based Closed High Utility Itemset Mining) that makes original use of symbolic Artificial Intelligence technique, i.e. propositional satisfiability, for efficiently enumerating all closed high utility itemsets embedded in a transaction database. The authors use the SAT-based formulation to specify in terms of constraints the task of finding (closed) high utility itemsets over transaction databases. The proposed algorithm performs the tree-based backtrack search procedure, named DPLL (Davis-Putnam-Logemann-Loveland) solver, which is similar to the TWU measure, to prune the search space. The experimental results show that the proposed method performs well as compared to two baselines HUIM algorithm, namely EFIM [51], and d2HUP [4]. The proposed method also shows effectiveness in comparison to three baselines closed HUIM algorithms, namely EFIM-Closed [64], CHUD [35], and CHUI-Miner [78]. However, the number of found itemsets highly depends on the selected threshold values. It decreases when the utility threshold increases and vice versa. Furthermore, the number of patterns can be limited when the minimum utility threshold is large.

Dahiya et al. [113] proposed an optimization technique, named EAHUIM (Enhanced Absolute High Utility Itemset Miner), an advanced version of AHUIM (Absolute High Utility Itemset Miner) from big data. A neutral division approach has been proposed for dividing the search space among the computing nodes. It considers both the parameters, TWU, and length of the item in the subspace to assign it to a node. The storage complexity of the process is reduced by maintaining the subsets of transactions wherever possible instead of the complete transactions. The divide-and-conquer approach is used for extracting the data from the large dataset. Two pruning techniques, namely Absolute Local Utility (ALU) and Absolute Subtree Utility (ASU) are used to prune the search space to significantly improve the mining process. The experimental results prove the improvement of the proposed algorithm over the state-of-the-art algorithms EFIM-Par and PHUI-Miner [114] in terms of the total number of transactions, distinct items, the average number of items per transaction from the benchmark datasets.

Table 15 An overview of key characteristics of miscellaneous approaches of high utility itemsets mining algorithms

3.5.1 Summary

Several miscellaneous approaches for HUIM have been discussed that address the issues of high memory consumption of the level-wise and tree-based algorithms. BAHUI uses horizontal and vertical bitmap representations to mine HUIs by using the divide-and-conquer method. IHUI-Mine uses the subsume index to mine HUIs. BPHUI-Mine is based on the binary partition to represent the transaction dataset. However, these algorithms could be further improved in terms of memory usage. The details of miscellaneous HUIM approaches are shown in Table 15. The pros and cons of all the miscellaneous approaches are depicted in Table 16.

Table 16 Pros and cons of miscellaneous approaches of high utility itemsets mining
Fig. 3
figure 3

Horizontal view of HUIs mining algorithms for transactional databases

3.6 Summary and discussions

We discussed the various approaches of HUIM for transaction databases such as level-wise, tree-based, utility-list-based, projection-based and miscellaneous. It has been observed that they have the following main advantages: (1) consider the usefulness and semantic information of an itemset; (2) extract the maximum profit for the businesses; (3) use efficient pruning strategies to significantly prune the search space; (4) significantly reduce the excessive database scans; (5) significantly reduce the number of candidates; (6) improve the performance of the mining process; (7) effectively reduce the memory usage and execution time; (8) highly scalable; (9) perform well on various benchmark databases. However, all these approaches are not applicable in the following domains: (1) suitable only for the static databases, not for incremental or dynamic databases; (2) use only positive utility value, however, the real-world businesses consist of both positive and negative utility value; (3) apply only for transaction databases, not for other databases eg. on-shelf, sequential databases, etc; (4) use only for simple mining processes, however, there exists many complex approaches such as sequential pattern mining, data stream, uncertain databases, etc.

Fig. 4
figure 4

Horizontal view of Closed-HUIs mining algorithms for transactional databases

Furthermore, we show the horizontal view of all the algorithms. We categorize the existing algorithms into two groups: high utility itemsets based algorithms and closed high utility itemsets based algorithms. Figures 3 and 4 show horizontal classification of HUIs and Closed HUIs mining algorithms, respectively.

4 Other high utility itemsets mining algorithms

In this section, we briefly discuss the HUIM approaches for other type of databases, including on-shelf HUIM, HUIM from sequential, uncertain, temporal, incremental, and other databases, databases with negative utility values, data stream mining, periodic mining, and privacy-preserving mining.

4.1 On-shelf high utility itemsets mining

Most of the HUIs remain undiscovered by using the existing traditional HUIM algorithms because the exhibition periods of all items are different in real-time applications. Hence, it may be biased when the items are not always on-shelf. On-shelf utility mining has recently received interest in the data mining field due to its practical considerations. On-shelf utility mining considers not only profits and quantities of items in transactions but also their on-shelf time periods in stores. Lan et al. [33] proposed TP-OHUI (Two-phase algorithm for mining On-shelf High Utility Itemsets in temporal databases) that efficiently and effectively mines the high on-shelf utility items. However, it works only in the case of positive utility. To resolve this challenge, Lan et al. [115] proposed an efficient three-scan mining approach, named TS-HOUN that efficiently discovers the high on-shelf utility itemsets with negative profit from the temporal databases. An effective itemset generation method is developed to avoid generating a large number of redundant candidates and to effectively reduce the number of data scans in mining. Another efficient algorithm, named FOSHU (Faster On-Shelf High Utility itemset miner) [116] is proposed to mine HUIs while considering on-shelf time periods of items, and items having positive and/or negative unit profit. The experiments show that FOSHU can be more than 1000 times faster and use up to 10 times less memory than the state-of-the-art algorithm TS-HOUN [115]. Dam et al. [102] proposed KOSHU (fast top-K On-Shelf High Utility itemset miner) algorithm that mines the top-k high on-shelf utility items having positive and/or negative unit profits while considering the on-shelf time periods of the items. KOSHU introduces three novel strategies, named efficient estimated co-occurrence maximum period rate pruning, period utility pruning and concurrence existing of a pair 2-itemset pruning to reduce the search space. KOSHU also incorporates several novel optimizations and a faster method for constructing utility-lists. Zhang et al. [117] propose two methods, OSUM (On-shelf utility mining) of sequence data (OSUMS) and OSUMS+, to extract on-shelf high-utility sequential patterns. For further efficiency, several strategies are designed to reduce the search space and avoid redundant calculation with two upper bounds time prefix extension utility (TPEU) and time reduced sequence utility (TRSU). In addition, two novel data structures were developed for facilitating the calculation of upper bounds and utilities. OSUMS may consume a large amount of memory and is unsuitable for cases with limited memory, while OSUMS+ has wider real-life applications owing to its high efficiency.

4.2 High utility itemsets mining from sequential databases

The problem of mining high utility sequences aims at discovering subsequences having a high utility (importance) in a quantitative sequential database. High utility sequence mining has been applied in numerous applications. It is quite challenging task to solve this problem due to the combinatorial explosion of the search space when considering sequences, and because the utility measure of sequences does not satisfy the downward-closure property used in pattern mining to reduce the search space [118]. Various extensions of the HUSP problem have been studied such as to hide high utility sequential patterns in databases to protect sensitive information [119] and discovering high-utility sequential rules [119]. Yin et al. [120] propose an efficient algorithm, named USpan to mine for high utility sequential patterns from the sequential database. A lexicographic quantitative sequence tree is designed to extract the complete set of high utility sequences and design concatenation mechanisms for calculating the utility of a node and its children with two effective pruning strategies. The experimental results show that USpan efficiently identifies high utility sequences from large scale data with very low minimum utility.

4.3 High utility itemsets mining from uncertain databases

The traditional mining algorithms are designed to mine the high frequent or high utility patterns by considering support and utility individually from the uncertain datasets. They are not designed to mine the required information from the uncertain datasets which consider both measures (utility and uncertainly) together as a multi-objective optimization problem. The utility measure is considered as an semantic method to access the value of an pattern, on the other hand, the uncertainly measure is considered as an objective method to access the reliability and existence of a pattern. It is a non-trivial tasks to consider both measures to mine HUIs from uncertain datasets. The reason is that both measures are conflicting with each other, thereby, resulting in the useless extracted patterns. Lin et al. [100] propose a a novel framework, named PHUIM (Potential High-Utility Itemset Mining) in uncertain databases, to efficiently discover not only the itemsets with high utilities but also the itemsets with high existence probabilities in an uncertain database based on the tuple uncertainty model. The PHUI-UP algorithm (Potential High-Utility Itemsets Upper-Bound-based mining algorithm) is first presented to mine PHUIs (Potential High-Utility Itemsets) using a level-wise search. Since PHUI-UP adopts a generate-andtest approach to mine PHUIs, it suffers from the problem of repeatedly scanning the database. To address this issue, a second algorithm named PHUI-List (Potential High-Utility Itemsets PU-list-based mining algorithm) is also proposed. This latter directly mines PHUIs without generating candidates, thus greatly improving the scalability of PHUI mining. Lin et al. [121] propose an efficient algorithm, named MUHUI (Mining Uncertain High-Utility Itemsets), is proposed to efficiently discover PHUIs in uncertain data. Based on the PU-list (Probability-Utility-list) structure, the MUHUI algorithm directly mines PHUIs without generating candidates, and can avoid constructing PU-lists for numerous unpromising itemsets by applying several efficient pruning strategies, which greatly improve its performance. MUHUI algorithm scales well when mining PHUIs in large-scale uncertain datasets. Ahmed et al. [122] propose the multi-objective evolutionary approach to discover the high expected utility patterns mining (MOEAHEUPM) framework in a limited time-period from the uncertain database. The proposed algorithm considers the utility and uncertainty simultaneously to extract the set of non-dominated high expected utility patterns (HEUPs) based on the evolutionary computation from the uncertain environment. The proposed algorithm does not need the prior knowledge (minimum utility threshold and minimum uncertain threshold) to discover the information. But, instead of prior knowledge, it mines more meaningful and unique non-dominated patterns to meet the decision.

4.4 High utility itemsets mining from temporal databases

The existing traditional HUIM algorithms generate too many candidates that make it difficult for the user to identify useful items from the large static databases. To resolve this issue, several algorithms are proposed in data stream mining. Chu et al. [123] proposed THUI-Mine (Temporal High Utility Itemsets Mine) algorithm that mines the temporalFootnote 8 HUIs from the data stream. The proposed algorithm identifies the temporal HUIs with fewer candidate-generated itemsets and has high performance. Shie et al [90] proposed a GUIDE (Generation of maximal high Utility Itemsets from Data strEams) algorithm that generates compact and insightful items which are high utility as well as maximal from the data stream. Ryang et al. [124] proposed SHU-Grow (Sliding window-based High Utility Grow) algorithm and SHU-Tree (Sliding window-based High Utility Tree) that efficiently mine the HUIs from the continuous data stream. They also present two techniques, namely RGE (Reducing Global Estimated utilities) and RLE (Reducing Local Estimated utilities) that significantly prune the search space and candidates by decreasing the overestimated utilities.

4.5 High utility itemsets mining from incremental databases

The traditional HUIM algorithms are proposed to handle static databases. However, in real-time applications, the transactions are gradually inserted, deleted, or modified in the database. When new transactions occur in the database, new items may emerge and old items become irrelevant. The conventional HUIM approaches run in batch mode only. When they apply to extract data from the updated database, they need to execute from scratch which ignores the previous results and is also time-consuming. In the recent past, several efficient IHUPM (Incremental High Utility Itemsets Mining) algorithms are proposed to handle the inserted transactions in the updated database.

Gan et al. [27] provide a comprehensive survey of incremental HUIM. The authors have comprehensively reviewed ten incremental HUIM algorithms which mainly fall into the following three groups: (1) Apriori-based; (2) tree-based; (3) utility-list-based. For details, the readers may refer to the survey paper on incremental HUIM by [27].

Now, we briefly discuss several incremental HUIM algorithms which are not covered by [27]. Lee et al. [125] proposed PIHUP (Pre-large Incremental High Utility Patterns) algorithm that is based on the pre-large concept to effectively discover the HUIs in the incremental database. It uses PIHUP\(_L\)-tree (PIHUP Lexicographic tree) structure to find the patterns fastly. The proposed algorithm needs only one scan to process the dynamic data. It effectively reduces the redundant operations and memory space as compared to the PRE-HUI algorithm [126]. However, the generated candidate patterns need to maintain the anti-monotone property [1].

Dam et al. [127] proposed a single-phase IncCHUI (Incremental Closed High Utility Itemset miner) that mines closed HUIs from the incremental databases using the incremental utility-list structure. The incremental utility-list structure stores the information of all single items both in the original database and the added transactions. The proposed algorithm constructs the lists of single items by scanning the original database or updated section only once. It uses the CHT (Closed Hash Table) to store the discovered closed HUIs. The proposed algorithm is highly scalable concerning the sizes of the input databases as opposed to the benchmark algorithms. Moreover, it is efficient in memory usage and execution time.

Nguyen et al. [128] proposed a modified version of the EFIM algorithm [51], named MEFIM (Modified EFficient high utility Itemset Mining) by adding the capability to handle the dynamic databases. The authors proposed an optimized version of the MEFIM algorithm, named iMEFIM that uses a novel structure P-set. The P-set structure reduces the number of transaction scans and improves the mining process. The proposed algorithm performs well as compared to the benchmark algorithms on dynamic databases regarding memory usage and run-time. The iMEFIM scans 26 percent fewer transactions than MEFIM. However, the MEFIM algorithm scans the database repeatedly to compute the utility, local utility, and sub-tree utility of each item. The iMEFIM algorithm consumes high memory because of the P-set structure.

Liu et al. [129] proposed an Id2HUP+ (Incremental Direct Discovery of High Utility Patterns) algorithm which adopts a one-phase paradigm of d2HUP [4, 16] by improving relevance-based pruning and upper-bound based pruning and introducing quick merge of identical transactions. The proposed algorithm proposes niCAUL (newly improved Chain of Accurate Utility-Lists) structure to quickly update the dynamic databases. Two pruning strategies, namely absence-based pruning and legacy-based pruning are proposed for incremental mining. The proposed algorithm is up to one-to-three orders of magnitude more efficiently as compared to the benchmark algorithms.

Yun et al. [130] proposed IIHUM (Indexed-list based Incremental High Utility Pattern Mining) algorithm to discover the HUIs from the incremental databases. An IIU-List ( Incremental Indexed Utility-List) structure is proposed in a list form to discover the HUIs without any candidate generation. Furthermore, restructuring and pruning techniques are suggested to efficiently process the incremental data. The proposed algorithm efficiently mines the HUIs from incremental databases as compared to the benchmark algorithms.

4.6 High utility itemsets mining from other databases

Traditional HUIM methods deal with positive utility only, however, real-world applications consist of negative utility as well. For example, an outlet may sell items at the loss to cover the cost of the other items in the store or promotion of certain items. The traditional HUIM approaches do not address this issue. In the past decade, several HUIM with negative utility approaches are proposed to mine HUIs with negative utility. Singh et al. [30] provide a comprehensive survey of HUIM with negative utility. This survey includes twelve papers that mainly fall into the following three groups: (1) level-wise; (2) tree-based; (3) utility-list-based. For details, the readers may refer to the survey paper of HUIM with negative utility value by [30]. Now, we briefly discuss several HUIM algorithms for the negative utility which are not covered by [30]. Singh et al. [131] proposed EHNL (Efficient High utility itemsets mining with Negative utility and Length constraints) algorithm that uses length constraints to find HUIs. This is the first work to mine HUIs from the database having negative utility values and considering length constraints. It introduces a minimum length constraint to remove the excessive very small itemsets. Furthermore, it also uses maximum length constraint to restrict the too longer itemsets. EHNL utilizes database projection and transaction merging techniques to reduce the cost of database scans. Moreover, it utilizes a sub-tree-based pruning strategy, based on EHIN [132], to reduce the search space and accelerate the mining process. The proposed algorithm efficiently mines the HUIs with low memory consumption for real databases.

In the conventional HUIM methods, the actual utility values of itemsets are increased along with the increase of their length. This makes it difficult to analyze whether an itemset is better than its subsets or not. To resolve this issue, HAUIM (High Average-Utility Itemsets Mining) algorithms [133] are proposed that mine the average HUIs from the databases. The average utility of an itemset is the summation of all utility values of all items in the appeared transaction divided by the total number of items in the transaction. However, they spend a high amount of time and generate excessive candidates. To address these limitations of traditional HAUIM algorithms, in recent years, more advanced HAUIM algorithms [134, 135] are proposed that perform well in terms of the number of join operations, execution time, memory consumption, and scalability. However, they work only on static databases. To address this issue, incremental HAUIM algorithms [136, 137] are proposed that efficiently mine the average HUIs on the dynamic database in a significant way. Singh et al. [25] provide a comprehensive survey of high average-utility itemsets mining algorithms that can serve as the recent advancement and research opportunities in the area of data mining. For further details, the readers may refer to the paper [25] for the new technological advancements in the domain of HAUIM.

The traditional HUIM algorithms spend more computation time searching for an item from large databases. The issue can be resolved by using evolutionary computation. Two genetic algorithms, namely HUPE\(_{umu}\)-GRAM (High Utility Pattern Extracting using Genetic algorithm with Ranked Mutation using Minimum Utility threshold) [138] and HUPE\(_{wumu}\)-GRAM (High Utility Pattern Extracting using Genetic algorithm without Ranked Mutation using Minimum Utility threshold) [138], are proposed to find HUIs with and without minimum utility threshold respectively. But, both these algorithms are inadequate to obtain promising HUIs because the computations are more if the distinct items are very large in the database. Another evolutionary algorithm based on particle swarm optimization, named Binary-Particle Swarm Optimization (BPSO) [139], is proposed to obtain the optimization solutions from the large search space. However, it requires a large number of computations to obtain high accuracy. Another solution to the computational evolutionary is to use the fuzzy theory system to handle the transaction databases based on the crisp sets. Recently, Kumar et al. [140] provide a comprehensive survey on soft computing based on HUIM that includes evolutionary computation and fuzzy-based approaches. For more information, the readers may refer to the paper [140] to get the new advancement and research opportunities in the area of soft computing using HUIM.

The traditional HUIM algorithms utilize the single minimum utility threshold to discover the HUIs. But, in real-time applications, these approaches are unrealistic because it is hard to develop efficient strategies for businesses by using these methods. To overcome this problem, many HUIM algorithms with multiple minimum thresholds [31, 141] are extensively proposed that specify multiple minimum utility thresholds for each item to identify more specific and useful HUIs. These approaches generate more benefits as compared to the HUIM algorithms. The HUIM with multiple minimum utility thresholds can be applied to expert intelligent systems to make more efficient decisions or strategies. However, these approaches consume a high amount of memory.

The traditional HUIM methods do not consider the period constraints and ignore the timestamps of the transactions. Hence, these methods discover profitable HUIs but they seldom occur in the transactions. To address this issue, several algorithms are proposed that efficiently and effectively mine the complete set of periodic HUIs while considering the period constraint and pruning a large number of non-periodic items from the large databases [142]. These methods may provide significant, reliable, and effective solutions in real-time applications.

HUIM algorithms are vulnerable to privacy issues. To address this issue, several privacy-preservation HUIM algorithms are proposed to hide sensitive HUIs [143]. These algorithms, not only generate the high profitable HUIs, but also capable of privacy-preserving the secure or private HUIs. These methods can be applied to intelligent privacy-preserving approaches in the industry environment.

4.7 Summary

As discussed above, the on-shelf HUIM algorithms efficiently and effectively mine the high on-shelf utility itemsets. The temporal HUIM algorithms efficiently mine the HUIs over a time period from the data stream. The incremental HUIM approaches are proposed to efficiently discover the HUIs in the updated databases. HUIMs with negative utility efficiently work to discover the HUIs with negative profit. Average HUIM algorithms effectively discover the HUIs along with the increase of their length. The HUIM algorithms with multiple minimum utility thresholds are used to identify more specific and interesting items that make it easy for businesses to reach more efficient decisions. The periodic HUIM mines the HUIs while considering the periodic constraint. The privacy-preserving HUIM algorithms are capable to hide sensitive HUIs. In short, these approaches perform well in the real world environment.

5 Databases and open-source resources

5.1 Databases

In this survey, the HUIM algorithms for transactional databases consider various real-world [144,145,146]. A brief description of these real-world is given that are publicly available in the literature. These algorithms consist of dense, sparse, mixed, short transactions, moderate transactions, and long transactions databases. Dense databases have fewer items and longer transactions as compared to sparse databases. Dense databases are generated from games like Chess and species of mushroom that have very few items and longer transaction’s length. Sparse databases are generated by retail giants like Walmart, Amazon, etc. that sell millions or billions of products, but a customer usually purchases a few products only. Accident database anonymized traffic accident data. BMS-POS database contains several year’ worth of point-of-sale data from a large electronics retailer. The real database BMSWebview-1 is taken from [147]. This database was used in KDD CUP 2000. It contains click-stream data from e-commerce. The Chess database is prepared based on UCI chess. Connect database is prepared based on the UCI connect-4. The Foodmart database consists of customer transactions from a retail store. The database includes 1,112,949 transactions with quantities, 46,086 items, an average transaction length of 7.2, and a utility table. The Mushroom database is prepared based on the UCI mushrooms. OnlineRetail is transformed from the Online Retail database. Pumsb database consists of census data for population and housing. Retail database is the customer transactions from an anonymous Belgian retail store. WebDocs is a real-life transactional database that is available to the data mining community through the FIMI repository. The whole collection contains about 1.7 million documents, mainly written in English, and its size is about 5 GB. It is stored, obtained, and transformed from SQL-Server 2000. Kosarak is a very large database containing 9,90,000 sequences of click-stream data from a Hungarian news portal. The database was converted into SPMF format using the original data. Grocery chain store were received from U-MineBench version 2.0 [146]. The characteristics of all the real-world databases are shown in Table 17.

Table 17 Characteristics of Real-World Databases [144,145,146]

5.2 SPMF open source framework

SPMF [148] is an open-source data mining framework. It offers more than 254 data mining algorithms. These algorithms cover 14 different categories, including sequential pattern mining, sequential rule mining, sequence prediction, itemset mining, episode mining, periodic pattern mining, graph pattern mining, high-utility pattern mining, association rule mining, stream pattern mining, clustering, time series mining, classification, and text mining. It is written in Java, specialized in pattern mining. It is distributed under the GPL v3 license. It offers several resources such as documentation with examples of how to run each algorithm, a developer’s guide, performance comparisons of algorithms, datasets, an active forum, a FAQ, and a mailing list. SPMF presents various kinds of databases for data mining. It also presents database generators for a synthetic transaction database, sequence database, and sequence database with timestamps. It offers a toolbox for calculating statistics about a transaction database with utility information, and about a sequence database. The toolbox can be used to convert a sequence database to a transaction database and vice-versa.

6 Research opportunities and future directions

In recent decades, there are tremendous growth in the field of HUIM. However, there are still lots of challenges with the traditional approaches regarding memory consumption, run-time, scalability and, many more. Hence, there is wide scope for improvement. In this section, we highlight some key research opportunities and future research areas in high utility itemsets mining.

6.1 Compact and concise high utility itemsets mining

Although several algorithms [20, 35, 84] integrate the utility mining and closed itemsets. However, there is still wide scope for the integration of HUIM algorithms with other compact representations [36]. This is an interesting research problem that may be quite useful to further reduce the redundant itemsets from the databases. Moreover, the closed HUIM approaches consume high memory and execution time in the case of dense databases depending on the minimum utility threshold. Hence, efficient algorithms could be developed by introducing novel data structures, pruning strategies, and constraints.

6.2 Dynamic complex data

Another promising research opportunity could be the extension of HUIM algorithms in the area of dynamic complex data, for example, uncertain, dynamic sequential, incremental, and stream data. There is a wide range of scope for improvement so that the algorithms can efficiently handle the dynamic complex data to mine HUIs.

6.3 Constraint based high utility itemsets mining

Although there are lots of efficient HUIM algorithms, the user is mainly interested in the longer and better actionable candidates. The action-ability of HUIM algorithms is useful to decide for the users to increase utility. Several algorithms are proposed that reduce the total number of generated itemsets based on the constraints on the resulting rules. Therefore, the users are more interested in constraint HUIM algorithms. Pei et al. [149] present many constraints for FIM. It can be pushed into HUIM as well. Moreover, length-based HUIM algorithms play a significant role in constraint-based mining that can remove a lot of small itemsets and generate more interesting and actionable high-utility items. Although, in [69], the HUIM algorithm with length thresholds is proposed. But, there are still lots of research opportunities to discover more efficient algorithms to push the constraints as deep as we can.

6.4 High utility itemsets mining with big data

Big data deals with Grid computing, Multi-core computing, MapReduce, Graphics Processing Units (GPUs), Spark framework, and Apache Hadoop. The HUIM approaches could be explored to mine the HUIs from the large databases in the area of distributed and parallel algorithms, and their applications in big data analytics, similar to those proposed in [150]. Previous works [151, 152] explore big data and various operations on big data. The multi-threads approach in cloud computation such as the Map-Reduce framework may be further investigated to find the HUIs in large databases. Furthermore, the distributed version of the concise mining algorithm [84] could be further developed to run on cloud computing platforms like Hadoop or Spark platforms and the TensorFlow system, to process very large databases. Hence, there are lots of research opportunities to work on the scalability issue of HUIM algorithms from constraint-based to more collaborative and hybrid models in the future.

6.5 Other problems

The latest preprocessing techniques improve the completeness, consistency, and precision of the algorithms. The previous works [153, 154] utilize the latest preprocessing techniques with HUIM algorithms. This can be further investigated to develop more efficient algorithms. Few researchers combine the association rule mining and classification models in the past [155]. The integration between classification and utility mining could be further investigated. The sampling-based approximation [156] could be investigated to mine the HUIs that significantly make the computation less expensive. There are many research possibilities to extend the variations of the HUIM problems and reuse them in various areas such as high utility association rules [157], incremental HUIs [32, 158], top-k HUIs [94], on-shelf HUIs [116] and periodic HUIs [103]. The HUIM approaches may be further investigated to optimization methods involving the HUIs in data-stream.

7 Conclusions

High utility itemsets mining (HUIM) is a powerful technique for discovering interesting patterns in transactional datasets. The HUIM algorithms have various applications including market basket analysis, web-click stream analysis, web mining, cross-marketing, gene regulation, mobile commerce environment, and e-commerce. In this survey, we presented highlights of the HUIM approaches for transactional databases that might be useful for the readers to select proper methods for their applications. This work is quite appropriate to find suitable algorithms and improve business satisfaction. It is not only beneficial to the retailer to extract the useful information of high utility items and gain more profits but also to the customers to find a better choice of the selection of items.

In this survey, we provided an up-to-date and comprehensive review of HUIM algorithms for transactional databases. In this survey paper, we have discussed the key concepts, algorithms, and applications of HUIM. The paper provided the taxonomy and presented the state-of-the-art algorithms of HUIM for transactional databases that are categorized as level-wise, tree-based, utility-list-based, projection-based, and miscellaneous. It also discussed the pros and cons of each category of HUIM approaches are discussed in-depth. We also presented a summary of other existing HUIM algorithms including on-shelf, sequential databases, uncertain databases, temporal databases, incremental databases, negative utility values, average-utility, soft computing using HUIM, multiple minimum utility thresholds, data stream, periodic mining, and privacy-preserving utility mining. We also presented 16 real-world databases that are utilized by various HUIM approaches. Finally, the paper identified several key research opportunities and future directions for HUIM algorithms.