1 Introduction

Association mining [1,2,3] of a transaction database is performed to determine association rules between a set of itemsets, for example, a set of events containing {diaper, beer} and {milk, bread}, which are often analyzed in market basket analysis. Several useful and interesting phenomena can be explored based on the association rules. By discovering the connection between items/objects, we can create a suitable market plan, form a suitable marketing strategy [4,5,6], and effectively apply the data to all types of analysis. Different data-mining methods are used depending on the requirements of various applications. The examples of such methods include frequent pattern mining (FPM) [1, 3, 7], utility-driven pattern mining [8,9,10,11], sequential pattern mining [12,13,14], and rare pattern mining (RPM) [15,16,17,18]. FPM is commonly adopted to extract association rules from a transaction database. In general, association rules are categorized as “frequent” or “rare” according to the specified minimum support threshold (minSup). “Frequent” refers to common or anticipated phenomena, while “rare” represents infrequent or previously unknown phenomena; thus, varied information can be extracted from the database.

In real life, it is possible to buy multiple copies of the same item in a transaction database, and mining association rules from such a quantitative database is an important task. The fuzzy set theory, which was first proposed by Zadeh in 1965 [19], is more suitable for dealing with quantitative values and expressing appropriate language values because it can help people better understand knowledge. Each element in the fuzzy set can be assigned a membership degree to indicate the degree to which it belongs, such as {A.low}, {B.mid}. Chan and Au [20] first proposed an Apriori-like [1], namely F-APACS, to discover fuzzy association rules. Kuok et al. [21] proposed a method for processing quantitative data. Hong et al. [22] proposed an FDTA algorithm to process a quantitative database. In contrast to these Apriori-like algorithms, several fuzzy pattern mining algorithms based on tree structures were proposed. Lin et al. [23] proposed a fuzzy frequent pattern tree that can effectively discover fuzzy frequent itemsets (FFIs). Lin et al. [24] proposed a compressed FFP (CFFP)-tree algorithm. Although this CFFP-tree-based algorithm can reduce the number of tree nodes by using an additional array on each node, it is requires computational expenses to save the array. In addition, if the transactions are large, the spatial complexity of each node is high. Therefore, Lin et al. [25] proposed the UBFFP-tree algorithm to solve the problem of CFFP-tree overhead, which uses the same global sort strategy as in CFFP-tree to construct trees. Each term in the transaction is obfuscated by retaining only the language terms with the largest cardinality later in the process. Subsequently, Lin et al. [26] proposed the FFI-Miner algorithm to discover a complete set of FFIs without generating candidates. By adopting the fuzzy-list structure, the necessary information is preserved in the mining process. They also proposed an effective pruning strategy that can reduce the search space and further accelerate the mining process. Some effective algorithms for mining FFIs for association rules are still being studied.

Fuzzy-based data mining is simple and similar to human reasoning. According to the algorithms mentioned above, a fuzzy theory has been extensively developed [19] for application to FPM. However, the rare pattern mining (RPM) remains to be explored further. FPM, which is relatively mature, can find often-appearing or expected phenomena. In contrast, RPM can usually discover unknown or unexpected phenomena. Based on the existing algorithms, it is known that RPM can discover phenomena that are important in real life. For example, we assume that A and B represent two symptoms, and C represents a disease. In general, symptom A may lead to disease C. After rare pattern mining, we may find that symptom B also leads to disease C, which will be of great help to the medical industry. The performance of students with poor scores can also be determined, and their learning conditions can be adjusted accordingly. Apriori-like algorithms were initially used to determine frequent or rare association rules. If the minimum support (minSup) threshold is set extremely small, explosive growth occurs. However, if it is set extremely large, some useful association rules may be ignored. Based on the above problems, we attempt to use two thresholds, the minimum rare support (minRSup) and the minimum frequent support (minFSup), to achieve a better effect. As mentioned before, a quantitative transaction database can be processed with linguistic meaning using fuzzy theory. In the FFI-Miner algorithm [26], which aims to mine FFIs, while fuzzy rare but quite interesting itemsets are ignored.

To the best of our knowledge, RPM based on fuzzy theory has not yet been studied. To this end, a novel algorithm named FRI-Miner for mining fuzzy rare itemsets (FRIs) from a quantitative transaction database is proposed in this paper. The major contributions of this study are as follows.

  1. 1.

    This study is the first to formulate the problem of fuzzy RPM with linguistic meaning. The proposed fuzzy-list-based FRI-Miner algorithm addresses this problem successfully. As a fuzzy-theoretic data-driven model, it is explainable and similar to human reasoning that is more useful for decision making.

  2. 2.

    Fuzzy rare itemsets in FRI-Miner are categorized into three types: (1) containing only fuzzy rare items; (2) containing any combination of fuzzy rare items and fuzzy frequent items; and (3) containing only fuzzy frequent items. The first type is easy to understand; in the third type, fuzzy frequent itemsets that are themselves frequent may actually be fuzzy rare itemsets.

  3. 3.

    Several pruning strategies that utilize the properties of fuzziness, rare pattern, and fuzzy support are designed to successfully reduce the search space of FRI-Miner.

  4. 4.

    Experiments on several benchmark databases are conducted to show that the proposed FRI-Miner algorithm has better effectiveness and mining efficiency compared to those of existing methods.

The remainder of this paper is organized as follows: In Section 2, we review previous related work in the field of FPM and RPM. In Section 3, we present the definition and basic concepts of rare itemsets and fuzzy theory and then formulate the problem of FRI mining. The details of the proposed FRI-Miner algorithm are presented in Section 4. In addition, we use an example to illustrate the details of FRI-Miner for ease of understanding. The experimental evaluation is provided in detail in Section 5. Finally, the conclusions and future work are presented in Section 6.

2 Literature review

In this section, we briefly discuss fuzzy pattern mining and RPM. Then, we highlight the importance of fuzzy RPM and further discuss several applications.

2.1 Fuzzy pattern mining

According to the existing FPM algorithms, the data types can be roughly categorized as Boolean, ordered, or quantitative [2, 3, 7]. Traditional FPM aims at processing data that do not contain quantitative values, although quantitative data are common in real life. To solve the processing problem of a quantitative database, the quantitative value of a project can be converted into a language term with a fuzzy degree by using fuzzy set theory [19]. Meaningful fuzzy association rules can be discovered in real life. The addition of fuzzy concepts fuzzifies the quantitative values. This is also convenient to better understand the true meaning of the data and make the data simpler and easier to understand. For example, if the database does not contain the quantity value corresponding to each item, then the frequency of the item is only related to the Boolean value. However, for databases containing quantitative values, the frequency of occurrence of each item is determined by the quantitative values of the items.

Frequent itemset mining based on fuzzy theory was further elaborated by the Apriori algorithm [1]. Fuzzy association rules are induced by the hierarchical intelligent mining of the FFI. First, according to the predefined membership function, the quantization value of the item is converted into a language term, which not only reflects the occurrence frequency of the item in the database, but also reflects the support of the itemset. Therefore, several algorithms for mining fuzzy itemsets that meet the minimum support threshold through deterministic factors have been designed. The F-APACS algorithm for mining fuzzy association rules was first proposed by Chan and Au [20]. Kuok et al. [21] proposed an algorithm by processing quantitative attributes to discover fuzzy association rules, and Hong et al. [22, 27] proposed the FDTA algorithm to process quantitative databases using fuzzy set theory. Subsequently, a fuzzy frequent pattern tree [28] was proposed. Several tree-based algorithms for mining FFIs have been proposed [23, 25], which are based on the FP-growth method [29] for mining frequent itemsets. These algorithms first construct a tree structure of fuzzy frequent items through the reserved frequent itemsets and then further extract more fuzzy frequent patterns from the constructed tree structure. However, the mining process with a pattern-growth mechanism [29] is often complicated and requires the storage of numerous tree nodes with additional information. To summarize, all these fuzzy-based FPM algorithms can only discover expected frequent phenomena.

2.2 Rare pattern mining

Based on the Apriori algorithm [1], many frequency-based data mining algorithms have been extended, but the mining results mostly correspond to common or expected phenomena. To address this, another data mining framework, rare pattern mining (RPM) [16, 30,31,32], has been introduced, and many algorithms for RPM have been proposed in recent years. In contrast to frequent and common patterns, discovering rare patterns may be more useful in some cases (e.g., itemsets and association rules), which are important for real-life applications.

Most of static rare itemset mining algorithms can be roughly divided into support threshold, no support threshold, and constraints [16]. Because we use the support threshold in this study, we provide a brief overview of RPM with the support threshold. Because the support threshold for rare itemsets is lower than that for frequent itemsets, the generation of rare itemsets is better realized by setting lower or distinct support thresholds. To address the “rare item problem” in FPM, in 1999, Liu et al. [33] proposed the MSApriori algorithm that adopted multiple minimum supports (MMS) to successfully discover rare itemsets. In contrast to this Apriori-like method, several set-enumeration tree based methods were proposed for RPM, such as the frequency-driven FP-ME algorithm [30] and the utility-driven HUIM-MMU [34] and HIMU [35] algorithms. The Apriori-reverse algorithm proposed by Koh and Rountree [36] aims to discover rare rules with respect to completely dispersed itemsets, which only contain items below the maximum support threshold. Later, a rare itemset mining algorithm (ARIMA) was proposed by Szathmary et al. [17]. Troiano et al. [37] introduced the rarity algorithm, which is a top-down rare itemset mining algorithm. Up to now, a number of RPM algorithms have been extensively proposed, such as CFP-growth++ [15]. Among them, several studies are designed to deal with dynamic data streams [38,39,40]. There are numerous candidates for most algorithms based on Apriori mechanism. The RP-tree-based algorithm [18] discovers rare patterns that meet the conditions between minRSup and minFSup.

2.3 Applications of fuzzy rare pattern mining

For some real applications, unusual rare patterns are more important and useful than frequent ones. In some application domains, RPM is more suitable for intelligent systems. The result of network intrusion can be obtained by detecting whether the network is abnormal. In medicine, sudden changes are diagnosed by finding data that are different from those corresponding to normal health. In an insurance company, by finding rare people who need high-risk claims, making reasonable marketing strategies, and so on. In recent years, the field of RPM [16] has been further developed. The emergence of the RPM has also made a significant contribution to the research community. However, these previous RPM algorithms rarely involve fuzzy theory. In contrast, many FPM algorithms based on fuzzy theory have been widely used in the mining of quantitative data, while RPM dealing with quantitative data is extremely rare. To the best of our knowledge, RPM dealing with Boolean or quantitative data based on fuzzy theory has not yet been studied. To this end, we propose an effective algorithm that uses fuzzy theory to discover the interesting rare patterns from a quantitative transaction database.

3 Preliminaries and problem formulation

In this section, we introduce some basic concepts, principles of fuzzy-driven pattern mining and RPM. Some definitions in previous research are adopted here to present common concepts clearly. Further details regarding the background of the fuzzy FPM can be found in Ref. [23, 26].

3.1 Preliminaries

I = {i1, i2, \(\dots \), im} represents a finite set I composed of m different items. D = {t1, t2, \(\dots \), tn} represents a transactional database D (1 ≤ qn) composed of n different items, in which each transaction tq is a subset of I. Each transaction contains a unique identifier tid, and each item lists the number of quantities as viq. If an itemset consists of k different items, we call it k-itemset, and if each item in the k-itemsets is included in the transaction tq, we call the itemset a subset of the transaction tq. The minimum support is defined as minRSup, and the maximum support is defined as minFSup. The specified member function is set to μ, which can be adjusted according to the user’s needs. In this study, all items in our database are represented using a single fuzzy membership function, as shown in Fig. 1. We set three language terms in the adopted membership functions: low, middle, and high. The quantitative value of each item is obscured by the same member function as the corresponding language term.

Fig. 1
figure 1

The used linear membership functions of linguistic 3-terms

In this paper, we present a running example using a quantitative transactional database, as shown in Table 1. There are six items, (A), (B), (C), (D), (E), (F), and eight transactions {t1, t2, \(\dots \), t8}. We set the minimum support threshold to minRSup (= 25%) and the maximum support threshold to minFSup (= 50%). The idea is to filter out items that seem insensible and similar to noise; thus, this can help to reduce unnecessary search space.

Definition 1

(The attributes of a quantitative database) They are represented by the language variable Ri, and the fuzzy language terms are represented by the natural language as (Ri1, Ri2, \(\dots \), Ril). Ri can be defined by the membership function μ.

Table 1 A quantitative database

For example, in this study, the membership function μ applied in the running example is shown in Fig. 1. There are five items in Table 1: (A), (B), (C), (D), (E), and (F). The three language terms are expressed as Low(L), Middle(M), and High(H). In transaction t1, item B is denoted by (B.L), (B.M), and (B.H), as shown in Fig. 1. The other items in these transactions are calculated similarly to those of item B.

Definition 2

(The quantitative value of an item) In the quantitative database, the value of an item is expressed as viq, which represents the number of this item i in transaction tq.

For example, in transaction t2, the quantitative values of the items as (B) and (D) are vB2 (= 8) and vD2 (= 3), respectively. Here, we can clearly see what it means.

Definition 3

(Fuzzy set) In the fuzzy stage, a fuzzy set refers to the set of fuzzy language terms with a membership degree (fuzzy value) converted from the quantitative value viq of the item i in the transaction database tq by the membership function μ. Its specific expression is as follows:

$$ f_{iq} = \mu_{i}(v_{iq}) = \left( \frac{fv_{iq1}}{R_{i1}} + \frac{fv_{iq2}}{R_{i2}} + {\dots} + \frac{fv_{iqh}}{R_{il}}\right), $$
(1)

where l refers to the number of fuzzy language terms in which the membership function μ converts to I, Ril represents the l-th fuzzy language term of item i, and fviql represents the membership of item i in the quantitative value viq (fuzzy value) of the l-th fuzzy language term Ril’s, where \( fv_{iql}\subseteq [0, 1]\).

For example, in transaction t2, we convert the quantitative value of the items in Table 1 to the membership degree using the membership function. Here, we provide a detailed description of the quantitative value of (B) in Fig. 1 as \(\left (\frac {0.6}{B.L} + \frac {0.4}{B.M}\right )\). The transformation of all other transactions is similar to the transformation of item (B) in transaction t2. The specific results are listed in Table 2.

Table 2 Transformed results from Table 1

Definition 4

(Membership degree) The support after the membership function is converted to membership degree and expressed as sup(Ril). The sum of the scalar cardinality of the fuzzy value of Ril is expressed as:

$$ sup(R_{il})=\sum\limits_{R_{il}\subseteq t_{q}\wedge t_{q}\in Q^{\prime}}fv_, $$
(2)

where database \( Q^{\prime } \) is the database transformed by the membership function μ, which is the same as the original database D.

For example, the support of the fuzzy terms (B.M) appears in transactions t1, t2, t3, t4, t5, t6, t7, and t8, as shown in Table 2. Thus, its support in the running database can be calculated as sup(B.M) = 0.8 + 0.6 + 0.6 + 0.8 + 0.8 + 0.8 + 0.6 + 0.8 = 5.8.

Definition 5

(The minimum fuzzy value of an k-itemset) An itemset X is composed of k-items (k≥ 1), and its support degree is expressed as sup(X), which represents the sum of the minimum fuzzy values of k-items contained in X. The definition is as follows:

$$ \begin{array}{@{}rcl@{}} sup(X) & =& \{X\in R_{il}| \sum\limits_{X\subseteq t_{q}\wedge t_{q}\in Q^{\prime}}min(fv_{i_{j}ql}, fv_{i_{m}ql}),\\&& i_{j}, i_{m}\in X, i_{j}\notin i_{m}\}, \end{array} $$
(3)

where the items in X do not intersect each other.

For example, the fuzzy 2-itemset (A.L,D.H) appears in transactions t1, t3, t6, and t7, as shown in Table 2. Thus, the support of (A.L,D.H) can be calculated as sup(A.L, D.H) = {min(0.6, 0.8) + min(0.6, 0.6) + min(0.8, 0.2) + min(0.8, 0.6)} = {0.6 + 0.6 + 0.2 + 0.6} = 2.0. Note that here the fuzzy-based support is different from the support concept of FPM, and the later equals to the number of transactions contains X.

In a transaction database, the mining of some common itemsets aims at discovering frequent itemsets. In previous methods, a minimum frequent support minFSup is generally set, and itemsets that meet the minFSup specified by the user are preserved. Thus, more frequent itemsets that meet these criteria are discovered. For RPM, we set the minimum rare support minRSup. RPM discovers itemsets that meet the criteria of minRSup. However, the support of these rare itemsets cannot be greater than that of minFSup. For example, we assume that sup(A) = 1.0, and sup(B) = 2.5. If we set the minRSup to 2.0, and the minFSup to 3.0, then, we find that item A does not meet the minRSup. The support of item B is greater than that of minRSup and less than minFSup. Thus, item B is a rare pattern that we intend to discover.

Definition 6

(Fuzzy rare itemset) Considering quantitative data, the user specifies the specific minimum rare support threshold and minimum frequent support threshold, which are minRSup and minFSup, respectively. Only the itemsets that satisfy two conditions can be considered as fuzzy rare itemsets (FRIs), as follows:

$$ \begin{array}{@{}rcl@{}} FRIs &\leftarrow& \{X|\ minRSup \times |D|\ \leq sup(X)\\& \leq& minFSup \times |D|\}. \end{array} $$
(4)

We take an example of 1-itemset. Through the calculation of the fuzzy values, we find that sup(A.L) = 2.8, sup(B.M) = 5.8, sup(C.L) = 2.4, and sup(D.H) = 4.0. According to the definition of FRIs, we find that (A.L) and (C.L) are rare itemsets, whereas (B.M) and (D.H) are frequent itemsets.

3.2 Problem statement

To date, many algorithms have been developed for mining frequent patterns in quantitative databases. Corresponding with the emergence of quantitative values, fuzzy theory has also merged. Therefore, a mining algorithm for fuzzy frequent patterns is proposed accordingly. However, there are many interesting but rare patterns that are ignored, mostly because of setting a low support.

In this study, the problem of fuzzy rare pattern mining (fuzzy RPM for short) is formulated as follows. Given a quantitative database, the specific minimum support threshold and maximum support threshold, which are minRSup and minFSup, respectively. The goal of fuzzy RPM is to discover the complete set of FRIs that satisfies two conditions.

4 Proposed fuzzy mining algorithm: FRI-miner

For the mining of quantitative databases, previous studies have shown that fuzzy theory can play an important role. Through many previous statements, we find that it has been relatively mature in the mining of frequent itemsets, while RPM based on fuzzy theory has not yet been studied. In this study, we apply fuzzy theory to process the quantitative database, obtain the candidates, and finally discover FRIs by using the fuzzy-list structure [26]. In summary, the specific steps of the FRI-Miner algorithm include: 1) fuzzification phase, 2) construction of fuzzy-list, and 3) recursively mining FRIs, which are presented as follows.

4.1 Fuzzification phase

FRI-Miner first fuzzes the quantitative database and converts viq in quantitative data into the membership degree of the corresponding fuzzy terms through a membership function. Next, it forms a new fuzzy database by using the maximum scalar cardinality and support-ascending order.

Definition 7

(Maximum scalar cardinality) For an item i, it uses the corresponding fuzzy term Ril to denote its corresponding quantitative value viq. In the expressed fuzzy terms, we find the fuzzy terms that should be reserved according to the maximum scalar cardinality. This can represent the corresponding language variables as item i.

For example, in Table 2, the transformed fuzzy terms with their summed fuzzy values of the linguistic variable (A) are (A.L: 2.8, A.M: 1.2, A.H: 0). Thus, the fuzzy term (A.L) is used to represent the linguistic variable of (A), which can be used for the later mining process of FRIs.

Based on the reserved Ril of the maximum scalar cardinality, the fuzzy quantitative database was modified to form a new fuzzy-based database. The revised database from Table 2 is presented in Table 3. Fuzzy items reserved by each transaction tq are then sorted into ordered fuzzy items according to the support-ascending order. Note that the items retained here meet the minimum level of support. In addition to rare items, these items also have frequent items. The cross combination between them produces a new itemset that meets the two conditions.

Table 3 A revised database

Definition 8

(The support-ascending order) In the fuzzy transaction database, fuzzy items that satisfy the fuzzy value condition are retained according to the maximum scalar cardinality in the transaction tq. Based on the fuzzy values of the reserved fuzzy items, they are sorted according to the support-ascending order to prepare for the subsequent computation.

For example, we can obtain the result through the transformed fuzzy database, based on the maximum scalar cardinality and the support-ascending order.

4.2 Fuzzy-list construction phase

In FRI-Miner, first, viq in a quantitative database is converted into a membership degree corresponding to the corresponding fuzzy terms through the membership function. Then, we form a novel fuzzy database and then construct the corresponding fuzzy-list structure [26]. We keep the fuzzy terms in L1, which uses three fields that make up the fuzzy-list: the transaction identifier (tid), the internal fuzzy value (if), and the remaining fuzzy value (rf). We briefly introduce these details.

Definition 9

(Transaction identifier) It represents a fuzzy term Ril in transaction tq, which is a subset of the corresponding transaction in tq is denoted as \( R_{il}\subseteq t_{q} \). Here, we use the corresponding (tid) to represent the presence in that transaction.

According to the initial construction of the fuzzy-list, as shown earlier in L1, the support ascending order is used to obtain (C.L < A.L < D.H < B.M) in Fig. 2. For example, in Table 3 and Fig. 2, the fuzzy item (B.M) exists in t1, t2, t3, t4, t5, t6, t7, and t8.

Fig. 2
figure 2

The constructed fuzzy-list structures

Definition 10

(Internal fuzzy value) The fuzzy value (if) of the fuzzy term Ril in the transaction tq is denoted as if(Ril,tq).

For example, in Table 3, the internal fuzzy values of (B.M) in t1 and t2 are if(B.M,t1) = 0.8, and if(B.M,t2) = 0.6, respectively.

Definition 11

(The resting fuzzy value [26]) In the fuzzy-list, the resting fuzzy value of Ril is expressed as rf(Ril,tq). It represents the maximum fuzzy value obtained by performing a union operation. In other words, the upper bound value of all fuzzy terms Ril in tq, as shown in Fig. 2. It is defined as:

$$ rf(R_{il}, t_{q}) = max\{if(z, t_{q}) |z\in (t_{q}/R_{il})\}. $$
(5)

According to Table 3, formed by the support-ascending order, the rf (C.L, t4) is calculated as max{0.8,1.0} = 1.0, and rf (C.L, t5) is calculated as max{0.8} = 0.8, the rf (C.L, t6) is counted as max{0.8,0.2,0.8} = 0.8, and the rf (C.L, t8) is max{0.8,0.8} = 0.8. Details of the fuzzy-list structure can be referred to Ref. [26].

In Fig. 2, we can observe that the element (4, 0.4, 1.0) in the constructed fuzzy-list structure of (C.L) indicates (tid,if,rf), 4 represents the transaction t4, 0.4 represents the internal fuzzy value is 0.8, and 1.0 represents the resting fuzzy value after (C.L) is 1.0. The other information for the fuzzy item is similarly represented.

The fuzzy-list of the 1-itemset performs the intersection operation to form a new fuzzy-list structure of k-itemsets (k ≥ 2). In the recombination process, the ones with the same tid are combined. Except for the item that needs to be calculated, the internal fuzzy value of the transaction corresponding to all eligible fuzzy k-items (k ≥ 2) takes the minimum value of the merged item as the remaining fuzzy value. In Fig. 3, the combined results of fuzzy 2-itemsets are shown as (C.L,A.L), (C.L,D.H), (C.L,B.M), and so on.

Fig. 3
figure 3

The fuzzy-lists of fuzzy 2-itemsets

According to the similar fuzzy-list in Fig. 2, we obtain the corresponding values of the three columns, and the corresponding support can be obtained according to the values of the second and third columns. They are defined as follows:

Definition 12

In the fuzzy-list, we can calculate the sum of the inner fuzzy values of an itemset Ril in a quantitative database, and it is denoted as SUM(Ril.if) [26] and is defined as follows:

$$ SUM(R_{il}.if) = \sum\limits_{R_{il} \subseteq t_{q} \wedge t_{q}\in Q^{\prime}} if(R_{il}, t_{q}). $$
(6)

For example, in Fig. 2, the sum of the internal fuzzy values of (C.L) in D is calculated as (0.4 + 0.6 + 0.6 + 0.8) = 2.4.

Definition 13

In the fuzzy-list, we calculate the sum of the resting fuzzy values of Ril in the quantitative database D is defined as SUM(Ril.rf) [26]. Here, the rf value is calculated from the third column as follows:

$$ SUM(R_{il}.rf) = \sum\limits_{R_{il} \subseteq t_{q} \wedge t_{q}\in Q^{\prime}}rf(R_{il}, t_{q}). $$
(7)

For example, in Fig. 2, the sum of the resting fuzzy values for (C.L) in D is calculated as (1.0 + 0.8 + 0.8 + 0.8) = 3.4.

The fuzzy value in the fuzzy-list was used to obtain the total support. Here, we use the support-ascending order to construct a set of fuzzy-lists. According to the measurement of the support of the corresponding item, we obtain the rare itemsets between the support tends to the minimum and the maximum specified support threshold, and the fuzzy itemsets with the support above the maximum support. We also obtain a set of items that we can intersect further in Fig. 3. Then, it sorts the results according to the support-ascending order and adopts the pruning strategies to prune the search space with respect to an enumerated tree [41]. Then, the 1-itemset is gradually expanded to 2-itemsets, 3-itemsets, and other combinations, which are shown in Fig. 4. Finally, a set of rare itemsets that satisfy support can be obtained.

In Fig. 4, we narrow the search space by pruning to remove items below the minimum threshold. The remaining items are combined to form a new k-itemset. The structure of the fuzzy-list of the k-itemset is similar to that of the 1-itemset. The specific description is as follows:

Fig. 4
figure 4

An enumeration tree of the used example

Theorem 1

According to the concept of FRIs, for an itemset X in the fuzzy-list structure, if its SUM(X.if) is no less than the minimum fuzzy rare support and no more than the maximum fuzzy frequent support, it is seen as a FRI. If min(SUM(X.if), SUM(X.rf)) of X is no less than the minimum fuzzy rare support, new itemsets are required to be generated. If the sum of the resting fuzzy values of Ril is no less than the minimum fuzzy rare support (minRSup × \(|Q^{\prime }|\)), extensions of Ril may be a FRI. If the summation of the resting fuzzy values of Ril is smaller than the minimum fuzzy rare support, any extensions of Ril will neither be a FRI nor a FFI. Thus, there is no need to construct a new fuzzy-list structure for its extension.

Proof

For \(\forall t_{q}\supseteq X^{\prime }\), suppose fuzzy term Ril is denoted as X, and X’ is the extension of X (\(X\subset X^{\prime } \subseteq t_{q} \Rightarrow X^{\prime }.tids \subseteq X.tids\)), thus \( (X^{\prime }-X)\) = \((X^{\prime }/X) \) and \((X^{\prime }/X)\) \(\subseteq (t_{q}/X) \). Since \(if(X^{\prime },t_{q})\) = min{if(X,tq), rf(Ril, tq)}, it holds \(if(X^{\prime })\) = SUM(X.rf) [26]. Thus, with the definition of a FRI, both minRSup and minFSup are able to determine the promising candidates for FRIs by quickly pruning the search space. □

For example, we can consider the 3-itemsets (A.L, D.H, B.M), which is an extension of 2-itemsets (A.L, D.H). Because the sum of the resting fuzzy values of (A.L, D.H) is calculated as (0.8 + 0 + 0.8 + 0.4) = 2.0. The extension (A.L, D.H, B.M) in the search space of FRI-Miner with respect to an enumeration tree [41] must be generated. For the 3-itemsets (C.L, A.L, D.H), which is the extension of 2-itemsets (C.L, A.L), the sum of the resting fuzzy values of (C.L, A.L) is calculated as 0.8 < 2.0. Thus, the extension of (C.L, A.L, D.H) is unnecessary because the fuzzy value of the 3-itemsets is lower than the specified minimum fuzzy value. A detailed description of the FRI-Miner algorithm is presented in the following section. The pseudocode for the construction of the fuzzy-list here is similar to that of the frequency-based FFI-Miner algorithm [26].

4.3 Fuzzy rare itemset mining phase

Note that we have learned about both rare and frequent itemsets from the previous sections. In this section, we go into more details about the mining processes of FRIs. In general, there are three types of rare itemsets, as follows:

  1. 1.

    The itemsets with only fuzzy rare items: all items in an itemset are rare, and the itemset may be rare.

  2. 2.

    The itemsets with fuzzy rare and fuzzy frequent items: an itemset that contains rare and frequent items internally may be a rare itemset. For example, we can calculate the sum of the inner fuzzy values of itemset (A.L) is calculated as (= 2.8 > 2.0), and the sum of the inner fuzzy values of the itemset (D.H) is calculated as (= 4.0 ≥ 4.0), but we can calculate the sum of the inner fuzzy values of the itemset (A.L,D.H) is calculated as (= 2.0 ≥ 2.0).

  3. 3.

    The itemsets with only fuzzy frequent items: an itemset in which all items are frequent may be rare. For example, we can calculate the sum of the inner fuzzy values of the itemset (D.H) as 4.0. The sum of the inner fuzzy values of the itemset (B.M) is calculated as (= 5.8 > 4.0), but the value of the itemset (D.H,B.M) is calculated as (4.0 > 3.8 > 2.0).

In the above discussion, we have carried out detailed concepts, data structure, theorem, and strategies of the proposed FRI-Miner algorithm. Algorithm 1 describes the complete details of FRI-Miner for discovering FRIs. Notice that the itemset that satisfies SUM(X.if)minRSup ×|D| will be rare or frequent (lines 2-3), and it is used to generate the new itemsets (extensions of X, lines 11-12). This guarantees the correctness and completeness of the final discovered results of FRI-Miner.

figure e

According to the above pseudo-code, it can be known that the quantitative database as follows shown in Table 1 is converted to a fuzzy database as follows shown in Table 2 at the beginning of the algorithm, and the membership function μ is used to blur the quantitative value at this phase. The fuzzy items in the fuzzy database obtained in the first phase are clearly planned in the fuzzy-list structure, and the fuzzy value that meets the user-specified threshold is selected by the maximum scalar cardinality and support-ascending order strategy. It keeps the rare and frequent items that meet the conditions and outputs rare items. It combines the remaining fuzzy items in a tree structure and deletes the itemsets that do not meet the conditions through the pruning strategy. Finally, it outputs complete FRIs that meet the conditions.

5 Experimental evaluation

In this study, we address the problem of mining FRIs in a quantitative database. To the best of our knowledge, FRI-Miner is the first fuzzy-based algorithm for mining rare itemsets with linguistic meanings. It can discover rare itemsets that satisfy the two conditions of fuzzy values. The RP-Growth algorithm [18], which aims to mine rare itemsets, was selected as the baseline to evaluate the validity and performance of the proposed FRI-Miner algorithm.

A total of six benchmark datasets were used in this experiment: retail, accidents, foodmart, chess, mushroom, and kosarak. Real-life datasets are available in a public repository.Footnote 1 Their characteristics and descriptions can be found in previous studies [7, 42]. All algorithms were implemented using Java language and executed on an Intel Core 6 Duo 3.00 GHz machine with 8 GB of RAM running Windows 10.

We transform the quantitative datasets according to a predefined membership function. For the foodmart dataset, we used a different membership function, which is [100, 600, 1100]. The reason is to ensure that the data in the dataset are evenly distributed in the set membership function, so that the final result obtained is more accurate. The other datasets were distributed between the specified [1, 21, 31]. According to the previous algorithm, to evaluate the performance of the compared algorithms, we analyzed them from three aspects: running time, memory usage, and the number of generated patterns. For example, the changes in runtime can reflect whether the designed data mining algorithm is acceptable within a reasonable execution time. The results obtained with the compared algorithms are shown in Figs. 57, Table 4, and Fig. 8, respectively. The detailed results are as follows.

Fig. 5
figure 5

Runtime vs. minSup/maxSup

Table 4 Number of patterns (candidates and final results) under various parameter settings

5.1 Effectiveness

As mentioned before, FPM and RPM are two different mining tasks. Moreover, the fuzzy-theoretic-based RPM is also different from the traditional RPM. In the mining rare itemsets, two thresholds minRSup and minFSup are set in FRI-Miner. As shown in Table 4 and Fig. 5, the constraints of the two thresholds further enable the FRI-Miner algorithm to run efficiently, shorten the running time, and identify meaningful rare itemsets. First, FRI-Miner uses minRSup to remove itemsets that are lower than minRSup; that is, these itemsets without practical significance. Then, fuzzy-based rare itemsets are discovered from the potential candidates. In the experiments, minRSup and minFSup are expressed as minSup and maxSup, respectively, as shown in Table 4. Note that the number of #FRIs is derived by FRI-Miner, and the number of #RIs is derived by the RP-Growth.

In general, in the experimental process, the running time reflects the execution efficiency of the algorithm. According to the experimental results, we find that the running time of the FRI-Miner algorithm is relatively constantly shortened, and the number of discovered patterns also increases. This RP-tree-based RP-Growth algorithm is also a prominent algorithm in mining rare itemsets, and FRI-Miner is also based on its efficient performance.

It can be observed that the setting of two thresholds also affects the mining time of the itemset. In addition, normal behavior/patterns often appear in the form of infrequent itemsets, but fewer rare behaviors/patterns can be found with FRI-Miner using fuzzy theory. For example, consider the retail dataset, when setting parameters as minSup from 95 to 100, and maxSup from 200 to 500, the number of FRIs is changed from 1,261 to 1,593, while the number of RIs can reach 1,740 to 3,552. It is clear that the two types of patterns, FRIs and RIs, are different, and the fuzzy-based patterns have more useful meaning. Thus, the concept of FRI has excellent fuzzy modeling capabilities and has a linguistic meaning.

5.2 Efficiency w.r.t. runtime

Runtime vs. support

The effects of the minSup and maxSup thresholds were evaluated first. For each dataset, we uniquely specified the corresponding thresholds. When the minimum support threshold we choose continues to increase and the maximum support threshold remains unchanged, we find that the running time of FRI-Miner will continue to decrease. The results of the detailed experiments are shown in Fig. 7a–f. Similarly, when the minimum support threshold is unchanged and the maximum support threshold continuously increases, the running time continuously increases as the mining range increases. These results are shown in Fig. 6a–f. For RP-Growth previously studied, it is a mining rare itemsets algorithm based on the RP-tree structure. This is especially true when minSup is set to be very small, and the runtime increases sharply. For example, in Figs. 6 and 7, the running time of the two compared algorithms always increases with an increase in maxSup, and the running time of the two compared algorithms always increases with a decrease in minSup. This result corresponds to the following experiments.

Fig. 6
figure 6

Runtime vs. maxSup

Fig. 7
figure 7

Runtime vs. minSup

Runtime vs. size/density of dataset

After the runtime analysis on the benchmark datasets, we found that the size of the tested dataset also had a significant impact on the running time of the experiments. For a larger dataset, we need to spend more time on the mining process. In addition, we found that, for a dense dataset, although the dataset is not large, it takes a longer time in the mining process.

Discussion

Based on the results of the datasets in the experimental run, we found that the running time of the FRI-Miner algorithm was longer than that of the RP-Growth algorithm. The reason is that in the process of mining rare itemsets, we need to fuzzify the quantitative values; thus, the running time is relatively longer. Another reason for the longer running time is that FRI-Miner considers that the combination of frequent itemsets may also be rare itemsets when mining rare itemsets. In other words, the branches of frequent itemsets are not removed in the pruning stage, which makes we find more rare itemsets that are ignored. Owing to the existence of pruning strategies, FRI-Miner not only reduces the search space, but also improves the efficiency of the mining task.

5.3 Efficiency w.r.t. pattern

Pattern

In this subsection, we compare the number of patterns discovered by the algorithms. Our intuition suggests that in the FRI-Miner algorithm, the number of patterns mined is greater than that of the RP-Growth algorithm. This has also been confirmed in some datasets in the experiment, such as dataset accidents and BMS, which well reflect this phenomenon. The results of the experiment may be unclear or conflicting. The reason for this experimental phenomenon is that we use quantitative values and thus use fuzzification and pruning strategies to prune many inconsistent itemsets in the mining process. This pruning strategy ensures that the number of itemsets we finally discover is relatively small, but the practicability of the itemsets that we discover is higher. In the experiments, we found that most datasets reflect this phenomenon, including retail, chess, mushroom, and kosarak. The specific experimental results are listed in Table 4.

Pattern vs. support

For each dataset, we choose a threshold that meets their own thresholds; when the minimum support threshold selected continues to increase and the maximum support threshold remains unchanged, we find that the number of itemsets extracted will continue to decrease. Figure 9 shows the experimental results. Similarly, when the minimum threshold remains unchanged and the maximum threshold value continuously increases, we find that the number of itemsets mined continuously increases. The experimental results are presented in Fig. 8. This is especially true when minSup is set to be very small, and the runtime increases sharply. The effects of the minSup and maxSup thresholds were evaluated first. For example, in Figs. 8 and 9, the number of patterns of the two compared algorithms always increases with an increase in maxSup, and the number of patterns of the two compared algorithms always increases with a decrease in minSup.

Fig. 8
figure 8

Pattern vs. maxSup

Fig. 9
figure 9

Pattern vs. minSup

5.4 Efficiency w.r.t. memory

In this subsection, we further analyze the maximum memory occupied during the mining processes of the algorithms. According to the experimental results of six datasets, we found that in sparse datasets (e.g., retail, kosarak), the FRI-Miner algorithm can extract meaningful rare itemsets, that is, ignore most of the itemsets that do not meet the actual situation, but also the memory usage is lower than that of RP growth. However, for relatively dense datasets, a relatively large memory space is required during the mining process. In the dataset BMS, FRI-Miner shows good performance, not only mining more eligible itemsets but also taking up relatively less memory space. The experimental results of these datasets are shown in Fig. 10a–f.

Fig. 10
figure 10

Memory vs. minSup/maxSup

6 Conclusion and future work

In this paper, we propose a novel fuzzy-theoretic-based algorithm for mining FRIs determined by fuzzy set theory and pattern mining. The purpose of this effective mining algorithm is to find meaningful rare itemsets that meet the minimum thresholds. In contrast to existing algorithms, an effective FRI-Miner algorithm is proposed based on a fuzzy list structure. The algorithm requires specifying two minimum support thresholds; it utilizes several pruning strategies to prune unqualified itemsets and discovers the complete set of rare itemsets containing rare and frequent items. The experimental analysis shows that this algorithm performs well and has an improved overall mining quality compared to that of the existing algorithm.

There are still some limitations in our research, such as the membership function and the specified minimum thresholds, which are defined in advance. If the threshold is specified to be extremely small, many candidate itemsets are generated, which occupy a large storage space. If it is specified as extremely high, some meaningful itemsets are ignored, which results in poor mining quality. It is hoped that there will be a more convenient algorithm in the future, which can automatically detect the most suitable threshold value and the setting of the membership function to effectively discover rare patterns.