Keywords

1 Introduction

We are following rough set based rule generation from table data sets [10, 14, 22] and Apriori based rule generation from transaction data sets [1, 2, 9], and we are investigating a new framework of rule generation from table data sets with information incompleteness [17,18,19,20,21].

Table 1 is a standard table. We term such a table as a Deterministic Information System (DIS). In DISs, several rough set based rule generation methods are proposed [3, 5, 10, 14, 16, 22, 23]. Furthermore, missing values ‘?’ [6, 7, 11] (Table 2) and a Non-deterministic Information System (NIS) [12, 13, 15] (Table 3) were also investigated to cope with information incompleteness. In [12], question-answering based on possible world semantics was investigated, and an axiom system was given for query translation to one equivalent normal form [12].

In NIS, some attribute values are given as a set of possible attribute values due to information incompleteness. In Tables 2, \(\{2,3\}\) in x2 implies ‘either 2 or 3 is the actual value, but there is no information to decide it’, and ‘?’ does there is no information. We replace each ‘?’ with all possible attribute values and have Table 3. Thus, we can handle ‘?’ in NIS (some discretization may be necessary for continuous attribute values). Formerly in NISs, question-answering and information retrieval were investigated, and we are coping with rule generation from NISs.

Table 1. An exemplary DIS \(\psi \).
Table 2. An exemplary NIS \(\varPhi \) with missing value ‘?’, whose value is one of 1, 2, 3.
Table 3. An exemplary NIS \(\varPhi \). Each ‘?’ is replaced with a set \(\{1,2,3\}\) of possible attribute values.

The Apriori algorithm [1] was proposed by Agrawal for handling transaction data sets. We adjust this algorithm to DIS and NIS by using the characteristics of table data sets. The highlight of this paper is the following.

  1. (1)

    A brief survey of Apriori based rule generation and a rule generator,

  2. (2)

    Some improvements of the Apriori based algorithm and a rule generator,

  3. (3)

    Experiment by the improved rule generator in Python.

This paper is organized as follows: Sect. 2 surveys our framework on NISs and the Apriori algorithm [1, 2, 9]. Section 3 connects table data sets to transaction data sets and copes with the manipulation of candidates of rules. Then, more effective manipulation is proposed in DISs and NISs. Section 4 describes a new NIS-Apriori based system in Python and presents the improved results. Section 5 concludes this paper.

2 Preliminary: An Overview of Rule Generation and Examples

This section briefly reviews rule generation from DISs and NISs.

2.1 Rules and Rule Generation from DISs

In Table 1, we consider implications like \([P,3]\Rightarrow [Dec,a]\) from x1 and \([R,2]\wedge [S,1]\Rightarrow [Dec,b]\) from x3. Generally, a rule is defined as an implication satisfying some constraint. The following is one standard definition of rules [1, 2, 9, 14, 22]. We follow this definition and consider the following rule generation from DIS.

(A rule from DIS). A rule is an implication \(\tau \) satisfying \(support(\tau )\ge \alpha \) and \(accuracy(\tau )\ge \beta \) (\(0< \alpha ,~\beta \le 1.0\)) for given threshold values \(\alpha \) and \(\beta \).

(Rule generation from DIS). If we fix \(\alpha \) and \(\beta \) in DIS, the set of all rules is also fixed, but we generally do not know them. Rule generation is to generate all minimal rules (we term a rule with minimal condition part a minimal rule).

Fig. 1.
figure 1

The obtained all minimal rules (\(support(\tau )\ge 0.2\), \(accuracy(\tau )\ge 0.9\)) from Table 1. Our system ensures that there is no other rule except them. In the table rule1, the first rule is \(\tau : [P,1]\Rightarrow [Dec,b]\). Even though \(\tau ': [P,1]\wedge [Q,2]\Rightarrow [Dec,b]\) satisfies the constraint of rules, \(\tau '\) is a redundant implication of \(\tau \) and \(\tau '\) is not minimal.

Here, \(support(\tau )\) is an occurrence ratio of an implication \(\tau \) for the total objects and \(accuracy(\tau )\) is a consistency ratio of \(\tau \) for the condition part of \(\tau \). For example, let us consider \(\tau : [R,2]\wedge [S,1]\Rightarrow [Dec,b]\) from x3. Since \(\tau \) occurs one time for five objects, we have \(support(\tau )\) = 1/5. Since \([R,2]\wedge [S,1]\) occurs two times, we have \(accuracy(\tau )\) = 1/2. Fig. 1 shows all minimal rules (redundant rules are not generated) from Table 1.

2.2 Rules and Rule Generation from NISs

From now, we employ the symbols \(\varPhi \) and \(\psi \) for expressing NIS and DIS, respectively. In NIS \(\varPhi \), we replace a set of all possible values with an element of this set, and then we have one DIS. We term such a DIS a derived DIS from NIS, and let \(DD(\varPhi )\) denote a set of all derived DISs from NIS. Table 1 is a derived DIS from Table 3. In NISs like Table 3, we consider the following two types of rules,

  1. (1)

    A rule which we certainly conclude from NIS (a certain rule),

  2. (2)

    A rule which we may conclude from NIS (a possible rule).

These two types of rules seem to be natural for rule generation with information incompleteness. Yao recalls three-valued logic in rough sets and proposes three-way decisions [23, 24]. These types of rules concerning missing values were also investigated in [6, 11], and we coped with the following two types of rules based on possible world semantics [18, 20]. The definition in [6, 11] and the following definition are semantically different [18].

(A certain rule from NIS). An implication \(\tau \) is a certain rule, if \(\tau \) is a rule in each of derived DIS from NIS,

(A possible rule from NIS). An implication \(\tau \) is a possible rule, if \(\tau \) is a rule in at least one derived DIS from NIS.

(Rule generation from NIS). If we fix \(\alpha \) and \(\beta \) in NIS, the set of all certain rules and the set of all possible rules are also fixed. Rule generation is to generate all minimal certain rules and all minimal possible rules.

Two types of rules depend on all derived DISs from NIS, and the number of them increases exponentially. For Table 3, the number is 324 (=\(2^2\times 3^4\)), and the number is more than \(10^{100}\) for the Mammographic data set [4]. Thus, the realization of a system to handle two types of rules was seemed to be hard, however, we gave one solution to this problem.

(Proved Property). For each implication \(\tau \), we developed some formulas to calculate the following,

  1. (1)

    \(minsupp(\tau )=\min _{\psi \in DD(\varPhi )}\{support(\tau ) \text{ in } \psi \}\),

  2. (2)

    \(minacc(\tau )=\min _{\psi \in DD(\varPhi )}\{accuracy(\tau ) \text{ in } \psi \}\),

  3. (3)

    \(maxsupp(\tau )=\max _{\psi \in DD(\varPhi )}\{support(\tau ) \text{ in } \psi \}\),

  4. (4)

    \(maxacc(\tau )=\max _{\psi \in DD(\varPhi )}\{accuracy(\tau ) \text{ in } \psi \}\).

This calculation employs the rough sets based concept and is independent of the number of derived DISs [18, 20, 21]. By using these formulas, we proved a method to examine ‘\(\tau \) is a certain rule or not’ and ‘\(\tau \) is a possible rule or not’. This method is also independent of the number of all derived DISs [18, 20, 21].

Fig. 2.
figure 2

The obtained all minimal certain rules (\(support(\tau )\ge 0.2\), \(accuracy(\tau )\ge 0.9\)) from Table 3. There is no rule except them.

Fig. 3.
figure 3

The obtained all minimal possible rules (\(support(\tau )\ge 0.2\), \(accuracy(\tau )\ge 0.9\)) from Table 3.There is no rule except them.

We apply this property to the Apriori algorithm for realizing a rule generation system. The Apriori algorithm effectively enumerates itemsets (candidates of rules), and the support and accuracy values of every candidate are calculated by the Proved Property. Figures 2 and 3 show the obtained minimal certain rules and minimal possible rules from Table 3. As for the execution time, we discuss it in Sect. 4.

2.3 A Relation Between Rules in DISs and Rules in NISs

Let \(\psi ^{actual}\) be a derived DIS with actual information from NIS \(\varPhi \) (we cannot decide \(\psi ^{actual}\) from \(\varPhi \), but we suppose there is an actual \(\psi ^{actual}\) for \(\varPhi \)), then we can easily have the next inclusion relation.

$$\begin{aligned}\begin{array}{ll} \{\tau ~|~\tau \text{ is } \text{ a } \text{ certain } \text{ rule } \text{ in } \varPhi \} \subseteq \{\tau ~|~\tau \text{ is } \text{ a } \text{ rule } \text{ in } \psi ^{actual}\} \\ \subseteq \{\tau ~|~\tau \text{ is } \text{ a } \text{ possible } \text{ rule } \text{ in } \varPhi \} \end{array}\end{aligned}$$

Due to information incompleteness, we know lower and upper approximations of a set of rules in \(\psi ^{actual}\). This property follows the concept of rough sets based approximations.

2.4 The Apriori Algorithm for Transaction Data Sets

Let us consider Table 4, which shows four persons’ purchase of items. Such structured data is termed a transaction data set. In this data set, let us focus on a set \(\{ham,beer\}\). Such a set is generally termed an itemset. For this itemset, we consider two implications \(\tau _{1}: ham\Rightarrow beer\) and \(\tau _{2}: beer\Rightarrow ham\). In \(\tau _{1}\), \(support(\tau _{1})\) = 3/4 and \(accuracy(\tau _{1})\) = 3/3. In \(\tau _{2}\), \(support(\tau _{2})\) = 3/4 and \(accuracy(\tau _{2})\) = 3/4. For an itemset \(\{ham,beer,corn\}\), we consider six implications, \(ham\wedge beer\Rightarrow corn\), \(\cdots \), \(beer\Rightarrow corn\wedge ham\). Like this, Agrawal proposed a method to obtain rules from transaction data sets, which is known as the Apriori algorithm [1, 2, 9]. This algorithm makes use of the following.

Table 4. An exemplary transaction data set

(Monotonicity of support). For two itemsets P and Q, if P \(\subseteq \) Q, \(support(Q)\le support(P)\) holds.

By using this property, the Apriori algorithm enumerates all itemsets, which satisfy \(support\ge \alpha \). Each of such itemsets is termed a frequent itemset. Let us consider the manipulation of itemsets in Table 4 under \(support\ge 0.5\). Since there are four transactions, each itemset must occur more than two times. Let \(CAN_{i}\) and \(FI_{i}\) (\(i\ge 0\)) denote a set of all candidates of itemsets and a set of all frequent itemsets consisting of \((i+1)\)-items, respectively. We have the following.

$$\begin{aligned}&CAN_{0}=\{\{bread\}(\text{ Occurrence= }1),\{milk\}(1),\{ham\}(3),\{beer\}(4),\{corn\}(2), \\&\qquad \{cheese\}(2),\{apple\}(1),\{potato\}(1),\{cake\}(1)\}, \\&FI_{0}=\{\{ham\}(3),\{beer\}(4),\{corn\}(2),\{cheese\}(2)\}, \\&CAN_{1}=\{\{ham,beer\},\{ham,corn\},\{ham,cheese\},\{beer,corn\},\\&\qquad \{beer,cheese\},\{corn,cheese\}\}, \\&FI_{1}=\{\{ham,beer\}(3),\{ham,corn\}(2),\{beer,corn\}(2),\{beer,cheese\}(2)\}, \\&CAN_{2}=\{\{ham,beer,corn\},\{ham,beer,cheese\},\{ham,corn,cheese\},\\&\qquad \{beer,corn,cheese\}\}, \\&FI_{2}=\{\{ham,beer,corn\}(2)\}. \end{aligned}$$

Each element in \(CAN_{i}\) (\(i\ge 1\)) is generated by the combination of two itemsets in \(FI_{i-1}\) [1, 2]. Then, every itemset satisfying the support condition becomes the element of \(FI_{i}\). For example, for \(A:\{ham,corn\}\), \(B:\{beer,cheese\}\in FI_{1}\), we add one element of B to A and have \(\{ham,corn,beer\}, \{ham,corn\), \(cheese\}\in CAN_{2}\). We also do the converse and have \(\{beer,cheese,ham\}, \{beer\), \(cheese,corn\}\in CAN_{2}\). Only one itemset \(\{ham,corn,beer\}\) satisfies the support condition and becomes an element of \(FI_{2}\). Like this, \(FI_{1}\), \(FI_{2}\), \(\cdots \), \(FI_{n}\) are obtained at first, then the accuracy value of each implication defined by a frequent itemset is evaluated. In the subsequent sections, we change the above manipulation by using the characteristics of table data sets.

3 Some Improvements of the NIS-Apriori Based Rule Generator

We describe the improvements in our framework based on Sect. 2.

3.1 From Transaction Data Sets to Table Data Sets

We translate Table 1 to Table 5 and identify each descriptor with an item. Then, we can see that Table 5 is a transaction data set. Thus, we can apply the Apriori algorithm to rule generation.

Table 5. A transaction data set for DIS \(\psi \) in Table 1.

We define the next sets \(IMP_{1}\), \(IMP_{2}\), \(\cdots \), \(IMP_{n}\).

  • \(IMP_{1}=\{[A,val_{A}]\Rightarrow [Dec,val]\}\),

  • \(IMP_{2}=\{[A,val_{A}]\wedge [B,val_{B}]\Rightarrow [Dec,val]\}\),

  • \(IMP_{3}=\{[A,val_{A}]\wedge [B,val_{B}]\wedge [C,val_{C}]\Rightarrow [Dec,val]\}\),

Here, \(IMP_{i}\) means a set of implications which consist of i-condition attributes. A minimal rule is an implication \(\tau \in \cup _{i}IMP_{i}\), and we may examine each \(\tau \in \cup _{i}IMP_{i}\). However, in the subsequent sections, we consider some effective manipulations to generate minimal rules in \(IMP_{1}\), \(IMP_{2}\), \(\cdots \), sequentially.

3.2 The Manipulation I for Frequent Itemsets by the Characteristics of Table Data Sets

Here, we make use of the characteristics of table data sets below.

(TA1). The decision attribute Dec is fixed. So, it is enough to consider each itemset including one descriptor whose attribute is Dec. For example, we do not handle any itemset like \(\{[P,3],[Q,2]\}\) nor \(\{[P,3],[Dec,a],[Dec,b]\}\) in Table 5.

(TA2). An attribute is related to each descriptor. So, we handle itemsets with different attributes. For example, we do not handle any itemset like \(\{[P,3],[P,1]\), \([Q,2],[Dec,b]\}\) in Table 5.

(TA3). To consider implications, we handle \(CAN_{1}\), \(FI_{1}\) (\(\subseteq IMP_{1})\), \(CAN_{2}\), \(FI_{2}\) (\(\subseteq IMP_{2})\), \(\cdots \), which are defined in Sect. 2.4.

Fig. 4.
figure 4

The manipulation I for itemsets.

Fig. 5.
figure 5

The Apriori algorithm adjusted to table data set DIS \(\psi \). We can examine the accuracy value in each while loop (the rectangle area circled by the dotted line in Fig. 4). This examination is not done in the Apriori algorithm for transaction data sets.

Based on the above characteristics, we can consider Fig. 4. In Fig. 4, itemsets satisfying (TA1) and (TA2) are enumerated. Generally, in the Apriori algorithm, the accuracy value is examined after obtaining all \(FI_{i}\), because the decision attribute is not fixed. For each set in \(FI_{i}\), there are plural implications. However, in a table data set, one implication corresponds to a frequent itemset. We employed this property and proposed the Apriori algorithm adjusted to table data sets [20, 21] in Fig. 5. We term this algorithm the DIS-Apriori algorithm. Here, we calculate the accuracy value of every frequent itemset in each while loop (the rectangle area circled by the dotted line in Fig. 4 and lines 5-7 in Fig. 5). We can easily handle certain rules and possible rules in NISs by extending the DIS-Apriori algorithm.

Proposition 1

[20, 21]

  1. (1)

    We replace DIS \(\psi \) with NIS \(\varPhi \), support and accuracy with minsupp and minacc, respectively. Then, this algorithm generates all minimal certain rules.

  2. (2)

    We replace DIS \(\psi \) with NIS \(\varPhi \), support and accuracy with maxsupp and maxacc, respectively. Then, this algorithm generates all minimal possible rules.

  3. (3)

    We term the algorithm consisting of (1) and (2) the NIS-Apriori algorithm.

Both DIS-Apriori and NIS-Apriori algorithms are logically sound and complete for rules. They generate rules without excess and deficiency.

Figures 1, 2 and 3 by the rule generator in SQL are based on the algorithm in Fig. 5 and Proposition 1.

3.3 The Manipulation II for Frequent Itemsets by the Characteristics of Table Data Sets

Now, we advance the manipulation I to the manipulation II. We focus on the statement ‘create \(FI_{i}\)’ in lines 2 and 10 in Fig. 5. In every while loop, we examine each \(\tau \in FI_{i}\subseteq CAN_{i}\subseteq IMP_{i}\), so to reduce sets \(CAN_{i}\) and \(FI_{i}\) will influence the performance of execution. In Fig. 5, we at first need to remark the following.

(Rule generation). The purpose of rule generation is to generate each minimal implication \(\tau \in \cup _{i}IMP_{i}\) satisfying \(support(\tau )\ge \alpha \) and \(accuracy(\tau )\ge \beta \). We obtain \(Rule_{1}, Rest_{1}\subseteq IMP_{1}\) in the 1st while loop, \(Rule_{2}, Rest_{2}\subseteq IMP_{2}\) in the 2nd while loop, and \(Rule_{3}, Rest_{3}\) in the 3rd while loop, \(\cdots \).

(Relation between sets in Fig. 5). We clarify the relation and the definition of \(NOrule_{i}\) below.

  1. (1)

    \(Rule_{i}=\{\tau \in IMP_{i}~|~support(\tau )\ge \alpha ,~accuracy(\tau )\ge \beta \}\),

  2. (2)

    \(Rest_{i}=\{\tau \in IMP_{i}~|~support(\tau )\ge \alpha ,~accuracy(\tau )<\beta \}\),

  3. (3)

    \(FI_{i}=\{\tau \in IMP_{i}~|~support(\tau )\ge \alpha \}\),

  4. (4)

    \(NOrule_{i}=\{\tau \in IMP_{i}~|~support(\tau )<\alpha \}\),

  5. (5)

    \(IMP_{i}=FI_{i}\cup NOrule_{i}=(Rule_{i}\cup Rest_{i})\cup NOrule_{i}\).

(A case of \(\tau \in Rule_{i}\)). If \(\tau : \wedge _{j} [A_{j},val_{j}]\Rightarrow [Dec,val]\in Rule_{i}\), we do not deal with any redundant implication \(\tau ': (\wedge _{j} [A_{j},val_{j}])\wedge [B,b]\Rightarrow [Dec,val]\in IMP_{i+1}\), because \(\tau '\) cannot be a minimal rule.

(A case of \(\tau \in NOrule_{i}\)). If \(\tau : \wedge _{j} [A_{j},val_{j}]\Rightarrow [Dec,val]\in NOrule_{i}\), any redundant implication \(\tau ': (\wedge _{j} [A_{j},val_{j}])\wedge [B,b]\Rightarrow [Dec,val]\) satisfies \(support(\tau ')<\alpha \). So, \(\tau '\in IMP_{i+1}\) cannot be a rule. Thus, we do not deal with any redundant implication \(\tau '\).

(A case of \(\tau \in Rest_{i}\)). In the accuracy value, the monotonicity like support does not hold (an example is in [20]). Thus, if \(\tau : \wedge _{j} [A_{j},val_{j}]\Rightarrow [Dec,val]\in Rest_{i}\), \(accuracy(\tau ')\ge \beta \) may hold for a redundant implication \(\tau ': (\wedge _{j} [A_{j},val_{j}])\wedge [B,b]\Rightarrow [Dec,val]\in FI_{i+1}\).

Proposition 2

Let us suppose that we had \(Rule_{i}\) and \(Rest_{i}\) \((IMP_{i}\)=\(Rule_{i}\cup Rest_{i}\cup NOrule_{i})\) in the i-th while loop in Fig. 5. Every candidate of a minimal rule in \(IMP_{i+1}\) is a redundant implication of \(\tau \in Rest_{i}\).

(Proof)

For every implication \(\tau \not \in FI_{i}\subseteq IMP_{i}\), its redundant implication \(\tau '\) satisfies \(support(\tau ')\le support(\tau )<\alpha \). Thus, \(\tau '\) cannot be a minimal rule in \(IMP_{i+1}\). Based on the Apriori algorithm, we need to combine two frequent itemsets in \(FI_{i}\)=\(Rule_{i}\cup Rest_{i}\) (an example of this combination is described in Sect. 2.4). However, for the minimality condition of rules, we do not handle any redundant implication of \(\tau \in Rule_{i}\). Thus, we conclude that every candidate of a minimal rule in \(IMP_{i+1}\) is a redundant implication of \(\tau \in Rest_{i}\).

Definition 1

We define a set \(RCAN_{i}~(\subseteq CAN_{i})\), whose element is a candidate of a minimal rule in \(IMP_{i}\) w.r.t. rules \(\cup _{j=1,\cdots ,(i-1)}Rule_{j}\) and a set \(RFI_{i}=\{\tau \in RCAN_{i}~|~support(\tau )\ge \alpha \}\) \((\subseteq FI_{i}\subseteq IMP_{i})\).

In the Apriori algorithm, the concept of redundancy is not introduced, so that some redundant rules may be generated. The sets \(CAN_{i}\) and \(FI_{i}\) in Fig. 4 are generated from \(FI_{i-1}\) (=\(Rule_{i-1}\cup Rest_{i-1}\)). However, we can generate \(RCAN_{i} (\subseteq CAN_{i})\) and \(RFI_{i} (\subseteq FI_{i})\) from \(Rest_{i-1}\). Furthermore, we previously generated itemsets \(\{[A,a],[B,b],[Dec,v1]\},\{[A,a],[B,b],[Dec,v2]\}\in RCAN_{2}\) from \(\{[A,a],[Dec,v1]\}, \{[B,b],[Dec,v2]\}\in Rest_{1}\), and we removed this combination, because there is no object satisfying both [Decv1] and [Decv2]. This combination formerly generated meaningless itemsets. This revision is another improvement in the manipulation of itemsets.

Proposition 3

The set \(RCAN_{i}\) and \(RFI_{i}\) are given as follows:

$$\begin{aligned}&(i=1) RCAN_{1}=CAN_{1} \textit{ and } RFI_{1}=FI_{1}, \\&(i\ge 2) RCAN_{i}=\{\tau : (\wedge _{j} [A_{j},val_{j}])\wedge [B,b]\Rightarrow [Dec,val]~|~\\&\qquad \wedge _{j} [A_{j},val_{j}]\Rightarrow [Dec,val]\in Rest_{i-1}, [B,b]\Rightarrow [Dec,val]\in Rest_{1}\}, \\&\qquad RFI_{i}=\{\tau \in RCAN_{i}~|~support(\tau )\ge \alpha \}. \end{aligned}$$
Fig. 6.
figure 6

New manipulation II of itemsets. We can handle \(RCAN_{i}\subseteq CAN_{i}\) and \(RFI_{i}\subseteq FI_{i}\) for generating minimal rules. In the Apriori algorithm, \(CAN_{i}\) and \(FI_{i}\) are employed, so redundant rules may be generated. By using \(RCAN_{i}\) and \(RFI_{i}\), the candidates of rules are reduced, and the performance of execution is improved.

(Proof)

(In case of i = 1) \(RCAN_{1}\) = \(CAN_{1}\) and \(RFI_{1}\) = \(FI_{1}\) hold, because redundant rules occur after 2nd while loop.

\((In~ case~ of~ i\ge 2)\) We add one descriptor [Bb] to \(\wedge _{j} [A_{j},val_{j}]\Rightarrow [Dec,val]\in Rest_{i-1}\) and have a redundant implication \(\tau : (\wedge _{j} [A_{j},val_{j}])\wedge [B,b]\Rightarrow [Dec,val]\in IMP_{i}\) due to Proposition 2.

  1. (1)

    In order to handle the same decision, [Bb] must be the condition part of \(\tau ': [B,b]\Rightarrow [Dec,val]\in RFI_{1}\) = \(FI_{1}\). (If \(\tau '\not \in FI_{1}\), \(support(\tau )<\alpha \) holds and \(\tau \) cannot be a rule, because \(\tau \) is a redundant implication of \(\tau ')\).

  2. (2)

    \(FI_{1}\) = \(Rule_{1}\cup Rest_{1}\) holds. If \(\tau '\in Rule_{1}\), \(\tau \) cannot be a minimal rule, because \(\tau '\) is a minimal rule.

Based on the above discussion, we conclude \(\tau '\in Rest_{1}\).

We propose the manipulation II in Fig. 6 due to the above propositions. In the Apriori algorithm, \(CAN_{i}\) is generated by \(FI_{i-1}\), but we can remove redundant implications of \(\tau \in Rule_{i-1}\). Thus, we can handle \(RCAN_{i}\), which is a subset of \(CAN_{i}\). If the number of elements in \(Rule_{i-1}\) is large, the number of elements in \(RCAN_{i}\) will be much smaller than that of \(CAN_{i}\).

Proposition 4

The DIS-Apriori algorithm with the manipulation II is sound and complete for minimal rules in DIS, and the NIS-Apriori algorithm with the manipulation II is also sound and complete for minimal certain rules and minimal possible rules in NIS. They do not miss any rule defined in DIS \(\psi \) or NIS \(\varPhi \).

(Sketch of Proof). We have proved that the DIS-Apriori and NIS-Apriori algorithms are sound and complete [20, 21]. We newly introduced sets \(RCAN_{i}\subseteq CAN_{i}\) and \(RFI_{i}\subseteq FI_{i}\) by using the redundancy of rules, and we extended the previous two algorithms to those with the manipulation II. The proposed algorithm does not examine each \(\tau \in \cup _{j}IMP_{j}\), but examines each \(\tau \in \cup _{j}RCAN_{j}\). As a result, this algorithm generates the same rules defined by the procedure ‘to examine each \(\tau \in \cup _{j}IMP_{j}\)’.

Table 6. The Car Evaluation data set (Objects: 1728, condition attributes: 6). A:\(|Rule_{1}|\), B:\(|CAN_{2}|\) or \(|RCAN_{2}|\), C:\(|Rule_{2}|\), D:\(|CAN_{3}|\) or \(|RCAN_{3}|\), E:\(|Rule_{3}|\), F:\(|CAN_{4}|\) or \(|RCAN_{4}|\), G:\(|Rule_{4}|\).
Table 7. The Phishing data set (Objects: 1353, condition attributes: 9). Here, A, B, \(\cdots \), G are the same as Table 6.
Table 8. The Congressional Voting data set (Objects: 435, condition attributes: 16). There are 392 missing values, thus \(|DD(\varPhi )|\) = \(2^{392}\ge 10^{100}\) (the number of derived DISs exceeds \(10^{100}\)). A certain rule is a rule in each of more than \(10^{100}\) derived DISs. A possible rule is a rule in at least one derived DISs. Here, A, B, \(\cdots \), G are the same as Table 6.
Table 9. The Lithology data set (Objects: 1923, condition attributes: 10). There are 519 missing values, therefore there are more than \(10^{100}\) (\(2^{519}\fallingdotseq (2^{10})^{50}> (10^{3})^{50}>10^{100}\)) derived DISs. Here, A, B, \(\cdots \), G are the same as Table 6.

4 An Improved Apriori Based Rule Generator and Some Experiments

This section compares the NIS-Apriori algorithm and the NIS-Apriori algorithm with the manipulation II. Of course, two algorithms generate the same rules due to Propositions 1 and 4, and the latter algorithm makes use of the redundancy concept. We newly implemented two systems in Python (Windows PC, CPU: Intel i7-4600U, 2.7 z). Table 6 shows the results on the Car Evaluation data set [4], and Table 7 does the results on the Phishing data set [4]. They are the cases of DISs, and the characteristic of \(RCAN_{i}\subseteq CAN_{i}\) is effectively employed.

Now, we show two examples by the NIS-Apriori algorithm. The one is the Congressional Voting data set [4], and the other is the Lithology data set [8]. As we described in Proposition 1, the NIS-Apriori algorithm (certain rule generation) is the DIS-Apriori algorithm with criterion values minsupp and minacc. Thus, the number of candidates of itemsets is also reduced by the manipulation II. The experiments easily examine the advancement of the manipulation II (Tables 8 and 9).

5 Concluding Remarks

We recently adjusted the Apriori algorithm to table data sets and proposed the DIS-Apriori and NIS-Apriori algorithms. This paper makes use of the characteristics of table data sets (one decision attribute Dec is fixed) and improved these algorithms. If we do not handle table data sets, there was no necessity for considering Fig. 6. The framework of the manipulation II (Fig. 6) is an improvement of Apriori based rule generation by using the characteristics of table data sets. We can generate minimal rules by using \(RCAN_{i}\subseteq CAN_{i}\) and \(RFI_{i}\subseteq FI_{i}\). This reduction causes to reduce the candidates of itemsets. We newly implemented the proposed algorithm in Python and examined the improvement of the performance of execution by experiments.