Keywords

1 Introduction

The Rough Set (RS) theory was introduced by Pawlak [1] and used for inducing if-then rules from a dataset called the decision table (DT). To date, various methods and algorithms for inducing rules by the theory have been proposed [2,3,4,5] since the inducing rules are useful to simply and clearly express the structure of rating and/or knowledge hiding behind the table. The basic idea to induce rules is to approximate the concept in the DT by use of the lower and/or upper approximation sets which are respectively derived from the equivalence relations and their equivalence sets in the given DT. However, those methods and algorithms by RS paid little attention to the fact that the DT was just a sample set gathered from the population of interest. If resampling the DT from the population or the DT by Bootstrap method for example, the new DT will change equivalence relations, their equivalence sets, and the lower and/or upper approximation sets, so the induced rules will change and fluctuate. Those methods and algorithms also had the problem that those induced rules were not arranged from the statistical views. Then, we proposed a rule induction method named STRIM (Statistical Test Rule Induction Method) taking the above mentioned problems into consideration [6,7,8,9,10,11,12,13,14,15,16]. Specifically, STRIM

  1. (1)

    Proposed a data generation model for generating a DT. This model recognized the DT as an input-output system which transformed a tuple of the condition attribute’s value occurred by chance (the input) into the decision attribute value (the output) through pre-specified if-then rules (generally unknown) under some hypotheses. That is, the input was recognized as an outcome of the random variables and the output was also the outcome of a random variable dependent on the input and the pre-specified rules. Accordingly, the pairs of input and output formed the DT containing rules.

  2. (2)

    Assumed a trying proper condition part of if-then rules and judged whether it was a candidate of rules by statistically testing whether the condition caused bias in the distribution of the decision attribute’s values.

  3. (3)

    Arranged the candidates having inclusion relationships by representing them with one of the highest bias and finally induced if-then rules with a statistical significance level after systematically exploring the trying condition part of rules. The validity and capacity of STRIM have been confirmed by the simulation experiments that STRIM can induce pre-specified if-then rules from the DT proposed in (1). In this way, the conventional data generation model proposed in (1) also can be used for a verification system of a rule induction method (VSofRIM). The validity and capacity also secure a certain extent of the confidence of rules induced by STRIM from the DT of real-world datasets. The VSofRIM is also used for confirming the validity and capacity of other rule induction methods proposed previously [11, 14].

The conventional STRIM systematically explores the domain of the condition attributes looking for rule candidates causing the bias and statistically judges their validity by use of the DT which is accumulated by rules intervening between the inputs of the condition attributes and the corresponding outputs of the decision attribute. This paper reconsiders the process after (2) from the view of Bayes’s law which generally infers the causes from the results, and proposes a new rule induction method named apriori-STRIM. Specifically, the method explores a co-occurrence set of the condition attribute’s value in the DT against a specific decision attribute’s value. The concept of the co-occurrence set plays an important role in the association rule learning (ARL) field [17] and the set can be effectively found using the well-known Apriori algorithm [18]. That is, apriori-STRIM focuses on the property that the specific decision attribute’s value will occur with the specific condition attribute values by the rules’ intervention, although the conventional STRIM focuses on the bias. The rules’ intervention also can be judged by a statistical test using the co-occurrence set in the DT. The validity and capacity of apriori-STRIM is also confirmed by the same experiment as the conventional and the two-way confirmations by both STRIMs secure the validity and capacity for the rule induction method. This paper also shows interesting features of ARL by applying it to VSofRIM and clarifies the differences between apriori-STRIM and ARL.

2 Conventional Rough Sets and STRIM

The Rough Set theory is used for inducing if-then rules from a decision table S. S is conventionally denoted by \(S=(U,A=C \cup \{D\}, V, \rho )\). Here, \(U=\{u(i)\vert i=1,...,\vert U\vert =N\}\) is a sample set, A is an attribute set, \(C=\{C(j) \vert j=1,...,\vert C \vert \}\) is a condition attribute set, C(j) is a member of C and a condition attribute, and D is a decision attribute. Moreover, V is a set of attribute values denoted by \(V=\cup _{a \in A} V_{a}\) and is characterized by the information function \(\rho \): \(U \times A \rightarrow V\).

The conventional Rough Set theory first focuses on the following equivalence relation and the equivalence set of indiscernibility within the decision table S of interest:

$$ I_{B}=\{(u(i),u(j)) \in U^{2}| \rho (u(i),a)=\rho (u(j),a), \forall a \in B \subseteq C\}. $$

\(I_{B}\) is an equivalence relation in U and derives the quotient set \(U/I_{B}=\{[u_{i}]_{B}|i=1,2,...,|U|=N\}\). Here, \([u_{i}]_{B}=\{u(j) \in U|(u(j),u_{i}) \in I_{B},u_{i} \in U\}\), \([u_{i}]_{B}\) is an equivalence set with the representative element \(u_{i}\).

Let X be an arbitrary subset of U then X can be approximated as \(B_{*}(X) \subseteq X \subseteq B^{*}(X)\) through the use of the equivalence set. Here, \(B_{*}(X)=\{u_{i} \in U \vert [u_{i}]_{B} \subseteq X\}\), and \(B^{*}(X)=\{u_{i} \in U \vert [u_{i}]_{B} \cap X \ne \phi \}\), \(B_{*}(X)\) and \(B^{*}(X)\) are referred to as the lower and upper approximations of X by B respectively. The pair of \((B_{*}(X), B^{*}(X))\) is usually called a rough set of X by B.

Specifically, let be \(X=\{u(i)|\rho (u(i),D)=d\}\) \(=U(d)=\{u(i)|u^{D=d}(i)\}\) called the concept of \(D=d\), and define a set of u(i) as \(U(CP)=\{u(i)|u^{C=CP}(i)\), meaning CP satisfies \(u^{C}(i)\), where \(u^{C}(i)\) is the tuple of the condition attribute values of \(u(i)\}\) and let it be equal to \(B_{*}(X)\), then CP can be used as the condition part of the if-then rule of \(D=d\), with necessity. That is, the following expression of if-then rules with necessity is obtained: if \(CP=\wedge _{j}(C(j_{k})=v_{j_{k}})\) then \(D=d\). In the same way, \(B^{*}(X)\) derives the condition part CP of the if-then rule of \(D=d\) with possibility.

However, the approximation of \(X=U(d)\) by the lower or upper approximation is respectively too strict or loose so that the rules induced by the approximations are often of no use. Then, Ziarko expanded the original RS by introducing an admissible error in two ways [4]: \(\underline{B}_{\epsilon }(U(d))=\{u(i) \vert accuracy \ge 1-\varepsilon \}\), \(\overline{B}_{\varepsilon }(U(d))=\{u(i) \vert accuracy > \varepsilon \}\), where \(\varepsilon \in [0,0.5)\). The pair of \((\underline{B}_{\epsilon }(U(d))\), \(\overline{B}_{\varepsilon }(U(d)))\) is called an \(\varepsilon \)-lower and \(\varepsilon \)-upper approximation which satisfies the following properties: \(B_{*}(U(d)) \subseteq \underline{B}_{\varepsilon }(U(d)) \subseteq \overline{B}_{\varepsilon }(U(d)) \subseteq B^{*}(U(d))\), \(\underline{B}_{\varepsilon =0}(U(d))=B_{*}(U(d))\) and \(\overline{B}_{\varepsilon =0}(U(d))=B^{*}(U(d))\). The \(\varepsilon \)-lower and/or \(\varepsilon \)-upper approximation induce if-then rules with admissible errors in the same way as the lower and/or upper approximation.

As mentioned above, the conventional RS theory basically focuses on the equivalence relation \(I_{B}\) and its equivalence sets \({U/I}_{B}\) in U given in advance and induces rules approximating the concept by use of the approximation sets derived from the \({U/I}_{B}\). However, \(I_{B}\) is very dependent on the DT provided. Accordingly, every DT obtained from the same population is different from each other and, \(I_{B}\), \({U/I}_{B}\) and the approximation sets are different from each other for each DT, which leads to inducing different rule sets. That is, the rule induction methods by the conventional RS theory lack statistical views.

Fig. 1.
figure 1

A data generation model: Rule box contains if-then rules R(dk): if sCP(dk) then \(D=d\) (\(d=1,2,...,k=1,2,...\)).

Table 1. Hypotheses with regard to the decision attribute value.

Then, STRIM [6, 9, 10, 12, 15] has proposed a data generation model (DGM) for the DT and a rule induction method based on the model. Specifically, STRIM considers the decision table to be a sample dataset obtained from an input-output system including a rule box, as shown in Fig. 1, and hypotheses regarding the decision attribute values, as shown in Table 1. A sample u(i) consists of its condition attribute values \(u^{C}(i)\) and its decision attribute value \(u^{D}(i)\). \(u^{C}(i)\) is the input for the rule box, and is transformed into the output \(u^{D}(i)\) using the rules (generally unknown) contained in the rule box and the hypotheses. The hypotheses consist of three cases corresponding to the input. They are uniquely determined, indifferent and conflicted cases (see Table 1). In contrast, \(u(i)=(u^{C}(i), u^{D}(i))\) is measured by an observer, as shown in Fig. 1. The existence of NoiseC and NoiseD makes missing values in \(u^{C}(i)\), and changes \(u^{D}(i)\) to create another value for \(u^{D}(i)\), respectively. Those noises bring the system closer to a real-world system. The data generation model suggests that a pair of \((u^{C}(i),u^{D}(i))\), \((i=1,\ldots ,N)\), i.e. a decision table is an outcome of these random variables: \((C, D)=((C(1),\ldots ,C(|C|),D)\) observing the population.

Based on the data generation model, STRIM (1) extracted significant pairs of a condition attribute and its value like \(C(j_{k})=v_{j_{k}}\) for rules of \(D=d\) by the local reduct [10, 12, 13], (2) constructed a tentatively trying condition part of the rules like \(CP=\wedge _{j} (C(j_{k})=v_{j_{k}})\) by use of the reduct results, and (3) investigated whether U(CP) caused a bias at \(n_{d}\) in the frequency distribution of the decision attribute values \(f=(n_{1},n_{2},\ldots ,n_{M_{D}})\) or not, where \(n_{m}=|U(CP)\,\cap \,U(m)|\) \((m=1,\ldots ,|V_{a=D}|=M_{D})\) and \({U}(m)=\{u(i) \vert u^{D=m}(i)\}\), since the \(u^{C}(i)\) coinciding to sCP(dk) in the rule box is transformed into \(u^{D}(i)\) based on Hypotheses 1 or 3. Accordingly, the CP coinciding to one of rules in the rule box produces bias in f. Specifically, STRIM used a statistical test method for the investigation specifying a null hypothesis H0: f does not have any bias, that is, CP is not a rule and its alternative hypothesis H1: f has a bias, that is, CP is a rule, and a proper significance level, and tested H0 by use of the sample dataset, that is, the decision table and the proper test statistics, for example,

$$ z=\frac{(n_{d}+0.5-np_{d})}{(np_{d}(1-p_{d}))^{0.5}}, $$

where \(n_{d}=\max _{m} f=(n_{1},\ldots ,n_{m},\ldots ,n_{M_{D}})\), \(p_{d}=P(D=d)\), \({n=}\sum _{j=1}^{M_{D}} n_{j}\). z obeys the standard normal distribution under test conditions: \({np}_{d} \ge 5\) and \({n(1-p}_{d}) \ge 5\) [19] and is considered to be an index of the bias of f. (4) If H0 is rejected then the assumed CP becomes a candidate for the rules in the rule box. (5) After repeating the processes from (1) to (4) and obtaining the set of rule candidates, STRIM arranged their rule candidates and induced the final results (see literatures [12, 13] for details).

To summarize, STRIM directly induces rules with statistical significance level assuming the condition part of rules: \(CP=\wedge _{j}(C(j_{k})=v_{j_{k}})\) and statistically testing it by use of U. STRIM does not require the basic concept of the approximation which is the point for the rule induction by RS theory. Conversely, RS theory has nothing directly to do with statistical significance.

3 Studies on STRIM by Simulation Experiment

Table 2. An example of pre-specified rules R(dk) in the rule box: if sCP(dk) then \(D=d\) (\(d=1, \ldots , 5\), \(k=1, 2\)).
Table 3. An example of estimated rules by the conventional STRIM for the DT with \(N_{B}=5,\!000\) generated by the data generation model in Fig. 1 and the pre-specified rules in Table 2.

We implemented the data generation process and verified the capacity of inducing the rules by the conventional STRIM as follows: (1) Specified the rules in Table 2 (the number of rules (\(N_{rule}\)) \(=10\)) in the rule box in Fig. 1, where \(|C|=6\), \(V_{a}=\{1,2,\ldots ,5\}\) (\(a=C(j)\), \((j=1,\ldots ,|C|)\), \(a=D\)), and \(sCP(1,1)=110000\) denoted \(sCP(1,1)=(C(1)=1) \wedge (C(2)=1)\) and was called a rule of the rule length 2 (\(RL=2\)), having two conditions. (2) Generated \(v_{C(j)}(i)\) \((j=1,\ldots ,|C|=6)\) with a uniform distribution and formed \(u^{C}(i)=(v_{C(1)}(i),\ldots ,v_{C(6)}(i))\) \((i=1,\ldots ,N=10,\!000)\). (3) Transformed \(u^{C}(i)\) into \(u^{D}(i)\) using the pre-specified rules in Table 2 and hypotheses in Table 1, without generating NoiseC and NoiseD for a plain experiment and then generated the decision table. After randomly selecting samples by \(N_{B}=5,\!000\) samples, newly forming the DT and applying STRIM to the DT, Table 3 was obtained. The table shows us the following: For example, the estimated Rule No. 1 (\(RN=1\)) “0022002” denotes if \((C(3)=2) \wedge (C(4)=2)\) then \(D=2\), has \(f=(n_{1},n_{2},\ldots ,n_{5})=(4,216,8,5,4)\) and the bias at \(D=2\). The outcome probability to cause such a bias is around 5.87E-173 under H0, which leads to rejecting H0 and adopting H1. As the result, “0022002” was adopted as a rule. It should be noted that the reason it was adopted as the rule was not the high \(accuracy=216/237=0.911\). STRIM just induced all the pre-specified rules in Table 2. This experiment suggests that conventional STRIM works well and the DGM in Fig. 1 can be useful as a verification system of a rule induction method.

4 New Rule Induction Method by Co-occurrence Set

As mentioned in Sect. 2, conventional STRIM regards the condition attributes C and the decision attribute D as random variables, and D of the output depends on C of the input, rules and hypotheses and the rule induction method focuses on \(P(D=d \vert CP)\) of \(P(CP, D=d)=P(CP)P(D=d \vert CP)\). That is, STRIM regards \(P(D=d \vert CP)\) as P( if CP then \(D=d)\) and explores \(CP=\wedge _{j}(C(j_{k})=v_{j_{k}})\) which causes bias at \(n_{d}\) in \(f=(n_{1}, n_{2}, \ldots ,n_{M_{D}})\). The bias can be detected by use of the DT and the statistical test.

Fig. 2.
figure 2

An algorithm for apriori-STRIM written in R language style.

Table 4. An example of FIS extracted from \(U(D=1)\) of the DT corresponding to Table 3.

From the view of Bayes’s law, however, another strategy of focusing on \(P(CP \vert D=d)\) of \(P(CP, D=d)=P(D=d)P(CP \vert D=d)\) can be considered for the rule induction. That is, after receiving the outputs of \(D=d\), the strategy exploring and estimating \(CP=\wedge _{j}(C(j_{k})=v_{j_{k}})\) of the corresponding inputs can be also valid. Specifically, when receiving the outputs, the corresponding inputs are classified into two cases: One is the uniquely determined and/or conflicted cases and the other is the indifferent case (see Table 1). Both cases can be easily distinguished from each other by use of a statistical test specifying the null hypothesis H0: the event \(D=d\) has occurred by chance (the indifferent case) and the alternative hypothesis H1: the event \(D=d\) hasn’t occurred by chance (the uniquely and/or conflicted case). Under H0, \(P(CP \vert D=d)=P(CP)\) and the intervention of rules transforming the inputs into the output is denied. If H0 is denied, H1 is adopted as a rule candidate, which means \(P(CP \vert D=d) \ne P(CP)\). Such hypothesis testing can be easily executed by finding the co-occurrence set with the event \(D=d\) since the concept of the co-occurrence set is well-known in the field of association rule learning (ARL) [17] and the finding is effectively executed by the Apriori algorithm [18]. Then, this paper names this rule induction method apriori-STRIM. The ARL and the Apriori algorithm are summarized in Appendix A and the differences of the idea for the rule induction method between ARL and apriori-STRIM are shown through the same experiment in Sect. 3, that is, by VSofRIM. See Appendix A to easily understand the following.

The specific procedure for apriori-STRIM is shown in Fig. 2 where the procedure is shown in R language style since the Apriori algorithm is already implemented by the language as an apriori() function which has a good reputation. The outline is the following: Line No. 1 (\(LN=1\)) installs the package of ARL as “arules” [20] via the internet and \(LN=2\) loads it as the library “arules”. \(LN=3\) inputs the DT as “data”. From \(LN=4\) to \(NL=15\), every iD of the decision attribute (\(=1 \sim M_{D})\) is proceeded. \(LN=5\) substitutes \(U(iD)=\{u(i) \vert u^{D=iD}(i)\}\) with dataiD, and its condition part dataCiD is extracted (\(LN=6\)), and transformed into the transaction form CiD.tra at \(LN=7\). \(LN=8\) extracts co-occurrence sets of CiD.tra, that is, co-occurrence sets of the condition attributes’ values of U(iD) as frequent item sets according to parameters which specify them as greater than or equal to supp0 and substitute them for CiD.ap. \(LN=9\) extracts the set of frequent item set (SFIS). From \(LN=10\) to \(LN=13\), every p-value of SFIS(iCo) (\(iCo=1, \ldots , \vert SFIS \vert )\) is calculated and tested for whether its p-value is less than a pre-specified p-value0 and SFIS(iCo) is saved as a candidate if it satisfies the condition. \(LN=14\) arranges the candidates having inclusion relationship by representing the candidate with the least p-value and finally decides rules for iD.

5 Studies on Apriori-STRIM by Simulation Experiment

An experiment result for the rule induction by the conventional STRIM was shown in Table 3 by applying it to the DT containing the pre-specified rules. We also applied apriori-STRIM for the same DT and show the results with the process in Fig. 2.

Table 5. An example of rule candidates extracted from Table 4.

Table 4 shows the part of SFIS obtained at \(LN=8\) and 9 for \(iD=1\), that is, \(D=1\). Here, \(supp0=5 \cdot \vert V_{a} \vert / \vert U(iD) \vert \), \(\forall a \in C\) was used for exploring FIS. This specification secure \({freq}({FIS}) \ge 5 \cdot \vert V_{a} \vert = count0\) for the hypothesis testing at \(LN=11\) and 12 and induced \(\vert {SFIS} \vert =403\). The table shows: No. of \(FIS=1 - 30\) (\(NFIS=1 - 30\)) is \({FIS}(\vert {items} \vert =1)\), \(NFIS=31 - 353\) is \({FIS}(\vert {items} \vert =2)\) and \(NFIS=354 - 403\) is \({FIS}(\vert {items} \vert =3)\), and for example, \(NFIS=387\) indicates that the co-occurrence set of \(items=\{C(3)=1,C(5)=2,C(6)=3\}\) that is, the pattern \(CP=(C(3)=1) \wedge (C(5)=2) \wedge (C(6)=3)\) occurred \(count=25\) times in \(\vert U(D=1)=\{u(i) \vert u^{D=1}(i)\} \vert =1,033\). The pre-specified rules for \({D=1}\) in Table 2 are R(1, 1) and R(1, 2), which appear in \(NFIS=350\) and 351 respectively.

\(LN=12\) induced significant FIS patterns in Table 4 by the hypothesis testing under H0. The frequency X of the co-occurrence pattern CP obeys Binominal distribution Bn(np) having the expectation np where \(n=\vert U(D=1) \vert \) and \(p=\prod _{a \in CP} \frac{1}{\vert V_{a} \vert }\). For example, \(p=(\frac{1}{5})^{3}\) at \(NFIS=387\) due to \(RL=3\). One specification for supp0 was to satisfy the requirement \(np=Xp \ge 5\) for \(RL=1\) as well as the conventional STRIM (see the test conditions [19] in Sect. 2). That is, \(\min X = count0 = \frac{5}{p}=5 \cdot \vert V_{a} \vert \). \(count \ge count0=25\) is satisfied in Table 4. As shown in Appendix A, the small count0 tends to generate a large number of meaningless FIS, and conversely, the large increases the possibility to miss the valid FISs.

Table 6. Finally estimated rules by apriori-STRIM for the DT corresponding to those of Table 3.

\(LN=12\) induced the number of 46 rule candidates from 403 in Table 4 using p-value \(0=1.0E-10\) this time and saved them with p-value, accuracy, coverage and so on. Table 5 shows the first nine candidates after sorting them in ascending order of p-value. \(RN=1\) and 2 coincide with R(1, 2) and R(1, 1) respectively. \(RN=3\) can be represented by \(RN=2\) with the smaller p-value and in the same manner all the following candidates were arranged and represented by \(RN=1\) or 2 at \(LN=14\). Table 6 shows the final rule induction results including those of \(D=2, \ldots ,5\) by apriori-STRIM.

To compare Table 6 by apriori-STRIM with Table 3 by the conventional STRIM, the following is seen:

  1. (1)

    Both methods statistically induce the pre-specified rules in Table 2 in proper quantities and justly coincide with each other in corresponding figures.

  2. (2)

    The differences between two tables are their surface caput of \(f=(n_{1},\) \(n_{2},\) \(\ldots ,\) \(n_{5})\) and count. The former focuses on \(P(D=d \vert CP)\) and adopts the strongest bias of the distribution of D by CP. The latter focuses on \(P(CP \vert D=d)\) and adopts the strongest intervention by rules, which appears in the p-value of the co-occurrence set (pattern) in Table 5.

In the same way, to compare Table 6 and/or Table 3 with Table 9 and/or Table 10 by the associate rule learning (ARL), the following is seen:

  1. (3)

    ARL first focuses on the co-occurrence set of (CPD) and its count directly connects to P(CPD) and induces rules by use of parameters of support, confidence, and so on. However, ARL has no way of distinguishing the co-occurrence sets by rules from those by chance since ARL doesn’t have any models for the distinction.

  2. (4)

    Connecting to (3), ARL also has no way of arranging a large number of rule candidates as shown in Appendix A although it has useful indexes of support, confidence, lift, and so on. That is, ARL based on P(CPD) seems not to closely focus on inducing if-then rules although it can induce the co-occurrence set.

6 Conclusion

This paper summarized the rule induction methods by the conventional RS and their statistically improved method STRIM showing the validity and capacity of STRIM in VSofRIM. The conventional STRIM focused on \(P(D=d \vert CP)\) which can be recognized as a probabilistic structure transforming the input CP into the corresponding output D, and used the structure and the DT for inducing the if-then rules that causes the bias in the distribution of D. From this view, another new rule induction method focusing on \(P(CP \vert D=d)\) was proposed. Specifically, the method estimated the inputs after receiving the outputs by exploring the co-occurrence set of \(U(d)=\{u(i) \vert u^{D=d}(i)\}\) and executing statistical testing with regard to the explored set under H0: \(P(CP \vert D=d)=P(CP)\). The exploration was executed by Apriori algorithm developed in the field of ARL. The new method was named apriori-STRIM. The validity and capacity for apriori-STRIM was confirmed by applying it to the same DT as the conventional STRIM. The similarities and differences between the conventional, apriori-STRIM and ARL were clarified through the same simulation dataset, that is, VSofRIM.

Table 7. An example of transaction dataset.
Table 8. An example of the set of FIS (SFIS) for Table 7 (\(\theta _{0}=3\)).

Focus for future studies:

  1. (1)

    The differences of performance evaluation between the above three methods were considered by the plain data generation model. To examine them in a much closer model to the real-world.

  2. (2)

    To apply three methods to the real-world dataset after finishing the studies (1).

  3. (3)

    To expand the DT to the transaction database and study if both STRIMs can be applied to such a database and work effectively.