1 Introduction

Rough Sets theory was introduced by Pawlak [1] and used for inducing if-then rules from a dataset called the decision table. The induced if-then rules simply and clearly express the structure of rating and/or knowledge hiding behind the decision table. Such rule induction methods are needed for disease diagnosis systems, discrimination problems, decision problems and other aspects. The first step for the rule induction is to find the condition attributes which do not have any relationships with the decision attribute, to remove them and finally to reduce the table. Those processes to obtain the reduced table are useful for efficiently inducing rules and called a reduct. The conventional Rough Sets theory to induce if-then rules is based on the indiscernibility of the samples of the table. The reduct by the conventional method also uses the same concept and various types of indiscernibility, methods to find their indiscernibility and algorithms for the reducts are proposed to date [2,3,4,5,6,7].

This paper retests the conventional reduct methods through the use of a simulation dataset and points out their problems after summarizing the conventional rough sets and reduct methods. Then a new reduct method is proposed to overcome their problems from a statistical point of view. Specifically, the new method recognizes each sample data in the decision table as the outcomes of random variables of the tuple of the condition attributes and the decision attribute, since the dataset is obtained from their population of interest. Accordingly, the reduct problem can be replaced by the problem of finding the condition attributes which are statistically independent of the decision attribute and/or its values. The statistical independence can be easily tested, for example, by a Chi-square test using the dataset. The validity of the new reduct method is confirmed by applying it to the simulation dataset. The experiment also gives an idea of improving STRIM (Statistical Test Rule Induction Method [8,9,10,11]) to include the reduct function and to induce if-then rules more efficiently. The usefulness of the reduct method and the improved STIRM are also confirmed by applying them to a UCI dataset [12] prepared for machine learning.

Table 1. An example of a decision table.

2 Conventional Rough Sets and Reduct Method

Rough Sets theory is used for inducing if-then rules from a decision table S. S is conventionally denoted \(S=(U, A=C \cup \{D\}, V, \rho )\). Here, \(U=\{u(i) | i=1, ..., |U|=N\}\) is a sample set, A is an attribute set, \(C=\{C(j) | j=1, ..., |C|\}\) is a condition attribute set, C(j) is a member of C and a condition attribute and D is a decision attribute. V is a set of attribute values denoted by \(V=\cup _{a\in A} {V_{a}}\) and is characterized by an information function \(\rho \): \(U \times A \rightarrow V\). Table 1 shows an example where \(|C|=6\), \(|V_{a=C(j)}|=M_{C(j)}=6\), \(|V_{a=D}|=M_D=6\), \(\rho (x=u(1), a=C(1))=5\), \(\rho (x=u(2), a=C(2))=5\), and so on.

Rough Sets theory focuses on the following equivalence relation and equivalence set of indiscernibility:

$$ I_{C} = \{(u(i),u(j)) \in U^{2} \vert \rho (u(i),a) = \rho (u(j),a), \forall a \in C\}. $$

\(I_{C}\) derives the quotient set \(U/I_{C} =\{[u_{i}]_{C} \vert i=1,2,...\}\). Here, \([u_{i}]_{C} = \{u(j) \in U \vert (u(j),u_{i}) \in I_{C}, u_{i} \in U\}\). \([u_{i}]_{C}\) is an equivalence set with the representative element \(u_{i}\) and is called an element set of C in Rough Sets theory [2]. Let be \(\forall X \subseteq U\) then X can be approximated like \(C_{*}(X) \subseteq X \subseteq C^{*}(X)\) by use of the element set. Here,

$$\begin{aligned} C_{*}(X)= & {} \{u_{i} \in U \vert [u_{i}]_{C} \subseteq X\}, \end{aligned}$$
(1)
$$\begin{aligned} C^{*}(X)= & {} \{u_{i} \in U \vert [u_{i}]_{C} \cap X \ne \emptyset \}, \end{aligned}$$
(2)

\(C_{*}(X)\) and \(C^{*}(X)\) are called the lower and upper approximations of X by C respectively. The pair of \((C_{*}(X),C^{*}(X))\) is usually called a rough set of X by C. Specifically, let \(X=D_{d} = \{u(i) \vert (\rho (u(i),D)=d\}\) called concept \(D=d\) then \(C_{*}(X)\) is surely a set satisfying \(D=d\) since \(C_{*}(X) \subseteq X\) and it derives if-then rules of \(D=d\) with necessity.

The conventional Rough Sets theory seeks a minimal subset of C denoted with \(B (\subseteq C)\) satisfying the following two conditions:

  1. (i)

    \(B_{*}(D_{d})=C_{*}(D_{d})\), \(d=1,2,...,M_{D}\).

  2. (ii)

    \(a (\in B)\) satisfying \((B-\{a\})_{*}(D_{d})=C_{*}(D_{d})\) \((d=1,2,...,M_{D})\) does not exist.

\(B (\subseteq C)\) is called a relative reduct of \(\{D_{d} \vert d=1,...,M_{D}\}\) preserving the lower approximation and is useful for finding if-then rules since redundant condition attributes have been already removed from C.

Fig. 1.
figure 1

An example of LEM1 algorithm.

LEM1 algorithm [2] and the discernibility matrix method (DMM) [3] are well known as representative ways to perform reducts. Figure 1 shows an example of LEM1, and A and \(\{d\}^{*}\) at Line 1 of the figure respectively correspond to C and \(\{D_{d} \vert d=1,...,M_{D}\}\) in this paper. LEM1 from Line 6 to 16 in the figure in principle checks and executes (i) and (ii) for all the combinations of the condition attributes.

DMM [3] at first forms a symmetric \(N \times N\) matrix having the following (ij) element \(\delta _{ij}\):

  • \(\delta _{ij}=\{a \in C \vert \rho (u(i),a) \ne \rho (u(j),a)\}\); \(\exists d \in D\), \(\rho (u(i),d) \ne \rho (u(j),d)\) and \(\{u(i),u(j)\} \cap Pos(D) \ne \emptyset \), \(=*\); otherwise.

Here, \(Pos(D)=\cup _{d=1}^{M_{D}} C_{*}(D_{d})\) and \(*\) denotes “don’t care”. Then, a relative reduct preserving the lower approximation can be obtained by the following expression:

$$\begin{aligned} F^{reduct}=\bigwedge _{i,j:i<j} \bigvee \delta _{ij}. \end{aligned}$$
(3)
Fig. 2.
figure 2

A data generation model for a decision table contaminated with noise.

Table 2. Hypotheses with regard to the decision attribute value.

3 Retests of the Conventional Reduct Method

Here we retest the ability of the reducts obtained through LEM1 and DMM by use of a simulation dataset. Figure 2 [8,9,10,11] shows a way of how to generate simulation datasets. Specifically let (a) generate the condition attribute values of u(i), that is, \(u^{C}(i)=(v_{C(1)}(i), v_{C(2)}(i), ..., v_{C(\vert C \vert )}(i))\) by the use of random numbers with a uniform distribution and (b) determine the decision attribute value of u(i) without NoiseC and NoiseD for a plain experiment, that is \(u^{D}(i)\) by use of if-then rules specified in advance and the hypotheses shown in Table 2 and repeat the (a) and (b) processes by N times. Table 1 shows an example dataset generated by the use of those procedures with the following if-then rule R(d) specified in advance:

$$\begin{aligned} R(d): \mathrm{~if~} Rd \mathrm{~then~} D=d \quad (d=1,...,M_{D}=6), \end{aligned}$$
(4)

where \(Rd=(C(1)=d) \wedge (C(2)=d) \vee (C(3)=d) \wedge (C(4)=d)\).

The results of retesting both methods using the \(N=10,000\) dataset showed \(F_{LEM1}^{reduct} = F_{DMM}^{reduct} = C(1) \wedge C(2) \wedge C(3) \wedge C(4) \wedge C(5) \wedge C(6)\) while the results were expected to be \(F^{reduct}=C(1) \wedge C(2) \wedge C(3) \wedge C(4)\) from the rules (4) specified in advance. The retest experiment was repeated three times by changing the generated dataset and obtained the same results.

These results are clearly derived from the indiscernibility and/or discernibility caused by the element set which could not distinguish the differences between samples by the if-then rules (see Hypothesis 1 in Table 2) or those obtained by chance (see Hypothesis 2 and 3 in Table 2).

4 Proposal of Statistical Reduct Method

As mentioned in Sect. 3, the conventional reduct methods are unable to reproduce the forms of reducts specified in advance from the decision table due to a lack of abilities adaptive to the indifferent and conflicted samples in datasets despite the fact that real-world datasets will have such samples. This paper studies this problem with reducts from the view of STRIM (Statistical Test Rule Induction Method) [8,9,10,11]. STRIM regards the decision table as a sample set obtained from the population of interest based on the input-output system as shown in Fig. 2. According to a statistical model, \(u(i)=(u^{C}(i), u^{D}(i)=(v_{C(1)}(i), v_{C(2)}(i), ..., v_{C(\vert C \vert )}(i), u^{D}(i))\) is an outcome of the random variables of \(A=(C,D)=(C(1),C(2),...,C(|C|),D)\) (hereafter, the names of the attributes are used as the random variables). Then, the following probability model will be specified: For any j, \(P(C(j)=v_{C(j)}(k))=p(j,k)\), \(\sum \nolimits _{k=1}^{M_{C(j)}} {p(j,k)}=1\). For any \(j1 \ne j2\), C(j1) and C(j2) are independent of each other for simplicity. According to the rules specified in (4), if \(C=(1, 1, 2, 3, 4, 5)\) (hereafter (112345) briefly), for example, then \(P(D=1|C=(112345))=1.0\) by use of Hypothesis 1 in Table 2. If \(C=(123456)\) then \(P(D=1|C=(123456))=1/M_{D}=1/6\) by use of Hypothesis 2. If \(C=(112256)\) then \(P(D=1|C=(112256))=1/2\) by use of Hypothesis 3. Generally, the outcome of random variable D is determined by the outcome of C, if-then rules (generally unknown) and the hypothesis shown in Table 2. Consequently, the following expression is obtained:

$$\begin{aligned} P(D=l,C=u^{C}(i))=P(D=l \vert C=u^{C}(i)) P(C=u^{C}(i)). \end{aligned}$$
(5)

Here, \(P(D=l \vert C=u^{C}(i)\) is the conditional probability of \(D=l\) by \(C=u^{C}(i)\) and very dependent on the if-then rules to be induced.

In the special case, if C(j) does not exist in the condition part of the if-then rules of \(D=l\), then the event \(D=l\) is independent of C(j), that is \(P(D=l,C(j))=P(D=l \vert C(j))P(C(j))=P(D=l)P(C(j))\). This independence between \(D=l\) and C(j) can be used for a reduct of the decision table for the concept \(D=l\). The problem of whether they are independent or not can be easily dealt with using a statistical test of hypotheses by the use of \(\{u(i)=(v_{C(1)}(i),v_{C(2)}(i),...,v_{C(\vert C\vert )}(i),u^{D}(i)) | i=1,...,N\}\). Specifically, specifications and testing of the following null hypothesis H0(jl) and its alternative hypothesis H1(jl) \((j=1,...,|C|, l=1,...M_{D})\) were implemented:

  • H0(jl): C(j) and \(D=l\) are independent of each other.

  • H1(jl): C(j) and \(D=l\) are not independent of each other.

This paper adopts a Chi-square test since it is a standard method for testing the independence of two categorical variables by use of the contingency table \(M_{C(j)} \times 1\). The test statistic \(\chi ^{2}\) of C(j) vs. \(D=l\) is

$$\begin{aligned} \chi ^{2}=\sum \limits _{k=1}^{M_{C(j)}} \frac{(f_{kl}-e_{kl})^{2}}{e_{kl}}, \end{aligned}$$
(6)

where, \(f_{kl}=\vert U(C(j)=k) \cap U(D=l) \vert \), \(U(C(j)=k)=\{u(i) \vert \rho (u(i),C(j))=k\}\), \(U(D=l)=\{u(i) \vert u^{D}(i)=l\}\), \(e_{kl}=n\hat{p}(j,k)\hat{p}(D,l)\), \(n=\sum _{k=1}^{M_{C(j)}} \sum _{l=1}^{M_{D}} f_{kl}\), \(\hat{p}(j,k)=\frac{f_{k\_}}{n}\), \(\hat{p}(D,l)=\frac{f_{\_l}}{n}\), \(f_{k\_}=\sum _{l=1}^{M_{D}} f_{kl}\), \(f_{\_l}=\sum _{k=1}^{M_{C(j)}} f_{kl}\). \(\chi ^{2}\) obeys a Chi-square distribution with degrees of freedom \(df=(M_{C(j)}-1)\) under H0(jl) and testing condition [13]: \(n\hat{p}(j,k)\hat{p}(D,l) \ge 5\). This paper proposes a reduct method to adopt only the C(j)s of H0(jl) that were rejected and to construct a decision table for \(D=l\) composed by them, since the test of the hypotheses cannot control type II errors, but only type I errors by a significance level. This paper names the proposed method the statistical reduct method (SRM) to distinguish it from the conventional methods.

Table 3. Example of contingency table by a statistical reduct (\(N=3000\), \(df=5\)).

A simulation experiment was conducted to confirm the validity of the proposed method using the decision table of the samples of \(N=10,000\) used in Sect. 2, and the following procedures:  

Step 1::

Randomly select samples by \(N_{B}=3000\) from the decision table \((N=10,000)\), and form a new decision table.

Step 2::

Apply SRM to the new table, and calculate \(\chi ^{2}\) every C(j) by \(D=l\).

 

Table 3 shows an example of the contingency table for the case of \(D=1\) vs. C(j) \((j=1,...,6)\) and the results of a Chi-square test of them with \(df=(M_{C(j)}-1)\), and suggests the following knowledge:

  1. (1)

    The p-values of C(5) and C(6) are quite high compared with the other condition attributes and indicate that C(5) and C(6) are independent of \(D=1\), that is, they are redundant and should be removed from the viewpoint of reduct.

  2. (2)

    The frequencies \(f_{kl=1}\) of \(C(1)=1\), \(C(2)=1\), \(C(3)=1\) and \(C(4)=1\) are relatively high compared with those of the rest of the same C(j) \((j=1,...,4)\). Accordingly, the combinations of \(C(j)=1\) \((j=1,...,4)\) will most likely construct the rules of \(D=1\), which coincides with the rules specified in (4).

The above knowledge of (1) and (2) was also confirmed for the case of \(D=l\) \((l=2,...,6)\) and coincided with the specifications of Rules (4), and thus through them the validity and usefulness of SRM have been confirmed.

5 Proposal of Improved STRIM

STRIM has been proposed as a method to induce if-then rules from decision tables by use of two stages [8,9,10,11]. The first stage is that of searching rule candidates by the following procedures:

 

Step 1::

Specify a proper condition part of trying if-then rules:

$$\begin{aligned} CP(k)=\bigwedge _{j} (C(j_{k})=v_{k}). \end{aligned}$$
(7)
Step 2::

Test the condition part on the null hypothesis (H0) that CP(k) is not a rule candidate and its alternative hypothesis (H1) specifying a proper significance level. Specifically, use a test statistic \(z=\frac{(n_{d}+0.5-np_{d})}{\sqrt{np_{d}(1-p_{d})}}\) which obeys the normal distribution \(N(0,1^{2})\) on H0, if \(np_d \ge 5\), \(n(1-p_d) \ge 5\) (testing condition [14]). Here, \(p_{d}=\frac{1}{|M_{D}|}\), \(n_{d}=\max (n_{1},n_{2},...,n_{M_{D}})\), \(n=\sum \nolimits _{m=1}^{M_{D}} n_{m}\), \(n_{m}=|U(CP(k)) \cap U(D=m)|\), \(U(CP(k))=\{u(i) \vert u^{C}(i) \mathrm{~satisfies~} CP(k)\}\), \(U(D=m)=\{u(i) \vert \) \(u^{D=m}(i)\}\).

Step 3::

If H0 is rejected then add the trying rule to the rule candidates.

Step 4::

Repeat from Step 1 to Step 3 changing the trying rule systematically until the patterns of it are exhausted.

 

The basic notion of STRIM is that the rule makes a bias in the distribution of decision attribute values \((n_{1},n_{2},...,n_{M_{D}})\). It should be noted that \(P(D \vert CP(k))= P(\mathrm{~if~} CP(k) \mathrm{~then~} D)\) corresponding to (5), and can be estimated by \((n_{1},n_{2},...,\) \(n_{M_{D}})/n\) using the sample set.

The second stage is that of arranging the rule candidates having an inclusion relationship by representing them with CP(k) of the maximum bias.

Fig. 3.
figure 3

An algorithm for STRIM including a reduct function.

However, the conventional STRIM [8,9,10,11] did not have such a reduct function studied in Sect. 4 so that it had to search CP(k)s in (7) including even C(j) to be reducted and induced many kinds of rule candidates to burden the second stage. The knowledge from (1) and (2) studied in Sect. 4 can drastically squeeze the search space without idle C(j)s and/or its values. Figure 3 shows an algorithm for the improved STRIM described in a C language style including the reduct function “attribute_reduct()” (Line 14–18) studied in Sect. 4. The first stage (Line 7–11) is executed every \(D=l\) in a function “rule_check()” (Line 19–33) after operating the reduct (Line 8). The algorithm develops the patterns of trying rules implemented by the dimension “rule[]” (Line 4), for example, for \(D=1\) as (100000) \(\rightarrow \) (110000) \(\rightarrow \) (111000) \(\rightarrow \) (111100) \(\rightarrow \) (110100) \(\rightarrow \) (101100) \(\rightarrow \) (10010) \(\rightarrow \) (010000) ... \(\rightarrow \) (001100) \(\rightarrow \) (000100) by the operation of “rule[ci] = rdct_max[cj]” (Line 22) and the recursive call of “rule_check(ci,rule)” (Line 28) since “rdct_max[] = [1, 1, 1, 1, 0, 0]” and “rdct[] = [1, 1, 1, 1, 0, 0]” at Line 9 have been obtained (see Table 3). Accordingly, the number of trying rule patterns for R(d) specified in (4) is \((2^{4}-1) \times 6=90\). If the function of reduct is not implemented, the number is \((6^{6}-1) \times 6=279,930\), which burdens the second stage with a heavy load. From here, the effectiveness of the improved STRIM can be seen.

Table 4. An example of rule candidates.

Table 4 shows examples of the rule candidates arranged in descending order from z value obtained from the distribution of decision attribute values \(f=(n_{1},n_{2},...,n_{M_{D}})\) by applying the improved STRIM to the dataset shown in Table 3. We can see the following from the table:

  1. (1)

    The trying rule \(CP(k=1)=004400\) makes bias of the distribution of D like \(f=(2,1,1,101,2,1)\) at \(D=4\) intensively. Accordingly, the candidate is a rule for \(D=4\). The intensity of the bias can be measured by the z value as mentioned in Step 2.

  2. (2)

    There are inclusion relationships between rule candidates, for example, \(U(CP(1)=004400) \subset U(CP(14)=004000)\), \(U(CP(17)=404000) \subset U(CP(14)=004000)\), while the z value of CP(1) > that of CP(14) and the z value of CP(14) > that of CP(17).

Table 5. Estimated rules for the decision table in Table 3 by improved STRIM.

The second stage at Line 12 in Fig. 3 arranges their candidates and represents them with only CP(1), which happens to coincide with the specified rule in (4). Table 5 shows the last results induced through the first and second stages with D, f, p-value, accuracy and coverage besides CP(k). Here, accuracy and coverage are defined as follows:

$$ \mathrm {accuracy}= \frac{\vert U(CP(k)) \cap U(D=d) \vert }{\vert U(CP(k))\vert }, \quad \mathrm {coverage}= \frac{\vert U(CP(k)) \cap U(D=d) \vert }{\vert U(D=d)\vert }, $$

and they are often used for showing the indexes of the validity of the induced rules in Rough Sets theory. The improved STRIM induced all of twelve rules specified in advance from the decision table with \(N=3,000\), and also one extra rule. However, there are clear differences between them in the indexes of accuracy and coverage.

Table 6. An arrangement of Car Evaluation dataset of UCI.

6 An Example of Application for an Open Dataset

This paper applied SRM for the “Car Evaluation” dataset included in the literature [12]. Table 6 shows the summaries and specifications of the dataset: \(|C|=6\), \(|C(1)|=4\),..., \(|\{D\}|=4\), \(N=|U|=|C(1)| \times ,..., \times |C(6)|=1,728\) which consists of every combination of condition attributes’ values and there were no conflicted or identical samples. The frequencies of D extremely incline toward \(D=1\) as shown in Table 6.

Table 7. Results by SRM for Car Evaluation dataset.

Table 7 shows the results obtained by SRM and suggests the following:

  1. (1)

    Given \(1.0E-5\) as the critical p-value, C(3) is commonly redundant at \(D=l\) \((l=1,...,4)\).

  2. (2)

    With regard to the if-then rule of \(D=1\), C(5) is redundant besides C(3). In the same way, so are C(2) and C(5) at \(D=2\), as well as C(5) at \(D=3\).

Table 8. Examples of contingency table and \(\chi ^{2}\) test by SRM ((a): \(D=1\) vs. C(j), (b): \(D=4\) vs. C(j) \((j=1,...,6)\)).

Table 8 shows examples of the contingency tables of \(D=1\) (a) and \(D=4\) (b), and their \(\chi ^{2}\) by SRM, and suggests the following knowledge:

  1. (1)

    With regard to the if-then rules of \(D=1\), (a) the frequencies of \(C(1)=1\), \(C(2)=1\), \(C(4)=1\) and \(C(6)=1\) are distinctively high. Accordingly, the if-then rules of \(D=1\) are supposed to be constructed by the combinations of them as shown in the knowledge (2) studied in Sect. 4.

  2. (2)

    In the same way, the if-then rules of \(D=4\) are constructed by the combinations of \(C(1)=4\), \(C(2)=4\), \(C(5)=3\) and \(C(6)=3\).

Table 9. Estimated rules of \(D=1\) (a) and \(D=4\) (b) for Table 6

Corresponding to Tables 8, 9 shows estimated rules of \(D=1\) (a) and \(D=6\) (b). With regard to rules of \(D=1\), the improved STRIM clearly induces their rules. To express those rules by use of the original notation in Table 6, if person = “2” \(\vee \) safety = “low” \(\vee \) buying = “vhigh” \(\wedge \) maint = “vhigh” then class = “unacc” is obtained with accuracy \(=1.0\) and coverage \(=1,008/1,210 \approx 0.833\). In the same way, three examples of the trying rule of \(D=4\) satisfying the testing condition \(n\hat{p}(j,k) \ge 5\) (for \(D=4\), \(n \ge \frac{5}{0.04}=125\)) are shown in Table 9 (b) although their \(n_{d}=\max (n_{1},n_{2},...,n_{M_{D}})\) is not satisfied at \(D=4\) (in the table \(D=4\) is forcibly entered). The first rule that if buying = “low” \(\wedge \) safety = “high” then class = “vgood” is thought to be proper since the p-value is the best and the indexes of accuracy and coverage are moderate among the trying rules for \(D=4\). Both estimated rules for \(D=1\) and 4 coincide with our common sense.

7 Conclusions

The Rough Sets theory has been used for inducing if-then rules from the decision table. The first step in inducing the rules is to find reducts of the condition attributes. This paper retested the conventional reduct methods LEM1 [2] and DMM [3] by a simulation experiment after summarizing the conventional Rough Sets theory and pointed out their problems. Then, this paper proposed a new statistical reduct method (SRM) to overcome the problems of the conventional method from the view of STRIM [8,9,10,11]. STRIM including SRM was developed and its validity and usefulness were confirmed by a simulation experiment and application to an open dataset of UCI for machine learning. The improved STRIM should be recognized to be particularly useful for not only reducts of condition attributes but also inducing if-then rules.