Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Rough Sets theory was introduced by Pawlak [1] and used for inducing if-then rules from a dataset called the decision table. The induced if-then rules simply and clearly express the structure of rating and/or knowledge hiding behind the decision table. Such rule induction methods are needed for disease diagnosis systems, discrimination problems, decision problems and other aspects. Each data in the decision table is a sample data consisting of the tuple of condition attributes’ values and the decision attribute value, and the first step for the rule induction is to find the condition attributes which do not have any relationship with the decision attribute, to remove them and finally to reduce the table. The use of those processes to obtain the reduced table is called a reduct. The conventional rough sets theory to induce if-then rules is based on the indiscernibility of the samples of the given table by their attributes, and a reduct by the conventional method also uses the same concept and various types of indiscernibility, their methods and algorithms for the reducts are proposed to date [26].

This paper retests the conventional reduct methods through the use of a simulation dataset and points out their problems and how they lack receptivity for datasets containing conflicting and/or indifferent data, such as real-world datasets, after summarizing the conventional rough sets and reduct methods. Then a new reduct method is proposed to overcome their problems from a statistical point of view. Specifically, the new method recognizes each sample data in the decision table as the outcomes of random variables of the tuple of the condition attributes and the decision attribute since the dataset is obtained from their population of interest. Accordingly, the problem of finding the reduct through the concept of discernibility can be replaced by the problem of finding the condition attributes which are statistically independent of the decision attribute. This paper proposes two reduct methods: one is called a global reduct which examines the statistical independence between each condition attribute and the decision attribute, and the other is called a local reduct which finds each condition attribute statistically independent of a decision attribute value. The validity and usefulness of both reduct methods are confirmed by applying them to the simulation dataset as well as a UCI dataset [7] prepared for machine learning. The local reduct particularly shows that it gives important clues in the search for the if-then rules hidden behind the decision table of interest.

2 Conventional Rough Sets and Reduct Method

Rough Sets theory is used for inducing if-then rules from a decision table S. S is conventionally denoted \(S = (U, A=C \cup \{D\}, V, \rho )\). Here, \(U = \{u(i)|i=1,...,|U|=N\}\) is a sample set, A is an attribute set, \(C = \{C(j)|j=1,...,|C|\}\) is a condition attribute set, C(j) is a member of C and a condition attribute, and D is a decision attribute. V is a set of attribute values denoted by \(V = \bigcup _{a\in A} {V_{a}}\) and is characterized by an information function \(\rho \): \(U \times A \rightarrow V\). Table 1 example where \(|C| = 6\), \(|V_{a=C(j)}| = M_{C(j)} = 6\), \(|V_{a=D}| = M_D = 6\), \(\rho (x=u(1), a=C(1)) = 5\), \(\rho (x=u(2), a=C(2)) = 5\), and so on.

Table 1. Example of a decision table

Rough Sets theory focuses on the following equivalence relation and equivalence set of indiscernibility: \( I_{C} = \{(u(i), u(j)) \in U^{2} | \rho (u(i), a) = \rho (u(j), a), \forall a \in C\}. \) \(I_{C}\) is an equivalence relation in U and derives the quotient set \(U / I_{C} = \{ [u_{i}]_{C} |\) \(i = 1,2,... \}\). Here, \([u_{i}]_{C} = \{ u(j) \in U | (u(j), u_{i}) \in I_{C}, u_{i} \in U \}\). \([u_{i}]_{C}\) is an equivalence set with the representative element \(u_{i}\) and is called an element set of C in Rough Sets theory [2]. Let be \(\forall X \subseteq U\) then X can be approximated like \(C_{*}(X) \subseteq X \subseteq C^{*}(X)\) by use of the element set. Here,

$$\begin{aligned} C_{*}(X)&= \{ u_{i} \in U | [u_{i}]_{C} \subseteq X \}, \end{aligned}$$
(1)
$$\begin{aligned} C^{*}(X) =&\{ u_{i} \in U | [u_{i}]_{C} \cap X \ne \emptyset \}, \end{aligned}$$
(2)

\(C_{*}(X)\) and \(C^{*}(X)\) are called the lower and upper approximations of X by C respectively. The pair of \((C_{*}(X)\), \(C^{*}(X))\) is usually called a rough set of X by C. Specifically, let \(X = D_{d} = \{ u(i) | (\rho (u(i), D) = d \}\) then \(C_{*}(X)\) is surely a set satisfying \(D=d\) and \(C^{*}(X)\) is possibly, and they derive if-then rules of \(D = d\) with necessity and possibility, respectively.

The conventional Rough Sets theory seeks a minimal subset of C denoted with \(B(\subseteq C)\) satisfying the following two conditions:

  • (i) \(B_{*} (D_{d}) = C_{*} (D_{d})\), \(d = 1,2,...,M_{D}\).

  • (ii) \(a(\in B)\) satisfying \((B-\{a\})_{*} (D_{d}) = C_{*} (D_{d})\) \((d = 1,2,...,M_{D})\) does not exist.

\(B(\subseteq C)\) is called a relative reduct of \(\{D_{d} | d=1,...,M_{D}\}\) preserving the lower approximation, and is useful for finding if-then rules with necessity, since redundant condition attributes have been already removed from C. In the same way, a relative reduct preserving the upper approximation can be also defined and obtained.

LEM1 algorithm [2] and the discernibility matrix method (DMM) [3] are well known as representative ways to perform reducts. Figure 1 shows an example of LEM1, and A and \(\{d\}^{*}\) at Line 1 of the figure respectively correspond to C and \(\{D_{d} | d = 1,...,M_{D}\}\) in this paper. LEM1, from Line 6 to 15 in the figure, in principle, checks and executes (i) and (ii) for all the combinations of the condition attributes.

Fig. 1.
figure 1

Example of LEM1 algorithm

DMM [3] at first forms a symmetric \(N \times N\) matrix having the following (ij) element \(\delta _{ij}\):

$$\begin{aligned} \delta _{ij} = \{a \in C | \rho (u(i),a) \ne \rho (u(j),a)\}; \end{aligned}$$
$$\begin{aligned} \exists d \in D, \rho (u(i),d) \ne \rho (u(j),d) \mathrm{and~} \{u(i),u(j)\} \cap Pos(D) \ne \emptyset , \end{aligned}$$
$$\begin{aligned} = *; \quad otherwise (U-Pos(D)). \end{aligned}$$

Here, \(Pos(D) = \cup _{d=1}^{M_{D}} C_{*} (D_{d})\) and \(*\) denotes “don’t care”. Then, a relative reduct preserving the lower approximation can be obtained by the following expression:

$$\begin{aligned} F^{reduct} = \wedge _{i,j:i<j} \vee \delta _{ij}. \end{aligned}$$
(3)

Here, \(F^{reduct}\) is called a discernible function and indicates that reducts are obtained by arranging all of respective discernibility.

3 Retests of the Conventional Reduct Method

Here we retest the ability of the reducts obtained through LEM1 and DMM by use of a simulation dataset. The literatures [811] show ways of how to generate simulation datasets. Specifically, let (a) generate the condition attribute values of u(i), that is, \(u^{C}(i) = (v_{C(1)}(i), v_{C(2)}(i),...,v_{C(|C|)}(i))\) by the use of random numbers with a uniform distribution and (b) determine the decision attribute value of u(i), that is, \(u^{D}(i)\) by use of if-then rules specified in advance and in the hypotheses shown in Table 2, and repeat the (a) and (b) processes by N times. Table 1 shows an example dataset generated by the use of those procedures with the following if-then rule R(d) specified in advance:

$$\begin{aligned} R(d): \mathrm{if~} Rd \mathrm{~then~} D=d \quad (d=1,...,M_{D}=6), \end{aligned}$$
(4)

where \(Rd = (C(1)=d) \bigwedge (C(2)=d) \bigvee (C(3)=d) \bigwedge (C(4)=d)\). The results of retesting both methods using the \(N = 10000\) dataset showed \(F_{LEM1}^{reduct} = F_{DMM}^{reduct} = C(1) \wedge C(2) \wedge C(3) \wedge C(4) \wedge C(5) \wedge C(6)\) while the results were expected to be \(F^{reduct} = C(1) \wedge C(2) \wedge C(3) \wedge C(4)\) from the rules (4) specified in advance. The retest experiment was repeated three times by changing the generated dataset and obtained the same results, or in other words, the expected reducts were not obtained.

Table 2. Hypotheses with regard to the decision attribute value

These results are clearly derived from the indiscernibility and/or discernibility caused by the element set in the conventional Rough Sets theory. The element sets couldn’t distinguish the differences between samples by the if-then rules (see Hypothesis 1 in Table 2) or those obtained by chance (see Hypothesis 2 and 3 in Table 2). On the other hand, real-world datasets will contain all kinds of data generated by Hypotheses 1, 2 and 3. In other words, the conventional Rough Sets theory does not have any abilities adaptive to sample data caused by chance.

4 Proposal of Statistical Reduct

4.1 Statistical Global Reduct Method

As mentioned in Sect. 3, conventional reduct methods are unable to reproduce the forms of reducts specified in advance from the decision table due to a lack of abilities adaptive to the indifferent and conflicted samples in datasets, despite the fact that real-world datasets will have such samples. This paper studies this problem with reducts from the view of STRIM (Statistical Test Rule Induction Method) [811]. STRIM regards the decision table as a sample set obtained from the population of interest. According to a statistical model, \(u(i) = (v_{C(1)}(i), v_{C(2)}(i),...,v_{C(|C|)}(i), u^{D}(i))\) is an outcome of the random variables of \(A = (C(1),C(2),...,C(|C|),D)=(C, D)\) (hereafter, the names of the attributes are also used as the random variables). Next, the following probability model will be specified: For any j, \(P(C(j)=v_{C(j)}(k)) = p(j,k)\), \(\sum \nolimits _{k=1}^{M_{C(j)}} p(j,k) = 1\). For any \(j1 \ne j2\), \(P( C(j1) = v_{C(j1)}(k1), C(j2) = v_{C(j2)}(k2) ) = {p(j1,k1) p(j2,k2)}\), that is, C(j1) and C(j2) are independent of each other. According to the rules specified in (4), if \(C = (1, 1, 2, 3, 4, 5)\) (hereafter (112345) briefly), for example, then \(P(D=1) = 1.0\) by use of Hypothesis 1 in Table 2. If \(C=(123456)\) then \(P(D=1) = 1/M_{D} = 1/6\) by use of Hypothesis 2. If \(C=(112256)\) then \(P(D=1) = 1/2\) by use of Hypothesis 3. Generally, the outcome of random variable D is determined by the outcome of C, if-then rules (generally unknown) and the hypothesis shown in Table 2. Consequently, the following expression is obtained:

$$\begin{aligned} P(D,C) = P(D|C) P(C=(v_{c(1)}(i), v_{c(2)}(i), ..., v_{c(|C|)}(i))). \end{aligned}$$
(5)

Here, P(D|C) is the condition probability of D by C and dependent on the if-then rules to be induced by the use of the dataset \(\{ u(i) = (v_{C(1)}(i), v_{C(2)}(i), ...,\) \(v_{C(|C|)}(i), u^{D}(i)) | i=1,...,N \}\).

In the special case, if C(j) does not exist in the condition part of the if-then rules to be induced, then D is independent of C(j), that is \(P(D,C(j)) = P(D|C(j)) P(C(j)) = P(D)P(C(j))\). This independency between D and C(j) can be used for a reduct of the decision table. The problem of whether they are independent or not can be easily dealt with using a statistical test of hypotheses by the use of \(\{ u(i) = (v_{C(1)}(i), v_{C(2)}(i), ...,\) \(v_{C(|C|)}(i), u^{D}(i)) | i=1,...,N \}\). Specifically, specification and testing of the following null hypothesis H0(j) and its alternative hypothesis H1(j) \((j=1,...,|C|)\) were implemented:

  • H0(j): C(j) and D are independent of each other.

  • H1(j): C(j) and D are not independent of each other.

This paper adopts a Chi-square test since it is a standard method for testing the independency of two categorical variables by use of the contingency table \(M_{C(j)} \times M_{D}\). Chi-square tests are widely used in a data mining area for feature selection [12]. The test statistic \(\chi ^{2}\) of C(j) vs. D is

$$\begin{aligned} \chi ^{2} = \sum \limits _{k=1}^{M_{C(j)}} \sum \limits _{l=1}^{M_{D}} \frac{(f_{kl}-e_{kl})^{2}}{e_{kl}}, \end{aligned}$$
(6)

where, \(f_{kl} = |U(C(j)=k) \bigcap U(D=\ell )|\), \(U(C(j)=k) = \{u(i) | v_{C(j)}(i) = k\}\), \(U(D=l) = \{u(i) | u^{D}(i) = l\}\), \(e_{kl} = n \hat{p}(j,k) \hat{p}(D,l)\), \(n = \sum \nolimits _{k=1}^{M_{C(j)}} \sum \nolimits _{l=1}^{M_{D}} f_{kl}\), \(\hat{p}(j,k) = f_{k\_}/n\), \(\hat{p}(D,l) = f_{\_ l}/n\), \(f_{k\_} = \sum \nolimits _{l=1}^{M_{D}} f_{kl}\), \(f_{\_ l} = \sum \nolimits _{k=1}^{M_{C(j)}} f_{kl}\). \(\chi ^{2}\) obeys a Chi-square distribution with degrees of freedom \(df = (M_{C(j)} -1) \times (M_{D} -1)\) under H0(j) and test condition [13]: \(n\hat{p}(j,k)\hat{p}(D,l) \ge 5\). This paper proposes a reduct method to adopt only the C(j)s of H0(j) that were rejected and to construct a decision table composed by them since the test of the hypotheses can’t control type II errors but only type I errors by a significance level. This paper names the proposed method the statistical global reduct (SGR) to distinguish it from the conventional method.

A simulation experiment was conducted to confirm the validity of the proposed method using the decision table of the samples of \(N = 10000\) used in Sect. 2, and the following procedures:

  • Step1: Randomly select samples by \(N_{B}=3000\) from the decision table (\(N=10000\)), and form a new decision table;

  • Step2: Apply SGR to the new table, and calculate \(\chi ^{2}\) every C(j);

  • Step3: Repeat Step1 and Step2 \(N_{r}=100\) times.

Table 3 shows the arrangement of the results of the simulation experiment, that is, the average (AVE), standard deviation (S.D.), maximum (Max) and minimum (Min) of \(\chi ^{2}\), and their corresponding p-values for every C(j) \((j=1,...,6)\). The table shows the following:

  • (1) There are clear differences and big gaps in \(\chi ^{2}\) values between those of \(C(1)-C(4)\) dependent on D and those of C(5) or C(6) independent of D, and are not overlapped by fluctuating ranges of the \(\chi ^{2}\) value.

  • (2) Accordingly, there are clear differences in p-values between them.

  • (3) This reduct method of adopting rejected C(j)s of H0(j) and to construct a decision table composed by them is valid and useful since the notion reproduces the expected reduct.

Table 3. Results of test for independence by Bootstrap method (\(N=10000\), \(N_{B}=3000\), \(Nr=100\)): C(j) \((j=1,...,6)\) vs. D

4.2 Statistical Local Reduct Method

In this section, a statistical local reduct (SLR) method which is applied for every decision attribute value, that is, every if-then rule, is proposed corresponding to the SGR method. The SLR method is applicable for the decision table which doesn’t even have a relative reduct and/or the SGR. Specifically, (5) also holds at \(D=l\), that is,

$$\begin{aligned} P(D=l,C) = P(D=l|C) P(C=(v_{c(1)}(i), v_{c(2)}(i), ..., v_{c(|C|)}(i))). \end{aligned}$$
(7)

Accordingly, the discussion in Sect. 4.1 holds even in the case where D is replaced in H0(j) and H1(j) with \(D=l\), and the test of statistical independency between C(j) and \(D=l\) can be used for the reduct method for every \(D=l\) or if-then rule.

Table 4 shows a set of new rules specified in advance in order to confirm the validity and usefulness of the SGL and SLR method. For example, (1100001) at Rule 1 of the table means there is a rule that if \((C(1)=1) \bigwedge (C(2)=1)\) then \(D=1\). The set of new rules doesn’t include redundant condition attributes, that is, D is dependent on all condition attributes. The same simulation experiment in 4.1 was conducted replacing (4) with the set of rules in Table 4 and the results of the applied SGR method were summarized in Table 5 in the same way as Table 3. The validity of the SGR method is also confirmed since the table suggests the condition attributes seemed to be independent of D never existing, which coincides with the rules specified in advance. However, the structure of the if-then rules specified in advance is left almost unknown.

Table 4. Example of if-then rules for a simulation experiment
Table 5. Results of test for independence by Bootstrap method using sample dataset generated by rules in Table 4 (\(N=10000\), \(N_{B}=3000\), \(Nr=100\)): C(j) \((j=1,...,6)\) vs. D

Then, the SLR method was applied for one of the same datasets of \(N_{r} = 100\) times used in Table 5. As one of examples, Table 6 shows the contingency table of the case of \(D = 1\) vs. C(j) \((j=1,...,6)\) and the results of a chi-square test of them with \(df = (M_{C(j)} -1)\), and suggests the following knowledge:

  • (1) The p-values of C(5) and C(6) are quite high compared with those of the other condition attributes and indicate that C(5) and C(6) are independent of \(D=1\), that is, they are redundant and should be removed from the viewpoint of reduct.

  • (2) The frequencies of \(C(1)=1\), \(C(2)=1\), \(C(3)=1\) and \(C(4)=1\) are relatively high compared with those of the rest of the same C(j) \((j=1,...,4)\) set. Accordingly, the combinations of \(C(j)=1\) \((j=1,...,4)\) will most likely construct the rules of \(D=1\).

The above knowledge of (1) and (2) coincides with the specifications of Rules 1 and 2 in Table 4 and the same results also have been obtained from the cases of \(D=l\) \((l=2,...,6)\), and thus through them the validity and usefulness of the SLR method have been confirmed.

Table 6. Example of contingency table by local reducts (\(N=3000\), \(df=5\)): \(D=1\) vs. C
Table 7. Arrangement of Car Evaluation dataset of UCI

5 An Example Application on an Open Dataset

The literature [7] provides a lot of datasets for machine learning. This paper applied the SGR and SLR method for the “Car Evaluation” dataset included in them. Table 7 shows the summaries and specifications of the dataset: \(|C| =6\), \(|V_{C(1)}|=|V_{C(2)}|=|V_{C(3)}|=4\), \(|V_{C(4)}|=|V_{C(5)}|=|V_{C(6)}|=3\), \(|V_{D}|=4\), \(N=|U|=|V_{C(1)}| \times |V_{C(2)}| \times |V_{C(3)}| \times |V_{C(4)}| \times |V_{C(5)}| \times |V_{C(6)}| = 1728\) which consists of every combination of condition attributes’ values, and there were not any conflicted or identical samples. The frequencies of the decision attribute values extremely incline toward \(D=1\) as shown in Table 7.

Table 8 shows the results obtained by the SGR method, that is, \(\chi ^{2}\) and the corresponding p-values at every C(j) \((j=1,...,|C|)\) and suggests that C(3) is independent of D and redundant compared to those of the other condition attributes, and should be removed from the viewpoint of the reduct.

Table 8. Results of global reduct for Car Evaluation dataset: D vs. C(j) \((j=1,...,6)\)
Table 9. Results of local reduct for Car Evaluation dataset: \(D=\ell \) \((\ell =1,...,4)\) vs. C(j) \((j=1,...,6)\)
Table 10. Examples of contigency table and \(\chi ^{2}\) test by local reducts

Table 9 shows the results obtained by the SLR method, that is, \(\chi ^{2}\) and the corresponding p-values at every \(D=l\) \((l=1,...,4)\) and C(j) \((j=1,...,|C|)\) and suggests the following:

  • (1) C(3) is commonly redundant at \(D=l\) \((l=1,...,4)\), which coincides with the results by the SGR method.

  • (2) With regard to the if-then rule of \(D=1\), C(5) is redundant. In the same way, C(2) and C(5) at \(D=2\) are, and C(5) at \(D=3\) is.

Table 11. Examples of estimated rules by use of the local reduct results in Table 10

Table 10 shows examples of the contingency tables of \(D=1\) (a) and \(D=4\) (b), and their \(\chi ^{2}\) by the SLR method, and suggests the following knowledge:

  • (1) With regard to the if-then rules of \(D=1\) (see (a)),the frequencies of \(C(1)=1\), \(C(2)=1\), \(C(4)=1\) and \(C(6)=1\) are distinctively high, which is statistically confirmed by their p-values. Accordingly, the if-then rules of \(D=1\) are supposed to be constructed by the combinations of the set of \(\{C(1)=1, C(2)=1, C(4)=1 \mathrm{~and~} C(6)=1\}\) as shown in the knowledge obtained in 4.2.

  • (2) In the same way, the if-then rules of \(D=4\) are supposed to be constructed by the combinations of the set of \(\{C(1)=4, C(2)=4, C(4)=3, C(5)=3 \mathrm{~and~} C(6)=3\}\) (see (b)).

Table 11 shows the trying rules (Rule 1-9 for \(D=1\), Rule 10-16 for \(D=4\)) based on the knowledge obtained above (1) and (2). Here Rule 15 and 16 do not have p-values since they do not satisfy the test condition \(n\hat{p}(j,k) \ge 5\) (for \(D=4\), \(n \ge \frac{5}{0.04}=125\)). Rule 1 and Rule 2 can be selected as one of the proper rules of \(D=1\) since their p-values are extremely high and their indexes of accuracy and coverage are in good condition compared to the other trying rules. To express those rules by use of the original notation in Table 7, if person\(=\)2 \(\bigvee \) safety=low then class\(=\)unacc, is obtained with the coverage\(=\) \((576+576-192)/1210 \approx 0.793\) and accuracy\(=\)1.0.

In the same way, the rule that if buying\(=\)low \(\bigwedge \) safety\(=\)high then class\(=\)vgood (Rule 13) is thought to be proper since the p-value is the best and the indexes of accuracy and coverage are moderate among the trying rules for \(D=4\) although the number of sample data for \(D=4\) is small compared to that for \(D=1\). Both rules estimated coincide with our common sense and the validity and usefulness of the SGR and SLR method have been experimentally confirmed through the use of an open dataset.

6 Conclusions

The Rough Sets theory has been used for inducing if-then rules from the decision table and clarifying the structure of rating and/or knowledge hidden behind the dataset. The first step in inducing the rules is to find reducts of the condition attributes. This paper retested the conventional reduct methods LEM1 [2] and DMM [3] by a simulation experiment after summarizing the conventional rough sets theory and pointed out their problems. Then, this paper proposed a new reduct method to overcome the problems of the conventional method from the view of STRIM [811]. Specifically, the SGR and SLR methods were proposed and their validity and usefulness were confirmed by a simulation experiment and application to an open dataset of UCI for machine learning. The SLR method should be recognized to be particularly useful for not only reducts of condition attributes but also inducing if-then rules.