Keywords

1 Introduction

With the growth of various network societies, numerous electric datasets are generated and stored for use under different policies and/or business strategies, and such datasets are often arranged in each suitable form for the application. This paper considered a Rakuten Travel dataset (R-dataset) [1, 12] which is a real-world dataset (RWD) of questionnaire surveys with the accommodation (object) rating some feature items of each object and its overall category, and a typical decision table (DT) in the field of the Rough Sets (RSs) [2]. This paper proposed a new method for arranging the R-dataset into a rule table (RT), presenting the relationships between the feature items of each object and the overall category, and applying these relationships for classifying a new object into its belonging overall category. The classification results were evaluated using the random forest (RF) [3].

In addition, as mentioned in the principle of incompatibility [4], accurate arrangement and/or summarization against the dataset as well as the comprehensive expressions for decision making are necessary to support real-world activities. However, the arrangement using the RT was too complex for human beings, as also in the case of RF, and lost comprehensibility. After showing that the RT includes if-then rule candidates (RCs) with a proper statistical significance level induced by the previous statistical test rule induction method (STRIM) [5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] and RCs without it, the former RCs were also pointed out including the intuitively comprehensive basic rules. These three types of RCs in the dataset can organize the rule set, balancing accuracy and comprehensibility depending on the target matter. Meanwhile, RF does not provide such knowledge or information behind the dataset, although it remains a useful method for classification problems. In this way, the usefulness of the proposed method was confirmed.

2 Conventional RS and Its Rule Induction Method

The R-dataset is a typical DT in the field of the RSs [2] and the DT is formulated with an observation system S as follows: \(S=(U,A=C\cup \{D\},V,\rho )\), where \(U=\{u(i)|i=1,...,N=|U|\}\) is a dataset, u(i) denotes an object in a population, A denotes a set of given attributes of U, \(C=\{C(j)|\,j=1,...,|C|\}\) denotes a set of the condition attribute C(j), and D denotes a decision attribute. Meanwhile, V denotes a value set of the attribute, i.e., \(V=\bigcup \nolimits _{a\in A}V_{a}\), where \(V_{a}\) denotes the set of values for an attribute a and \(\rho :U\times A\rightarrow V\) is called an information function. For example, let a = C(j) \((j=1,...,|C|)\), then \(V_{a}=\{1,2,\ldots ,M_{C(j)}\}\). If a = D, then \(V_{a}=\{1,2,\ldots ,M_{D}\}\). Corresponding relationships with the R-dataset are given as follows: \(|C|=6\), \(A=\{C(1)=Location,\,C(2)=Room,\,C(3)=Meal,\,C(4)=Bath\,(HotSpring),\,C(5)=Service,\,C(6)=Amenity,\,D=Overall\}\), and \(V_{a}=\{1:Dissatisfied,\,2:Slightly\,Dissatisfied,\,3:Neither\,Dissatisfied\,nor\,Satisfied,\,4:Slightly\,Satisfied,\,5:Very\,Satisfied\}\), \(a\in A\) , i.e., \(|V_{a=D}|=M_{D}=|V_{a=C(j)}|=M_{C(j)}=5\).

The conventional RS theory finds the following subsets of U through C and D:

$$\begin{aligned} C_{*}(D_{d})\subseteq D_{d}\subseteq C^{*}(D_{d}). \end{aligned}$$
(1)

Here, \(C_{*}(D_{d})=\{u_{i}\in U|[u_{i}]_{C}\subseteq D_{d}\}\), \(C^{*}(D_{d})=\{u_{i}\in U|[u_{i}]_{C}\cap D_{d}\ne \emptyset \}\), \([u_{i}]_{C}=\{u(j)\in U|(u(j),u_{i})\in I_{C},u_{i}\in U\}\) , and \(I_{C}=\{(u(i),u(j))\in U^{2}|\rho (u(i),a)=\rho (u(j),a),\forall a\in C\}\), where \([u_{i}]\) denotes the equivalence class with the representative \(u_{i}\) induced by the equivalence relation \(I_{C}\) , and \(D_{d}=\{u(i)|(\rho (u(i),D)=d\}\). In equation (1), \(C_{*}(D_{d})\) and \(C^{*}(D_{d})\) are called the lower and upper approximation of \(D_{d}\), respectively, and \((C_{*}(D_{d}),\) \(C^{*}(D_{d}))\) is the rough set for \(D_{d}\). Being found \(C_{*}(D_{d})=\{u(i)|\wedge _{j}(\rho (u(i),C(j))=v_{j_{k}})\}\) by using the DT, the following if–then rule with necessity is obtained using the inclusion relation in equation (1): if CP then \(D=d\), where the condition part (CP) is specifically \(CP=\wedge _{j}(C(j)=v_{j_{k}})\). Similarly, the if–then rule with possibility is induced using \(C^{*}(D_{d})\). Thus, the conventional RS theory derives relationships between \(C=(C(1),\ldots ,C(6))\) and D. The specific algorithm and RSs can be respectively referred to in the literature [2, 21].

However, in most cases, \(u(i)=(u^{C}(i),u^{D}(i))\) is randomly collected from a population of interest so that the attribute values \(u^{C}(i)=(v_{C(1)}(i),...,v_{C(6)}(i))\) (\(v_{C(j)}(i)\,(\in V_{C(j)}))\) or \(u^{D}(i)=v_{d}(i)\,(\in V_{D})\) follow random variations. The collection of the dataset from the same population indicates that U will variate such that the corresponding induced rules also variate since the conventional RS theory directly uses the DT attribute values [5, 8, 13]. As a result, the rules induced by the conventional RSs do not fulfill the function, e.g., for the classification problem. In statistics, C(j) and D are recognized as random variables, and \(v_{C(j)}(i)\,(\in V_{C(j)})\) and \(v_{d}(i)\,(\in V_{D})\) are their respective outcomes. The conventional RS theory lacks these statistical views and does not have a model for collecting the DT values.

Fig. 1.
figure 1

Data generation model: Rule Box contains if-then rules and Hypotheses regulate how to apply rules for Input and transform Input into Output.

3 Outlines of Data Generation Model

The previous STRIM proposed a data generation model as shown in Fig. 1 in which Rule Box and Hypotheses transformed an input \(u^{C}(i)\) into the output \(u^{D}(i)\). Here, Rule Box contains pre-specified if-then rules and Hypotheses regulate how to apply those rules for the input and transform the input into the output. The model was used for generating a dataset on a simulation experiment as follows: (1) specifying some proper if-then rules in Rule Box, (2) randomly generating an input and transforming it into the output based on those rules and Hypotheses, (3) repeating (1) and (2) N times and forming \(U=\{u(i)=(u^{C}(i),\,u^{D}(i))|i=1,...,N=|U|\}\). The generated U was used for investigating the rule induction abilities by applying it for any rule induction method (see details [5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]). The previous STRIM [20] could induce the pre-specified rules while the rules induced by the method like RSs [15, 21], CART [14, 22] or association rule [19, 23] included a lot of meaningless rules and hardly corresponded with the pre-specified rules.

4 Introduction of RT and Its Application for Classification Problem

The validity and usefulness of the previous STRIM have been confirmed in simulation experiments which generate the dataset obeying pre-specified rules, that is, a well-behaved dataset. This paper newly expanded Rule Box and Hypotheses in Fig. 1 so as to adapt the R-dataset as one of RWD which includes an ill-behaved dataset caused by various raters with different rating standard, and investigated its rule induction abilities. Generally, the result of the rule induction from an RWD cannot be directly ascertained and increases the complexity of rules’ description. This paper investigates the ability in the classification problem having the complimentary relation to the rule induction problem in which induced rules directly affect the classification rate.

Let there be a new relationship between C(j) \((j=1,...,6)\) and D by using the R-dataset. This section shows how to form the RT and apply the new relationship for a classification problem. The R-dataset of N= 10,000 was first formed by randomly selecting 2,000 samples, each of \(D=m\) (\(m= 1,\ldots ,5\)) from about 400,000 surveys in the 2013–2014 dataset. The R-dataset was randomly divided into two groups. One is the \(R_{L}\)-dataset with \(N_{L}\)=5,000 for learning the R-dataset and the other is the \(R_{C}\)-dataset with \(N_{C}\)=5,000 for the classification experiment.

Regarding the learning process, let us consider a specific example of learning data: \((C(1),\ldots ,C(6),D)\) = (1, 2, 3, 4, 5, 1, 3). This data can be derived by an if–then rule: if \(CP=(C(1)=1)\wedge (C(2)=2)\wedge (C(3)=3)\) (hereafter denoted with \(CP=(123000)\)) then \(D=3\). This rule is called the rule with rule length 3 (\(RL=3\)) as it involves three conditions. Assuming \(RL=3\), \(CP=(023400)\) can be considered another RC. Thus, all the RCs with \(RL=3\) in this example can be \(_{6}C_{r}|_{r=3}=20\) different ways. Accordingly, all the possible RCs with \(RL=3\) is \(_{6}C_{r}(5)^{r}|_{r=3}=2,500\). The \(R_{L}\)-dataset was arranged in the RT with \(RL=3\) (\(RT(r=3\))) as shown in Table 1. The first row of the table (1,2,3) = (1,1,1) represents the CP: \((C(1)=1)\wedge (C(2)=1)\wedge (C(3)=1)\). Meanwhile, (79,0,0,0,0) is the frequency distribution of D satisfying the condition in the \(R_{L}\)-dataset. In other words, \(D=1\) represent the maximum frequency (if there are the same frequencies, D is randomly selected between them). Most of the distributions of D in Table 1 widely fluctuate corresponding to the same CP, which was caused by different raters with varying standards. If an RC has a frequency of (0,0,0,0,0), it is called the empty RC. In Table 1, each RC is called a sub-rule of the RT. In addition, by using the RT(r), the rule set \(\cup _{r=1}^{|C|=6}RT(r)\) includes all the rules behind the \(R_{L}\)-dataset. This RT is the newly expanded Rule Box in Fig. 1.

With respect to the classification process of transforming an input \(u^{C}(i)\) into the output\(u^{D}(i)\), sub-rules of \(_{6}C_{r}|_{r=3}=20\) that match the input pattern should be considered to adapt to different rating standards. Therefore, this paper adopted the vote of each sub-rule’s output. Figure 2 provides a specific example where the input \(u^{C}(i)=(4,5,1,4,4,3)\) (\(u^{D}(i)=4\)) is classified by 20 sub-rules with \(RL=3\) using the RT in Table 1. In the first row, the input values (\(C(1)=4,C(2)=5,C(3)=1\)) correspond to the CP of the sub-rule: (1,2,3) = (4,5,1) in the RT. It is classified as \(D=1\) by selecting the maximum frequency. Similarly, in the 19th row, \(D=2\) or 3 is randomly selected. By arranging non-empty cases (i.e., deleting empty sub-rules) and by counting the votes, the maximum frequency is 9 at \(\hat{D}=4\), which is the final result that happens to coincide with \(u^{D}(i)\). In the case of same values, one of them is randomly selected. These processes are the new Hypotheses in Fig. 1.

Table 1. Example of RT with RL = 3 of the \(R_{L}\)-dataset.
Fig. 2.
figure 2

Example of classification process by RT with \(RL=3\)

5 Classification Experiments on R-Dataset Using RT and Comparison With RF

The RT accompanied with the classification method proposed in Sect. 4 was applied for the R-dataset to investigate its ability and confirm the usefulness. The experiment was executed for every RT(r) (\(r=1,\ldots ,6\)) as follows: 1) composing RT(r) using the \(R_{L}\)-dataset, 2) forming the classification dataset by randomly sampling \(N_{b}\) = 500 from the \(R_{C}\)-dataset, 3) classifying the \(N_{b}\) dataset according to the classification process (see Fig. 2), and 4) repeating the previous three procedures by \(N_{r}\) = 50 times. Table 2 summarizes one of the results classified by RT(r) (\(r=1,\ldots ,6\)) with \((m_{RT(r)},SD_{RT(r)})\) [%]. Here \(m_{RT(r)}\) and \(SD_{RT(r)}\) denote the \(N_{r}\) times mean of the classification rate and its standard deviation, respectively. The following represents the comparisons between the results by \(RT(RL=r)\) ( \(r=1,\ldots ,6\)):

(1):

When RL is small, sub-rule accuracy tend to be low, while their coverage tends to be high. Consequently, there are rarely any empty sub-rules, and the frequency distribution bias of D is low, leading to a lower classification ability. Here, \(accuracy=|U(d)\bigcap U(CP)|/|U(CP)|=P(D=d|CP)\), \(coverage=|U(d)\bigcap U(CP)|/|U(d)|=P(CP|D=d)\), \(U(d)=\{u(i)|u^{D=d}(i)\}\), \(U(CP)=\{u(i)|u^{C}(i)\,satisfies\,CP\}\).

(2):

When RL becomes too high, the sub-rule accuracy tend to increase, while the coverage decreases. Consequently, there are many empty sub-rules, leading to a decrease in the classification ability.

(3):

Suppose that \(X_{r1}\sim N(\mu _{r1},\sigma _{r1}^{2})\) and \(X_{r2}\sim N(\mu _{r2},\sigma _{r2}^{2})\). Here, \(X_{r1}\) and \(X_{r2}\) denote random variables of the classification rate by RT(r1) and RT(r2), respectively. The \(N_{r}\) times mean \(\overline{X_{r1}}\)and \(\overline{X_{r2}}\), and their normalized difference \(\overline{X_{r1}}-\overline{X_{r2}}\) is given as follows

$$\begin{aligned} Z=\frac{\overline{X_{r1}}-\overline{X_{r2}}-(\mu _{r1}-\mu _{r2})}{\left( \frac{\sigma _{r1}^{2}}{N_{r}}+\frac{\sigma _{r2}^{2}}{N_{r}}\right) ^{0.5}}. \end{aligned}$$
(2)

Under null hypothesis H0: \(\mu _{r1}=\mu _{r2}\), \(Z=\frac{\overline{X_{r1}}-\overline{X_{r2}}}{\left( \frac{\sigma _{r1}^{2}}{N_{r}}+\frac{\sigma _{r2}^{2}}{N_{r}}\right) ^{0.5}}\sim N(0,1)\). For example, placing \(\overline{X_{r1}}=m_{RT(4)}\), \(\sigma _{r1}=SD_{RT(4)}\), \(\overline{X_{r2}}=m_{RT(3)}\), \(\sigma _{RT(3)}=SD_{RT(3)}\), z =2.83 with p-value = 2.31E-3, resulting in the rejection of H0, i.e., statistically \(\mu _{RT(4)}>\mu _{RT(3)}\).

Accordingly, \(RT(r=4)\) was used for the classification experiment in the \(R_{C}\)-dataset due to the highest classification rate. Table 3 presents one of the results classified by \(RT(r=4)\), arranged as the confusion matrix of all the datasets (500 \(\times \) 50). For example, the first row shows the total data number of \(D=1\) is (3,827+1,032+95+66+52) = 5,072, the rate classified \(D=1\) is 3,827/5,072 = 0.755, \(D=2\) is 1032/5,072 = 0.203, and the class error is 0.245.

The same experiment was conducted for the same \(R_{L}\)-dataset and \(R_{C}\)-dataset by RF for comparison with the classification results by the RT method. Specifically, the same \(R_{L}\)-dataset was first used for the learning RF model as shown in Fig. 3: classmodel = randomForest(x, y, mtry = 2). Here, the function: randomForest is in the R-language library [24], {\(u^{C}(i)\in R_{L}\)-datase} was set to x, and {\(u^{D}(i)\in R_{L}\)-datase} was set to y after changing their data classes appropriately. The parameter mtry = 2 was found to be the least error rate of out of bag (OOB) in the preliminary experiment. Figure 3 shows one of the outputs of “classmodel” is (729 + 201 + 32 + 15 + 11) = 988 for the dataset of \(D=1\) and class error = (201+32+15+11)/988 = 0.262. Table 4 corresponding to Table 3 shows one of the results of the classification experiment by the RF implemented in Fig. 3 and each class error is similar as that in Fig. 3. The \(N_{r}\) times mean rate of the classification rate and its standard deviation was \((m_{RF},SD_{RF})\) = (68.6,1.70) [%], as shown in Table 2.

The comparison of the classification results of the R-dataset between \(RT(r=4)\) and RF revealed no significant difference between \(\mu _{RT(4)}>\mu _{RF}\) and \(\mu _{RT(4)}<\mu _{RF}\) using equation (2) (z = 0.075), indicating that \(\mu _{RT(4)}=\mu _{RF}\) for time being. Figure 4 also shows the comparison of class errors with D between Table 3 (by \(RT(r=4)\)) and Table 4 (by RF) which shows the same tendency and both appears to execute the classification based on the close rules each other. RF is also one of the classification methods which uses the voting result through a large number of decision trees with randomly selected variables to decrease the correlation among those decision trees, improving the CART method by constructing a tree structure [3, 22]. However, the difference between them is that the RT explicitly induces rules whereas RF cannot.

Table 2. Summary of classification experiment for R-dataset by RT(r), RF and tr-STRIM.
Table 3. Results of R-dataset classification experiment by \(RT(r=4)\).

6 Proposal of Trunk Rules and Its Consideration

The classification experiments on the R-dataset by the RT method accompanied with classification procedures showed the equivalent ability to that by RF. However, the RT method arranged the R-dataset into numerous sub-rules and used the RT for voting to obtain the classified result, although the method was adaptive to the ill-behaved RWD, resulting in RT losing comprehensibility for human beings. Meanwhile, the previous STRIM [5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] used the following principle for exploring the CP of if–then rules: \(P(D=d|CP)\ne P(D=d)\), setting the null hypothesis H0: the CP is not a rule candidate (\(P(D=d|CP)=P(D=d)\)) and executing the statistical test using the \(R_{L}\)-dataset. The frequency distribution of D, e.g., (79,0,0,0,0) at the first row in Table 1, rejects the null hypothesis, i.e., it is a rule candidate. However, H0 cannot be rejected by the second (9,2,0,0,0) and the third (9,0,0,0,0) as they do not satisfy the necessary test sample size n (in this specification, approximately \(n\ge 25\) and the sample size of the second is \(9+2=11\) (see [13])). Thus, the RT can be divided into two types of sub-rules: rejecting H0 and not rejecting. Table 5 shows the number of induced RCs from the whole of RT: \(\cup _{r=1}^{|C|=6}RT(r)\), which satisfies the principle with \(p-value<1.0E-5\). It arranges the induced RCs by every \(RL=r\), which coincides with the result by the previous STRIM. That is, the RT expands the range of RCs by the previous STRIM, and the RT with \(RL=r\) method was labeled as expanded STRIM (\(ex-STRIM|RL=r\)).

Fig. 3.
figure 3

Results of error rate of OOB and confusion matrix in \(R_{L}\)-dataset.

Table 4. Results of classification experiment for \(R_{C}\)-dataset by RF
Fig. 4.
figure 4

Class error tendency corresponding to Tables 3 and 4

Although the details were omitted due to space limitations, all the induced RCs with \(RL=1\) in Table 5 are of the following form:

$$\begin{aligned} if\,C(j)=d\,then\,D=d\,(j=1,\ldots ,6,\,d=1,\ldots ,5). \end{aligned}$$
(3)

Table 6 shows the RCs with \(RL\ge 2\), having only \(C(j)=d\), extracted from Table 5 and arranged in the same manner as Table 5. The number of all RCs with \(RL=r\) constructed by only \(C(j)=d\) is given by \(_{6}C_{r}\) and presented in Table 6, except for the cases of \(D=2\) with RL = 4, 5, and 6. In addition, eight RCs of \(D=2\) with \(RL=4\) were discovered: CP= (222002), (220202), (220022), (022220),(022202), (022022), (020222), and (002222). The values in both the condition and decision parts of the if–then rule coincide with each other, considering that the rating scale of C(j) (\(j = 1,\ldots ,6\)) is the same ordinal scale including D. Consequently, the RT or the previous STRIM induced such understandable RCs, which were labeled trunk rules, and an inducing method trunk STRIM (tr-STRIM). As shown in Table 2, a classification result by the tr-STRIM was \((m_{tr-ST},SD_{tr-ST})\) = (60.5,2.12) [%], which is positioned in the middle of RT(r) with r= 1 and r= 2.

The inclusion relationship of RCs induced by the three types of STRIM is arranged as follows:

(1):

\(Rset(ex-STRIM|RL=r)\supset Rset(STRIM|RL=r)\supset Rset(tr-STRIM|RL=r)\),

(2):

\(\cup _{r=1}^{|C|=6}Rset(ex-STRIM|RL=r)\supset Rset(STRIM)\supset Rset(tr-STRIM)\),

(3):

\(Rset(ex-STRIM|r=1)\supset Rset(ex-STRIM|r=2)\supset ,\ldots ,Rset(ex-STRIM|r=6)\).

In this context, Rset(method) refers to the rule set induced by a particular method. The average classified results can be summarized as follows, including no-show relationships due to space limitations: \(C(Rset(ex-STRIM|RL=r),data)>C(Rset(STRIM|RL=r),data)>C(Rset(tr-STRIM|RL=r),data)\), where C(Rset(method), data) represents the average classification result against a dataset by Rset(method). To express the classification results qualitatively, the trunk rules were improved by adding the branch rules with statistical significance level and further by the leaf rules without it. These studies can be beneficial in considering a level of “the principle of incompatibility” depending on the targeted matter. For instance, these studies can be useful in various policies or business strategies where RF cannot explicitly induce rules, indicating that the contents of the dataset and the level need to be considered.

Table 5. Number of induced rules with statistical significance level from whole of RT.
Table 6. Number of extracted trunk rules for each RL from Table 5

7 Conclusion

The validity and usefulness of the previous rule induction method, STRIM have been confirmed using a simulation experiment. However, the examination of its usefulness in an RWD was left for future study [8, 13, 20]. This study specifically used the R-dataset with DT in the field of RS for experimentally addressing the issue by newly proposing the RT method for adaption to the RWD. The validity or usefulness of the method was confirmed by applying it to the classification problem, as their confirmation of the rule induction from the RWD cannot be generally ascertained, even though the result directly reflects the classification result. The usefulness of the method was confirmed using the classification result by RF. The following aspects were specifically considered:

(1):

The newly proposed RT method demonstrated how to arrange the dataset into RT(r) which is a set of sub-rules and all the possible RCs with \(RL=r\), and how to use RT(r) to classify a new object.

(2):

The validity and usefulness of the RT method were experimentally examined by applying it to the classification problem after selecting a proper RT(r). The result of the classification rate showed equivalence to that by RF, i.e., \(\mu _{RT(r=4)}=\mu _{RF}\).

(3):

However, the induced RT(r) and the decision trees generated by RF are increasingly difficult to understand. Both methods offer limited insights and information regarding the dataset in the form of if–then rules. Thus, the whole of RT: \(\cup _{r=1}^{|C|=6}RT(r)\) was subsequently reviewed and then organized into trunk rules using the trunk-STRIM method. The relationships between these rules were shown as \(Rset(ex-STRIM|RL=r)\supset Rset(STRIM|RL=r)\supset Rset(tr-STRIM|RL=r)\) and were useful in providing “the principle of incompatibility” depending on the targeted matter. On the other hand, RF does not provide such knowledge, although it remains a useful method for classification problems.

The investigation of the following points is recommended for future studies:

(1):

To validate the findings of this study, future research should focus on applying the three types of STRIM, namely the previous, ex-STRIM, and tr-STRIM, to various other RWDs and replicate the findings to validate the findings of this study.

(2):

When using ex-STRIM with \(RL=r\) as a classification method, whether \(\mu _{(ex-ST|r)}=\mu _{RF}\) is always equivalent or not should be studied, considering factors, such as the size of the learning dataset and changes in RWD.