Keywords

1 Introduction

Nowadays, a large number of electronic datasets are being generated with the growth of a network society. Among such datasets, those generated in the e-commerce area are used for various business strategies and such trials have recently proliferated quickly. The e-commerce takes in the various datasets including their attributes with regard to items for sale as well as their customers so that their relationships, structures and features are easily analyzed and used for strategies of providing it with new items and/or services for sale as well as acquiring new customers. In those processes, the conventional data mining or analyzing methods are used, or new methods are needed and developed for improving their precision and adaptation of new aims. Demands from such a network society generate research and development in those data science areas.

A statistical test rule induction method (STRIM) [1,2,3,4,5,6,7,8] also has been proposed for improving rule induction methods by the conventional Rough Sets methods [9,10,11,12] which are used for inducing if-then rules from a dataset called the decision table. Specifically, STRIM recognized the if-then rules as an input-output system and proposed a data generation model for the decision table in order to clarify the relationship between if-then rules and the decision table, the stochastic uncertainty included in the table and what is a rule hidden in the table. The data generation model made up for faults of the conventional Rough Sets lacking statistical views. An algorithm for the rule induction by STRIM also has been proposed and the validity and the usefulness have been confirmed by applying it to real-world datasets after simulation experiments.

However, the plain hypotheses were used in the process of transforming the input into the output in order to simply study the data generation process. Specifically, the previous data generation process used a complete correspondence hypothesis (CCH) that the input was transformed by the pre-specified rules only when it completely corresponded with them. In the real-world, human beings often use their rules even when the input partially corresponds with them and they decide to compromise with the second best. This paper experimentally studies an if-then rule induction problems from the dataset generated based on a partial correspondence hypothesis (PCH) in order to better match the previous STRIM to the real-world dataset judged in the processes such as human decision-making. Specifically, the previous STRIM is first applied to the PCH dataset in a simulation experiment. The experimental consideration suggests that the interim results by the previous STRIM can be used for inferring the original rules by use of a Hamming distance and a technique of a one-strike sketch. STRIM2 named after the revised STRIM is applied for the real-world dataset, Rakuten Travel dataset and draws results showing interesting suggestions.

2 Introduction of Decision-Making Processes

In statistics, a dataset \(U=\{u(i)|i=1,...,N=|U|\}\) is collected from a population of interest to estimate and/or infer properties and features of the population. Here, u(i) is an object with several attributes, whose properties and features contribute to the estimation and inference of the population. Let us denote an observation system by \(S=(U,A,V)\). Here, A is the set of an attribute and V is the set of the attribute’s values; that is, \(V=\bigcup \nolimits _{a \in A}V_{a}\) and \(V_{a}\) is the set of the value of attribute a. When randomly sampling u(i) from the population, each attribute becomes a random variable with the respective attribute value as its outcome.

Here, there are two main types of datasets, with a division between the response and explanatory variables and those without it. In the former case, the set of attributes A is denoted \(A=C \cup \{D\}\) to distinguish from the latter case. Here, D is a decision attribute and the response variable, and \(C=\{C(j)|j=1,...,|C|\}\) is the set of condition attribute C(j) and C(j) is also an explanatory variable for the response variable. If D and C(j) are qualitative variables, D represents the random variable of the class containing u(i) and is affected by the set C of the random variable C(j). This paper studies the former case dealing with qualitative variables based on the system \(S=(U,A=C \cup D,V)\) called the decision table in the Rough Sets theory.

Figure 1 outlines the data generation process. Randomly sampling u(i) from the population, the outcome of \(C=(C(1),...,C(|C|))\); that is, \(u^{C}(i)=(v_{C(1)}(i),\) \(...,v_{C(|C|)}(i))\) is obtained and becomes the input into the rule box. The rule box transforms \(u^{C}(i)\) into the output \(u^{D}(i)\) using the rule box’s pre-specified rules R(dk): if CP(dk) then \(D=d\) \((d=1,2,...,k=1,2,...)\) and the following partial correspondence hypothesis with the input modifying CCH shown in Table 1.

Partial correspondence hypothesis (PCH): The degree Dgr of \(u^{C}(i)\) for correspondence with the box’s pre-specified rules is estimated and the rule of the highest Dgr is applied for transforming \(u^{C}(i)\) into \(u^{D}(i)\). If there are several rules of ties, one of them is randomly determined in the same way as Hypothesis 3 in Table 1. PCH expands and generalizes three cases for \(u^{C}(i)\) in Table 1 for CCH, taking human decision-making into account. The observer in Fig. 1 records \(u(i)=(u^{C}(i),u^{D}(i))\). NoiseC and NoiseD are introduced to adapt the model for the real-world dataset. NoiseC adjusts the value of \(u^{C}(i)=(v_{C(1)}(i),...,v_{C(|C|)}(i))\) or makes \(v_{C(j)}(i)\) a missing value, and NoiseD adjusts the value of \(u^{D}(i)\).

Generating \(u^{C}(i)=(v_{C(1)}(i),...,v_{C(|C|)}(i))\) using random numbers and transforming it into \(u^{D}(i)\) using the model shown in Fig. 1, including PCH, \(U=\{u(i)=(u^{C}(i),u^{D}(i))|i=1,...,N=|U|\}\) can be obtained and applied to any rule induction method to investigate the extent to which the method applied induces the pre-specified rules.

Fig. 1.
figure 1

A simulation model for data generation and verification of induced rules. The rule box contains if-then rules R(dk): if CP(dk) then \(D=d\) \((d=1,2,...,k=1,2,...)\).

Table 1. Complete correspondence hypothesis with regard to the input.

3 Simulation Experiment by the Previous STRIM

We implemented the data generation process with PCH and the verification process applying the previous STRIM as follows: (1) Specified rules, for example, shown in Table 2 in the rule box in Fig. 1, where \(|C|=6\), \(V_{a}=\{1,2,...,6\}\) (\(a=C(j)(j=1,...,|C|),a=D)\), and \(CP(1,1)=110010\) denoted \(CP(1,1)=(C(1)=1) \bigwedge (C(2)=1) \bigwedge (C(5)=1)\) and was called a rule of the rule length 3 (\(RL=3\)) having three conditions. (2) Generated \(v_{C(j)}(i)\) \((j=1,...,|C|=6)\) with a uniform distribution and formed \(u^{C}(i)=(v_{C(1)}(i),...,v_{C(6)}(i))\) \((i=1,...,N=10,000)\). (3) Transformed \(u^{C}(i)\) into \(u^{D}(i)\) using the pre-specified rules in Table 2 and PCH, without generating NoiseC and NoiseD for a simple experiment. Here, Dgr was simply estimated by the sum of the number of the conditions satisfied for each rule. For example, if \(u^{C}(i)=112251\) then \(Dgr=2\) at R(1, 1), \(Dgr=1\) at R(1, 2), \(Dgr=0\) at R(2, 1), and so on. Accordingly, R(1, 1) or R(2, 2) having the highest \(Dgr=2\) were randomly selected. We will refer to the dataset generated based on the above procedures as the PCH dataset. We randomly sampled \(N_{B}=5,000\) data and formed a new dataset as the decision table.

We applied the previous STRIM [1,2,3,4,5,6,7,8] to the PCH dataset. Figure 2 shows an outline of the algorithm implementing the STRIM written in C-language style (details in [7, 8]). At \(LN=8-9\), for each decision attribute value di, the statistically independent condition attributes against di are reducted. At \(LN=10\), the function rule_check() (the body is at \(LN=19-33\)) systematically forms a trying rule based on the dimension rule[ ] (condition part of a rule CP). At \(LN=25\), we examine the degree of the validity for the trying rule by the z-value, which is the degree of bias in the frequency distribution of D supposing the standard normal distribution and is used to select the rule as a candidate. The selected candidates are finally arranged into the induced rules at \(LN=12\).

Table 3 shows examples of the results of the arranged rules for \(D=1\) and the part of those for \(D=2\) in descending order of z-values for each D. For example, the first row CP(1, 1) of the table means the following: The condition part of the induced rule is \((C(2)=1) \bigwedge (C(5)=1)\). The frequency distribution of D \(f=(n_{1},...,n_{6})\) satisfying the condition is (138, 3, 4, 4, 6, 6), which suggests the maximum frequency \(n_{d}\) of D is \(n_{d=1}=138\) and thus \(D=1\) is the decision part for the rule. The distribution of \(z=\frac{n_{d}+0.5-np_{d}}{(np_{d}(1-p_{d}))^{0.5}}\) obeys the standard normal distribution under the null hypothesis \(H_{0}\): CP is not a rule candidate (the alternative hypothesis \(H_{1}\): CP is a rule candidate) and the testing condition [13]: \(np_{d} \ge 5\) and \(n(1-p_{d}) \ge 5\), where \(n=\sum \nolimits _{m=1}^6 n_{m}\). The p-value corresponding to the z-value is the index of supporting \(H_{0}\), and the accuracy and the coverage are also shown in the table.

Table 3 shows that the previous STRIM doesn’t induce R(1, 1) of the pre-specified rules having \(RL=3\) in Table 2 but induces three rules CP(1, 1), CP(1, 3) and CP(1, 6) with \(RL=2\) including R(1, 1). Hereafter R(1, 1) is called a partial rule of them since it is a special case of them and conversely they are called a including rule of R(1, 1) respectively. The same results apply to R(1, 2) and applied to those for \(D=2,...,6\). Then, all the rule candidates for \(D=1\) were investigated as shown in Table 4 which shows CndCP to distinguish the CP in Table 3. Table 4 shows the following:  

(4-1):

The rules CndCP(1, 1), ..., CndCP(1, 6) with \(RL=2\) including R(1, 1) or R(1, 2) appear in descending order of z-values, which coincides with the CP in Table 3. They suggest us that a lot of inputs partially coinciding with the pre-specified rules by \(Dgr=2\) were transformed into the output by the use of their rules and PCH.

(4-2):

The CndCP(1, 7), ..., CndCP(1, 21) with \(RL=1\) including R(1, 1) or R(1, 2), or those straddling both rules with \(RL=\)2 appear in descending order of z-values. For example, the candidate CndCP(1, 10) with \(RL=2\) straddles both CndCP(1, 8) and CndCP(1, 7) of the rule including R(1, 1) and R(1, 2) respectively. They also suggest the same as that applied to (4-1) by \(Dgr=1\).

(4-3):

All CndCP(1, 7), ..., CndCP(1, 21) in Table 4 were arranged in Table 3, which was conducted at \(LN=12\) in Fig. 2. For example, CndCP(1, 10) is a partial rule of CndCP(1, 7) whereas the z-value of CndCP(1, 7) is larger than that of CndCP(1, 10). Accordingly, the previous STRIM made CndCP(1, 7) represent CndCP(1, 10) based on the index of z. In the same way, CndCP(1, 7) was represented by CndCP(1, 3). In this way, the previous STRIM arranged the rule candidates with inclusion relationships by their z-values.

 

The pre-specified rules R(1, 1) and R(1, 2) did not appear even as rule candidates respectively in Table 3 since each of them did not satisfy the testing condition at \(LN=24\). The following is a summary of the simulation studies using the previous STRIM for the PCH dataset:  

(1):

The previous STRIM can’t induce the pre-specified rules with longer rule lengths since the datasets partially corresponding with those rules will cause increased growth, and overwhelmingly covers those completely corresponding with them which is the PCH effects. As the result, it induces a lot of rules including the pre-specified rules.

(2):

In the case when N is not so large and the rule length of the pre-specified rules is long, the previous STRIM can’t adopt them even as a rule candidate.

 

Table 2. An example of pre-specified rules in the rule box.
Fig. 2.
figure 2

An algorithm for STRIM including a reduct function.

Table 3. Examples of finally induced rule using previous STRIM for the PCH dataset.
Table 4. Rule candidates for \(D=1\) induced by the previous STRIM for the PCH dataset.

4 Improved Algorithm Taking PCH into Account

The PCH effects derive a lot of including rules of the pre-specified rules as shown in (4-1) and (4-2) if the previous algorithm of STRIM is applied to the PCH dataset. In this section, we improve the algorithm based on the considerations obtained by the simulation experiment in Sect. 3. Figure 3 especially shows their relationships for the including rules of \(D=1\). For example, “110000(6)” denotes CndCP(6) in Table 4. The solid line connects each other with one Hamming distance (\(HD=1\)) which is considered to be the closest and solidest relationship since rule candidates derived from the pre-specified rules by the PCH effects as shown in (4-1) and (4-2). For example, one of the methods to estimate R(d, 1) or R(d, 2) is to make the groups of candidates connected to each other with \(HD=1\) in Table 4 and to make each group indicate the pre-specified rules for each \(D=d\) as follows:  

(Step1):

Truncate Table 4 in descending order of z-value until the candidate with \(RL=1\) having the least z-value.

(Step2):

Make the Hamming matrix (HM) having the (ij) element of the HD between CndCP(di) and CndCP(dj) by use of the truncated table. The HM is symmetric.

(Step3):

Make the groups with \(HD=1\) by using the HM and a one-stroke sketch, and estimate the pre-specified rules.

  In the case of \(D=1\), the last term of Table 4 to be truncated in (Step1) is CndCP(1, 14) and the HM obtained in (Step2) is Table 5 showing HM(ij) \((i,j=1,...,14)\). For example, the HM(1, 2) (\(=HM(2,1)\)) is the HD between \(CndCP(1,1)=010010\) and \(CndCP(1,2)=000101\) and is found to be \(HD=4\). The following is the specific procedures of (Step3) by the use of Table 5:  

(1):

Find the i-th element in Table 4 corresponding with \(CP(d=1,k)\) in Table 3 and the least j-th with \(HM(i,j)=1\). Reserve the i for the starting point i0.

(2):

Reset \(HM(i,j)=0\) and \(HM(j,i)=0\) to prevent a loop.

(3):

Substitute i with j.

(4):

If \(i=i0\) then go to (6) else go to (5).

(5):

Find the least j-th element with \(HM(i,j)=1\) if there are and go to (2), else go to (6).

(6):

If \(i=i0\) is satisfied then construct the pre-specified rule by use of the above sequence candidates else discard the sequence.

 

For example, execute procedure (1) by \(k=1\) in Table 3 then \(i=1\) is found in Table 4, \((i,j)=(1,7)\) is obtained and \(i0=1\) is the starting point in Table 5 since (1, 7) is the least j satisfying \(HM(1,j)=1\). Execute the procedures (2)–(5) and then the sequence of H(ij) is \((i,j)=(1,7) \rightarrow (7,3) \rightarrow (3,11) \rightarrow (11,6) \rightarrow (6,13) \rightarrow (13,1)\) and \(i=i0\) is satisfied. The sequence is proved to be the one-stroke sketch of the rule candidates with \(HD=1\) of R(1, 1) (trace the sequence in Fig. 3) and then R(1, 1) is reconstructed. In the same way, for \(k=2\), the sequence satisfying \(i=i0\): \((i,j)=(2,8) \rightarrow (8,5) \rightarrow (5,14) \rightarrow (14,4) \rightarrow (4,9) \rightarrow (9,2)\) is obtained and is proved to be that of R(1, 2) (see Fig. 3). The \(k=3\) in Table 3 derives R(1, 1). All of the k in Table 3 derives R(1, 1) and R(1, 2) by three respectively. The same applied to \(D=2,...,6\).

Adding to an algorithm implementing the above procedure under \(LN=12\) in Fig. 2, STRIM can adapt the PCH dataset and results in a new algorithm we call STRIM2.

Fig. 3.
figure 3

Derived rules from the pre-specified rules for \(D=1\) with one Hamming distance.

Table 5. Examples of Hamming distance against rule candidates for \(D=1\).

5 Another Type of Pre-specified Rule

In order to confirm the availability of the algorithm studied in Sect. 4, let us study it by modifying the rules in Table 2 like \(R(d,2)=00dd0d \rightarrow R(d,2)=0d0d0d\) \((d=1,...,6)\). Having the same condition attribute value like \(C(2)=d\) in R(d, 1) and R(d, 2) is the feature of the modified rules. Generating the PCH dataset based on the modified rules in Fig. 1, and applying STRIM2 to the dataset, Table 6 for \(D=1\) was obtained by arranging the interim results. Table 6 contains the set of CndCP(1, k) which is ordered in descending order of the z-value and truncated at the least z-value of the candidate with \(RL=1\) corresponding to the front side of Table 5, and the HM which corresponds to Table 5 and was constructed by the set of CndCP(1, k). Here, three CndCP(1, k) \((k=1,5,7)\) with an “\(*\)” are the candidates corresponding to CP(1, k) in Table 3.

In the same way as Table 5, STRIM2 induced the rules from Table 6 as follows: By use of \(CndCP(1,1)=010000(*1)\), the sequence: \((1,2) \rightarrow (2,9) \rightarrow (9,5) \rightarrow (5,8) \rightarrow (8,3) \rightarrow (3,1)\) induced \(010101=R(1,2)\) although CndCP(1, 1) is also the including rule of R(1, 1). In the same way, \(CndCP(1,5)=000101(*2)\) derived the sequence: \((5,8) \rightarrow (8,3) \rightarrow (3,1) \rightarrow (1,2) \rightarrow (2,9) \rightarrow \)(9,5) and induced \(010101=R(1,2)\). However, \(CndCP(1,7)=100010(*3)\) derived the sequence: \((7,10) \rightarrow (10,4) \rightarrow (4,1) \rightarrow (1,2) \rightarrow (2,9) \rightarrow (9,5) \rightarrow (5,8) \rightarrow (8,3) \rightarrow (3,1) \rightarrow (1,6) \rightarrow (6,12) \rightarrow (12,7)\) and induced “110111,” which was the compound of R(1, 1) and R(1, 2). Inspecting the sequence in detail, it started from CndCP(1, 7) of the including rule of R(1, 1), and on the way changed that of R(1, 1) into that of R(1, 2) like \((1,2) \rightarrow (2,9)\) and again changed into that of R(1, 1). That is why STRIM2 induced the compound rule. It should be noted that the case when STRIM2 cannot induce the pre-specified rules but the compound rules may happen in the case when they have more than two CP(dk) \((k=1,2,...)\) and the same condition attribute value like \(C(2)=d\) for the same decision attribute value, and/or their including rules are not separated from each other (see Fig. 3).

Table 6. Rule candidates and Hamming distance induced by STRIM2 for the dataset generated by the rules modifying Table 2.

6 Application of STRIM2 to a Real-World Dataset

The Rakuten Institute of Technology provides an open dataset of Rakuten Travel [14]. This dataset contains about 6, 200, 000 questionnaire survey ratings \(A=\{C(1)=\) Location, \(C(2)=\) Room, \(C(3)=\) Meal, \(C(4)=\) Bath (Hot Spring), \(C(5)=\) Service, C(6)=Amenity, \(D=\) Overall\(\}\) for about 130, 000 travel facilities using a set of categorical values \(V_{a}=\{\)Dissatisfied (1), Somewhat dissatisfied (2), Neither satisfied nor dissatisfied (3), Satisfied (4), Very Satisfied (5)\(\}\), \(\forall a \in A\), that is, \(|V_{a=D}|=|V_{a=C(j)}|=5\). We constructed a decision table of \(N=10,000\) surveys by randomly selecting 2, 000 samples, each with \(D=m\) \((m=1,...,5)\), from about 400, 000 surveys of the 2013–2014 dataset because there were heavy biases with respect to the frequency of \(D=m\). Finally, we randomly sampled \(N_{B}=5,000\) from the 10, 000 surveys and re-constructed the decision table.

We applied STRIM2 to the decision table and Table 7 shows the interim results corresponding to Table 3. The HM corresponding to Table 5 or Table 6 is omitted since its size is so large, for example, \(62 \times 62\) for \(D=1\). Table 8 shows the final results by STRIM2 obtained in the same procedures as the simulation experiments in Sects. 4 and 5. Here, the results are shown as CP2(dk) to distinguish the final from the interim. Although the Rakuten Travel dataset is not clear whether it obeys PCH or not, and no one knows the original rules since it is not a simulation experiment, Table 8 suggests the following based on the results obtained from the simulation experiments:  

(1):

For \(D=1\), both of CP(1, 1) and CP(1, 2) with \(RL=1\) induced the same rule CP2(1, 1) with \(RL=3\) respectively. That is, STRIM2 induced the partial rule CP2(1, 1) of CP(1, 1) and CP(1, 2) which represented CP2(1, 1) by use of the previous STRIM and moreover found another factor \(C(6)=1\) affecting \(D=1\). The result seems not to be so strange.

(2):

For \(D=2\), STRIM2 induced the same rule as CP(2, 1) of which accuracy is not so high to compare with the other rules. The frequency distribution of CP(2, 1) spreads widely from \(D=1\) to \(D=3\), which seemed to be caused by the hard decision of “Somewhat dissatisfied.” Accordingly, it is supposed that the original rule of \(D=2\) could not make the one-strike sketch by the including rules with \(RL=1\)

(3):

STRIM2 induced CP2(3, 1) with \(RL=4\) from CP(3, 1) with \(RL=2\) and CP2(4, 1) with \(RL=3\) from CP(4, 1) with \(RL=2\), which seems not to be so strange taking the simulation studies into account.

(4):

STRIM2 induced CP2(5, 1) with \(RL=3\) from CP(5, 1) with \(RL=1\) and CP2(5, 2) with \(RL=4\) from CP(5, 2) with \(RL=2\), and the former rule includes the latter, which remind us of the studies in Sect. 5. However, STRIM2 suggested that the factors: \(C(2)=5\), \(C(3)=5\), \(C(5)=5\), \(C(6)=5\) have an important effect on \(D=5\) while the previous STRIM indicates only the partial effect.

 

Table 7. Induced interim rules from Rakuten Travel dataset by STRIM2.
Table 8. Induced final rules from Rakuten Travel dataset by STRIM2.

7 Conclusion

This paper experimentally studied an algorithm to adapt PCH datasets and improved the previous STRIM. Specifically, this paper focused on rule candidates derived by the STRIM, proposed a method to group them by the solid relationship of a one-stroke sketch having one Hamming distance (\(HD=1\)) and made the groups estimate the pre-specified rules. STRIM incorporating this function was named STRIM2 which clarified its performance and cautions for use by applying it in two typical simulation experiments. STRIM2 was applied to a real-world dataset, that is, Rakuten Travel dataset and the induced rules were considered from the view of those studied by the simulation so that the results were roughly shown to be valid and were full of interesting suggestions although no one knew the pre-specified rules and the domain-knowledge was needed for the review.