Keywords

1 Introduction

In this paper data sets with missing attribute values are mined using probabilistic approximations. The probabilistic approximation, with a probability \(\alpha \), is an extension of a standard approximation, a basic idea of rough set theory. If \(\alpha \) = 1, the probabilistic approximation becomes the lower approximation, for very small and positive \(\alpha \), the probabilistic approximation is identical with the upper approximation. The idea of the probabilistic approximation was introduced in [20] and further developed in [19, 2224].

Data sets with missing attribute values need special kinds of approximations, called singleton, subset and concept [12, 13]. Such approximations were generalized to singleton, subset and concept probabilistic approximations in [15]. The first experiments on probabilistic approximations were presented in [1]. In experiments reported in this paper, we used all three kinds of probabilistic approximations: singleton, subset and concept.

In this paper, missing attribute values may be interpreted in two different ways, as attribute-concept values or as “do not care” conditions. Attribute-concept values, introduced in [14], are typical values for a given concept. For example, if the concept is a set of people sick with flu, and a value of the attribute Temperature is missing for some person who is sick with flu, using this interpretation, we would consider typical values of Temperature for other people sick with flu, such as high and very high. A “do not care” condition is interpreted as if the original attribute value was irrelevant, we may replace it by any existing attribute value [8, 17, 21].

The first experiments on data sets with missing attribute values interpreted as lost values and “do not care” conditions, with 35 % of missing attribute values, were reported in [7]. Research on data with missing attribute values interpreted as attribute-concept values and “do not care” conditions was presented in [26]. In [6] two imputation methods for missing attribute values were compared with rough-set approaches based on two interpretations of missing attribute values, as lost values and “do not care” conditions, combined with using singleton, subset and concept probabilistic approximations. It was shown that the rough-set approaches were better than imputation for five out of six data sets. The smallest error rate was associated with data sets with lost values. In [3] experiments were related to the error rate computed by ten-fold cross validation for mining data sets with attribute-concept values and “do not care” conditions using only three probabilistic approximations: lower, middle (with \(\alpha \) = 0.5) and upper. Results were not conclusive, in four cases attribute-concept values were better, while in two cases “do not care” conditions were better, in remaining 18 cases differences between the two were statistically insignificant. In [4] the error rate was evaluated for data sets with many missing attribute-concept values and “do not care” conditions. In two cases “do not care” conditions were better, in one case attribute-concept values were better, in remaining three cases differences were statistically insignificant.

With inconclusive results of experiments on the error rate, the question is which interpretation of missing attribute values is associated with smaller complexity of rule sets. In [2], experiments on complexity of rule sets induced from data sets with attribute-concept values and “do not care” conditions using lower, middle and upper approximations were presented. For half of the cases the number of rules was smaller for attribute-concept values, similarly for the total number of rule conditions. Results on the choice of the best type of probabilistic approximation (singleton, subset or concept) were inconclusive. In [5] experiments were also focused on complexity of rules sets, this time for data sets with 35 % of attribute-concept values and “do not care” conditions. For 13 combinations (out of 24) the attribute-concept values were associated with simpler rules, for five combinations “do not care” conditions were better, similarly for the total number of rule conditions.

The difference in performance between the two interpretations of missing attribute values, as attribute-concept values or “do not care” conditions, is more clear for data sets with many missing attribute values. Results of this paper are more conclusive than in our previous research.

Thus, our first objective was to check which interpretation of missing attribute values should be used to induce simpler rule sets, in terms of the number of rules and total number of rule conditions, from data sets with many attribute-concept values and “do not care” conditions, using the Modified Learning from Examples Module version 2 (MLEM2) system for rule induction [11]. There is some evidence that the “do not care” conditions are better. Our secondary objective was to test which of the three probabilistic approximations: singleton, subset or concept, used for rule induction should be used to induce simpler rule sets. The best choice is the subset probabilistic approximation and the singleton probabilistic approximation is the worst choice.

2 Incomplete Data

In this paper the input data sets are in the form of a decision table. A decision table has rows representing cases and columns defining variables with the set of all cases denoted by U. The dependent variable d is called the decision and the independent variables are labeled attributes. The set of all attributes will be denoted by A. Additionally, the value for a specific case x and attribute a is denoted by a(x).

There are multiple ways to represent missing attribute values, however in this paper we distinguish them with two interpretations. The first, attribute-concept values, are identified using − and the second, denoted by \(*\) are “do not care” conditions.

One of the most important ideas of rough set theory [18] is an indiscernibility relation, defined for complete data sets. Let B be a nonempty subset of A. The indiscernibility relation R(B) is a relation on U defined for \(x, y \in U\) as follows:

$$\begin{aligned} (x, y) \in R(B) ~\mathrm{if~and~only~if}~ \forall a \in B \ (a(x) = a(y)). \end{aligned}$$

The indiscernibility relation R(B) is an equivalence relation. Equivalence classes of R(B) are called elementary sets of B and are denoted by \([x]_B\). A subset of U is called B-definable if it is a union of elementary sets of B.

The set X of all cases defined by the same value of the decision d is called a concept. The largest B-definable set contained in X is called the B-lower approximation of X, denoted by \(\underline{appr}_B(X)\), and defined as follows

$$\begin{aligned} \cup \{[x]_B \mid [x]_B \subseteq X\}, \end{aligned}$$

while the smallest B-definable set containing X, denoted by \(\overline{appr}_B(X)\) is called the B-upper approximation of X, and is defined as follows

$$\begin{aligned} \cup \{[x]_B \mid [x]_B \cap X \ne \emptyset \}. \end{aligned}$$

For a variable a and its value v, (av) is called a variable-value pair. When considering a complete data set, the block of (av), denoted by [(av)], is the set \(\{ x \in U \mid a(x) = v\}\) [9]. However, when representing missing information and incomplete data sets, the definition of a block of an attribute-value pair is modified in the following way.

  • For an attribute a, where there exists a case x such that \(a(x) = -\), the case x should be included in blocks [(av)] for all specified values \(v \in V(x, a)\) of attribute a, where

    $$\begin{aligned} V(x, a) = \{a(y) \ | \ a(y) \ is \ specified, \ y \in U, \ d(y) = d(x)\}, \end{aligned}$$
  • For an attribute a, where there exists a case x such that \(a(x) = *\), the case x should be included in blocks [(av)] for all specified values v of the attribute a.

For a case \(x \in U\) and \(B \subseteq A\), the characteristic set \(K_B(x)\) is defined as the intersection of the sets K(xa), for all \(a \in B\), where the set K(xa) is defined in the following way.

  • If a(x) is specified, then K(xa) is the block [(aa(x))] of attribute a and its value a(x),

  • If \(a(x) = -\), then the corresponding set K(xa) is equal to the union of all blocks of attribute-value pairs (av), where \(v \in V(x, a)\) if V(xa) is nonempty. If V(xa) is empty, \(K(x, a) = U\),

  • If \(a(x) = *\), then the set \(K(x, a) = U\), where U is the set of all cases.

3 Lower and Upper Approximations

We quote some definitions from [16]. Let X be a subset of U and let B be a subset of the set A of all attributes. The B-singleton lower approximation of X, denoted by \(\underline{appr}_B^{singleton}(X)\), is defined as follows

$$\begin{aligned} \{x \ | \ x \in U, K_B(x) \subseteq X \}. \end{aligned}$$

The B-singleton upper approximation of X, denoted by \(\overline{appr}_B^{singleton}(X)\), is defined as follows

$$\begin{aligned} \{ x \ | \ x \in U, K_B(x) \cap X \ne \emptyset \}. \end{aligned}$$

The B-subset lower approximation of X, denoted by \(\underline{appr}_B^{subset}(X)\), is defined as follows

$$\begin{aligned} \cup \ \{ K_B(x) \ | \ x \in U, K_B(x) \subseteq X \}. \end{aligned}$$

The B-subset upper approximation of X, denoted by \(\overline{appr}_B^{subset}(X)\), is defined as follows

$$\begin{aligned} \cup \ \{ K_B(x) \ | \ x \in U, K_B(x) \cap X \ne \emptyset \}. \end{aligned}$$

The B-concept lower approximation of X, denoted by \(\underline{appr}_B^{concept}(X)\), is defined as follows

$$\begin{aligned} \cup \ \{ K_B(x) \ | \ x \in X, K_B(x) \subseteq X \}. \end{aligned}$$

The B-concept upper approximation of X, denoted by \(\overline{appr}_B^{concept}(X)\), is defined as follows

$$\begin{aligned} \cup \{ K_B(x) \ | \ x \in X, K_B(x) \cap X \ne \emptyset \} = \cup \{K_B(x) \ | \ x \in X \}. \end{aligned}$$

4 Probabilistic Approximations

The B-singleton probabilistic approximation of X with the threshold \(\alpha \), \(0 < \alpha \le 1\), denoted by \(appr_{\alpha , B}^{singleton}(X)\), is defined as follows

$$\begin{aligned} \{x \ | \ x \in U, \ Pr(X|K_B(x)) \ge \alpha \}, \end{aligned}$$

where \(Pr(X|K_B(x)) = \frac{|X \cap K_B(x)|}{| K_B(x)|}\) is the conditional probability of X given \( K_B(x)\).

A B-subset probabilistic approximation of the set X with the threshold \(\alpha \), \(0 < \alpha \le 1\), denoted by \(appr_{\alpha , B}^{subset} (X)\), is defined as follows

$$\begin{aligned} \cup \{ K_B(x) \ | \ x \in U, \ Pr(X| K_B(x)) \ge \alpha \}. \end{aligned}$$

A B-concept probabilistic approximation of the set X with the threshold \(\alpha \), \(0 < \alpha \le 1\), denoted by \(appr_{\alpha , B}^{concept} (X)\), is defined as follows

$$\begin{aligned} \cup \{ K_B(x) \ | \ x \in X, \ Pr(X| K_B(x)) \ge \alpha \}. \end{aligned}$$

In general, all three probabilistic approximations are distinct, even for the same value of the parameter \(\alpha \). Additionally, if for a given set X a probabilistic approximation \(appr_{\beta }(X)\) is not listed, then \( appr_{\beta }(X)\) is equal to the closest probabilistic approximation \(appr_{\alpha }(X)\) of the same type with \(\alpha \) larger than or equal to \(\beta \).

If a characteristic relation R(B) is an equivalence relation, all three types of probabilistic approximation: singleton, subset and concept are reduced to the same probabilistic approximation.

Fig. 1.
figure 1

Number of rules for the breast cancer data set

Fig. 2.
figure 2

Number of rules for the echocardiogram data set

Fig. 3.
figure 3

Number of rules for the hepatitis data set

Fig. 4.
figure 4

Number of rules for the image segmentation data set

Fig. 5.
figure 5

Number of rules for the lymphography data set

Fig. 6.
figure 6

Number of rules for the wine recognition data set

Fig. 7.
figure 7

Total number of conditions for the breast cancer data set

Fig. 8.
figure 8

Total number of conditions for the echocardiogram data set

Fig. 9.
figure 9

Total number of conditions for the hepatitis data set

Fig. 10.
figure 10

Total number of conditions for the image segmentation data set

Fig. 11.
figure 11

Total number of conditions for the lymphography data set

Fig. 12.
figure 12

Total number of conditions for the wine recognition data set

5 Experiments

Our experimental data sets are based on six data sets available from the University of California at Irvine Machine Learning Repository. Basic information about these data sets are presented in Table 1.

Table 1. Data sets used for experiments

Incomplete data sets were produced from the base data by creating a set of templates. To create the templates, existing specified attribute values are replaced at 5 % increments with a corresponding attribute-concept value. So the template creation begins with no missing values, then 5 % of the values are randomly replaced with attribute-concept values, then an additional 5 % are randomly replaced. The process continues with the data set until at least one row of the decision table attribute values are all missing values. Three attempts were made to randomly replace specified values with missing values where either a new data set with an extra 5 % is created or the process stops. To produce the “do not care” condition data sets, the same templates are used, replacing − with \(*\).

In this paper, data sets with many missing attribute values are studied. We chose the maximum number of missing values that could be synthesized and for this research, has been defined as more than 40 % of the values being replaced. As shown in Table 1, the maximum percentage of missing values ranges between 40.15 % and 69.89 %.

The Modified Learning from Examples Module version 2 (MLEM2) rule induction algorithm was used for our experiments [11]. MLEM2 is a component of the Learning from Examples based on Rough Sets (LERS) data mining system [10]. Results of our experiments are presented in Figs. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12.

First we compared two interpretations of missing attribute values, attribute-concept values and “do not care” conditions with respect to the number of rules in a rule set. For every data set type, separately for singleton, subset and concept probabilistic approximations, the Wilcoxon matched-pairs signed rank test was used with a 5 % level of significance two-tailed test. With six data set types and three approximation types, the total number of combinations was 18.

For the number of rules in a rule set, for five combinations the “do not care” condition interpretation of missing attribute values was the best. For two combinations the attribute-concept values were the best. For the remaining 11 combinations the difference was not statistically significant. Similarly, for the total number of conditions in a rule set, for 11 combinations this number was smaller for “do not care” conditions, for two combinations attribute-concept values were the best, for the remaining five combinations the difference was not statistically significant.

Next, for a given interpretation of missing attribute values we compared all three types of probabilistic approximations in terms of the number of rules and the total number of conditions in a rule set using multiple comparisons based on Friedman’s nonparametric test. Here, with six types of data sets and two interpretations of missing attribute values, the total number of combinations was 12. For the number of rules, the smallest number was associated with the subset probabilistic approximations for three combinations, with one tie between subset and concept probabilistic approximations. For remaining combinations the difference was not statistically significant. The singleton probabilistic approximation was never a winner. For the total number of rule conditions, the smallest number was also associated with the subset probabilistic approximations for six combinations, with one tie between subset and concept probabilistic approximations. For remaining combinations the difference was not statistically significant. Again, the singleton probabilistic approximation was never a winner.

6 Conclusions

As follows from our experiments, there is some evidence that the number of rules and the total number of conditions are smaller for “do not care” conditions than for attribute-concept values. Additionally, the best probabilistic approximation that should be used for rule induction from data with many attribute-concept values and “do not care” conditions is the subset probabilistic approximation. On the other hand, the singleton probabilistic approximation is the worst.