Keywords

1 Introduction

Incomplete data sets are affected by missing attribute values. In this paper, we consider an interpretation of missing attribute values called a “do not care” condition. According to this interpretation, a missing attribute value may be replaced by any specified attribute value.

For rule induction we use probabilistic approximations, a generalization of the idea of lower and upper approximations known in rough set theory. A probabilistic approximation of the concept X is associated with a probability \(\alpha \); if \(\alpha \) = 1, the probabilistic approximation becomes the lower approximation of X; if \(\alpha \) is a small positive number, e.g., 0.001, the probabilistic approximation is reduced to the upper approximation of X. Usually, probabilistic approximations are applied to completely specified data sets [18, 20,21,22,23,24,25,26,27], such approximations are generalized to incomplete data sets, using characteristic sets, in [13, 14], and maximal consistent blocks in [1, 2].

Missing attribute values are usually categorized into lost values and “do not care” conditions. A lost value, denoted by “?”, is unavailable for the process of data mining, while a ‘do not care” condition, denoted by “*”, represents any value of the corresponding attribute.

Recently, two new types of approximations were introduced, global probabilistic approximations in [3] and saturated probabilistic approximations in [8]. Results of experiments on an error rate, evaluated by ten-fold cross validation, were presented for characteristic sets in [6,7,8] and for maximal consistent blocks in [1, 2]. In these experiments, global and saturated probabilistic approximations based on characteristic sets were explored using data sets with lost values and “do not care” conditions. Results show that among these four methods there is no universally best method.

The main objective of this paper is a comparison of four approaches to mining data, using two probabilistic approximations, global and saturated, based on two granules, characteristic sets and maximal consistent blocks, in terms of an error rate evaluated by ten-fold cross validation.

Rule induction was conducted using a new version of the Modified Learning from Examples Module, version 2 (MLEM2) [5, 12]. The MLEM2 algorithm is a component of the Learning from Examples using Rough Sets (LERS) data mining system [4, 11, 12].

2 Incomplete Data

We assume that the input data sets are presented in the form of a decision table. An example of the decision table is shown in Table 1. Rows of the decision table represent cases, while columns are labeled by variables. The set of all cases will be denoted by U. In Table 1, U = {1, 2, 3, 4, 5, 6, 7, 8}. Independent variables are called attributes and a dependent variable is called a decision and is denoted by d. The set of all attributes will be denoted by A. In Table 1, A = {Temperature, Wind, Humidity} and d is Trip. The value for a case x and an attribute a will be denoted by a(x). For example, Temperature(1) = normal.

Table 1. A decision table

The set X of all cases defined by the same value of the decision d is called a concept. For example, a concept associated with the value yes of the decision Trip is the set {1, 2, 3}.

A block of the attribute-value pair (av), denoted by [(av)], is the set \(\{ x \in U \mid a(x) = v\}\) [10]. For incomplete decision tables, the definition of a block of an attribute-value pair is modified in the following way:

  • if for an attribute a and a case x we have \(a(x) = \ ?\), the case x should not be included in any blocks [(av)] for all values v of attribute a;

  • if for an attribute a and a case x we have \(a(x) = *\), the case x should be included in blocks [(av)] for all specified values v of attribute a.

For the data set from Table 1, the blocks of attribute-value pairs are:

figure a

For a case \(x \in U\) and \(B \subseteq A\), the characteristic set \(K_B(x)\) is defined as the intersection of the sets K(xa), for all \(a \in B\), where the set K(xa) is defined in the following way:

  • if a(x) is specified, then K(xa) is the block [(aa(x))] of attribute a and its value a(x);

  • if \(a(x) = \ ?\) or \(a(x) = *\), then \(K(x, a) = U\).

For Table 1 and \(B = A\),

figure b

A binary relation R(B) on U, defined for \(x, y \in U\) in the following way

$$ (x, y) \in R(B) \ if \ and \ only \ if \ y \in K_B(x) $$

will be called the characteristic relation. In our example R(A) = {(1, 1), (1, 3), (1, 4), (2, 2), (3, 1), (3, 3), (3, 4), (3, 5), (3, 7), (4, 1), (4, 3), (4, 4), (5, 1), (5, 4), (5, 5), (5, 6), (6, 3), (6, 6), (6, 7), (7, 3), (7, 6), (7, 7), (8, 4), (8, 5), (8, 7), (8, 8)}.

We quote some definitions from [1]. Let X be a subset of U. The set X is B-consistent if \((x, y) \in R(B)\) for any \(x, y \in X\). If there does not exist a B-consistent subset Y of U such that X is a proper subset of Y, the set X is called a generalized maximal B-consistent block. The set of all generalized maximal B-consistent blocks will be denoted by \(\mathscr {C}(B)\). In our example, \(\mathscr {C}(A)\) = {{1, 3, 4}, {2}, {3, 7}, {5}, {6, 7}, {8}}.

Let \(B \subseteq A\) and \(Y \in \mathscr {C}(B)\). The set of all generalized maximal B-consistent blocks which include an element x of the set U, i.e. the set

$$ \{Y | Y \in \mathscr {C}(B), x \in Y \} $$

will be denoted by \(\mathscr {C}_B(x)\).

For data sets in which all missing attribute values are “do not care” conditions, an idea of a maximal consistent block of B was defined in [19]. Note that in our definition, the generalized maximal consistent blocks of B are defined for arbitrary interpretations of missing attribute values. For Table 1, the generalized maximal A-consistent blocks \(\mathscr {C}_A(x)\) are

figure c

3 Probabilistic Approximations

In this section, we will discuss two types of probabilistic approximations: global and saturated.

Fig. 1.
figure 1

The bankruptcy data set

Fig. 2.
figure 2

The breast cancer data set

Fig. 3.
figure 3

The echocardiogram data set

Fig. 4.
figure 4

The hepatitis data set

Fig. 5.
figure 5

The image segmentation data set

Fig. 6.
figure 6

The iris data set

Fig. 7.
figure 7

The lymphography data set

Fig. 8.
figure 8

The wine recognition data set

3.1 Global Probabilistic Approximations Based on Characteristic Sets

An idea of the global probabilistic approximation, restricted to lower and upper approximations, was introduced in [16, 17], and presented in a general form in [3]. Let X be a concept, \(X \subseteq U\). A B-global probabilistic approximation of the concept X, based on characteristic sets, with the parameter \(\alpha \) and denoted by \(appr^{global}_{\alpha , B}(X)\) is defined as the following set

$$\begin{aligned} \bigcup \{K_B(x) \ | \ \exists \ Y \subseteq U \ \forall x \in Y, \ Pr(X|K_B(x)) \ \ge \alpha \}.\end{aligned}$$
(1)

Obviously, for some sets B and X and the parameter \(\alpha \), there exist many B-global probabilistic approximations of X. In addition, the algorithm for computing B-global probabilistic approximations is of exponential computational complexity. Therefore, in our experiments we used a heuristic version of the definition of B-global probabilistic approximation, called a MLEM2 B-global probabilistic approximation of the concept X, associated with a parameter \(\alpha \) and denoted by \(appr^{mlem2}_{\alpha , B}(X)\) [3]. This definition is based on the rule induction algorithm MLEM2 [12]. The MLEM2 algorithm is used in the Learning from Examples using Rough Sets (LERS) data mining system [4, 11, 12]. The approximation \(appr^{mlem2}_{\alpha , B}(X)\) is constructed from characteristic sets \(K_B(y)\), the most relevant to the concept X, i.e., with \(|X \cap K_B(y)|\) as large as possible and \(Pr(X|K_B(y)) \ge \alpha \), where \(y \in U\). If more than one characteristic set \(K_B(y)\) satisfies both conditions, we pick the characteristic set \(K_B(y)\) with the largest \(Pr(X|K_B(y))\). If this criterion ends up with a tie, a characteristic set is picked up heuristically, as the first on the list [3].

In this paper, we study MLEM2 B-global probabilistic approximations based on characteristic sets, with B = A. Such approximations are called, for simplicity, global probabilistic approximations associated with the parameter \(\alpha \), denoted by \(appr^{global}_{\alpha }(X)\). Similarly, for B = A, the characteristic set \(K_B(X)\) is denoted by K(x).

Let \(E_{\alpha }(X)\) be the set of all eligible characteristic sets defined as follows

$$\begin{aligned} \{ K(x) \ | \ x \in U, Pr(X | K(x)) \ge \alpha \}. \end{aligned}$$
(2)

A heuristic version of the global probabilistic approximation based on characteristic sets is presented below.

figure d

For Table 1, all distinct global probabilistic approximations based on characteristic sets are

$$\begin{aligned}&appr_{1}^{global}(\{1, 2, 3\}) = \{2\}, \\&appr_{0.667}^{global}(\{1, 2, 3\}) = \{1, 2, 3, 4\}, \\&appr_{0.4}^{global}(\{1, 2, 3\}) = \{1, 2, 3, 4, 5, 7\}, \\&appr_{1}^{global}(\{4, 5, 6, 7, 8\}) = \{4, 5, 7, 8\}, \\&appr_{0.75}^{global}(\{4, 5, 6, 7, 8\}) = \{1, 4, 5, 6, 7, 8\}, \end{aligned}$$

3.2 Saturated Probabilistic Approximations Based on Characteristic Sets

Another heuristic version of the probabilistic approximation is based on selection of characteristic sets while giving higher priority to characteristic sets with larger conditional probability Pr(X|K(x)). Additionally, if the approximation covers all cases from the concept X, we stop adding characteristic sets.

Let X be a concept and let \(x \in U\). Let us compute all conditional probabilities Pr(X|K(x)). Then, we sort the set

$$\begin{aligned} \{Pr(X|K(x)) \ | \ x \in U\}.\end{aligned}$$
(3)

Let us denote the sorted list of such conditional probabilities by \(\alpha _1\), \(\alpha _2\),..., \(\alpha _n\), where \(\alpha _1\) is the largest. For any i = 1, 2,..., n, the set \(E_i(x)\) is defined as follows

$$\begin{aligned} \{ K(x) \ | \ x \in U, Pr(X | K(x)) = \alpha _i \}. \end{aligned}$$
(4)

If we want to compute a saturated probabilistic approximation, denoted by \(appr^{saturated}_{\alpha }(X)\), for some \(\alpha \), \(0 < \alpha \le 1\), we need to identify the index m such that

$$\begin{aligned} \alpha _m \ge \alpha > \alpha _{m + 1}, \end{aligned}$$
(5)

where \(m \in \{1, 2,..., n\}\) and \(\alpha _{n + 1}\) = 0. Then, the saturated probabilistic approximation \(appr^{saturated}_{\alpha _{m}}(X)\) is computed using the following algorithm.

figure e

For Table 1, all distinct saturated probabilistic approximations based on characteristic sets are

$$\begin{aligned}&appr_{1}^{saturated}(\{1, 2, 3\}) = \{2\}, \\&appr_{0.667}^{saturated}(\{1, 2, 3\}) = \{1, 2, 3, 4\}, \\&appr_{1}^{saturated}(\{4, 5, 6, 7, 8\}) = \{4, 5, 7, 8\}, \\&appr_{0.75}^{saturated}(\{4, 5, 6, 7, 8\}) = \{1, 4, 5, 6, 7, 8\}, \end{aligned}$$

3.3 Global Probabilistic Approximations Based on Maximal Consistent Blocks

A special case of the global probabilistic approximation, limited only to lower and upper approximations and to characteristic sets, was introduced in [16, 17]. A general definition of the global probabilistic approximation was introduced in [9].

A B-global probabilistic approximation based on Maximal Consistent Blocks of the concept X, with the parameter \(\alpha \) and denoted by \(appr^{global}_{\alpha , B}(X)\) is defined as follows

$$ \cup \{ Y \ | \ Y \in \mathscr {C}_x(B), \ x \in X, \ Pr(X | Y) \ge \alpha \}. $$

Obviously, for given sets B and X and the parameter \(\alpha \), there exist many B-global probabilistic approximations of X. Additionally, an algorithm for computing B-global probabilistic approximations is of exponential computational complexity. So, we decided to use a heuristic version of the definition of B-global probabilistic approximation, called the MLEM2 B-global probabilistic approximation of the concept X, associated with a parameter \(\alpha \) and denoted by \(appr^{mlem2}_{\alpha , B}(X)\) [3]. This definition is based on the rule induction algorithm MLEM2. The approximation \(appr^{mlem2}_{\alpha , B}(X)\) is a union of the generalized maximal consistent blocks \( Y \in \mathscr {C}(B)\), the most relevant to the concept X, i.e., with \(|X \cap Y|\) as large as possible and with \(Pr(X|Y) \ge \alpha \). If more than one generalized maximal consistent block Y satisfies both conditions, the generalized maximal consistent block Y with the largest \(Pr(X|Y) \ge \alpha \) is selected. If this criterion ends up with a tie, a generalized maximal consistent block Y is picked up heuristically, as the first on the list [3].

Special MLEM2 B-global probabilistic approximations, with B = A, are called global probabilistic approximations associated with the parameter \(\alpha \), and are denoted by \(appr^{mlem2}_{\alpha }(X)\).

Let \(E_{\alpha }(X)\) be the set of all eligible generalized maximal consistent blocks defined as follows

$$\begin{aligned} \{ Y \ | Y \subseteq \mathscr {C}(A), Pr(X | Y) \ge \alpha \}. \end{aligned}$$

A heuristic version of the global probabilistic approximation is computed using the following algorithm

figure f

For Table 1, all distinct global probabilistic approximations based on maximal consistent blocks are

$$\begin{aligned}&appr_{1}^{global}(\{1, 2, 3\}) = \{2\}, \\&appr_{0.667}^{global}(\{1, 2, 3\}) = \{1, 2, 3, 4\}, \\&appr_{1}^{global}(\{4, 5, 6, 7, 8\}) = \{5, 6, 7, 8\}, \\&appr_{0.5}^{global}(\{4, 5, 6, 7, 8\}) = \{3, 5, 6, 7, 8\}, \\&appr_{0.333}^{global}(\{4, 5, 6, 7, 8\}) = \{1, 3, 4, 5, 6, 7, 8\}, \end{aligned}$$

3.4 Saturated Probabilistic Approximations Based on Maximal Consistent Blocks

Saturated probabilistic approximations are unions of generalized maximal consistent blocks while giving higher priority to generalized maximal consistent blocks with larger conditional probability Pr(X|Y). Additionally, if the approximation covers all cases from the concept X, we stop adding generalized maximal consistent blocks.

Let X be a concept and let \(x \in U\). Let us compute all conditional probabilities Pr(X|Z), where \(Z \in \{ Y \ | Y \subseteq \mathscr {C}(A), Pr(X | Y) \ge \alpha \}\). Then we sort the set

$$\begin{aligned} \{ Pr(X | Y) \ | \ Y \subseteq \mathscr {C}(A) \} \end{aligned}$$

in descending order. Let us denote the sorted list of such conditional probabilities by \(\alpha _1\), \(\alpha _2\),..., \(\alpha _n\). For any i = 1, 2,..., n, the set \(E_i(X)\) is defined as follows

$$\begin{aligned} \{ Y \ | \ Y \subseteq \mathscr {C}(A), Pr(X | Y) = \alpha _i \}. \end{aligned}$$

If we want to compute a saturated probabilistic approximation, denoted by \(appr^{saturated}_{\alpha }(X)\), for some \(\alpha \), \(0 < \alpha \le 1\), we need to identify the index m such that

$$\begin{aligned} \alpha _m \ge \alpha > \alpha _{m + 1}, \end{aligned}$$

where \(m \in \{1, 2,..., n\}\) and \(\alpha _{n + 1}\) = 0. The saturated probabilistic approximation \(appr^{saturated}_{\alpha _{m}}(X)\) is computed using the following algorithm

figure g

For Table 1, any saturated probabilistic approximation based on maximal consistent blocks for is the same as corresponding global probabilistic approximation based on maximal consistent blocks for the same concept.

3.5 Rule Induction

Once the global and saturated probabilistic approximations associated with a parameter \(\alpha \) are constructed, rule sets are induced using the rule induction algorithm based on another parameter, also interpreted as a probability, and denoted by \(\beta \). This algorithm also uses the MLEM2 principles [15], and was presented, e.g., in [3].

figure h
figure i

For example, for Table 1 and \(\alpha \) = \(\beta \) = 0.5, using the global probabilistic approximations, the MLEM2 rule induction algorithm induces the following rules:

(Temperature, very-high) & (Headache, yes) \(\rightarrow \) (Flu, yes)

(Temperature, high) & (Cough, yes) \(\rightarrow \) (Flu, yes)

(Headache, no) & (Cough, no) \(\rightarrow \) (Flu, no)

and

(Temperature, normal) \(\rightarrow \) (Flu, no)

4 Experiments

For our experiments, we used eight data sets taken from the Machine Learning Repository at the University of California at Irvine. For every data set, a new record was created by randomly replacing 35% of existing specified attribute values by “do not care” conditions.

In our experiments, the parameter \(\alpha \) varied between 0.001 and 1 while the parameter \(\beta \) was equal to 0.5. For any data set, ten-fold cross validation was conducted. Results of our experiments are presented in Figs. 1, 2, 3, 4, 5 6 and 7, where “CS” denotes a characteristic set, “MCB” denotes a generalized maximal consistent block, “Global” denotes a MLEM2 global probabilistic approximation and “Saturated” denotes a saturated probabilistic approximation. In our experiments, four methods for mining incomplete data sets were used, since we combined two types of granules from which approximations are constructed: characteristic sets and generalized maximal consistent blocks with two versions of probabilistic approximations: global and saturated.

These four methods were compared by applying the distribution free Friedman rank sum test and then by the post-hoc test (distribution-free multiple comparisons based on the Friedman rank sums), with a 5% level of significance.

For three data sets: bankruptcy, image segmentation and iris, two methods: global and saturated probabilistic approximations based on maximal consistent blocks are significantly better (error rates evaluated by ten-fold cross validation are smaller) than global probabilistic approximations based on characteristic sets. Additionally, for the iris data set saturated probabilistic approximations based on maximal consistent blocks are significantly better than saturated probabilistic approximations based on characteristic sets.

On the other hand, for the data set lymphography, saturated global approximations based on characteristic sets are better than both global and saturated probabilistic approximations based on maximal consistent blocks. For the data set wine recognition, saturated probabilistic approximations based on characteristic sets are better than both global probabilistic approximations based on characteristic sets and saturated probabilistic approximations based on maximal consistent blocks.

For three data sets, breast cancer, echocardiogram and hepatitis, pairwise differences in an error rate, evaluated by ten-fold cross validation between these four approaches to data mining, are statistically insignificant.

5 Conclusions

We compared four methods for mining incomplete data sets, combining two granules, characteristic sets and generalized maximal consistent blocks with two types of probabilistic approximations, global and saturated. Our criterion of quality was an error rate evaluated by ten-fold cross validation. As follows from our experiments, there are no significant differences between the four methods. The main conclusion is that for data mining all four methods should be applied.